Parsing in EBNFParser

EBNFParser is a parser generator framework to parse raw string into structured nested list(AST).

Pasring of EBNFParser has following steps:

Tokenizing

Tokenizing is the very first step to split input string into a sequence of Ruikowa.ObjectRegex.Tokenizer objects.

A Ruikowa.ObjectRegex.Tokenizer has the following readonly attributes:

  • name : str
    type of the tokenizer.
  • string : str
    string content(from input raw string) of the tokenizer.
  • colno : int
    column number in current file.
  • lineno : int
    row number in current file.

Example:

  • parsing_tokenizing.ruiko
MyTokenType  := 'abc' '233';
# The above syntax defines a literal parser to parse strings like "abc" or "233".
# "abc", "233" will be added into `token_table` to generate automatical tokenizing function.

parserToTest ::= MyTokenType+;
# The above syntax defines a combined parser with `MyTokenType`.

# A combined parser is combined by several literal parsers and other combined parsers,
#   which can handle very complicated cases of a sequence of `Ruikowa.ObjectRegex.Tokenizer` objects.
  • compile it
ruiko parsing_tokenizing.ruiko parsing_tokenizing.py --test
  • test it
python test_lang.py parserToTest "abc233233"
=========================ebnfparser test script================================
parserToTest[
    [name: MyTokenType, string: "abc"]
    [name: MyTokenType, string: "233"]
    [name: MyTokenType, string: "233"]
]

Take care that if you’re using anonymous literal pattern when definining a combined parser, like the following:

Just ::= 'just'+;

Then name of all the anonymous tokenizers is just "auto_const" :

ruiko just.ruiko just --test
python test_lang.py Just "justjustjust"
=========================ebnfparser test script================================
Just[
    [name: auto_const, string: "just"]
    [name: auto_const, string: "just"]
    [name: auto_const, string: "just"]
]

CastMap(Optional)

Sometimes we need special cases, a vivid instance is keyword .

The string content of a keyword could be also matched by identifier (in most programming languages we have identifiers), just as the following case:

  • parsing_CastMap.ruiko
ignore [space]
space        := R'\s+';
# ignore the whitespace characters.


identifier   := R'[a-zA-Z_]{1}[a-zA-Z_0-9]*';
keyword      := 'def' 'for' 'public';

parserToTest ::= (identifier | keyword)+;

There is no doubt that identifier will cover the cases of keyword

ruiko parsing_CastMap.ruiko parsing_CastMap.py --test
python test.py parserToTest "def for public"
=========================ebnfparser test script================================
parserToTest[
    [name: identifier, string: "def"]
    [name: identifier, string: "for"]
    [name: identifier, string: "public"]
]

Take care that all of the Tokenizers have name identifier, not keyword ! As as result, the keyword could be used in some illegal places, just like:

for = 1
for for <- [for] do
    for

The above example might not trouble you, but of course there could be something severer.

I’d like to give a solution adopted by EBNFParser auto-token.

(modify parsing_CastMap.ruiko

identifier   := R'[a-zA-Z_]{1}[a-zA-Z_0-9]*';
keyword cast := 'def' 'for' 'public';

Here we define a cast map that will map the string tokenized by identifier`(like :code:”def”, :code:”for”` and "public") to a const string, and output a Ruikowa.ObjectRegex.Tokenizer which name is a const string "keyword".

ruiko parsing_CastMap.ruiko parsing_CastMap.py --test
python test.py parserToTest "def for public other"
=========================ebnfparser test script================================
 parserToTest[
    [name: keyword, string: "def"]
    [name: keyword, string: "for"]
    [name: keyword, string: "public"]
    [name: identifier, string: "other"]
]

Perfect!

ReStructure Tokenizers

This is what the word “parsing” accurately means.

Maybe you’ve heard about some sequence operation like flatMap (Scala-flatMap) , collect (FSharp-collect) , selectMany (Linq-SelectMany), that’s great, because parsing is its inverse!

raw words :

    ["def", "f", "(", "x", ")", "=", "x"]

after parsing there is an AST:

    FunctionDef[
        "f"
              # "def" is thrown away because it's useless to semantics, but you can
              # preserve it, causing noises. The same below.
        ArgList[
            "x"
        ],

        Expression[
            "x"
        ]
    ]

And structures of the parsed just match what you defined with EBNF.

Here is an example to generate above AST by using a EBNF idiom - ruiko which is proposed by EBNFParser to extend primary EBNF.

keyword     cast as K       := 'def';
identifier                  := R'[a-zA-Z_]{1}[a-zA-Z_0-9]*';
FunctionDef throw ['def']   ::= K'def' identifier '(' ArgList ')' '=' Expression;
Expression                  ::= ... # omit
ArgList                     ::= ... # omit

What’s more, EBNFParser supports unlimited left recursions.