Parsing in EBNFParser¶
EBNFParser is a parser generator framework to parse raw string into structured nested list(AST).
Pasring of EBNFParser has following steps:
Tokenizing¶
Tokenizing is the very first step to split input string into a sequence of Ruikowa.ObjectRegex.Tokenizer
objects.
A Ruikowa.ObjectRegex.Tokenizer
has the following readonly attributes:
- name : str
- type of the tokenizer.
- string : str
- string content(from input raw string) of the tokenizer.
- colno : int
- column number in current file.
- lineno : int
- row number in current file.
Example:
- parsing_tokenizing.ruiko
MyTokenType := 'abc' '233';
# The above syntax defines a literal parser to parse strings like "abc" or "233".
# "abc", "233" will be added into `token_table` to generate automatical tokenizing function.
parserToTest ::= MyTokenType+;
# The above syntax defines a combined parser with `MyTokenType`.
# A combined parser is combined by several literal parsers and other combined parsers,
# which can handle very complicated cases of a sequence of `Ruikowa.ObjectRegex.Tokenizer` objects.
- compile it
ruiko parsing_tokenizing.ruiko parsing_tokenizing.py --test
- test it
python test_lang.py parserToTest "abc233233"
=========================ebnfparser test script================================
parserToTest[
[name: MyTokenType, string: "abc"]
[name: MyTokenType, string: "233"]
[name: MyTokenType, string: "233"]
]
Take care that if you’re using anonymous literal pattern when definining a combined parser, like the following:
Just ::= 'just'+;
Then name of all the anonymous tokenizers is just "auto_const"
:
ruiko just.ruiko just --test
python test_lang.py Just "justjustjust"
=========================ebnfparser test script================================
Just[
[name: auto_const, string: "just"]
[name: auto_const, string: "just"]
[name: auto_const, string: "just"]
]
CastMap(Optional)¶
Sometimes we need special cases, a vivid instance is keyword
.
The string content of a keyword
could be also matched
by identifier
(in most programming languages we have identifiers),
just as the following case:
- parsing_CastMap.ruiko
ignore [space]
space := R'\s+';
# ignore the whitespace characters.
identifier := R'[a-zA-Z_]{1}[a-zA-Z_0-9]*';
keyword := 'def' 'for' 'public';
parserToTest ::= (identifier | keyword)+;
There is no doubt that identifier
will cover the cases of keyword
ruiko parsing_CastMap.ruiko parsing_CastMap.py --test
python test.py parserToTest "def for public"
=========================ebnfparser test script================================
parserToTest[
[name: identifier, string: "def"]
[name: identifier, string: "for"]
[name: identifier, string: "public"]
]
Take care that all of the Tokenizers have name identifier, not keyword ! As as result, the keyword could be used in some illegal places, just like:
for = 1
for for <- [for] do
for
The above example might not trouble you, but of course there could be something severer.
I’d like to give a solution adopted by EBNFParser auto-token.
(modify parsing_CastMap.ruiko
identifier := R'[a-zA-Z_]{1}[a-zA-Z_0-9]*';
keyword cast := 'def' 'for' 'public';
Here we define a cast map
that will map the string tokenized by identifier`(like
:code:
”def”, :code:”for”` and "public"
) to a const string, and
output a Ruikowa.ObjectRegex.Tokenizer
which name is a const string "keyword"
.
ruiko parsing_CastMap.ruiko parsing_CastMap.py --test
python test.py parserToTest "def for public other"
=========================ebnfparser test script================================
parserToTest[
[name: keyword, string: "def"]
[name: keyword, string: "for"]
[name: keyword, string: "public"]
[name: identifier, string: "other"]
]
Perfect!
ReStructure Tokenizers¶
This is what the word “parsing” accurately means.
Maybe you’ve heard about some sequence operation like
flatMap
(Scala-flatMap) , collect
(FSharp-collect) , selectMany
(Linq-SelectMany),
that’s great, because parsing is its inverse!
raw words :
["def", "f", "(", "x", ")", "=", "x"]
after parsing there is an AST:
FunctionDef[
"f"
# "def" is thrown away because it's useless to semantics, but you can
# preserve it, causing noises. The same below.
ArgList[
"x"
],
Expression[
"x"
]
]
And structures of the parsed just match what you defined with EBNF.
Here is an example to generate above AST by using a EBNF idiom - ruiko
which is proposed by EBNFParser to extend primary EBNF.
keyword cast as K := 'def';
identifier := R'[a-zA-Z_]{1}[a-zA-Z_0-9]*';
FunctionDef throw ['def'] ::= K'def' identifier '(' ArgList ')' '=' Expression;
Expression ::= ... # omit
ArgList ::= ... # omit
What’s more, EBNFParser supports unlimited left recursions.