Use separate Tokenizer/Lexer? #241

Philipp-M · 2023-03-29T20:08:18Z

Hi,

I ran into a few issues when trying to parse the IFC express schema.

One being that something like TrueNorth is tried to be parsed (e.g. in an expression rule) as a boolean literal, because the literal rule has higher priority than the simple_id rule, but obviously after changing the order of these rules made something like True not a boolean literal anymore.
So I think there are two options to solve this in its core issue (I think it's just a sign for further issues that may arise because of ambiguous parsing):

Either all the basic parsing rules check that they are not another basic parsing rule (e.g. simple_id checks that it doesn't contain e.g. literals or other things that may also be a simple_id) or use a separate lexer/tokenizer that weeds these cases out already.

I personally prefer using a lexer, it's easier to restrict the problem space/abstract the parser on top of that, because I also have had issues with weird parsing ambiguities in the past when not using a separate lexer (in way simpler languages). I think the BNF grammar of STEP and EXPRESS should allow tokenizing/lexing the whole input without having to think about modal lexing etc. but I'm not sure yet.

I have actually started writing a parser/lexer for the express language, I'm not sure yet, if I will progress this project much further though (I guess I underestimated the scope of supporting STEP completely).
My original motivation was having better error recovery/messages (by using something like chumsky as parser combinator library).

I think the lexer is almost complete, so you may be interested in this:
https://github.com/Philipp-M/express-parser/blob/6464b29e5eb14d70b0445b84567ed58fdfd144b6/src/lexer.rs

The text was updated successfully, but these errors were encountered:

Philipp-M · 2023-03-29T21:43:59Z

Btw. this may also be helpful in case you want to go with a lexer:

https://github.com/stepcode/stepcode/blob/develop/src/express/expscan.l

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use separate Tokenizer/Lexer? #241

Use separate Tokenizer/Lexer? #241

Philipp-M commented Mar 29, 2023

Philipp-M commented Mar 29, 2023

Use separate Tokenizer/Lexer? #241

Use separate Tokenizer/Lexer? #241

Comments

Philipp-M commented Mar 29, 2023

Philipp-M commented Mar 29, 2023