Lexing does not appear to respect the declared order of lexer rules. #164

NigelWSewell · 2023-07-16T14:50:14Z

Description

While writing a JavaDoc Extractor, it was seen that the Lexing rules do not appear to follow the description in the Documentation. Where it is stated:

The order in which terminal rules are defined is critical as the lexer will always return the first match.

In the First Screenshot, the grammar can be seen to be extracting the correct text in the syntax tree, so the task is therefore to define some terminal rules that ignore everything else.

Adding the 'IGNORE' rule we can see that the syntax tree has removed the earlier matches, in favour of the later 'IGNORE' rule.

This seems to be in contradiction to the expectation from the requirement about the order of terminal rules.

Grammar Used

grammar JavaDocExtractor

entry Model: (docs+=JDoc)*;

terminal JDoc: ('/**' -> '*/');

hidden terminal CR: '\r'+;
hidden terminal LF: '\n'+;
//hidden terminal IGNORE: /.+?/;

Test Input


/** foo 1 */
person John
person Jane

/* foo 2*/


Hello John!
Hello Jane!

/** foo 4*/

The text was updated successfully, but these errors were encountered:

msujew · 2023-07-16T15:17:51Z

@NigelWSewell It seems like the documentation skipped over the small detail that we move terminals that can potentially match whitespace characters to the front as a performance optimization. See here.

Note that unlike in Xtext, it's not recommended in Langium to have a catch-all terminal. Langium's underlying lexer implementation (Chevrotain) works quite differently from ANTLR and catch-all terminals will always lead to trouble (even if the order of tokens is correct). A catch-all token will always consume the rest of the input, as even making it non-greedy doesn't work.

Instead, lexer errors are dealt with on a diagnostics level, and unexpected characters are simply omitted from the token stream.

NigelWSewell · 2023-07-16T16:49:58Z

@msujew That would explain the behaviour well eough.

Is there a workaround to this? Either:

A way of forcing strict declaration order.
Ignoring other syntax errors
A complete non-whitespace character set to catch other unwanted text.
Something else ive not thought of.

Either way im sure this is a question/mistake many people from ANTLR/XText will encounter so this can be a good opportunity to improve the documentation.

p.s.: Thanks for working on Sunday!

msujew · 2023-07-16T18:22:28Z

Is there a workaround to this?

Not directly in the grammar, though you can override the DefaultTokenBuilder to prevent the behavior. We should probably add a flag to disable the optimization.

Either way I'm sure this is a question/mistake many people from ANTLR/XText will encounter so this can be a good opportunity to improve the documentation.

I assume so as well. We should probably mention that in the docs.

msujew added recipe Improvements or additions to recipes documentation Improvements or additions to documentation and removed recipe Improvements or additions to recipes labels Jul 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexing does not appear to respect the declared order of lexer rules. #164

Lexing does not appear to respect the declared order of lexer rules. #164

NigelWSewell commented Jul 16, 2023

msujew commented Jul 16, 2023 •

edited

NigelWSewell commented Jul 16, 2023 •

edited

msujew commented Jul 16, 2023

Lexing does not appear to respect the declared order of lexer rules. #164

Lexing does not appear to respect the declared order of lexer rules. #164

Comments

NigelWSewell commented Jul 16, 2023

Description

Grammar Used

msujew commented Jul 16, 2023 • edited

NigelWSewell commented Jul 16, 2023 • edited

msujew commented Jul 16, 2023

msujew commented Jul 16, 2023 •

edited

NigelWSewell commented Jul 16, 2023 •

edited