General strategy for handling extras #3196
Unanswered
michaelfortunato
asked this question in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
What is a good way of handling a language which in general is newline insensitive, but in certain rules is not?
Disclaimer
I have looked at #884 , and I know that my question is basically a repeat of #1421. I also know that #931 has been talked about.
I post because I was still hoping to get some feedback for my particular language if possible, and to continue the extras conversation in general.
Background/Language Spec
I am working on writing a tree sitter grammar for a language that
if 1 let x = 1 let x = 2
gets parsed as(if-stmt (let-stmt) (let-stmt))
as opposed to(if-stmt (let-stmt)) (let-stmt)
andgets parsed as
(if-stmt (let-stmt)) (let-stmt)
Examples
Here are some examples of strings my language accepts and rejects
(let-stmt) (let-stmt)
:(if-stmt (let-stmt) (let-stmt)) (let-stmt)
:if-stmt (let-stmt)
(this is what I call a "block-type" if-statement, as opposed to single line)What is a "good way" to define such an if statement?
Solutions
I have two solutions so far that I have come up with.
1: Using an external scanner
The idea here is that we use a zero-width external token to start the single-line if statement, which notifies our scanner struct that we are "single-line mode", and then terminate the rule with single-line end zero width token to toggle our scanner back to its original state.
Then, in my scanner.c, my
void *
payload would be astruct Scanner { bool is_newline_sensitive; };
Advantages
One of the main advantages, if this approach could even work at all, (which I have yet to get it to, and will talk about more below), is that it allows me to keep
\n
in the extras rule. The other advantage, which I argue is more attractive, is that I could potentially reuse this construct in other parts of my language which are whitespace insensitive.Disadvantages
First off, I have not been able to get this to completely work, I can attach my full code if people become interested in this post, but figured I would ask for feedback first. Another disadvantage, which is more about my skill level, is that I do not know how to handle error recovery gracefully in my external scanner, and I find that having a stateful scanner gets hard to maintain. More on that second point, my understanding is that tree sitter serializes the state of the scanner inside each token that is recognized, which makes sense, but its just hard for me to keep track of when/if I need to reset my state in case of an error. For example,
say ts has the following sequence of tokens:
if ... single_line_senstive_mode_start ... singe_line_sensitive_mode_end
, I understand that the state of my scanner isis_newline_senstive = true
in from token range [singe_line_sensitive_mode_start, single_line_sensitive_mode_end], but what would happen if I deleted thesingle_line_senstive_mode_start
token from the token sequence, then the scanner would beis_newline_sensitive = true
from the first token in the consequence statement to$.singe_line_sensitive_mode_end
. It is unclear to me what the state of my scanner should be in this case. I know also that this is general problem. As such I try to avoid state in my external scanner in general.2: Remove
\n
fromextras
and handle it manually everywhere in my grammar.jsThis approach is easier for me to reason about but becomes harder to maintain as more statements get added.
The idea is that in my
grammar.js
, we would instead have:What I would do in this case is have a function which takes in the no_newline_stmt rule, traverses it recursively, and inserting MAYBE_NEWLINE tokens between all terminals.
In fact I have done this, and it wasn't too bad, but still adds a lot of new rules to the parse table
Update on the above: It seems like masaeedu, #931 (comment), also had this generator idea!
Thanks and thanks for the wonderful software!
Michael
Beta Was this translation helpful? Give feedback.
All reactions