external scanner: support mark_begin() instead of advance(skip=true) #2985

michael-db · 2024-02-09T18:38:52Z

Problem

I'm developing a small language where it is not known whether whitespace should be skipped or form part of a token until what is following it is consumed: if it is trailing whitespace on a line then it should be skipped, but in other cases it can be part of a token with internal whitespace. Now I can imagine that this is very unusual and I might even be the only one ever to ask this, but before I decide whether to use tree-sitter, could it be a problem?

I'm provisionally assuming that I will work around this by putting a token for insignificant separator-only whitespace tokens in the grammar (though I don't really want to) and so the fix for this bug report is just to mention that in the docs https://tree-sitter.github.io/tree-sitter/creating-parsers#external-scanners

It would be nice if the "skip" function was separated from the "advance" mechanism (advance, advance, advance, then decide to skip what's been consumed), but of course I don't really expect the API to be changed for this.

Excuse me if I've missed or misunderstood something.

Steps to reproduce

None, doc issue/question.

Expected behavior

Support skip_consumed_chars(). No, OK, doc update then.

Tree-sitter version (tree-sitter --version)

0.20.9

Operating system/version

Linux

michael-db · 2024-02-10T10:50:55Z

From #2315, I'm not the only one affected insofar as that "skip" parameter and its description is misleading. Having read that and looked at the code to confirm that "skip" doesn't really equate to "skip this character", but rather "exclude this and all preceding characters from the token", the current API is less limited than it appears to be.

Having said that, I think the best way to fix that confusion would be to add a function to the API that complements the "mark_end" function and eliminate that misleading notion of "skipping" individual characters.

This is how it is:

void (*advance)(TSLexer *, bool skip)
    A function for advancing to the next character.
    If you pass true for the second argument, the current character
    will be treated as whitespace; whitespace won’t be included in the
    text range associated with tokens emitted by the external scanner.
void (*mark_end)(TSLexer *)
    ...

Keeping that boolean argument only for compatibility, this is how it could be:

void (*advance)(TSLexer *, bool mark_begin)
    Advance to the next character.
    Passing true for the second argument also causes mark_begin() to
    be called, i.e., it is equivalent to
    'advance(lexer, false); mark_begin(lexer);'
void (*mark_begin)(TSLexer *)
    Indicate that the first character of the token will be the lookahead
    character, i.e., the next character to be consumed by advance().
    Normally used to discard whitespace characters which precede a token.
    If this is not called then the token begins with the first character
    to be consumed.
void (*mark_end)(TSLexer *)
    ...

Moreover, that API would make it possible to check the lookahead codepoint before deciding whether to call mark_begin() on the codepoint just advanced over, which I would like to be able to do.

maxbrunsfeld · 2024-02-25T21:45:11Z

Yes, I agree 100% @michael-db. This would be a better API, and ideally, we should not have a skip parameter. The skip thing is something that existed before mark_end was added.

I'm not sure when and how we should roll out this change, but maybe we should do it before 1.0.

maxbrunsfeld · 2024-02-25T21:57:11Z

Actually, it may be possible to do this in a backward-compatible manner. Here's what I think we could do.

Change the external scanner API in the way that you've described
- Remove the bool skip parameter from the advance function on TSLexer. Change that function's behavior so that nothing is skipped by default.
- Add a mark_start function that updates the token start position to the lexer's current position.
- Update the generated code to use mark_start. Because native calls have a small cost in the case of WASM builds, we'll have ts_lex call mark_start at most one time. We can use a local bool variable to recall if it's been called yet.
Bump the language ABI version.
Maintain compatibility with previous ABI versions by conditionally assigning the advance function. We'll keep a shim for the old two-parameter advance function, and use that shim when loading parsers that use the previous ABI version.

/cc @amaanq - just wanted to get your thoughts on this, since I think it would be a good pre-1.0 change.

@michael-db are you interested in working on this?

michael-db · 2024-02-26T16:07:32Z

Thank you for looking into this, @maxbrunsfeld. I had read only a little of the code - not enough to understand what would be involved in changing the API in a safe way.

I was researching Tree Sitter in anticipation of using it. That might be a few weeks away, and then I'll have some learning to do, e.g., I haven't worked with WASM before.

However, I'd be happy to contribute to a clever and useful tool if I could. How soon would you want a pull request? If it suited, I could take this on provisionally and warn you if I'm unlikely to be ready to start on it within, say, a month.

michael-db added the bug label Feb 9, 2024

michael-db changed the title ~~what to do when skip is unknown on calling advance()~~ external scanner: support mark_begin() instead of advance(skip=true) Feb 12, 2024

ObserverOfTime added enhancement Feature request c-library external-scanner Related to external token scanners and removed bug labels Apr 12, 2024

ObserverOfTime added this to the 1.0 milestone Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

external scanner: support mark_begin() instead of advance(skip=true) #2985

external scanner: support mark_begin() instead of advance(skip=true) #2985

michael-db commented Feb 9, 2024

michael-db commented Feb 10, 2024 •

edited

maxbrunsfeld commented Feb 25, 2024 •

edited

maxbrunsfeld commented Feb 25, 2024

michael-db commented Feb 26, 2024

external scanner: support mark_begin() instead of advance(skip=true) #2985

external scanner: support mark_begin() instead of advance(skip=true) #2985

Comments

michael-db commented Feb 9, 2024

Problem

Steps to reproduce

Expected behavior

Tree-sitter version (tree-sitter --version)

Operating system/version

michael-db commented Feb 10, 2024 • edited

maxbrunsfeld commented Feb 25, 2024 • edited

maxbrunsfeld commented Feb 25, 2024

michael-db commented Feb 26, 2024

michael-db commented Feb 10, 2024 •

edited

maxbrunsfeld commented Feb 25, 2024 •

edited