Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Req: option for watching token streams and candidate targets #1320

Open
davidmcnabnz opened this issue Aug 14, 2023 · 4 comments
Open

Req: option for watching token streams and candidate targets #1320

davidmcnabnz opened this issue Aug 14, 2023 · 4 comments
Labels
add to docs TODO: Include this information in the docs enhancement

Comments

@davidmcnabnz
Copy link
Sponsor

Suggestion
Requesting constructor keyword options to allow logging the lexer tokens stream. Also, if feasible, the potential target fulfilments in the current context.

Describe alternatives you've considered
The PyCharm debugger has sophisticated breakpoint options, including the ability to set a breakpoint to:

  1. Stay dormant until or unless a specific other breakpoint is reached, then become active
  2. Every time reached, execute an arbitrary Python statement (in my case, a print statement for the token)

Additional context
Printing the token stream, via the above IDE debugger breakpoint technique, has been a huge support in my current project.

(FYI, this requires carefully retro-implementing a parser for an archaic, convoluted and very non-standard programming/configuration language from the 1980s, whose parser was originally implemented in hand-crafted C, incrementally coded/patched/extended in a silo over the decades, and with no formal grammar specification, not even YACC. Getting its various cryptic nuances to parse and correctly feed into my transformer is a massively challenging undertaking, but I'm getting there.)

I would really like to be able to watch or log the LARK parser's token stream without reliance on the IDE. Even if a constructor option allowed passing an open writeable file object, and/or a logger object, and/or the pathname of a file to write to, this would be very helpful.

In a perfect world, for each token fetched and logged, it would be even better to see the current line/column numbers in the input at which the token was matched.

I acknowledge that logging of parser state would be a much harder venture, especially to do so in a readable manner. So even just token stream logging would be quite a boost.

@erezsh
Copy link
Member

erezsh commented Aug 14, 2023

For LALR, this is very easy to do using the Lark.parse_interactive() method, and then calling iter_parse().

See this recipe for an example: https://lark-parser.readthedocs.io/en/latest/recipes.html#adding-a-progress-bar-to-parsing-with-tqdm

I'm not sure this is relevant for Earley, since it matches and considers many different tokens, that are eventually thrown away. i.e. it's not exactly a stream of tokens.

@davidmcnabnz
Copy link
Sponsor Author

@erezsh thanks for that. I have been meaning to check out .parse_interactive(), but I've been a bit too tight-looped in my current project (well that's the official excuse anyway ;) ).

I'll make a point of trying it out today. I'm guessing it will have big payoffs when I hit a lot more of the ancient syntax's nitty-gritties.

@davidmcnabnz
Copy link
Sponsor Author

I just tried it out, but noticed that the .result wasn't getting assigned with the parse tree. After a bit of stepping through LARK internals, I noticed the parser wasn't seeing EOF. Sorted by manually calling .feed_eof() after the iterator loop quits.

Sample code is below, and has got me exactly where I need to be :) :

    def parseInteractive(self, raw, *args, **kw):
        pi = self.parser.parse_interactive(raw, *args, **kw)
        for token in pi.iter_parse():
            if self.debug:
                ctr = pi.lexer_state.state.line_ctr
                line = ctr.line
                column = ctr.column
                tokType = token.type.split('__')[-1]
                tokVal = token.value
                print(f"TOKEN:{line}:{column}:{tokType}={repr(tokVal)}")

        # apparently the interactive parser never sees $END, so we have to                                                     
        # feed it in explicitly                                                                                                
        pi.feed_eof()

        # now we can harvest the transformed tree                                                                              
        result = pi.result
        return result

As a takeaway, there might be merit in adding a couple of properties to the InteractiveParser object to allow cleaner and more future-proof access to line, column and character position values, to cover cases where .lexer_state.state.line_ctr.line etc access paths gets broken by future updates.

But for now, I'm delighted to have so much transparency in the parser's activity. Thanks again!

@erezsh
Copy link
Member

erezsh commented Aug 15, 2023

It looks like the recipe isn't entirely correct. To get the result, you should call pi.resume_parse(). See example here: https://github.com/geographika/mappyfile/blob/master/mappyfile/parser.py#L218

As for the line and column numbers, why don't you just take them from the token?

@erezsh erezsh added the add to docs TODO: Include this information in the docs label Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add to docs TODO: Include this information in the docs enhancement
Projects
None yet
Development

No branches or pull requests

2 participants