Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Lexer allowing for parsing non-standard spreadsheets from xslx, csv etc #1298

Open
rooterkyberian opened this issue Jul 13, 2023 · 1 comment

Comments

@rooterkyberian
Copy link

Suggestion

I'm dealing with various spreadsheets (in different formats: text-based csv, tsv, but also binary xls, xlsx ).

They can look like this:

"Name:", "Just some spreadheet"
"Additional metadata:", "some more information about this sheet"

"Table"
"no", "value1", "valuie2"
1, "a", "b"
2, "a10000", "b10000"
...
10000, "a10000", "b10000"

So lark can handle csv fine (albeit there is performance hit compared with csv module), but xlsx are no go.

Now the idea is to do initial tokenization with csv or xlrd module, and I guess that would mean writing a custom Lexer.
Right now Lexer's are not that much advertised as something user replaceable. i.e. I think I can do it, but it seems like interface can break at any time - for example, I don't think there is support for custom LexerState.

So my question is - is implementation of custom Lexers to solve this issue "supported", or should I expect it to break with any upgrade and lark development team does not plan on "stabilizing" lexer API to allow such use case?

Describe alternatives you've considered
My alternative to this issue is building my own finite-state-machine (basically a parser) on top of parsing done with csv/xlrd libraries.

Additional context
Add any other context or screenshots about the feature request here.

@MegaIng
Copy link
Member

MegaIng commented Jul 13, 2023

XML based formats are not context free. The ability to parse them using lark, which primarily supports CFG is and will always be quite limited. It's possible to account for some context sensitive conditions, like python style indentation. To a limit degree, this also applies to XML. What you can try is to use a Postlexer. That interface is guaranteed to be stable.

The interface for lexer is also quite stable. It hasn't seen that much development because there haven't been that many usecases. If your problem can't be solved with Postlexer, but could be solved with an improved Lexer interface (solved, not just maybe possible), I am sure we could consider updating it. But I doubt you will manage to coerce it into a working solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants