Custom Lexer allowing for parsing non-standard spreadsheets from xslx, csv etc #1298

rooterkyberian · 2023-07-13T11:11:59Z

Suggestion

I'm dealing with various spreadsheets (in different formats: text-based csv, tsv, but also binary xls, xlsx ).

They can look like this:

"Name:", "Just some spreadheet"
"Additional metadata:", "some more information about this sheet"

"Table"
"no", "value1", "valuie2"
1, "a", "b"
2, "a10000", "b10000"
...
10000, "a10000", "b10000"

So lark can handle csv fine (albeit there is performance hit compared with csv module), but xlsx are no go.

Now the idea is to do initial tokenization with csv or xlrd module, and I guess that would mean writing a custom Lexer.
Right now Lexer's are not that much advertised as something user replaceable. i.e. I think I can do it, but it seems like interface can break at any time - for example, I don't think there is support for custom LexerState.

So my question is - is implementation of custom Lexers to solve this issue "supported", or should I expect it to break with any upgrade and lark development team does not plan on "stabilizing" lexer API to allow such use case?

Describe alternatives you've considered
My alternative to this issue is building my own finite-state-machine (basically a parser) on top of parsing done with csv/xlrd libraries.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

MegaIng · 2023-07-13T11:17:16Z

XML based formats are not context free. The ability to parse them using lark, which primarily supports CFG is and will always be quite limited. It's possible to account for some context sensitive conditions, like python style indentation. To a limit degree, this also applies to XML. What you can try is to use a Postlexer. That interface is guaranteed to be stable.

The interface for lexer is also quite stable. It hasn't seen that much development because there haven't been that many usecases. If your problem can't be solved with Postlexer, but could be solved with an improved Lexer interface (solved, not just maybe possible), I am sure we could consider updating it. But I doubt you will manage to coerce it into a working solution.

rooterkyberian added the enhancement label Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Lexer allowing for parsing non-standard spreadsheets from xslx, csv etc #1298

Custom Lexer allowing for parsing non-standard spreadsheets from xslx, csv etc #1298

rooterkyberian commented Jul 13, 2023

MegaIng commented Jul 13, 2023

Custom Lexer allowing for parsing non-standard spreadsheets from xslx, csv etc #1298

Custom Lexer allowing for parsing non-standard spreadsheets from xslx, csv etc #1298

Comments

rooterkyberian commented Jul 13, 2023

MegaIng commented Jul 13, 2023