stricter lark grammar for notebook documents #143

tonyfast · 2023-10-17T23:45:20Z

this pull request modifies the json parser to use explicit properties from the notebook format to constrain the parser.

the top-level notebook keys and cell keys are known ahead of time because of the schema that notebooks conform to. these changes hard code the notebook and cell properties into the parser.

this change will NOT work for arbitrary data formats, only those conforming to subsets of the specified schema. otherwise the parser throws errors.

this change streamlines the transformation logic.

tonyfast · 2023-11-08T03:16:06Z

i'm worried about merging this one when it does land. i think i'd feel better about the support if i ran the parser through some hypothesis schema jawn.

tonyfast · 2023-11-08T07:05:39Z

this parser is super strict at the notebook and cell levels. after that the parser is free parse, but it will only pick up the source parts at the moment.

i'd like to harden it up with more tests. some questions i have are:

are there more tests i could add to check more cases or provide assurance?
does introducing hypothesis change the tune when considering parallelism in the tests?

bollwyvl · 2023-11-08T15:10:25Z

i'm worried about merging this one when it does land.

If not architecturally impossible, a common pattern when making something stricter would be to build, test, and ship both parsers, and initially make the strict one opt-in, then the default, then, maybe, deprecate the old one.

The parsers could be registered with entry_points, which would also allow a downstream to ship their own parser, prebuilt or otherwise. Indeed, having lark in the loop would be very interesting indeed for the interactive case.

In addition to clear docs on how that works in e.g. README.md (and probably a section in CHANGELOG.md), the API might then look like:

with Notebook():
    import a_somehow_bad_notebook
    # UserWarning: 
    #    a_somehow_bad_notebook failed strict parsing, on line {line}, column {line}
    #
    #        {lark error}
    # 
    #    The loose parser was used, but {features} will not be available. Silence this warning with:
    #    
    #        with Notebook(parser='loose'):
    #
    #    see https://github.com/deathbeds/importnb/issues/{number}

Then offer:

with Notebook(parser="loose"):
    import a_somehow_bad_notebook

And allow opting-in to hard fails:

with Notebook(parser="strict"):
    import a_somehow_bad_notebook
    # ImportnbParseError: a_somehow_bad_notebook.ipynb failed to parse on line {line}, column {line}
    # {lark error}

Where ImportNbParseError is a subclass of ImportError (or something).

i think i'd feel better about the support if i ran the parser through some hypothesis schema jawn.

hypothesis is always a win.

does introducing hypothesis change the tune when considering parallelism in the tests?

Yes, hypothesis-based suites can benefit from parallelism with pytest-xdist, if run together with others suites.
In CI, I'd probably treat it as a separate set of tests, either in a different folder or markers, which wouldn't even run if the fast tests fail, and likely just as part of a the fastest job (e.g. only on linux/3.12), and also enable a --cov-fail-under threshold.

But the above is irrelevant until the existing tests all run cleanly, ideally without doing anything to disk, sys.path tricks, etc.

tonyfast · 2023-11-09T06:27:06Z

yea i love this idea. i think i can manage it in this PR. we'll have to build two different standalone parsers to ship for a few months then sunset the more permissive one.

ultimately, i think this should work out cause both the notebook and cell levels define additionalProperties: False. as long as all the properties are captured in the grammar we should be good. there is a case where folks synthetically generate notebook like structure or introduce future keys like $schema or @graph @context. in these cases, we'd need a permissive flag turned on if an only if this synthetic data is stored on disk. in theory, the stricter grammar could host both a strict and permissive version. in the permissive version we all for additionalProperties by supporting generic keys.

bollwyvl · 2023-11-09T13:37:45Z

two different standalone parsers to ship

Yeah, this doesn't seem like a huge hit to avoid breaking things, and getting new features.

months then sunset the more permissive one.

Right, as long as there is clear messaging from the as-installed software that this is going to happen, and a timeline/trigger (e.g. first release that supports 3.13) for when. The ad-hoc CalVer scheme can make it hard to plan for these things in any way other than after this day... but... can't go back now without a package rename or creating huge problems downstream.

generate notebook like structure or

Sure, as long as not losing features for the 99% use case of nbformat-compliant structures. And further: formalizing and shipping the .g files, it would be possible to make extensible grammars in interesting ways.

So if some use case requires going wild, it could subclass the machinery (even interactively), by overloading the right level of the grammar/decoder/parser/transformer/whatever (these need some .mmd to explain them, perhaps) and passing that instead of an entry_point name, without having to mess about with entry_points or the standalone tooling generator.

If the anticipated needs are very predictable, having a more constrained API Notebook(additional_properties={"$schema": "object"}), etc. could make this feasible, provided importnb[lark] is installed, providing for a few simple cases.

A more formalized system could package and register it by normal means, which should be kept if added.

introduce future key

Changes like that happen slowly, but not worth losing features. We could add an nbformat canary build excursion that installs from upstream's main.

tonyfast added 9 commits October 17, 2023 10:52

use a stricter grammar based on the notebook schema

e82b367

add some notes to the parser

9f450c6

comment and clean up transformer decoding

6223a14

minor buffer refactor

1ab8212

update tests, cant load non-compliant notebooks anymore

cf87a50

fix for loading code

6d0fada

stricter grammar with smarter suppression

1b77aa3

rm sum heinous shit i did in my salad days

57b7a99

docstring

b91c1d9

tonyfast added 2 commits November 7, 2023 20:51

Merge branch 'main' into strict-lark-parser

cddc0e1

add test deps and wrangle decoder issues

4c6ae43

tonyfast requested a review from bollwyvl November 8, 2023 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stricter lark grammar for notebook documents #143

stricter lark grammar for notebook documents #143

tonyfast commented Oct 17, 2023

tonyfast commented Nov 8, 2023

tonyfast commented Nov 8, 2023

bollwyvl commented Nov 8, 2023

tonyfast commented Nov 9, 2023

bollwyvl commented Nov 9, 2023

stricter lark grammar for notebook documents #143

Are you sure you want to change the base?

stricter lark grammar for notebook documents #143

Conversation

tonyfast commented Oct 17, 2023

tonyfast commented Nov 8, 2023

tonyfast commented Nov 8, 2023

bollwyvl commented Nov 8, 2023

tonyfast commented Nov 9, 2023

bollwyvl commented Nov 9, 2023