feat: add custom cell magic parser #213

plamut · 2020-08-04T14:02:13Z

Fixes #166.
Fixes #108.

This PR is to preview the custom cell magic line parser.

PR checklist

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

plamut · 2020-08-04T14:31:10Z

@shollyman I would like to hear your thoughts on this.

The current cell magic line parser is not aware of Python structures (dicts, lists ...) and can thus incorrectly parse the --params option (example issues: #166 and #108). The default parser breaks these structures on whitespace, and while we stitch these parts back together and operate on the result, this only happens after the parsing has already taken place and bugs have sneaked in.

I took a stab at it and came up with a POC custom parser that successfully parses non-trivial input lines such as the following:

%%bigquery    target_var --use_legacy_sql --project my.project.name --params {'nested_dict': {'bar': 'baz', 'some_tuple': (-1, 3.2,)}, "answer": -42, 'array': ['aaa bbb', ["nested", 'list'], 'ccc ddd'], '--params': 'tricky --inner-option string literal',} --verbose

The following test snippets demonstrates that the parser correctly recognized the options:

from google.cloud.bigquery.ipython_magics import line_arg_parser as lap

example_line = (
    "  target_var "
    "--use_legacy_sql "
    "--project my.project.name "
    "--params {'nested_dict': {'bar': 'baz', 'some_tuple': (-1, 3.2,)}, "
    "\"answer\": -42, 'array': ['aaa bbb', [\"nested\", 'list'], 'ccc ddd'], "
    "'--params': 'tricky --inner-option string literal',} "
    "--verbose  "
)

lexer = lap.lexer.Lexer(line)
parser = lap.parser.Parser(lexer)
parse_tree = parser.input_line()

printer = lap.visitors.TreePrinter()
printer.visit(parse_tree)

Output:

Input:
    destination_var:    target_var
    Command options:
        --use_legacy_sql 
        --project my.project.name
        --params {'nested_dict': {'bar': 'baz', 'some_tuple': (-1, 3.2)}, "answer": -42, 'array': ['aaa bbb', ["nested", 'list'], 'ccc ddd'], '--params': 'tricky --inner-option string literal'}
        --verbose

The custom parser could be used to extract out the --params <...> part, evaluate it separately, and only feed the remainder of the input to the default parser.

Do you think we should pursue this further?

The grammar used is of course a simplification that is aimed to be "good enough" to reliably extract the --params part in typical uses cases, but it does not try to be bullet-proof. For instance, it currently does not support non-string dict keys, sets, floats in exponential notation, etc., although that can all be added depending on how much we want to complicate. :)

cc: @tswast Do customers complain about the default parser's limitations often? Do you think there even exist reasonable use cases where --params`dictionary would be incredibly complex?

tswast · 2020-08-04T15:53:57Z

Do customers complain about the default parser's limitations often?

I wouldn't say often, but it has come up more than once

google/cloud/bigquery/ipython_magics/line_arg_parser/lexer.py

cguardia

This looks pretty good. Thanks.

google/cloud/bigquery/ipython_magics/line_arg_parser/lexer.py

The parser should accept as wide a range of values as possible and let the code that delas with the semantics to decide whether the values are good or not.

The --params option spec must be followed by a non-alphanumeric character, otherwise it's a different option spec (e.g. --paramsX).

cguardia

I think this is looking good.

tswast

This is awesome! It does require a bit of refactoring, though. We need to keep google.cloud.bigquery.magics module/package name due to google.cloud.bigquery.magics.context

I'd also like to see if we can lean on pyparsing for a lot of the lexing / parsing work. I worry that those modules are going to be particularly challenging to maintain without either "literate programming" level comments or relying on common modules to make the parser easier to understand.

docs/magics.rst

google/cloud/bigquery/ipython_magics/__init__.py

google/cloud/bigquery/ipython_magics/line_arg_parser/lexer.py

The context still needs to be importable from the old path

tswast · 2020-09-08T15:49:49Z

google/cloud/bigquery/magics/line_arg_parser/lexer.py

+Token = namedtuple("Token", ("type_", "lexeme", "pos"))
+StateTransition = namedtuple("StateTransition", ("new_state", "total_offset"))
+
+# Pattern matching is done with regexes, and the order in which the token patterns are


Thanks so much for this explanation and the clearer state names!

Fun fact: anything that can be parsed with a regular expression can also be parsed by a finite state machine and vice versa. It's called a regular language. That means we might have been able to convert this into one giant regular expression, but I like having these explicit states better.

I recall that lookahead & backtracking isn't a regular expression in the computer science definition, so I don't think this is technically a "regular" language, though.

Indeed, regular expressions correspond to finite automata, although the regex implementation is a bit more powerful than that (AFAIK), as it supports things like lookahead and lookbehind.

The language itself recognized by the parser is not regular, true, as the grammar allows for recursive patterns such as nested dicts. A proof of non-regularity is left as an exercise to the reader. 😆

(hint: pumping lemma).

google/cloud/bigquery/magics/line_arg_parser/lexer.py

google/cloud/bigquery/magics/line_arg_parser/parser.py

google/cloud/bigquery/magics/line_arg_parser/visitors.py

tests/unit/test_magics.py

Apparently black just places all implicitly concatenated string literals in a single line when short enough without replacing them with a single string literal.

This is necessary to retain Python 2 compatibility.

The tokens are designed in a way that the scanner *always* returns some match, even if just UNKNOWN or EOL. The "no matches" code path can thus never be taken, but the coverage check can't know that.

tswast

Great work! Thanks for the clear explanations.

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Aug 4, 2020

plamut commented Aug 4, 2020

View reviewed changes

google/cloud/bigquery/ipython_magics/line_arg_parser/lexer.py Outdated Show resolved Hide resolved

plamut force-pushed the fix-cellmagic-parser branch 2 times, most recently from 5aa7f07 to e778cf2 Compare August 7, 2020 14:26

plamut force-pushed the fix-cellmagic-parser branch from e778cf2 to 4e30e4c Compare August 17, 2020 16:30

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Aug 21, 2020

plamut force-pushed the fix-cellmagic-parser branch 8 times, most recently from 5359b3a to b514850 Compare August 27, 2020 19:06

plamut requested a review from cguardia August 28, 2020 10:01

plamut added 4 commits August 28, 2020 16:40

chore: Move cell magic code into its own directory

c5e63c6

Add custom argument parser for cell magic

199974f

Add AST node visitor

d3525d4

Use a custom parser for cell magic arguments

5690b81

plamut force-pushed the fix-cellmagic-parser branch from b514850 to 5a6257a Compare August 28, 2020 14:41

plamut requested a review from shollyman August 28, 2020 15:10

Improve cell magic parser test coverage

976567c

plamut force-pushed the fix-cellmagic-parser branch from 5a6257a to 976567c Compare August 28, 2020 15:15

plamut marked this pull request as ready for review August 28, 2020 15:16

plamut commented Aug 31, 2020

View reviewed changes

google/cloud/bigquery/ipython_magics/line_arg_parser/lexer.py Outdated Show resolved Hide resolved

shollyman requested a review from tswast September 1, 2020 22:36

cguardia approved these changes Sep 2, 2020

View reviewed changes

google/cloud/bigquery/ipython_magics/line_arg_parser/lexer.py Outdated Show resolved Hide resolved

google/cloud/bigquery/ipython_magics/line_arg_parser/lexer.py Outdated Show resolved Hide resolved

Generalize valid option values

f57c6cd

The parser should accept as wide a range of values as possible and let the code that delas with the semantics to decide whether the values are good or not.

plamut added 3 commits September 2, 2020 13:07

Fix recognizing --params option in state 3

c7e4420

The --params option spec must be followed by a non-alphanumeric character, otherwise it's a different option spec (e.g. --paramsX).

Fix typo in comment

90c504e

Merge branch 'master' into fix-cellmagic-parser

3b94a55

plamut requested a review from cguardia September 2, 2020 12:02

cguardia approved these changes Sep 3, 2020

View reviewed changes

Cover missing parser code path with a test

5249c6a

plamut force-pushed the fix-cellmagic-parser branch from 767e60b to 5249c6a Compare September 3, 2020 07:09

Merge branch 'master' into fix-cellmagic-parser

83218a2

tswast requested changes Sep 3, 2020

View reviewed changes

Preserve the cell magic context's import path

2c74ec8

The context still needs to be importable from the old path

plamut requested a review from tswast September 4, 2020 23:21

Clarify lexer states

19af056

plamut force-pushed the fix-cellmagic-parser branch from 1a97836 to 19af056 Compare September 4, 2020 23:41

tswast requested changes Sep 8, 2020

View reviewed changes

plamut added 5 commits September 9, 2020 11:28

Replace re.scanner with finditer()

4cbf7f2

Fix typo in docstring

64f19b5

Simplify string literal in a single line

5d10d36

Apparently black just places all implicitly concatenated string literals in a single line when short enough without replacing them with a single string literal.

Explain the visitors module.

46d1def

Merge branch 'master' into fix-cellmagic-parser

ed01a66

plamut force-pushed the fix-cellmagic-parser branch from 806a4ad to ed01a66 Compare September 9, 2020 10:12

plamut requested a review from tswast September 9, 2020 10:14

plamut added 2 commits September 9, 2020 12:29

Pass pos as a positional arg to finditer()

d9dd3e0

This is necessary to retain Python 2 compatibility.

Resolve coverage complaint about a code path

415c608

The tokens are designed in a way that the scanner *always* returns some match, even if just UNKNOWN or EOL. The "no matches" code path can thus never be taken, but the coverage check can't know that.

tswast approved these changes Sep 9, 2020

View reviewed changes

tswast merged commit dcfbac2 into googleapis:master Sep 9, 2020

plamut deleted the fix-cellmagic-parser branch September 10, 2020 11:20

release-please bot mentioned this pull request Jan 4, 2022

chore(main): release python-bigquery 1.27.1 #1097

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add custom cell magic parser #213

feat: add custom cell magic parser #213

plamut commented Aug 4, 2020 •

edited

plamut commented Aug 4, 2020 •

edited

tswast commented Aug 4, 2020

cguardia left a comment

cguardia left a comment

tswast left a comment

tswast Sep 8, 2020

plamut Sep 9, 2020 •

edited

tswast left a comment

feat: add custom cell magic parser #213

feat: add custom cell magic parser #213

Conversation

plamut commented Aug 4, 2020 • edited

PR checklist

plamut commented Aug 4, 2020 • edited

tswast commented Aug 4, 2020

cguardia left a comment

Choose a reason for hiding this comment

cguardia left a comment

Choose a reason for hiding this comment

tswast left a comment

Choose a reason for hiding this comment

tswast Sep 8, 2020

Choose a reason for hiding this comment

plamut Sep 9, 2020 • edited

Choose a reason for hiding this comment

tswast left a comment

Choose a reason for hiding this comment

plamut commented Aug 4, 2020 •

edited

plamut commented Aug 4, 2020 •

edited

plamut Sep 9, 2020 •

edited