Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add custom cell magic parser #213

Merged
merged 20 commits into from Sep 9, 2020

Conversation

plamut
Copy link
Contributor

@plamut plamut commented Aug 4, 2020

Fixes #166.
Fixes #108.

This PR is to preview the custom cell magic line parser.

PR checklist

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Aug 4, 2020
@plamut
Copy link
Contributor Author

plamut commented Aug 4, 2020

@shollyman I would like to hear your thoughts on this.

The current cell magic line parser is not aware of Python structures (dicts, lists ...) and can thus incorrectly parse the --params option (example issues: #166 and #108). The default parser breaks these structures on whitespace, and while we stitch these parts back together and operate on the result, this only happens after the parsing has already taken place and bugs have sneaked in.

I took a stab at it and came up with a POC custom parser that successfully parses non-trivial input lines such as the following:

%%bigquery    target_var --use_legacy_sql --project my.project.name --params {'nested_dict': {'bar': 'baz', 'some_tuple': (-1, 3.2,)}, "answer": -42, 'array': ['aaa bbb', ["nested", 'list'], 'ccc ddd'], '--params': 'tricky --inner-option string literal',} --verbose  

The following test snippets demonstrates that the parser correctly recognized the options:

from google.cloud.bigquery.ipython_magics import line_arg_parser as lap

example_line = (
    "  target_var "
    "--use_legacy_sql "
    "--project my.project.name "
    "--params {'nested_dict': {'bar': 'baz', 'some_tuple': (-1, 3.2,)}, "
    "\"answer\": -42, 'array': ['aaa bbb', [\"nested\", 'list'], 'ccc ddd'], "
    "'--params': 'tricky --inner-option string literal',} "
    "--verbose  "
)

lexer = lap.lexer.Lexer(line)
parser = lap.parser.Parser(lexer)
parse_tree = parser.input_line()

printer = lap.visitors.TreePrinter()
printer.visit(parse_tree)

Output:

Input:
    destination_var:    target_var
    Command options:
        --use_legacy_sql 
        --project my.project.name
        --params {'nested_dict': {'bar': 'baz', 'some_tuple': (-1, 3.2)}, "answer": -42, 'array': ['aaa bbb', ["nested", 'list'], 'ccc ddd'], '--params': 'tricky --inner-option string literal'}
        --verbose 

The custom parser could be used to extract out the --params <...> part, evaluate it separately, and only feed the remainder of the input to the default parser.

Do you think we should pursue this further?

The grammar used is of course a simplification that is aimed to be "good enough" to reliably extract the --params part in typical uses cases, but it does not try to be bullet-proof. For instance, it currently does not support non-string dict keys, sets, floats in exponential notation, etc., although that can all be added depending on how much we want to complicate. :)


cc: @tswast Do customers complain about the default parser's limitations often? Do you think there even exist reasonable use cases where --params`dictionary would be incredibly complex?

@tswast
Copy link
Contributor

tswast commented Aug 4, 2020

Do customers complain about the default parser's limitations often?

I wouldn't say often, but it has come up more than once

@plamut plamut force-pushed the fix-cellmagic-parser branch 2 times, most recently from 5aa7f07 to e778cf2 Compare August 7, 2020 14:26
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Aug 21, 2020
@plamut plamut force-pushed the fix-cellmagic-parser branch 8 times, most recently from 5359b3a to b514850 Compare August 27, 2020 19:06
@plamut plamut requested a review from cguardia August 28, 2020 10:01
@plamut plamut marked this pull request as ready for review August 28, 2020 15:16
Copy link
Contributor

@cguardia cguardia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. Thanks.

The parser should accept as wide a range of values as possible and let
the code that delas with the semantics to decide whether the values are
good or not.
The --params option spec must be followed by a non-alphanumeric
character, otherwise it's a different option spec (e.g. --paramsX).
@plamut plamut requested a review from cguardia September 2, 2020 12:02
Copy link
Contributor

@cguardia cguardia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking good.

Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! It does require a bit of refactoring, though. We need to keep google.cloud.bigquery.magics module/package name due to google.cloud.bigquery.magics.context

I'd also like to see if we can lean on pyparsing for a lot of the lexing / parsing work. I worry that those modules are going to be particularly challenging to maintain without either "literate programming" level comments or relying on common modules to make the parser easier to understand.

docs/magics.rst Outdated Show resolved Hide resolved
google/cloud/bigquery/ipython_magics/__init__.py Outdated Show resolved Hide resolved
The context still needs to be importable from the old path
@plamut plamut requested a review from tswast September 4, 2020 23:21
Token = namedtuple("Token", ("type_", "lexeme", "pos"))
StateTransition = namedtuple("StateTransition", ("new_state", "total_offset"))

# Pattern matching is done with regexes, and the order in which the token patterns are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this explanation and the clearer state names!

Fun fact: anything that can be parsed with a regular expression can also be parsed by a finite state machine and vice versa. It's called a regular language. That means we might have been able to convert this into one giant regular expression, but I like having these explicit states better.

I recall that lookahead & backtracking isn't a regular expression in the computer science definition, so I don't think this is technically a "regular" language, though.

Copy link
Contributor Author

@plamut plamut Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, regular expressions correspond to finite automata, although the regex implementation is a bit more powerful than that (AFAIK), as it supports things like lookahead and lookbehind.

The language itself recognized by the parser is not regular, true, as the grammar allows for recursive patterns such as nested dicts. A proof of non-regularity is left as an exercise to the reader. 😆

(hint: pumping lemma).

google/cloud/bigquery/magics/line_arg_parser/lexer.py Outdated Show resolved Hide resolved
google/cloud/bigquery/magics/line_arg_parser/parser.py Outdated Show resolved Hide resolved
tests/unit/test_magics.py Outdated Show resolved Hide resolved
Apparently black just places all implicitly concatenated string
literals in a single line when short enough without replacing them
with a single string literal.
This is necessary to retain Python 2 compatibility.
The tokens are designed in a way that the scanner *always* returns
some match, even if just UNKNOWN or EOL. The "no matches" code path
can thus never be taken, but the coverage check can't know that.
Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks for the clear explanations.

@tswast tswast merged commit dcfbac2 into googleapis:master Sep 9, 2020
@plamut plamut deleted the fix-cellmagic-parser branch September 10, 2020 11:20
gcf-merge-on-green bot pushed a commit that referenced this pull request Sep 22, 2020
🤖 I have created a release \*beep\* \*boop\* 
---
## [1.28.0](https://www.github.com/googleapis/python-bigquery/compare/v1.27.2...v1.28.0) (2020-09-22)


### Features

* add custom cell magic parser to handle complex `--params` values ([#213](https://www.github.com/googleapis/python-bigquery/issues/213)) ([dcfbac2](https://www.github.com/googleapis/python-bigquery/commit/dcfbac267fbf66d189b0cc7e76f4712122a74b7b))
* add instrumentation to list methods ([#239](https://www.github.com/googleapis/python-bigquery/issues/239)) ([fa9f9ca](https://www.github.com/googleapis/python-bigquery/commit/fa9f9ca491c3f9954287102c567ec483aa6151d4))
* add opentelemetry tracing ([#215](https://www.github.com/googleapis/python-bigquery/issues/215)) ([a04996c](https://www.github.com/googleapis/python-bigquery/commit/a04996c537e9d8847411fcbb1b05da5f175b339e))
* expose require_partition_filter for hive_partition ([#257](https://www.github.com/googleapis/python-bigquery/issues/257)) ([aa1613c](https://www.github.com/googleapis/python-bigquery/commit/aa1613c1bf48c7efb999cb8b8c422c80baf1950b))


### Bug Fixes

* fix dependency issue in fastavro ([#241](https://www.github.com/googleapis/python-bigquery/issues/241)) ([2874abf](https://www.github.com/googleapis/python-bigquery/commit/2874abf4827f1ea529519d4b138511d31f732a50))
* update minimum dependency versions ([#263](https://www.github.com/googleapis/python-bigquery/issues/263)) ([1be66ce](https://www.github.com/googleapis/python-bigquery/commit/1be66ce94a32b1f924bdda05d068c2977631af9e))
* validate job_config.source_format in load_table_from_dataframe ([#262](https://www.github.com/googleapis/python-bigquery/issues/262)) ([6160fee](https://www.github.com/googleapis/python-bigquery/commit/6160fee4b1a79b0ea9031cc18caf6322fe4c4084))


### Documentation

* recommend insert_rows_json to avoid call to tables.get ([#258](https://www.github.com/googleapis/python-bigquery/issues/258)) ([ae647eb](https://www.github.com/googleapis/python-bigquery/commit/ae647ebd68deff6e30ca2cffb5b7422c6de4940b))
---


This PR was generated with [Release Please](https://github.com/googleapis/release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
3 participants