Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/sql parsing #157

Open
wants to merge 100 commits into
base: main
Choose a base branch
from
Open

Conversation

tools4origins
Copy link
Collaborator

This PR implements an ANTLR-based logic to parse SQL.

ANTLR is a parser generator used by Apache Spark. As such, we are able to use the exact spark SQL syntax.

SQL strings are converted into an abstract syntax tree (AST) by this project https://github.com/pysparkling/python-sql-parser.

These AST are then converted into Pysparkling object using the pysparkling/sql/ast/ast_to_python.py module, in particular via its entry points parse_xxx, e.g. parse_sql or parse_schema.

This PR only exposes SQL parsing on SQL schemas via a refactoring of StructType.fromDDL.

It also contains part of the logic that will be used to handle other types of SQL statements.

Copy link
Contributor

@svaningelgem svaningelgem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just working on something else and realized: why would you import parameterized? Why not use "@pytest.mark.parametrize" instead? That is one less dependency?

@svaningelgem
Copy link
Contributor

svaningelgem commented Mar 5, 2021

        "struct<`x+y`:int, `!@#$%^&*()`:string, `1_2.345<>:\"`:varchar(20)>": StructType([
            StructField("x+y", IntegerType(), True),
            StructField("!@#$%^&*()", StringType(), True),
            # StructField("1_2.345<>:\"", VarcharType(20), True)])
            StructField("1_2.345<>:\"", StringType(), True)
        ]),

==> This test would fail because the backquotes are not removed.

I solved this one as:

    'QuotedIdentifierAlternativeContext': partial(unwrap, has_quotes=True),

[...]


def unwrap(*children, has_quotes=False):
    check_children(1, children)
    tmp = convert_tree(children[0])

    if has_quotes:
        return tmp[1:-1]  # Strip the quotes.

    return tmp

@svaningelgem
Copy link
Contributor

svaningelgem commented Mar 5, 2021

        """
        struct<
            struct:struct<deciMal:DECimal, anotherDecimal:decimAL(5,2)>,
            MAP:Map<timestamp, varchar(10)>,
            arrAy:Array<double>,
            anotherArray:Array<char(9)>
        >
        """: StructType([
            StructField("struct", StructType([
                StructField("deciMal", DecimalType()),
                StructField("anotherDecimal", DecimalType(5, 2))
            ])),
            # StructField("MAP", MapType(TimestampType(), VarcharType(10))),
            StructField("MAP", MapType(TimestampType(), StringType())),
            StructField("arrAy", ArrayType(DoubleType())),
            # StructField("anotherArray", ArrayType(CharType(9)))])
            StructField("anotherArray", ArrayType(StringType()))
        ]),

    "strUCt<>": StructType([]),
    # DataType parser accepts certain reserved keywords.
    "Struct<TABLE: string, DATE:boolean>": StructType([
        StructField("TABLE", StringType()),
        StructField("DATE", BooleanType())
    ]),
    "struct<end: long, select: int, from: string>": StructType([
        StructField("end", LongType()),
        StructField("select", IntegerType()),
        StructField("from", StringType()),
    ]),
    "Struct<x: INT, y: STRING COMMENT 'test'>": StructType([
        StructField("x", IntegerType()),
        StructField("y", StringType(), {'comment': 'test'}),
    ])

Also gives an error.

@svaningelgem
Copy link
Contributor

svaningelgem commented Mar 6, 2021

And the following 2 do not fail correctly when they should:
struct<x: int
gives StructType(List(StructField(x,IntegerType,true))) when it should fail (no ending >).

struct<x+y: int, 1.1:timestamp>
fails with a TypeError where it should fail with SqlParserError (because of the unquoted identifier 1+1 which is invalid without quotes.

@svenkreiss
Copy link
Owner

Wow, this is fantastic @tools4origins . And thanks for the prompt review @svaningelgem . On the SQL sub package, I'll let you two discuss. I am currently very busy. For me the only concerns are changes to setup.py that I would like to understand why they are necessary.

Copy link
Owner

@svenkreiss svenkreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too me, everything looks good.

I have only two comments about dependencies. Once resolved, I'll wait for your ready signal to merge.

setup.py Outdated
@@ -31,14 +36,15 @@
'backports.tempfile==1.0rc1',
'cloudpickle>=0.1.0',
'futures>=3.0.1',
'pylint',
'pylint~=2.7',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added restrictive version requirements on all dependencies. I once again had several tests fail due to a change in the behavior of multiple dependencies (both isort and pylint). The tests did not behave the same on my laptop with an older version of isort/pylint and in the CI where the latest versions were downloaded.

Unless there are specific tests for multiple versions of a dependency, I don't think we should assume we support it.

setup.py Outdated
'pylzma',
'memory-profiler>=0.47',
'pycodestyle',
'pytest',
'pytest-cov',
'isort',
'tornado>=4.3',
'parameterized>=0.7.4',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question as @svaningelgem . Is it possible to avoid this extra dependency using pytest mechanisms?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked around and did not find something that would do what I was trying to achieve. It is very useful as there are a lot of cases to test for many SQL functions so I really appreciate the added value of the lib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants