Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48342][SQL] Introduction of SQL Scripting Parser #46665

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

davidm-db
Copy link
Contributor

@davidm-db davidm-db commented May 20, 2024

What changes were proposed in this pull request?

This PR proposes changes to SQL parser to introduce support for SQL scripting statements:

  • Introduces BEGIN keyword to the lexer.
  • Changes grammar to support compound statements.
  • Exposes new parseScript method in ParserInterface.
  • Implements visitor functions for new nodes in the syntax tree in AstBuilder.
  • Introduces new logical operators in SqlScriptingLogicalOperators for compound statements that are created by visiting functions and that will be used during interpretation phase.

In order to simplify the process, in this PR we only introduce the support for compound statements to the SQL parser.
Follow-up PRs will introduce interpreter, further statements, support for exceptions thrown from parser/interpreter, etc.
More details can be found in Jira item for this task and its parent (where the design doc is uploaded as well).

Why are the changes needed?

The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL.

Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL.

Does this PR introduce any user-facing change?

No.
This PR is a first in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged.
In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts.

How was this patch tested?

There are tests in SqlScriptingParserSuite that test the newly introduced parser changes.

Was this patch authored or co-authored using generative AI tooling?

No.

@davidm-db davidm-db changed the title SQL Batch Lang Parser [SPARK-48342] SQL Batch Lang Parser May 20, 2024
@davidm-db davidm-db changed the title [SPARK-48342] SQL Batch Lang Parser [SPARK-48342] Introduction of SQL Scripting Parser May 21, 2024
@davidm-db davidm-db changed the title [SPARK-48342] Introduction of SQL Scripting Parser [SPARK-48342][SQL] Introduction of SQL Scripting Parser May 21, 2024
@davidm-db davidm-db marked this pull request as ready for review May 21, 2024 15:17
@github-actions github-actions bot added the DOCS label May 22, 2024
Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

@melin
Copy link

melin commented May 27, 2024

Multiple statements include set statements if the next statement of the set statement has a syntax error. The analysis is wrong,

BEGIN
    SET hive.exec.dynamic.partition.mode = 'nonstrict';
    CREATE TABLE IF NOT EXISTS dws_spu_user_type_agg
    (
        spu_code                   STRING
        ,detail_channelSTRING
    )
    PARTITIONED BY (pt STRING)
    STORED AS ORC
END;

image

@davidm-db
Copy link
Contributor Author

@melin we are aware of this problem, it already exists without SQL scripting... example: using SET VAR to set a SQL variable with syntax error after the = (let's say SELECT query with some syntax error) won't throw syntax error, but will rather match the SET .*? rule of setConfiguration and return completely unrelated exception...

We have a follow-up PR, that will define behavior of SET within the SQL scripts. Per standard, we should make VAR | VARIABLE to not be required in SQL scripts for SET statement. SET .*? rule creates a big problem for this approach, so we decided that for the start we will remove those rules from SQL scripts and allow only SET for SQL variables. If anyone wants to set config, they can do it through EXECUTE IMMEDIATE.

@melin
Copy link

melin commented May 29, 2024

Can remove: SET.*? , or optimize the set syntax definition?

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the discussion!

val visitedChild = visit(ctx.getChild(i))
visitedChild match {
case statement: CompoundPlanStatement => buff += statement
case null if child.isInstanceOf[TerminalNodeImpl] =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would the child not be an instance of TerminalNodeImpl? If this is enforced by the parser, then you can just remove the check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I agree. We had only case null => but it was very unclear what it meant, so we decided to add this check.

This way, if anything changes in the future, we would get exceptions in our tests if this null does not happen because the child is instance of TerminalNodeImpl.

Not sure if this is the best way to do it though. I guess that it's probably better to not visit child at all if it's a terminal node. I don't know how better to achieve this. Or maybe we are fine with leaving only case null => and adding comment that explains it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example SQL to trigger this null case?

Copy link
Contributor Author

@davidm-db davidm-db Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically happens within every SQL statement - TerminalNodeImpl corresponds to keywords from the SQL grammar, so for example if the query is like:
image

Then, on the top level, TerminalNodeImpl are semicolons:
image

And then deeper in the tree it could be a SELECT:
image

Or whichever other keyword.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved with the latest changes in visitCompoundBodyImpl.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM!

}

override def visitCompoundStatement(ctx: CompoundStatementContext): CompoundPlanStatement = {
val child = visit(ctx.getChild(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks very weird, do you mean ctx.statement() != null?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser rule is

compoundStatement
    : statement
    | beginEndCompoundBlock
    ;

I think it should be very clear which one is matched.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean like checking if we are visiting statement or beginEndCompoundBlock first and then do the visit, instead of visiting first and than matching on the output type?
I will change it to this anyways because I think it makes more sense and it cleaner than getChild(0) etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

;

compoundBody
: (compoundStatement SEMICOLON)*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the scala side, (statements+=compoundStatement SEMICOLON)*

You can check out similar rules in this rule that use +=

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is helpful... This will help us resolve the other comment for null check in visitCompoundBodyImpl as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

case body: CompoundBody => body
case _ =>
val position = Origin(None, None)
throw QueryParsingErrors.sqlStatementUnsupportedError(sqlScriptText, position)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this really happen? Or it means bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, we did the same as in parsePlan, but in my opinion this means bug... This would happen if parsing is successful, but something other than expected node (CompoundBody here) is returned, which I have no idea how it could happen. If something is bad, I would expect Syntax/Parsing exception to be thrown, otherwise the sqlScriptText should be parsed into CompoundBody. So, I don't expect this case to ever be hit, but we copied the parsePlan behavior.

Of course, I might be missing something here...

/**
* Logical operator representing result of parsing a single SQL statement
* that is supposed to be executed against Spark.
* It can also be a Spark expression that is wrapped in a statement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing, this seems to be accidental paste from POC PR and it was a leftover from some previous version there I think... Will remove the last sentence from the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

* It can also be a Spark expression that is wrapped in a statement.
* @param parsedPlan Result of SQL statement parsing.
* @param sourceStart Index of the first char of the statement in the original SQL script text.
* @param sourceEnd Index of the last char of the statement in the original SQL script text.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird put these in the constructor. Other plan nodes keep the SQL string context as a thread local Origin, can't we do the same? Check out the withOrigin calls in AstBuilder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't aware of this, thanks! Will change to using Origin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants