Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Introducing an AST factory #3770

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

[WIP] Introducing an AST factory #3770

wants to merge 3 commits into from

Conversation

jods4
Copy link
Contributor

@jods4 jods4 commented Sep 8, 2022

This is an attempt at improving the internals of linq2db, there is no change for users.
As linq2db is large and complex, the changes in the PR are just "getting started" and far from complete, although as I will discuss below it should probably be merged incrementally in multiple steps..

Main idea: add an AST factory

The core idea of this PR is to introduce a factory class to create our SQL abstract syntax tree.
So, instead of new'ing up nodes with new SqlPredicate.IsNull(...) and friends, we call into ast.IsNull(...).

What's the difference anyway?!

new T must return an instance of T, whereas a factory could return any ISqlExpression.
This opens up quite a few possibilities:

Local optimizations

The AST factory can apply local transformations and return an equivalent, optimized expression. I think I've seen this referred to as optimization by construction.

For example, if you call and(...) with a series of nodes that incl. a false constant, the factory would just return a false constant; Nested calls to coalesce(...) can be flattened in a single node.

Today, these transformations are generally done by SqlOptimizer. Doing them at construction time largely reduces the number of allocations, which is beneficial for performance.

To illustrate, consider the expression isnull(row(x, y)) in a provider that doesn't support row.
Today, compilation could go like this:

isnull( row(x, y) )         # Parsed AST -> 2 nodes (in addition to x, y)
and( isnull(x), isnull(y) ) # Optimizer lowers to non-row code -> 3 new nodes
and( isnull(x), false )     # Optimizer notices y cannot be null, simplifies -> 2 new nodes
false                       # Optimizers sees "and" is always false -> 0 new node (hopefully reuses false node)

With a builder, the same process happens but immediately at construction without visiting anything, and without constructing intermediate nodes:

ast.row(x, y)           # Temporary row(x, y) must still be built by expression parser -> 1 node.
ast.isnull( row(x, y) ) # This call notices the row and immediately calls SQL equivalent -> 0 node.
ast.and( 
  ast.isnull( x ),      # Allocates a temporary isnull(x) -> 1 node
  ast.isnull( y )       # Notices y is non-nullable and returns a cached "false" node -> 0 allocation
)                       # ast.and call notices one parameter is false and returns it -> 0 allocation

Summary: current design allocates 7 nodes (in addition to x, y); factory only allocates 2!

Note
This doesn't remove or replace SqlOptimizer. It is still useful for global optimizations, i.e. transforms that need more context than just the current node.

Interning

Common constants or expressions such as null, 0, 1, true, false are cached and re-used, reducing allocations.
If the codebase is 100% going through the AST factory, we can take advantage of the fact that values are interned and make reference equalities rather than pattern matching to check if something is a well-known value like 0.

Provider-specific translations

The factory could be a static class, but I've made it an instance, so that we can have provider-specific factories.
This enable provider-specific transformations, such as lowering unsupported expressions like SqlRow in the previous example.

Enforcing invariants

The factory can enforce some constraints on our AST nodes, for example we could guarantee that a some structures like or, coalesce are always flattened, which makes further processing more efficient.

Benefits of AST factory

Performance

As explained above, a factory can apply transformations at construction without allocating nodes.
This improves performance because it reduces GC pressure and requires less "visiting".

Separation of concerns

Some optimizations are actually made by the expression parser.
This complicates the parser with things that are not its responsibilities.
Furthermore, those optimizations are either not applied later when the tree is transformed, or duplicated.
With a dedicated class that consistently does these transforms, the calling code can be simplified.

I think that if you look at the code that is already modified in this PR, the result is IMHO generally much cleaner -- and I'm only getting started.

Future-proof

This is a first step, that will help future refactorings.
I think our SQL AST is not ideal and there are changes I would like to make for cleaner code and more efficient execution.
This abstraction layer, once complete, will make it easier to apply those changes project-wide.

Provider-independent remoting?

This is a long-term idea that is not on my map for now, but anyway:
Instead of serializing a provider specific AST on the client, we could call into a "generic" AST factory that does nothing but creates a neutral structure to serialize these calls.
On the server, we can rebuild the AST by making the same calls into a provider-specific factory.
This could make the client provider-agnostic, which is nice to have, IMHO.

Rollout

Obviously this is a very large-scale change, as there are calls to AST constructors all over the place.
I've started the work just so that you can get a better grasp of what the code would actually look like.

I think this is gonna be too annoying to merge to fully do in a single PR.
If we commit to this idea, I think we should regularly merge progress. Not all the code calls into the factory, but everything is still working and can be released.

There's a lot to do, in waves:

  • Replace all ctors usage;
  • Make AstFactory provider-specific, move provider transformations into it;
  • Move local optimizations in factory (from SqlOptimizer and other places);

@@ -2003,8 +1990,10 @@ public virtual ISqlExpression ConvertExpressionImpl(ISqlExpression expression, C
return expression;
}

protected virtual ISqlExpression ConvertFunction(SqlFunction func)
protected virtual ISqlExpression ConvertFunction(ISqlExpression expr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would recommend this remains SqlFunction func, and AstFactory.Coalesce() returns a SqlFunction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the ConvertFunction business was annoying, I had to hack something to be able to proceed without making too many deep changes for now.

Long-term, I would like to see ConvertFunction be gone, seems to me its purpose is to switch function names so that correct SQL is generated. This could also happen at SQL generation time without modifying the AST at all, but I don't want to engage in too many refactorings at the same time.

Would recommend AstFactory.Coalesce() returns a SqlFunction.

Seems against the principle of AST factory to constraint what is returned too much.
For instance, one local optimization I have on the map for later is that Coalesce can drop all its arguments after the 1st one that cannot be null.
If this happens to be the 1st argument, Coalesce can just return the first argument. In that case there's no guarantee that Coalesce returns a SqlFunction, could be any expression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Wonder if a lot of the remaining functionality in this method could be converted to Function attributes or expression mappings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I haven't got my head too deep into that yet.

Gotta say one challenge when trying to improve the code is that you hit some design quirks and then it's a rabbit hole to work-around or fix them.

Other instances where I had to hack around:

  • I've changed Invert to return a ISqlPredicate instead of IQueryElement... turned out every caller had a cast already! 😱 .
  • I've changed NotExpr to take a ISqlPredicate instead of ISqlExpression. Makes sense and was already the case almost everywhere.

Source/LinqToDB/SqlQuery/AstFactory.cs Show resolved Hide resolved
Source/LinqToDB/SqlQuery/AstFactory.cs Show resolved Hide resolved
Source/LinqToDB/SqlQuery/AstFactory.cs Show resolved Hide resolved
Source/LinqToDB/Sql/Sql.Row.cs Show resolved Hide resolved
Comment on lines +173 to +176
(SqlFunction { Name: "Coalesce" or PseudoFunctions.COALESCE } ca,
SqlFunction { Name: "Coalesce" or PseudoFunctions.COALESCE } cb) => ToArray(ca.Parameters, cb.Parameters),
(SqlFunction { Name: "Coalesce" or PseudoFunctions.COALESCE } ca, _) => ToArray(ca.Parameters, b),
(_, SqlFunction { Name: "Coalesce" or PseudoFunctions.COALESCE } cb) => ToArray(a, cb.Parameters),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the commentary of the PR, these cases should be recognizing an existing SqlFunction object and updating it rather than preparing an array for a brand new SqlFunction object in l180. This probably means that SqlFunction.Parameters should use an IReadOnlyList<> or similar, where SqlFunction provides an internal void AddParameter(ISqlExpression a) that updates the underlying list of that property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm still trying to come up with better ways to reduce allocations and would love new ideas here.

I think our SQL AST nodes should be immutable. Maybe not in the strict sense, I'd maybe allow having flags set later during analysis / optimizations, but as far as what what the node is/represents.
Immutability enables re-using the node safely, for example when expanding an expression a == b into a == b or a is null and b is null without cloning the whole a and b subtree.
Another idea is that we could cache and reuse sub-expressions across various compilations.

I'm not set on what's the best way to reduce copying of immutable structures, there are options. I'd like to balance clean code and efficiency.
Some ideas:

  1. This specific coalesce example might be common and bad when converting a long ?? chain.
    To avoid this we could complement it by some custom code in the expression converter to immediately create the flattened node.
    I have done this for or/and predicates, the Expression converter immediately flattens them before calling the factory, although the factory would also flatten (by copying) later tree transforms that would put an or inside an or.

  2. Maybe nodes could be mutable until they're cloned? Could be indicated by a bit on the node.
    This makes any code that want to mutate the node less clean because it has to check the bit and it needs 2 code paths for the case where it can mutate and the case where it needs a copy :(

  3. For nodes that are nicer being mutated when they're first created (or/and immediately come to mind in current linq2db design), maybe we could have a Builder pattern, where a mutable struct is used until we're done and then an immutable node is produced.

  4. Maybe we can use more sophisticated structures inside ast nodes, e.g. a collection of nodes that can share and reuse a buffer when stuff is appended?
    BTW in this regard, I think we would benefit from using frugal collections, i.e. structs with optimized storage for 2-4 elements, instead of generic arrays/list, as many nodes commonly have very few children. That would cut down allocations some more... Lots of ideas but I don't want to touch the AST nodes until the factory is fully in place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should do some research on red-green trees from Roslyn: https://ericlippert.com/2012/06/08/red-green-trees/. It may help you find a better balance between immutability and memory pressure. I think the gist of it is similar to option 2, where you have internal mutable trees and public immutable trees once cloned.

Copy link
Contributor Author

@jods4 jods4 Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I change my mind every day but seeing some stuff that's done with the AST today, I'm having second-thoughts about immutability. I'm still unsure what the best design will be, but in all cases it'll be easier to tweak AST representation with the factory in place.

RE: red-green trees I don't think they would help with building the AST, as the green part is immutable and basically matches what our AST is and does today.
The design might be interesting if we want to maximize the sharing of immutable nodes (the green part), maybe even across queries, while allowing a mutable layer on top, e.g. for flags and other annotations (would be the red part).

My current thoughts:

  • The builder approach is most promising if we want to optimize allocations to the max during Expression conversion.
  • Another design that would work well during optimizations too: add a flag indicating that the node is shared (hence immutable -- but most nodes would not be) and mutating methods. Those methods would mutate and return the same node when they can, and return a fresh modified copy when they can't.

The second design pushed to the max would seriously reduce allocations during full tree optimizations by SqlOptimizer. Currently, when something is modified, every parent node of the tree is replaced by a new one -> in many cases that may not be required if you just swap child references.

@jods4
Copy link
Contributor Author

jods4 commented Sep 10, 2022

About SqlBuilders

SqlBuilder is the last step of the pipeline, it takes an optimized tree and prints a SQL string.

I've noticed some builders create a bunch of AST nodes, the main reason is to print those nodes as an alternative for "standard" nodes in the tree. These kind of transformations are often done by SqlOptimizer but not always...

I'm not gonna introduce the AST factory inside SqlBuilder classes. This is a last step past every optimizations and the provider-specific builders know exactly what they want to print, so let them be.

It is quite wasteful to create those temporary nodes. It'd be good to remove them with some more direct calls to SqlBuilder helpers that can print expressions without allocating new nodes.

A good example is the BuildFunction(string name, ISqlExpression[] parameters) that is currently private but could be made protected. It's much more efficient to call this directly rather than to print a new SqlFunction(name, parameters).

@viceroypenguin
Copy link
Contributor

About SqlBuilders

SqlBuilder is the last step of the pipeline, it takes an optimized tree and prints a SQL string.
...

I'll agree with this comment. I think ultimately there should be three distinct phases: building (when initial LINQ calls are made), optimizing (SqlOptimizer and co.), and rendering (SqlBuilder and co.). Currently, it sounds like optimizing and rendering are blended, with some optimization happening during the rendering phase and some pre-rendering happening during the optimization phase. It would definitely be a bit cleaner if the optimization was independent of rendering, with maybe some flags available from the data provider that affect certain features during the optimization phase.

@sdanyliv
Copy link
Member

sdanyliv commented Sep 18, 2022

@viceroypenguin, I've did that separation, and I've got performance degradation ;) So, SqlBuilder also optimises SQL on the fly. It is critical for queries which depends on parameter values.

Main disadvantage of doing optimisation on whole Statement is a lot of new allocations and AST rebuilding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants