Improve sequence building performance #3913

jods4 · 2023-01-15T16:38:54Z

Limitations of current design

Today ExpressionBuilder linearly goes through a long list (~70) of ISequenceBuilder to find which builder can handle an Expression. Each builder is probed by a call to CanBuild, in sequence.

This is horribly not performant. If your query uses one of the last builders -- or simply the expression has no builder, which is sometimes a normal case -- you have to go through 70+ checks!! They are virtual calls, many builders repeatedly perform the same casts or checks, and not all builders perform quick, cheap tests. Of course this is only gonna get worse as we add new builders.

The current design attempts at improving performance by dynamically sorting the list to put the most used builders at the front.
This is a good idea but it has two drawbacks.

It adds complexity for the book-keeping of how many times each builder is used. This is mutable concurrent code and requires locks. The amortized perf increases (probably) but the overall perf is worse.
It is not that effective. The list is re-sorted only when the first item would change. I assume this is done to avoid constantly reordering the list if 2 builders fight for a specific spot (although that can happen for 1st place!!).
So once the most used builder is found, nothing changes anymore, even if a builder that becomes commonly used is still way down the list at the moment.
As I said before, "not finding a builder" is sometimes an expected use-case and this always go through the whole list.

New strategy

Finding a builder now starts with a big switch statement on NodeType (an int). C# optimizes this depending on the number and distribution of items, it could be a hard-coded binary search (ifs) or a single indirect jump based on a table.

Then a CanBuild method is called, but notice:

At this stage, only 1 or 2 (rarely 3) candidates remain.
CanBuild is now a static method. Short ones will be inlined, which is very likely as most have been simplified thanks to the fact that we know what node to expect when CanBuild is called. Most of them are as simple as => true, => call.IsQueryable() or => call.IsAsyncExtension()

One notable case is ExpressionType.Call because we have many (most) builders for this node type.
This is handled by a second switch, this time on MethodCallExpression.Method.Name, which C# will compile into a static hashtable. This is then handled as other node types. In 90% of cases there's only a single candidate left.

There are other benefits, such as clarity. The book-keeping and re-sorting code is gone.
Before it was difficult to see which builder handles what, which is particularly important as they must all be mutually exclusive!
Now you just need to peek at the switch tables to find what is handled, by what.

Did you know that 3 different builders handle the Join method?
I found a builder that is actually unused! I deleted CountBuilder (methods Count and LongCount are actually handled by AggregationBuilder).

Source generators

I'm not crazy and I didn't write the big switches by hand, that would be too error-prone.
Instead, I added a C# Source Generator to the project, that looks for attributes like BuildsExpression or BuildsMethodCall on builders and implements FindBuilderImpl automatically.

Note
Tips for source generators:

Consuming a generator in VS generally works nicely. VS will go to definitions in generated sources and you can also see the files in Solution Explorer, under References > Analyzers.

If you modify a generator your changes apply to Build immediately but as of the current version of VS 2022, IDE will continue to use the one it has loaded. The only way to refresh IDE after modifying a generator is to close and reload VS.

To help when working on a generator, you can use <EmitCompilerGeneratedFiles> in csproj to save files generated during build on disk (inside obj). I left this property, commented, in the csproj PR.

As generated code is not committed to git, I uploaded the source generated by this PR here for your convenience:
https://gist.github.com/jods4/6b145327f3c7563836c357984b723caf

Code patterns

// BuildsExpression is picked up by the source generator.
// It takes the list of ExpressionType this builder may handle
[BuildsExpression(ExpressionType.Lambda, ExpressionType.Constant)]
class MyBuilder : ISequenceGenerator
{
  // This static method will be called from generated code.
  // `expr` is guaranteed to be of the types specified in attributes.
  // Keep this short and efficient!
  public static bool CanBuild(Expression expr, BuildInfo info, ExpressionBuilder builder)
    => true;
}

// BuildsMethodCall is picked up by the source generator.
// It takes the list of method names that this builder may handle
[BuildsMethodCall("Take", "Skip")]
class MethodBuilder : ISequenceGenerator // or more commonly : MethodCallBuilder
{
  // This static method will be called from generated code.
  // `call` is guaranteed to be for a method whose name is amongst BuildsMethodCall arguments.
  // Keep this short and efficient!
  public static bool CanBuildMethod(MethodCallExpression call, BuildInfo info, ExpressionBuilder builder)
    => true;
}

Special cases

A few builders may handle any method call. You can do this simply by using [BuildsExpression(ExpressionType.Call)], they will be probed after any method-specific builder [BuildsMethodCall].
Today this inculdes MethodChainBuilder, QueryExtensionBuilder, TableBuilder. Those builders tend to look for attributes on methods, e.g. [Extension] or [Table].

Some builders take advantage of having different, simpler checks based on what method or expression type matched.
You can have multiple attributes on a single builder, and use the named property CanBuildName to direct the check to a different method for some cases. Many sync/async builders do that, for example ContainsBuilder:

[BuildsMethodCall("Contains")]
[BuildsMethodCall("ContainsAsync", CanBuildName = nameof(CanBuildAsyncMethod))]
sealed class ContainsBuilder : MethodCallBuilder
{
	public static bool CanBuildMethod(MethodCallExpression call, BuildInfo info, ExpressionBuilder builder)
		=> call.IsQueryable() && call.Arguments.Count == 2;

	public static bool CanBuildAsyncMethod(MethodCallExpression call, BuildInfo info, ExpressionBuilder builder)
		=> call.IsAsyncExtension() && call.Arguments.Count == 3;
}

ContextRefBuilder is unique: it seems that it could potentially handle any expression if BuilderInfo.Parent is null. For this case I added [BuildsAny]. Builders with this attribute will be probed after everything else failed.
The signature of the static method is CanBuild(BuildInfo info, ExpressionBuilder builder).

Risks and breaking changes

Everything is internal: there's no public API change, so no breaking change.

A lot of files were touched because there is 70+ builders but 95% of changes are very mechanical. I think there's still room for improvements but I wanted to keep this PR straightforward, as it's fairly large.
If all tests pass, I'd say the risk of regression is low.

jods4 · 2023-01-15T19:39:16Z

Some wcf remoting examples don't compile because of http in System.Net when targetting .net 4.8.
@sdanyliv I suppose that's already the case on your branch, as it has nothing to do with my changes.

Source/CodeGenerators/BuildersGenerator.cs

viceroypenguin · 2023-01-15T22:25:24Z

Source/LinqToDB/Linq/Builder/DistinctBuilder.cs

@@ -6,14 +6,13 @@ namespace LinqToDB.Linq.Builder
 	using Reflection;
 	using LinqToDB.Expressions;

+	[BuildsMethodCall("Distinct", nameof(LinqExtensions.SelectDistinct))]


Not sure I like the inconsistency of strings and nameof. I'd say we (royal collective we) should pick one and stick with it, and this PR is a good chance to pick and update all to thath. I'd vote for nameof myself.

For this first draft I tried to stick to the style of existing code, but that's totally doable. 👍
At least for our own methods (e.g. LinqExtensions) I like the strong references.

It is worth noting:

Some references are to .net fx methods;

Sometimes a single name refers to multiple methods in different classes.

MaceWindu · 2023-01-17T11:47:36Z

I see PR is based on parser refactoring feature, so I assign it to v6 milestone

jods4 · 2023-01-17T12:02:15Z

@MaceWindu yes, this is on top of the parser refactoring.
I could have done it on top of v4, but I didn't want to have to merge with all @sdanyliv changes.

igor-tkachev · 2023-02-28T15:31:28Z

Do you have an example that shows the problem? I would like to look at in performance profiler.

jods4 · 2023-03-01T00:08:43Z

@igor-tkachev if you want to see the worst query compilations, you basically need to use the builders that are close to the end of the list.

A good example would be some complex MERGE statement, because it has many parts and they are all close to the end of the list. For bonus points be sure to tag the query, as that new feature is 4th from last.

That said, be careful after the first query compilation because linq2db attempts to improve performance by sorting the list of potential builders, which means your query will now be vastly more performant on the 2nd run.
Priority sorting is a good idea but it's far from perfect in practice. The list is only sorted if the 1st element would change, so if one type of query dominates from very early on (e.g. SelectBuilder), then the rest won't be sorted even if you suddenly use a significant count of MergeBuilder queries. Also, if you do many different kind of queries, you'll still pay the price for the list scanning as they can't all be at the front.

So if you're doing benchmarking, maybe execute a bunch of SELECT before benchmarking some MERGE, to ensure the required builders stay close to the end of the list during your tests.

In fact, another performance issue that is a bit unlikely would be when 2 different builders fight for the top spot. Then the list would be constantly updated, which is another perf hit.

Finally, the current code uses locks, so if you compile many queries in parallel, contention might impact perf. I think that's unlikely to show up in simple benchmarks.

igor-tkachev · 2023-03-02T01:24:23Z

It looks like we need some mechanism to collect metrics/statistics of our code and create a report at the end of the test run.. So, we will be able to compare different solutions.

viceroypenguin · 2023-10-30T14:45:25Z

Source/CodeGenerators/Directory.Build.props

Unnecessary file?

I don't quite remember (it's been a while...) but I think it was necessary, maybe to avoid picking up properties from a parent directory.

Then again... maybe I added it to override some props, then removed them and kept the file.
If it makes no difference to remove it, we can. But it needs a quick test.

viceroypenguin · 2023-10-30T14:46:35Z

Source/CodeGenerators/BuildersGenerator.cs

+		}
+	}
+
+	void GenerateCode(


Question for evaluation: would it be better to do this as written, or to use a templating system like Scriban? pros and cons to each, and no right answer for all situations, so worth evaluating here.

I did not want to add external dependencies, even for a private tool, as I know this is usually avoided in linq2db.

That said, I use Scriban myself on other projects and I agree it's a nice templating syntax to generate source code. 👍
We can make the change if the rest of the team feels the same way.
Obviously, there's a small learning curve to get familiar with the syntax and using a plugin for syntax highlighting (I use VS Code, not sure if avail. in VS) is highly recommended.

Public dependencies are usually avoided. Build dependencies I think are fine. The point is to minimize what dependencies we require consumers to assume.

No syntax coloring in VS, unfortunately. I've not found that to be a significant issue for me.

viceroypenguin · 2023-10-30T14:48:50Z

Source/LinqToDB/LinqToDB.csproj

+
+		<!-- <EmitCompilerGeneratedFiles>true</EmitCompilerGeneratedFiles> -->


Remove debugging code

Instead of removing, does it make sense to emit them to a place where they can be added to source control, so that changes can be reviewed as part of future PRs? This is what I do in SuperLinq to ensure that any changes to the SG or it's output are able to evaluated as part of the PR.

I left it on purpose, because it's a very useful property to know about when debugging the code generator and I doubt we all know its exact name.
But we can remove it if you prefer, obviously it's just a comment.

In other projects I also like to commit generated code somewhere (if not as the main compilation input), because as you said it makes it easy to diff changes when updating the generator.

Let's see what the team think about this.

Source/LinqToDB/Linq/Builder/ChangeTypeExpressionBuilder.cs

Source/LinqToDB/Linq/Builder/EnumerableBuilder.cs

Co-authored-by: Stuart Turner <stuart@turner-isageek.com>

jods4 · 2023-10-30T16:13:02Z

@viceroypenguin Thanks for reviewing, it's a big changeset. 🙏

igor-tkachev · 2024-01-21T20:04:45Z

Limitations of current design

Today ExpressionBuilder linearly goes through a long list (~70) of ISequenceBuilder to find which builder can handle an Expression. Each builder is probed by a call to CanBuild, in sequence.

This is horribly not performant.

According to metrics this is not an issue and old strategy performance is fine.

jods4 · 2024-05-03T23:03:39Z

Limitations of current design

Today ExpressionBuilder linearly goes through a long list (~70) of ISequenceBuilder to find which builder can handle an Expression. Each builder is probed by a call to CanBuild, in sequence.
This is horribly not performant.

According to metrics this is not an issue and old strategy performance is fine.

@igor-tkachev

The main loop that was locking and I got rid of is method BuildSequence, which accounts for 3.9% of the time according to your benchmark.
Of course it's never gonna be an overall dominator compared to ADO.NET execution or visitors... but 3.9% is far from nothing, esp. considering how little work this is supposed to perform! Methods that account for >4% are very few in your trace.
I'd love to see a high-load scenario with many concurrent requests. The previous code had locking in it, which does not scale. Executing sequential tests or benchmarks does not reflect that.
Can you time old BuildSequence vs new BuildSequence? There's not a shadow of a doubt that the new one is much faster. Even if this is not a dominator for overall execution, why not take improvements?
It's not only faster, but in my opinion much simpler to understand. Old code had concurrent mutable state, used locks... tricky stuff.
The code generator project might also just come handy on its own in other parts of linq2db codebase.

What's your objection to merging this?
I don't see how it isn't an overall improvement in all aspects.

Sequence builder refactoring

662a7ac

jods4 requested review from igor-tkachev, sdanyliv, MaceWindu, viceroypenguin and Shane32 as code owners January 15, 2023 16:38

Fix build of targets < netstandard 2.0

f72bd7b

viceroypenguin reviewed Jan 15, 2023

View reviewed changes

Source/CodeGenerators/BuildersGenerator.cs Show resolved Hide resolved

viceroypenguin reviewed Jan 15, 2023

View reviewed changes

viceroypenguin approved these changes Jan 15, 2023

View reviewed changes

viceroypenguin changed the title ~~Improve sequenece building performance~~ Improve sequence building performance Jan 16, 2023

Merge branch 'refactor/parsing' into refactor/seq-builder

28fed86

MaceWindu added this to the 6.0.0 milestone Jan 17, 2023

igor-tkachev mentioned this pull request Mar 2, 2023

Metrics #4005

Merged

viceroypenguin reviewed Oct 30, 2023

View reviewed changes

Source/LinqToDB/Linq/Builder/ChangeTypeExpressionBuilder.cs Outdated Show resolved Hide resolved

viceroypenguin reviewed Oct 30, 2023

View reviewed changes

Source/LinqToDB/Linq/Builder/EnumerableBuilder.cs Outdated Show resolved Hide resolved

jods4 and others added 2 commits October 30, 2023 17:01

Accept PR suggestion

c2c0749

Co-authored-by: Stuart Turner <stuart@turner-isageek.com>

Accept PR suggestion

c638566

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sequence building performance #3913

Improve sequence building performance #3913

jods4 commented Jan 15, 2023 •

edited

jods4 commented Jan 15, 2023

viceroypenguin Jan 15, 2023

jods4 Jan 16, 2023

MaceWindu commented Jan 17, 2023

jods4 commented Jan 17, 2023

igor-tkachev commented Feb 28, 2023

jods4 commented Mar 1, 2023

igor-tkachev commented Mar 2, 2023

viceroypenguin Oct 30, 2023

jods4 Oct 30, 2023

viceroypenguin Oct 30, 2023

jods4 Oct 30, 2023

viceroypenguin Oct 30, 2023

viceroypenguin Oct 30, 2023

viceroypenguin Oct 30, 2023

jods4 Oct 30, 2023

jods4 Oct 30, 2023

jods4 commented Oct 30, 2023

igor-tkachev commented Jan 21, 2024

Limitations of current design

jods4 commented May 3, 2024

Limitations of current design


		<!-- <EmitCompilerGeneratedFiles>true</EmitCompilerGeneratedFiles> -->

Improve sequence building performance #3913

Are you sure you want to change the base?

Improve sequence building performance #3913

Conversation

jods4 commented Jan 15, 2023 • edited

Limitations of current design

New strategy

Source generators

Code patterns

Special cases

Risks and breaking changes

jods4 commented Jan 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaceWindu commented Jan 17, 2023

jods4 commented Jan 17, 2023

igor-tkachev commented Feb 28, 2023

jods4 commented Mar 1, 2023

igor-tkachev commented Mar 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jods4 commented Oct 30, 2023

igor-tkachev commented Jan 21, 2024

Limitations of current design

jods4 commented May 3, 2024

Limitations of current design

jods4 commented Jan 15, 2023 •

edited