Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sequence building performance #3913

Open
wants to merge 5 commits into
base: refactor/parsing
Choose a base branch
from

Conversation

jods4
Copy link
Contributor

@jods4 jods4 commented Jan 15, 2023

Limitations of current design

Today ExpressionBuilder linearly goes through a long list (~70) of ISequenceBuilder to find which builder can handle an Expression. Each builder is probed by a call to CanBuild, in sequence.

This is horribly not performant. If your query uses one of the last builders -- or simply the expression has no builder, which is sometimes a normal case -- you have to go through 70+ checks!! They are virtual calls, many builders repeatedly perform the same casts or checks, and not all builders perform quick, cheap tests. Of course this is only gonna get worse as we add new builders.

The current design attempts at improving performance by dynamically sorting the list to put the most used builders at the front.
This is a good idea but it has two drawbacks.

  1. It adds complexity for the book-keeping of how many times each builder is used. This is mutable concurrent code and requires locks. The amortized perf increases (probably) but the overall perf is worse.
  2. It is not that effective. The list is re-sorted only when the first item would change. I assume this is done to avoid constantly reordering the list if 2 builders fight for a specific spot (although that can happen for 1st place!!).
    So once the most used builder is found, nothing changes anymore, even if a builder that becomes commonly used is still way down the list at the moment.
    As I said before, "not finding a builder" is sometimes an expected use-case and this always go through the whole list.

New strategy

Finding a builder now starts with a big switch statement on NodeType (an int). C# optimizes this depending on the number and distribution of items, it could be a hard-coded binary search (ifs) or a single indirect jump based on a table.

Then a CanBuild method is called, but notice:

  1. At this stage, only 1 or 2 (rarely 3) candidates remain.
  2. CanBuild is now a static method. Short ones will be inlined, which is very likely as most have been simplified thanks to the fact that we know what node to expect when CanBuild is called. Most of them are as simple as => true, => call.IsQueryable() or => call.IsAsyncExtension()

One notable case is ExpressionType.Call because we have many (most) builders for this node type.
This is handled by a second switch, this time on MethodCallExpression.Method.Name, which C# will compile into a static hashtable. This is then handled as other node types. In 90% of cases there's only a single candidate left.

There are other benefits, such as clarity. The book-keeping and re-sorting code is gone.
Before it was difficult to see which builder handles what, which is particularly important as they must all be mutually exclusive!
Now you just need to peek at the switch tables to find what is handled, by what.

  • Did you know that 3 different builders handle the Join method?
  • I found a builder that is actually unused! I deleted CountBuilder (methods Count and LongCount are actually handled by AggregationBuilder).

Source generators

I'm not crazy and I didn't write the big switches by hand, that would be too error-prone.
Instead, I added a C# Source Generator to the project, that looks for attributes like BuildsExpression or BuildsMethodCall on builders and implements FindBuilderImpl automatically.

Note
Tips for source generators:

  • Consuming a generator in VS generally works nicely. VS will go to definitions in generated sources and you can also see the files in Solution Explorer, under References > Analyzers.
  • If you modify a generator your changes apply to Build immediately but as of the current version of VS 2022, IDE will continue to use the one it has loaded. The only way to refresh IDE after modifying a generator is to close and reload VS.
  • To help when working on a generator, you can use <EmitCompilerGeneratedFiles> in csproj to save files generated during build on disk (inside obj). I left this property, commented, in the csproj PR.

As generated code is not committed to git, I uploaded the source generated by this PR here for your convenience:
https://gist.github.com/jods4/6b145327f3c7563836c357984b723caf

Code patterns

// BuildsExpression is picked up by the source generator.
// It takes the list of ExpressionType this builder may handle
[BuildsExpression(ExpressionType.Lambda, ExpressionType.Constant)]
class MyBuilder : ISequenceGenerator
{
  // This static method will be called from generated code.
  // `expr` is guaranteed to be of the types specified in attributes.
  // Keep this short and efficient!
  public static bool CanBuild(Expression expr, BuildInfo info, ExpressionBuilder builder)
    => true;
}

// BuildsMethodCall is picked up by the source generator.
// It takes the list of method names that this builder may handle
[BuildsMethodCall("Take", "Skip")]
class MethodBuilder : ISequenceGenerator // or more commonly : MethodCallBuilder
{
  // This static method will be called from generated code.
  // `call` is guaranteed to be for a method whose name is amongst BuildsMethodCall arguments.
  // Keep this short and efficient!
  public static bool CanBuildMethod(MethodCallExpression call, BuildInfo info, ExpressionBuilder builder)
    => true;
}

Special cases

A few builders may handle any method call. You can do this simply by using [BuildsExpression(ExpressionType.Call)], they will be probed after any method-specific builder [BuildsMethodCall].
Today this inculdes MethodChainBuilder, QueryExtensionBuilder, TableBuilder. Those builders tend to look for attributes on methods, e.g. [Extension] or [Table].

Some builders take advantage of having different, simpler checks based on what method or expression type matched.
You can have multiple attributes on a single builder, and use the named property CanBuildName to direct the check to a different method for some cases. Many sync/async builders do that, for example ContainsBuilder:

[BuildsMethodCall("Contains")]
[BuildsMethodCall("ContainsAsync", CanBuildName = nameof(CanBuildAsyncMethod))]
sealed class ContainsBuilder : MethodCallBuilder
{
	public static bool CanBuildMethod(MethodCallExpression call, BuildInfo info, ExpressionBuilder builder)
		=> call.IsQueryable() && call.Arguments.Count == 2;

	public static bool CanBuildAsyncMethod(MethodCallExpression call, BuildInfo info, ExpressionBuilder builder)
		=> call.IsAsyncExtension() && call.Arguments.Count == 3;
}

ContextRefBuilder is unique: it seems that it could potentially handle any expression if BuilderInfo.Parent is null. For this case I added [BuildsAny]. Builders with this attribute will be probed after everything else failed.
The signature of the static method is CanBuild(BuildInfo info, ExpressionBuilder builder).

Risks and breaking changes

Everything is internal: there's no public API change, so no breaking change.

A lot of files were touched because there is 70+ builders but 95% of changes are very mechanical. I think there's still room for improvements but I wanted to keep this PR straightforward, as it's fairly large.
If all tests pass, I'd say the risk of regression is low.

@jods4
Copy link
Contributor Author

jods4 commented Jan 15, 2023

Some wcf remoting examples don't compile because of http in System.Net when targetting .net 4.8.
@sdanyliv I suppose that's already the case on your branch, as it has nothing to do with my changes.

@@ -6,14 +6,13 @@ namespace LinqToDB.Linq.Builder
using Reflection;
using LinqToDB.Expressions;

[BuildsMethodCall("Distinct", nameof(LinqExtensions.SelectDistinct))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I like the inconsistency of strings and nameof. I'd say we (royal collective we) should pick one and stick with it, and this PR is a good chance to pick and update all to thath. I'd vote for nameof myself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this first draft I tried to stick to the style of existing code, but that's totally doable. 👍
At least for our own methods (e.g. LinqExtensions) I like the strong references.

It is worth noting:

  • Some references are to .net fx methods;
  • Sometimes a single name refers to multiple methods in different classes.

@viceroypenguin viceroypenguin changed the title Improve sequenece building performance Improve sequence building performance Jan 16, 2023
@MaceWindu MaceWindu added this to the 6.0.0 milestone Jan 17, 2023
@MaceWindu
Copy link
Contributor

I see PR is based on parser refactoring feature, so I assign it to v6 milestone

@jods4
Copy link
Contributor Author

jods4 commented Jan 17, 2023

@MaceWindu yes, this is on top of the parser refactoring.
I could have done it on top of v4, but I didn't want to have to merge with all @sdanyliv changes.

@igor-tkachev
Copy link
Member

Do you have an example that shows the problem? I would like to look at in performance profiler.

@jods4
Copy link
Contributor Author

jods4 commented Mar 1, 2023

@igor-tkachev if you want to see the worst query compilations, you basically need to use the builders that are close to the end of the list.

A good example would be some complex MERGE statement, because it has many parts and they are all close to the end of the list. For bonus points be sure to tag the query, as that new feature is 4th from last.

That said, be careful after the first query compilation because linq2db attempts to improve performance by sorting the list of potential builders, which means your query will now be vastly more performant on the 2nd run.
Priority sorting is a good idea but it's far from perfect in practice. The list is only sorted if the 1st element would change, so if one type of query dominates from very early on (e.g. SelectBuilder), then the rest won't be sorted even if you suddenly use a significant count of MergeBuilder queries. Also, if you do many different kind of queries, you'll still pay the price for the list scanning as they can't all be at the front.

So if you're doing benchmarking, maybe execute a bunch of SELECT before benchmarking some MERGE, to ensure the required builders stay close to the end of the list during your tests.

In fact, another performance issue that is a bit unlikely would be when 2 different builders fight for the top spot. Then the list would be constantly updated, which is another perf hit.

Finally, the current code uses locks, so if you compile many queries in parallel, contention might impact perf. I think that's unlikely to show up in simple benchmarks.

@igor-tkachev
Copy link
Member

It looks like we need some mechanism to collect metrics/statistics of our code and create a report at the end of the test run.. So, we will be able to compare different solutions.

@igor-tkachev igor-tkachev mentioned this pull request Mar 2, 2023
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite remember (it's been a while...) but I think it was necessary, maybe to avoid picking up properties from a parent directory.

Then again... maybe I added it to override some props, then removed them and kept the file.
If it makes no difference to remove it, we can. But it needs a quick test.

}
}

void GenerateCode(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for evaluation: would it be better to do this as written, or to use a templating system like Scriban? pros and cons to each, and no right answer for all situations, so worth evaluating here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not want to add external dependencies, even for a private tool, as I know this is usually avoided in linq2db.

That said, I use Scriban myself on other projects and I agree it's a nice templating syntax to generate source code. 👍
We can make the change if the rest of the team feels the same way.
Obviously, there's a small learning curve to get familiar with the syntax and using a plugin for syntax highlighting (I use VS Code, not sure if avail. in VS) is highly recommended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public dependencies are usually avoided. Build dependencies I think are fine. The point is to minimize what dependencies we require consumers to assume.

No syntax coloring in VS, unfortunately. I've not found that to be a significant issue for me.

Comment on lines +9 to +10

<!-- <EmitCompilerGeneratedFiles>true</EmitCompilerGeneratedFiles> -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove debugging code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing, does it make sense to emit them to a place where they can be added to source control, so that changes can be reviewed as part of future PRs? This is what I do in SuperLinq to ensure that any changes to the SG or it's output are able to evaluated as part of the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it on purpose, because it's a very useful property to know about when debugging the code generator and I doubt we all know its exact name.
But we can remove it if you prefer, obviously it's just a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other projects I also like to commit generated code somewhere (if not as the main compilation input), because as you said it makes it easy to diff changes when updating the generator.

Let's see what the team think about this.

jods4 and others added 2 commits October 30, 2023 17:01
Co-authored-by: Stuart Turner <stuart@turner-isageek.com>
@jods4
Copy link
Contributor Author

jods4 commented Oct 30, 2023

@viceroypenguin Thanks for reviewing, it's a big changeset. 🙏

@igor-tkachev
Copy link
Member

Limitations of current design

Today ExpressionBuilder linearly goes through a long list (~70) of ISequenceBuilder to find which builder can handle an Expression. Each builder is probed by a call to CanBuild, in sequence.

This is horribly not performant.

According to metrics this is not an issue and old strategy performance is fine.

@jods4
Copy link
Contributor Author

jods4 commented May 3, 2024

Limitations of current design

Today ExpressionBuilder linearly goes through a long list (~70) of ISequenceBuilder to find which builder can handle an Expression. Each builder is probed by a call to CanBuild, in sequence.
This is horribly not performant.

According to metrics this is not an issue and old strategy performance is fine.

@igor-tkachev

  • The main loop that was locking and I got rid of is method BuildSequence, which accounts for 3.9% of the time according to your benchmark.
  • Of course it's never gonna be an overall dominator compared to ADO.NET execution or visitors... but 3.9% is far from nothing, esp. considering how little work this is supposed to perform! Methods that account for >4% are very few in your trace.
  • I'd love to see a high-load scenario with many concurrent requests. The previous code had locking in it, which does not scale. Executing sequential tests or benchmarks does not reflect that.
  • Can you time old BuildSequence vs new BuildSequence? There's not a shadow of a doubt that the new one is much faster. Even if this is not a dominator for overall execution, why not take improvements?
  • It's not only faster, but in my opinion much simpler to understand. Old code had concurrent mutable state, used locks... tricky stuff.
  • The code generator project might also just come handy on its own in other parts of linq2db codebase.

What's your objection to merging this?
I don't see how it isn't an overall improvement in all aspects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants