New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call benchmark method directly #2334
base: master
Are you sure you want to change the base?
Conversation
43367bf
to
b262565
Compare
With this PR, I am seeing what looks like more accurate results in the default toolchain, but the Looking at the logs, it appears that overhead is measured at more time. @ig-sinicyn Any ideas? Master:
PR:
|
This comment was marked as outdated.
This comment was marked as outdated.
I reverted the ClrMd disassembler back to v1 on my local so I could actually inspect the asm. The only difference I see that might affect the result is this. Default toolchain call qword ptr [BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead()] InProcessEmit call BenchmarkDotNet.Autogenerated.Runnable_0.__Overhead() The IL is exactly the same for those calls, so it seems the JIT treats IL emit slightly different. I don't have any asm knowledge to know what effect that difference makes, but it seems that |
The assembly issue with InProcessEmit is only in net7+. The overhead measurement is off by about 2-3 clock cycles, which isn't far off from the current measurement in all toolchains. I don't think it should block this from being merged. |
@AndreyAkinshin I would also like to get this in v0.14.0 if you don't mind (followed by #2336). These 2 PRs will likely change the results of long-term measurements for higher accuracy (like dotnet/performance). |
tests/BenchmarkDotNet.IntegrationTests.ManualRunning/ExpectedBenchmarkResultsTests.cs
Outdated
Show resolved
Hide resolved
1982b8a
to
6ba4993
Compare
@timcassell could you please rebase on master one more time? I introduced a bug in Perfolizer 0.3.16 that was fixed in 0.3.17. I just pushed Perfolizer 0.3.17 to BenchmarkDotNet master. |
It seems I found a problem. Let's consider the following environment: BenchmarkDotNet v0.13.13-develop (2024-03-11), Ubuntu 22.04.4 LTS (Jammy Jellyfish)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.100
[Host] : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI The original benchmark was extended to the following form: Source codepublic class Program
{
public static void Main() => BenchmarkRunner.Run<OverheadTests>();
}
[DisassemblyDiagnoser]
public class OverheadTests
{
private int _field;
[Benchmark]
public void Increment01()
{
_field++;
}
[Benchmark]
public void Increment02()
{
_field++;
_field++;
}
[Benchmark]
public void Increment03()
{
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment04()
{
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment05()
{
_field++;
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment06()
{
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment07()
{
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment08()
{
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment09()
{
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment10()
{
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
}
[Benchmark]
public void Increment20()
{
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
_field++;
}
} Here is the generated assembly: Assembly.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment01()
inc dword ptr [rdi+8]
ret
; Total bytes of code 4 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment02()
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
ret
; Total bytes of code 14 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment03()
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
ret
; Total bytes of code 19 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment04()
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
ret
; Total bytes of code 24 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment05()
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
ret
; Total bytes of code 29 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment06()
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
ret
; Total bytes of code 34 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment07()
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
ret
; Total bytes of code 39 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment08()
push rbp
mov rbp,rsp
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
pop rbp
ret
; Total bytes of code 49 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment09()
push rbp
mov rbp,rsp
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
pop rbp
ret
; Total bytes of code 54 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment10()
push rbp
mov rbp,rsp
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
pop rbp
ret
; Total bytes of code 59 .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkDotNet.Samples.OverheadTests.Increment20()
push rbp
mov rbp,rsp
mov eax,[rdi+8]
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
inc eax
mov [rdi+8],eax
pop rbp
ret
; Total bytes of code 109 An interesting observation: since push rbp
mov rbp,rsp
...
pop rbp Here are my results with the latest | Method | Mean | Error | StdDev | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0282 ns | 0.0000 ns | 0.0000 ns | 4 B |
| Increment02 | 0.0329 ns | 0.0000 ns | 0.0000 ns | 14 B |
| Increment03 | 0.0339 ns | 0.0005 ns | 0.0004 ns | 19 B |
| Increment04 | 0.0000 ns | 0.0000 ns | 0.0000 ns | 24 B |
| Increment05 | 0.1551 ns | 0.0001 ns | 0.0001 ns | 29 B |
| Increment06 | 0.1588 ns | 0.0007 ns | 0.0006 ns | 34 B |
| Increment07 | 0.3427 ns | 0.0022 ns | 0.0021 ns | 39 B |
| Increment08 | 0.5363 ns | 0.0002 ns | 0.0002 ns | 49 B |
| Increment09 | 0.7391 ns | 0.0005 ns | 0.0005 ns | 54 B |
| Increment10 | 0.9274 ns | 0.0015 ns | 0.0013 ns | 59 B |
| Increment20 | 2.7696 ns | 0.0004 ns | 0.0004 ns | 109 B | The results are quite consistent, stable, and reproducible. For Now let's run the same set of benchmarks using BenchmarkDotNet from this PR: | Method | Mean | Error | StdDev | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0113 ns | 0.0061 ns | 0.0057 ns | 4 B |
| Increment02 | 0.0073 ns | 0.0001 ns | 0.0000 ns | 14 B |
| Increment03 | 0.0030 ns | 0.0000 ns | 0.0000 ns | 19 B |
| Increment04 | 0.0145 ns | 0.0001 ns | 0.0001 ns | 24 B |
| Increment05 | 0.0175 ns | 0.0001 ns | 0.0001 ns | 29 B |
| Increment06 | 0.0251 ns | 0.0017 ns | 0.0016 ns | 34 B |
| Increment07 | 0.5500 ns | 0.0014 ns | 0.0013 ns | 39 B |
| Increment08 | 0.6812 ns | 0.0442 ns | 0.1007 ns | 49 B |
| Increment09 | 0.3456 ns | 0.0001 ns | 0.0001 ns | 54 B |
| Increment10 | 0.3746 ns | 0.0035 ns | 0.0033 ns | 59 B |
| Increment20 | 2.2034 ns | 0.0020 ns | 0.0018 ns | 109 B | Observations:
While the "correct" results are a controversial thing in this case, the non-monotonic Mean column definitely feels wrong, and it's a clear regression compared to the master. These results are also reproducible on my machine: I'm ready to collect any additional diagnostic info if needed. |
@AndreyAkinshin Unfortunately I don't have a Ryzen cpu to run benchmarks on, but I ran those benchmarks again on both of my machines and got results that look mostly good (the only outlier is the drop from inc3 to inc4 on Intel on both branches). It must be a cpu architectural reason for those results. Master
PR
No assembly for the Intel chip (no support for MacOS), but assembly for the old AMD chip: Assembly
.NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment01()
inc dword ptr [rcx+8]
ret
; Total bytes of code 4 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment02()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 14 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment03()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 19 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment04()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 24 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment05()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 29 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment06()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 34 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment07()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 39 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment08()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 44 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment09()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 49 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment10()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 54 .NET 8.0.2 (8.0.224.6711), X64 RyuJIT SSE3; ConsoleApp1.OverheadTests.Increment20()
mov eax,[rcx+8]
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
inc eax
mov [rcx+8],eax
ret
; Total bytes of code 104 I see no logical reason for there to be a flaw with wrapping the call in a |
@AndreyAkinshin I know it'll spoil the "instant" results, but I wonder what results you will get if you make the field |
@AndreyAkinshin Also, can you disassemble |
| Method | Mean | Error | StdDev | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0272 ns | 0.0005 ns | 0.0005 ns | 4 B |
| Increment02 | 0.0138 ns | 0.0151 ns | 0.0141 ns | 7 B |
| Increment03 | 0.0326 ns | 0.0000 ns | 0.0000 ns | 10 B |
| Increment04 | 0.0632 ns | 0.0018 ns | 0.0015 ns | 13 B |
| Increment05 | 0.1947 ns | 0.0001 ns | 0.0001 ns | 16 B |
| Increment06 | 0.4581 ns | 0.0017 ns | 0.0014 ns | 24 B |
| Increment07 | 0.5812 ns | 0.0037 ns | 0.0033 ns | 27 B |
| Increment08 | 0.8946 ns | 0.0183 ns | 0.0171 ns | 30 B |
| Increment09 | 0.9137 ns | 0.0008 ns | 0.0007 ns | 33 B |
| Increment10 | 1.0144 ns | 0.0023 ns | 0.0022 ns | 36 B |
| Increment20 | 3.1494 ns | 0.0025 ns | 0.0023 ns | 66 B | PR: | Method | Mean | Error | StdDev | Code Size |
|------------ |----------:|----------:|----------:|----------:|
| Increment01 | 0.0103 ns | 0.0018 ns | 0.0016 ns | 4 B |
| Increment02 | 0.0068 ns | 0.0022 ns | 0.0021 ns | 7 B |
| Increment03 | 0.0148 ns | 0.0001 ns | 0.0001 ns | 10 B |
| Increment04 | 0.0147 ns | 0.0000 ns | 0.0000 ns | 13 B |
| Increment05 | 0.0186 ns | 0.0015 ns | 0.0014 ns | 16 B |
| Increment06 | 0.0418 ns | 0.0003 ns | 0.0003 ns | 24 B |
| Increment07 | 0.2448 ns | 0.0010 ns | 0.0009 ns | 27 B |
| Increment08 | 0.5236 ns | 0.0049 ns | 0.0046 ns | 30 B |
| Increment09 | 0.5196 ns | 0.0183 ns | 0.0171 ns | 33 B |
| Increment10 | 0.6923 ns | 0.0062 ns | 0.0058 ns | 36 B |
| Increment20 | 2.7928 ns | 0.0044 ns | 0.0041 ns | 66 B | |
Could you please remind me what is the easiest way to do this on Linux nowadays? |
You can use config.AddDiagnoser(new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig(filters: ["*Workload*", "*Overhead*"]))) |
Well those results look more stable (almost a consistent increase after inc6). It looks like almost a constant time of 0.4ns was subtracted from your master results. If your cpu is at 5ghz, that's 2 clock cycles. That's almost exactly the same as what I see with the InProcessEmitToolchain on my older machine. Will be curious to see if the assembly shows it. It's also interesting that adding |
Other things to check out: Results with |
Status update: measurements are in progress (I want to collect a comprehensive set of summary tables and share them at once) |
Fixes #1133
This wraps the workload call with a
NoInlining | NoOptimization
method instead of a delegate.Mac Intel x64 results
Master
This PR
Windows AMD x64 results
Master
This PR