/
Program.cs
256 lines (199 loc) · 12.5 KB
/
Program.cs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
/**
Meta-programming and policy-based design in C# using JIT dead code removal and inlining
=======================================================================================
Abstract
--------
This is a way to enable compile time customization of classes/functions in the style of C++ template meta-programming as in [Modern C++ Design](https://en.wikipedia.org/wiki/Modern_C%2B%2B_Design).
In particular, we are going to implement the [policy pattern](https://en.wikipedia.org/wiki/Modern_C%2B%2B_Design#Policy-based_design), which is compile time version of the [strategy pattern](https://en.wikipedia.org/wiki/Strategy_pattern).
What do we gain by constraining ourselves to compile time customization, instead of run time one? High performance. Blazingly high performance. You gain an abstraction, without paying for it.
Too good to be true? Read on!
BTW1: none of this is new. It has been floating around in various forms. But I have never seen explained fully and associated to the policy pattern.
BTW2: it is also related to the uber cool [Shape proposal for C#](https://github.com/dotnet/csharplang/issues/164), where it becomes an implementation detail.
Implementation
--------------
First, the usual plethora of namespaces ... BenchmarkDotNet is really heavy into separating abstractions in different namespaces ...
**/
using System.Runtime.CompilerServices;
using System.Threading;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
using BenchmarkDotNet.Diagnostics.Windows.Configs;
using BenchmarkDotNet.Mathematics;
using BenchmarkDotNet.Order;
using BenchmarkDotNet.Running;
/**
Let's take a look at the strategy pattern, as more traditionally implemented. You got an interface `IIncrementer` and two implementations, either thread safe or not.
Let's also add a couple of struct implementations so that we can measure the performance difference of implementation by classes and by structs.
**/
public interface IIncrementer { void Increment(ref int location); }
public sealed class CStandardIncrementer : IIncrementer {[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Increment(ref int location) => location += 1; }
public sealed class CInterlockedIncrementer : IIncrementer {[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Increment(ref int location) => Interlocked.Increment(ref location); }
public readonly struct SStandardIncrementer : IIncrementer {[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Increment(ref int location) => location += 1; }
public readonly struct SInterlockedIncrementer : IIncrementer {[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Increment(ref int location) => Interlocked.Increment(ref location); }
/**
We then need a class that can be customized by an Incrementer. Think of it as: the policy of incrementing something is independent from the class in question.
Let's take a Counter and call it Dynamic as we want to be able to customize it at runtime. We need to keep things simple so that looking at ASM is doable.
Also we inline everything to try to squeeze the most performance out of our program.
**/
public class DynamicCounter<T> where T: IIncrementer
{
int _count;
T _incrementer;
public DynamicCounter(T incrementer) => _incrementer = incrementer;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void Increment() => _incrementer.Increment(ref _count);
}
/**
Then we look at how to implement the strategy pattern at compile time (transforming it magically into the policy pattern).
(no-one every talks about the negative aspects of giving names to things, one of these days I'll blog about it ...)
There are many ways to go about it. They all rely on the fact that the JIT Compiler instantiates a different type for each struct used to customize the type (aka each `T`).
For each one of these types, the JIT knows at compile type what `T` is, which brings about certain optimizations.
The optimization exploited by `StaticCounterInterface` is that the call to `Increment` becomes a non-virtual call that can be inlined.
**/
public class StaticCounterInterface<T> where T : struct, IIncrementer
{
int _count;
T _incrementer = new T();
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void Increment() => _incrementer.Increment(ref _count);
}
/**
For `StaticCounterMatch` the situation is even more obvious.
The Jitter doesn't generate code for any of the `if` statements. It just puts the right code for the type `T` directly in the body of the `Increment` method.
It is as if the `if` statement were executed at compile time, as with `C++` templates. Also notice the `IncrementRaw` method used for perf comparison.
**/
public class StaticCounterMatch<T>
{
int _count;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void Increment() {
if (typeof(T) == typeof(SStandardIncrementer)) _count += 1;
if (typeof(T) == typeof(SInterlockedIncrementer)) Interlocked.Increment(ref _count);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void IncrementRaw() => _count += 1;
}
/**
We are now ready for the performance testing. We create one class for each combination and benchmark their Increment method.
A few comments on the attributes:
1. `DryCoreJob` doesn't do the benchmark, but just runs the diagnosers, in this case produces assembly code.
2. `InProcessAttribute` makes everything go faster, but cannot be used to generate assembly.
3. `DisassemblyDiagnoser` creates files with the assembly code.
4. `RankColumn` generates a nicer table.
**/
//[DryCoreJob]
//[InProcessAttribute]
[DisassemblyDiagnoser(printAsm: true, printPrologAndEpilog: true, printIL: false, printSource: false, recursiveDepth: 3)]
[Orderer(SummaryOrderPolicy.FastestToSlowest)]
[RankColumn(NumeralSystem.Stars)]
public class MainClass
{
DynamicCounter<IIncrementer> dynInterface = new DynamicCounter<IIncrementer>(new CStandardIncrementer());
DynamicCounter<CStandardIncrementer> dynConcrete = new DynamicCounter<CStandardIncrementer>(new CStandardIncrementer());
DynamicCounter<SStandardIncrementer> dynStruct = new DynamicCounter<SStandardIncrementer>(new SStandardIncrementer());
StaticCounterMatch<SStandardIncrementer> staticCounterM = new StaticCounterMatch<SStandardIncrementer>();
StaticCounterInterface<SStandardIncrementer> staticCounterI = new StaticCounterInterface<SStandardIncrementer>();
[Benchmark] public void DynamicInterface() => dynInterface.Increment();
[Benchmark] public void DynamicConcrete() => dynConcrete.Increment();
[Benchmark] public void DynamicStruct() => dynStruct.Increment();
[Benchmark] public void StaticCounterInterface() => staticCounterI.Increment();
[Benchmark] public void StaticCounterMatch() => staticCounterM.Increment();
[Benchmark(Baseline = true)] public void IncrementRaw() => staticCounterM.IncrementRaw();
public static void Main() => BenchmarkRunner.Run<MainClass>();
}
/**
Results
-------
Please note that the results are valid just for the tested configuration.
I have no reason to think that they would be different on other modern runtimes/OSs as the optimizations are quite well known.
``` ini
BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.253 (1809/October2018Update/Redstone5)
Intel Core i7-6600U CPU 2.60GHz (Skylake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=3.0.100-preview-009812
[Host] : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
DefaultJob : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
```
| Method | Mean | Error | StdDev | Median | P95 | Ratio | RatioSD | Rank |
|----------------------- |----------:|----------:|----------:|----------:|----------:|------:|--------:|-----:|
| StaticCounterInterface | 0.0266 ns | 0.0308 ns | 0.0649 ns | 0.0000 ns | 0.1710 ns | 0.02 | 0.04 | * |
| StaticCounterMatch | 0.0345 ns | 0.0473 ns | 0.1115 ns | 0.0000 ns | 0.3633 ns | 0.62 | 0.92 | * |
| DynamicStruct | 0.0466 ns | 0.0598 ns | 0.1433 ns | 0.0000 ns | 0.4790 ns | 1.02 | 1.07 | * |
| IncrementRaw | 0.2457 ns | 0.0160 ns | 0.0125 ns | 0.2404 ns | 0.2691 ns | 1.00 | 0.00 | ** |
| DynamicInterface | 1.6319 ns | 0.0741 ns | 0.1410 ns | 1.6309 ns | 1.8217 ns | 7.19 | 0.62 | *** |
| DynamicConcrete | 1.7630 ns | 0.0740 ns | 0.0656 ns | 1.7481 ns | 1.8628 ns | 7.20 | 0.49 | **** |
As expected, you gain an order of magnitude in performance by foregoing run time customization, except when using a `struct` as the optimizer manages to inline that one.
Notice that these numbers are really low. In fact the order of the first 4 lines might change when you run it. But they are always much faster than the rest.
But why? Let's look at the generated code.
IL and ASM
----------
First let's look at IL for a few of the methods
~~~
MainClass.IncrementRaw()
IL_0000: ldarg.0
IL_0001: ldfld StaticCounterMatch`1<SStandardIncrementer> MainClass::staticCounterM
IL_0006: callvirt System.Void StaticCounterMatch`1<SStandardIncrementer>::IncrementRaw()
IL_000b: ret
; MainClass.StaticCounterInterface()
IL_0000: ldarg.0
IL_0001: ldfld StaticCounterInterface`1<SStandardIncrementer> MainClass::staticCounterI
IL_0006: callvirt System.Void StaticCounterInterface`1<SStandardIncrementer>::Increment()
IL_000b: ret
; MainClass.StaticCounterMatch()
IL_0000: ldarg.0
IL_0001: ldfld StaticCounterMatch`1<SStandardIncrementer> MainClass::staticCounterM
IL_0006: callvirt System.Void StaticCounterMatch`1<SStandardIncrementer>::Increment()
IL_000b: ret
~~~
Ok, nothing interesting there apart from the use of `callvirt` when you would think a standard `call` would do (i.e. `IncrementRaw`).
I vaguely remember from my C# compiler days that we do that as a way to short-circuit the test for null, as `callvirt` does it automatically.
The assembly code is more interesting. Let's start from the three fast-resolved-at-compile-time methods.
BTW: remember that looking at optimized ASM code is like peering into a muddy lake with foggy glasses. Let's do it.
~~~
; MainClass.IncrementRaw()
mov rax,qword ptr [rcx+20h]
inc dword ptr [rax+8]
ret
; MainClass.StaticCounterInterface()
mov rax,qword ptr [rcx+28h]
(*)mov edx,dword ptr [rax]
add rax,8
inc dword ptr [rax]
ret
; MainClass.StaticCounterMatch()
mov rax,qword ptr [rcx+20h]
(*)mov edx,dword ptr [rax]
inc dword ptr [rax+8]
ret
~~~
Yep, they are all the same (apart from the mysterious `*` instruction)! Get the memory location of the field, increment it.
The Jitter has completely inlined the code. It is as if you had written the incrementing code directly into the function, despite you composing the type using two independent abstractions.
I think that is pretty cool! You can abstract your code properly and not pay the price for it (well, apart from one assembly instruction).
For the sake of completeness, let's look at the assembly code for the dynamic dispatching cases.
~~~
; MainClass.DynamicConcrete()
mov rcx,qword ptr [rcx+10h]
cmp dword ptr [rcx],ecx
lea rdx,[rcx+10h]
mov rcx,qword ptr [rcx+8]
mov r11,7FF8C2B30470h
mov rax,qword ptr [r11]
cmp dword ptr [rcx],ecx
jmp rax
; MainClass.DynamicInterface()
mov rcx,qword ptr [rcx+8]
cmp dword ptr [rcx],ecx
lea rdx,[rcx+10h]
mov rcx,qword ptr [rcx+8]
mov r11,7FF8C2B10470h
mov rax,qword ptr [r11]
cmp dword ptr [rcx],ecx
jmp rax
~~~
The first thing to notice is that the code is identical, despite their seemingly different declarations. The Jitter doesn't care.
Notice the machinations the Jitter performs, very likely related to dynamic dispatching, to calculate the address to finally jump to. That's where our Increment method is located.
No wonder it is slower.
Summary
-------
If you can afford to use the policy pattern instead of the more generic strategy pattern (i.e. compile time vs run time dispatch) and/or you need bare to the metal performance, consider the code above.
As for me, I plan to use it in the near future for a few low level abstractions (i.e. memory allocators).
**/