Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to reduce cost of Async #101605

Merged
merged 2 commits into from May 2, 2024
Merged

Conversation

MichalStrehovsky
Copy link
Member

Contributes to #79204.

I don't know if this will be considered mergeable, but I wanted to at least try something. For apps that use async a lot (like the Stage2 app we use for Goldilocks), the async infrastructure can cost 10% of the entire executable.

Shuffling a couple things in GetStateMachineBox I was able to get 0.23% saving. It's miniscule. In general async is death by a thousand papercuts so I don't see a silver bullet.

Click to expand existing assembly
    TodosApi!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<System.Threading.Tasks.VoidTaskResult>__GetStateMachineBox<S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100>:
00000001`40983370 4156               push    r14
00000001`40983372 57                 push    rdi
00000001`40983373 56                 push    rsi
00000001`40983374 55                 push    rbp
00000001`40983375 53                 push    rbx
00000001`40983376 4883ec20           sub     rsp, 20h
00000001`4098337a 488bf1             mov     rsi, stateMachine (rcx)
00000001`4098337d 488bda             mov     rbx, taskField (rdx)
00000001`40983380 e8eb68aaff         call    TodosApi!System.Threading.ExecutionContext__Capture (140429c70)
00000001`40983385 488be8             mov     rbp, rax
00000001`40983388 488b3b             mov     rdi, qword ptr [taskField (rbx)]
00000001`4098338b 4885ff             test    rdi, rdi
00000001`4098338e 742c               je      TodosApi!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<System.Threading.Tasks.VoidTaskResult>__GetStateMachineBox<S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100>+0x4c (1409833bc)
00000001`40983390 488d0df93f3e00     lea     rcx, [TodosApi!??_7System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1_AsyncStateMachineBox`1<System.Threading.Tasks.VoidTaskResult__S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100>@@6B@ (140d67390)]
00000001`40983397 48390f             cmp     qword ptr [rdi], rcx
00000001`4098339a 7520               jne     TodosApi!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<System.Threading.Tasks.VoidTaskResult>__GetStateMachineBox<S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100>+0x4c (1409833bc)
00000001`4098339c 48396f10           cmp     qword ptr [stronglyTypedBox (rdi)+10h], currentContext (rbp)
00000001`409833a0 740c               je      TodosApi!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<System.Threading.Tasks.VoidTaskResult>__GetStateMachineBox<S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100>+0x3e (1409833ae)
00000001`409833a2 488d4f10           lea     rcx, [stronglyTypedBox (rdi)+10h]
00000001`409833a6 488bd5             mov     rdx, rbp
00000001`409833a9 e8921a69ff         call    TodosApi!RhpAssignRefAVLocation (140014e40)
00000001`409833ae 488bc7             mov     rax, rdi
00000001`409833b1 4883c420           add     rsp, 20h
00000001`409833b5 5b                 pop     rbx
00000001`409833b6 5d                 pop     rbp
00000001`409833b7 5e                 pop     rsi
00000001`409833b8 5f                 pop     rdi
00000001`409833b9 415e               pop     r14
00000001`409833bb c3                 ret     
00000001`409833bc 4c8b33             mov     r14, qword ptr [taskField (rbx)]
00000001`409833bf 4d85f6             test    r14, r14
00000001`409833c2 7461               je      TodosApi!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<System.Threading.Tasks.VoidTaskResult>__GetStateMachineBox<S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100>+0xb5 (140983425)
00000001`409833c4 488d0d7d193e00     lea     rcx, [TodosApi!??_7System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1_AsyncStateMachineBox`1<System.Threading.Tasks.VoidTaskResult__System.Runtime.CompilerServices.IAsyncStateMachine>@@6B@ (140d64d48)]
00000001`409833cb 49390e             cmp     qword ptr [r14], rcx
00000001`409833ce 7555               jne     TodosApi!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<System.Threading.Tasks.VoidTaskResult>__GetStateMachineBox<S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100>+0xb5 (140983425)
00000001`409833d0 49837e4000         cmp     qword ptr [weaklyTypedBox (r14)+40h], 0
00000001`409833d5 7534               jne     TodosApi!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<System.Threading.Tasks.VoidTaskResult>__GetStateMachineBox<S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100>+0x9b (14098340b)
00000001`409833d7 488d0d1a622a00     lea     rcx, [TodosApi!Boxed_S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100::`vftable' (140c295f8)]
00000001`409833de e84d1869ff         call    TodosApi!RhpNewFast (140014c30)
00000001`409833e3 488bd0             mov     rdx, rax
00000001`409833e3 488bd0             mov     rdx, rax
00000001`409833e6 488d7a08           lea     rdi, [rdx+8]
00000001`409833ea e8411c69ff         call    TodosApi!RhpByRefAssignRefAVLocation1 (140015030)
00000001`409833ef e83c1c69ff         call    TodosApi!RhpByRefAssignRefAVLocation1 (140015030)
00000001`409833f4 48a5               movs    qword ptr [rdi], qword ptr [rsi]
00000001`409833f6 e8351c69ff         call    TodosApi!RhpByRefAssignRefAVLocation1 (140015030)
00000001`409833fb e8301c69ff         call    TodosApi!RhpByRefAssignRefAVLocation1 (140015030)
00000001`40983400 48a5               movs    qword ptr [rdi], qword ptr [rsi]
00000001`40983402 498d4e40           lea     rcx, [weaklyTypedBox (r14)+40h]
00000001`40983406 e8351a69ff         call    TodosApi!RhpAssignRefAVLocation (140014e40)
00000001`4098340b 498d4e10           lea     rcx, [weaklyTypedBox (r14)+10h]
00000001`4098340f 488bd5             mov     rdx, rbp
00000001`40983412 e8291a69ff         call    TodosApi!RhpAssignRefAVLocation (140014e40)
00000001`40983417 498bc6             mov     rax, r14
00000001`4098341a 4883c420           add     rsp, 20h
00000001`4098341e 5b                 pop     rbx
00000001`4098341f 5d                 pop     rbp
00000001`40983420 5e                 pop     rsi
00000001`40983421 5f                 pop     rdi
00000001`40983422 415e               pop     r14
00000001`40983424 c3                 ret     
00000001`40983425 488d0d643f3e00     lea     rcx, [TodosApi!??_7System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1_AsyncStateMachineBox`1<System.Threading.Tasks.VoidTaskResult__S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100>@@6B@ (140d67390)]
00000001`4098342c e8ff1769ff         call    TodosApi!RhpNewFast (140014c30)
00000001`40983431 4c8bf0             mov     r14, rax
00000001`40983434 41c7463400040002   mov     dword ptr [r14+34h], 2000400h
00000001`4098343c 41814e3400080000   or      dword ptr [r14+34h], 800h
00000001`40983444 488bcb             mov     rcx, rbx
00000001`40983447 498bd6             mov     rdx, r14
00000001`4098344a e8611a69ff         call    TodosApi!RhpCheckedAssignRefAVLocation (140014eb0)
00000001`4098344f 498d7e40           lea     rdi, [r14+40h]
00000001`40983453 e8d81b69ff         call    TodosApi!RhpByRefAssignRefAVLocation1 (140015030)
00000001`40983458 e8d31b69ff         call    TodosApi!RhpByRefAssignRefAVLocation1 (140015030)
00000001`4098345d 48a5               movs    qword ptr [rdi], qword ptr [rsi]
00000001`4098345f e8cc1b69ff         call    TodosApi!RhpByRefAssignRefAVLocation1 (140015030)
00000001`40983464 e8c71b69ff         call    TodosApi!RhpByRefAssignRefAVLocation1 (140015030)
00000001`40983469 48a5               movs    qword ptr [rdi], qword ptr [rsi]
00000001`4098346b 498d4e10           lea     rcx, [r14+10h]
00000001`4098346f 488bd5             mov     rdx, rbp
00000001`40983472 e8c91969ff         call    TodosApi!RhpAssignRefAVLocation (140014e40)
00000001`40983477 488d0dea6f1f00     lea     rcx, [TodosApi!?__NONGCSTATICS@System.Threading.Tasks.TplEventSource@@ (140b7a468)]
00000001`4098347e 488379f800         cmp     qword ptr [rcx-8], 0
00000001`40983483 7559               jne     TodosApi!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<System.Threading.Tasks.VoidTaskResult>__GetStateMachineBox<S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100>+0x16e (1409834de)
00000001`40983485 488b0d94b0c400     mov     rcx, qword ptr [TodosApi!?__GCSTATICS@System.Threading.Tasks.TplEventSource@@ (1415ce520)]
00000001`4098348c 488b5908           mov     rbx, qword ptr [rcx+8]
00000001`40983490 80bb9d00000000     cmp     byte ptr [rbx+9Dh], 0
00000001`40983497 7437               je      TodosApi!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<System.Threading.Tasks.VoidTaskResult>__GetStateMachineBox<S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100>+0x160 (1409834d0)
00000001`40983499 498bce             mov     rcx, r14
00000001`4098349c e83f27abff         call    TodosApi!System.Threading.Tasks.Task__get_Id (140435be0)
00000001`409834a1 8bf0               mov     esi, eax
00000001`409834a3 488d0de68c1a00     lea     rcx, [TodosApi!__RuntimeType_S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100 (140b2c190)]
00000001`409834aa e85161a4ff         call    TodosApi!System.RuntimeType__get_Name (1403c9600)
00000001`409834af 488bd0             mov     rdx, rax
00000001`409834b2 488d0dff5c0e00     lea     rcx, [TodosApi!__Str_Async___4E51D638FE527E2B0672DE04B6E5F6B84BA81E64E12AA0136B0691A50E6F80FE (140a691b8)]
00000001`409834b9 e89285a4ff         call    TodosApi!String__Concat_6 (1403cba50)
00000001`409834be 4c8bc0             mov     r8, rax
00000001`409834c1 488bcb             mov     rcx, rbx
00000001`409834c4 8bd6               mov     edx, esi
00000001`409834c6 4533c9             xor     r9d, r9d
00000001`409834c9 3909               cmp     dword ptr [rcx], ecx
00000001`409834cb e8608cabff         call    TodosApi!System.Threading.Tasks.TplEventSource__TraceOperationBegin (14043c130)
00000001`409834d0 498bc6             mov     rax, r14
00000001`409834d3 4883c420           add     rsp, 20h
00000001`409834d7 5b                 pop     rbx
00000001`409834d8 5d                 pop     rbp
00000001`409834d9 5e                 pop     rsi
00000001`409834da 5f                 pop     rdi
00000001`409834db 415e               pop     r14
00000001`409834dd c3                 ret     
00000001`409834de e86a0f68ff         call    TodosApi!__GetGCStaticBase_System.Threading.Tasks.TplEventSource (14000444d)
00000001`409834e3 eba0               jmp     TodosApi!System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<System.Threading.Tasks.VoidTaskResult>__GetStateMachineBox<S_P_Xml_System_Xml_XmlUtf8RawTextWriter__WriteEntityRefAsync_d__100>+0x115 (140983485)

Addresses following problems:

  • Multiple expensive epilogs
  • Copying the state machine is costly because they tend to be huge and are a mix of reference fields (need write barriers) and value fields. We need to copy twice - one is an assignment, the other is a box. See clusters of RhpByRefAssignRefAVLocation1 - that's the write barrier. I'm making the box take a less efficient path based on comments.
  • Logging is constructing a string that doesn't really need to be in generic code.

This is just a couple dozen bytes in savings per method, but ends up saving 50 kB for the whole app because that's how many specializations of the method we have.

Cc @stephentoub @jkotas for thoughts.

Contributes to dotnet#79204.

I don't know if this will be considered mergeable, but I wanted to at least try something. For apps that use async a lot (like the Stage2 app we use for Goldilocks), the async infrastructure can cost 10% of the entire executable.

Shuffling a couple things in `GetStateMachineBox` I was able to get 0.23% saving. It's miniscule. In general async is death by a thousand papercuts so I don't see a silver bullet.
@jkotas
Copy link
Member

jkotas commented Apr 26, 2024

Copying the state machine is costly because they tend to be huge and are a mix of reference fields (need write barriers) and value fields

That's because RyuJIT inlines the box helper for large structs. I think this should be fixed in the JIT.

Outlining of the event tracing helper looks good.

@@ -180,64 +181,67 @@ public struct AsyncTaskMethodBuilder<TResult>
// result in a boxing allocation when storing the TStateMachine if it's a struct, but
// this only happens in active debugging scenarios where such performance impact doesn't
// matter.
if (taskField is AsyncStateMachineBox<IAsyncStateMachine> weaklyTypedBox)
else if (taskField is AsyncStateMachineBox<IAsyncStateMachine> weaklyTypedBox)
{
// If this is the first await, we won't yet have a state machine, so store it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're trying to squeeze water from a stone, this whole block should be extremely cold (only used in debugging scenarios), so if you can come up with a good way to outline it all to a separate helper, that'd be reasonable. Not sure if that'd help much, though, given that stateMachine is generic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would likely make things worse because it's still the same code, but with extra unwinding/GC tracking data structures because we have a new method.

@MichalStrehovsky
Copy link
Member Author

That's because RyuJIT inlines the box helper for large structs. I think this should be fixed in the JIT.

Do you have a heuristic in mind? It seems like what RyuJIT is doing now is strictly better from throughput perspective (we call the right allocation helper, assign all the fields, only do write barriers for GC pointers). Falling back to the regular boxing helper would be slower. Or are you thinking about a object RuntimeHelpers.Box<T>(ref T memory) that would be like a box, but guaranteed not to be expanded into alloc+set individual fields.

@EgorBo
Copy link
Member

EgorBo commented Apr 26, 2024

That's because RyuJIT inlines the box helper for large structs. I think this should be fixed in the JIT.

Do you have a heuristic in mind? It seems like what RyuJIT is doing now is strictly better from throughput perspective (we call the right allocation helper, assign all the fields, only do write barriers for GC pointers). Falling back to the regular boxing helper would be slower. Or are you thinking about a object RuntimeHelpers.Box<T>(ref T memory) that would be like a box, but guaranteed not to be expanded into alloc+set individual fields.

I presume Jan means that for a struct with many gc fields (JIT can check that) it's better for VM/Box helper to do a batch move (BulkMoveWithWriteBarrier) instead of individual write barriers

@jkotas
Copy link
Member

jkotas commented Apr 26, 2024

It seems like what RyuJIT is doing now is strictly better from throughput perspective

I expect that the bulk copy in the box helper starts to win in throughput above certain size. The JIT should probably give up inlining the copy way below this threshold for good code size / throughput tradeoff.

Also, the current implementation of the box is slow since the box helper is almost never used today. Jeremy hit that problem when switching a few places in reflection to use the box helper (WIP #101137).

@EgorBo
Copy link
Member

EgorBo commented Apr 26, 2024

Quick test:

[Benchmark]
public object Box() => ms;

MyStruct ms;

struct MyStruct
{
    string a1;
    string a2;
    string a3;
    string a4;
    string a5;
    string a6;
}

Benchmark for current impl vs using CORINFO_HELP_BOX (which is lowered down to memmoveGCRefs basically):

| Method | Toolchain                   | Mean      |
|------- |---------------------------- |----------:|
| Box    | \Core_Root_PR\corerun.exe   |  5.832 ns |
| Box    | \Core_Root_base\corerun.exe | 10.236 ns |

@MichalStrehovsky
Copy link
Member Author

I presume Jan means that for a struct with many gc fields (JIT can check that) it's better for VM/Box helper to do a batch move (BulkMoveWithWriteBarrier) instead of individual write barriers

I wonder if this also means the JIT should use the batch move when assigning structs with many GC fields. Looks like there is a CORINFO_HELP_ASSIGN_STRUCT but I can't find any place in the JIT codebase where it would be used.

The method I'm changing here does two struct assignments with these large structs - one as part of the box, and the other as part of the non-boxing path. Looks like it might be beneficial to treat both cases the same in the JIT and if one is using CORINFO_HELP_BOX, the other should be using ASSIGN_STRUCT.

@EgorBo
Copy link
Member

EgorBo commented Apr 27, 2024

I wonder if this also means the JIT should use the batch move when assigning structs with many GC fields.

We discussed this in #99096 tldr: there is always a trade-off

E.g. the code we currently generate is interruption-friendly and is more precise from GC's point of view (we update card table for each gc handle in a struct)

@EgorBo
Copy link
Member

EgorBo commented Apr 27, 2024

Fun fact: no-optimization also improves perf! 🙂

image

@jkotas
Copy link
Member

jkotas commented Apr 27, 2024

is more precise from GC's point of view (we update card table for each gc handle in a struct)

For box helper, this precision loss in card table updates should not be a concern. The object was just allocated and so it is very unlikely that will be any card table updates necessary. (There was even a discussion at one point about what it would take to skip the write barrier calls completely in cases like this.)

@MichalStrehovsky
Copy link
Member Author

I made a draft PR to use the regular boxing helper above a certain number of GC fields at #101669. I hope we'll be able to tune the number with the bots/automation that exists in this repo. A struct with 6 fields, all of them GC fields should obviously use the box helper path but I'm less clear on the criteria for others (does mixing GC and non-GC fields make any difference, etc.?).

// generating this extra code until a better solution is implemented.
var box = new AsyncStateMachineBox<TStateMachine>();
// DebugFinalizableAsyncStateMachineBox looks like a small type, but it actually is not because
// it will have a copy of all the slots from its parent. It will add another hundred(s) bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be out of scope for this PR, but how about using an unspecialized DebugFinalizableAsyncStateMachineBox<IAsyncStateMachine> for NativeAOT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know much about this - if it's only used by managed debuggers, it would not be useful for native AOT anyway. It's ifdeffed out so I'm not concerned about it.

@MichalStrehovsky MichalStrehovsky merged commit 507abdf into dotnet:main May 2, 2024
140 of 147 checks passed
@MichalStrehovsky MichalStrehovsky deleted the async branch May 2, 2024 09:23
michaelgsharp pushed a commit to michaelgsharp/runtime that referenced this pull request May 9, 2024
Contributes to dotnet#79204.

I don't know if this will be considered mergeable, but I wanted to at least try something. For apps that use async a lot (like the Stage2 app we use for Goldilocks), the async infrastructure can cost 10% of the entire executable.

Shuffling a couple things in `GetStateMachineBox` I was able to get 0.23% saving. It's miniscule. In general async is death by a thousand papercuts so I don't see a silver bullet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants