Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT should use boxing helper to box structs with many GC pointers #101699

Closed
MichalStrehovsky opened this issue Apr 29, 2024 · 13 comments · Fixed by #101761
Closed

JIT should use boxing helper to box structs with many GC pointers #101699

MichalStrehovsky opened this issue Apr 29, 2024 · 13 comments · Fixed by #101761
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@MichalStrehovsky
Copy link
Member

Unoptimized code is faster than optimized code: #101605 (comment)

#101605 (comment)

Prototype with some extra discussion at #101669.

@MichalStrehovsky MichalStrehovsky added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 29, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 29, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo EgorBo added this to the Future milestone Apr 29, 2024
@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Apr 29, 2024
@EgorBo
Copy link
Member

EgorBo commented Apr 29, 2024

Marking as Future as we already have a lot of issues for 9.0, but I think I can try to land it this week

@EgorBo EgorBo self-assigned this Apr 29, 2024
@EgorBo
Copy link
Member

EgorBo commented Apr 30, 2024

Working prototype (CoreCLR): EgorBo@1c9f3a4

@EgorBo
Copy link
Member

EgorBo commented Apr 30, 2024

@jkotas what would be the best way to handle possible gc starvation for huge structs - can we handle that on VM's side inside the JIT_StructWriteBarrier call?

@jkotas
Copy link
Member

jkotas commented Apr 30, 2024

  • The JIT helper implementation should look like Buffer::BulkMoveWithWriteBarrier. It polls for GC among other things.
  • Null reference exceptions for null source/destination may need an extra care.
  • The bulk copy is great when the target is in recently allocated object - unlikely to introduce extra work for GC. It can be mixed bag for arbitrary assignments - it may introduce extra work for GC. I think we can give it a try and see whether it shows up anywhere. (cc @dotnet/gc).

@huoyaoyuan
Copy link
Member

I have some questions about further optimizing GC pointer copying:

Currently the actual copy is done in FCall so that GC can only happen at chunk boundaries. Large copy are split in to chunks and implicit GC poll happens when transiting between managed and FCall every time.

The actual GCSafeCopyHelpers are implemented in native code with SSE enabled only (_mm_loadu_ps/_mm_store_ps) since it's in baseline instruction set. Can it achieve significantly higher throughput if other SIMD instruction sets are used (AVX2/512 and ARM intrinsics)? How can the instruction set be dynamically selected, by special managed code block that runs in cooperated mode, or by choosing native stubs dynamically?

@EgorBo
Copy link
Member

EgorBo commented Apr 30, 2024

and ARM intrinsics

I'd expect NEON to kick in here implicitly as is.

Can it achieve significantly higher throughput if other SIMD instruction sets are used

Probably? Depends on how often we see large inputs here, we, probably, do see them when we do various Array/Span CopyTo, but in case of the struct copy discussed here - unlikely (a struct must be really big to see impact from AVX).

How can the instruction set be dynamically selected, by special managed code block that runs in cooperated mode, or by choosing native stubs dynamically?

I'd guess that the simplest option would be to have a dynamic ISA check in the existing code? Just like we e.g. check for atomics on arm64 in C++ code. Obviously, it means + one more condition overhead for small inputs

PS: one imprortant thing we have to keep in mind is the atomicity guarantees, e.g. SIMD loads/store do have them when input is 16 byte aligned (and the alignment is not changed) on x86/64

@EgorBo
Copy link
Member

EgorBo commented May 1, 2024

hm, I finished my prototype but for some reason I am not seeing the expected perf improvements compared to raw box helper (on Windows-x64) 🤔

GcStruct9 s9;

[Benchmark] 
public object BoxS9() => s9;

record struct GcStruct9(
    string a1, string a2, string a3, 
    string a4, string a5, string a6,
    string a7, string a8, string a9);

Codegen in Main:

; Method MyBench:BoxS9():System.Object:this (FullOpts)
G_M000_IG01:
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 32
       mov      rsi, rcx

G_M000_IG02:
       mov      rcx, 0x7FFBDD35BE20
       call     CORINFO_HELP_NEWSFAST
       mov      rbx, rax
       add      rsi, 40
       lea      rdi, bword ptr [rbx+0x08]
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       mov      rax, rbx

G_M000_IG03:
       add      rsp, 32
       pop      rbx
       pop      rsi
       pop      rdi
       ret      
; Total bytes of code: 92

Codegen in the prototype:

; Method MyBench:BoxS9():System.Object:this (FullOpts)
G_M5744_IG01:
       push     rsi
       push     rbx
       sub      rsp, 40
       mov      rbx, rcx

G_M5744_IG02:
       mov      rcx, 0x7FFB87E5B0C8      ; GcStruct9
       call     CORINFO_HELP_NEWSFAST
       mov      rsi, rax
       lea      rcx, bword ptr [rsi+0x08]
       lea      rdx, bword ptr [rbx+0x28]
       mov      r8d, 72
       call     CORINFO_HELP_ASSIGN_STRUCT
       mov      rax, rsi

G_M5744_IG03:
       add      rsp, 40
       pop      rbx
       pop      rsi
       ret      
; Total bytes of code: 56

Main: 13.5 ns
PR: 12.5 ns
BoxHelper: 6.5 ns

At first, I thought that the box helper is faster because it knows that the pointers never overlap, but it seems to be calling the same routine as my prototype...

@jkotas
Copy link
Member

jkotas commented May 1, 2024

You have not done this part: The JIT helper implementation should look like Buffer::BulkMoveWithWriteBarrier. It polls for GC among other things.

@huoyaoyuan
Copy link
Member

huoyaoyuan commented May 1, 2024

a struct must be really big to see impact from AVX

How about inline array?

I did an experiment about using AVX in copy, and disabling SSE.
Touching arraynative.inl requires a full rebuild for me, which wastes hours.

Benchmark code:

        [Params(1024, 200000, 4000000)]
        public int ArraySize { get; set; }

        [GlobalSetup]
        public void Setup()
        {
            src_str = Enumerable.Range(1, ArraySize).Select(i => $"abcd{i}").ToArray();
            src_str_unordered = [.. src_str];
            new Random(20240430).Shuffle(src_str_unordered);
            dst_str = new string[ArraySize];
            src_nint = Enumerable.Range(1, ArraySize).Select(i => (nint)i).ToArray();
            dst_nint = new nint[ArraySize];
        }

        private string[] src_str;
        private string[] src_str_unordered;
        private string[] dst_str;
        private nint[] src_nint;
        private nint[] dst_nint;

        [Benchmark]
        public string[] StringArray()
        {
            src_str.CopyTo(dst_str, 0);
            return dst_str;
        }

        [Benchmark]
        public string[] StringArray_Unordered()
        {
            src_str_unordered.CopyTo(dst_str, 0);
            return dst_str;
        }

        [Benchmark]
        public nint[] NintArray()
        {
            src_nint.CopyTo(dst_nint, 0);
            return dst_nint;
        }
    }

Results:

Method Job Toolchain ArraySize Mean Error StdDev Ratio RatioSD
StringArray Job-NTZEWB \AVX256\corerun.exe 1024 46.51 ns 0.221 ns 0.206 ns 0.60 0.00
StringArray Job-GRYOMD \NoSSE\corerun.exe 1024 108.67 ns 0.414 ns 0.387 ns 1.40 0.01
StringArray Job-IGDLAQ \main\corerun.exe 1024 77.41 ns 0.277 ns 0.231 ns 1.00 0.00
StringArray_Unordered Job-NTZEWB \AVX256\corerun.exe 1024 50.76 ns 0.290 ns 0.272 ns 0.65 0.01
StringArray_Unordered Job-GRYOMD \NoSSE\corerun.exe 1024 111.31 ns 0.627 ns 0.586 ns 1.43 0.01
StringArray_Unordered Job-IGDLAQ \main\corerun.exe 1024 77.82 ns 0.534 ns 0.474 ns 1.00 0.00
NintArray Job-NTZEWB \AVX256\corerun.exe 1024 44.01 ns 0.164 ns 0.153 ns 1.01 0.00
NintArray Job-GRYOMD \NoSSE\corerun.exe 1024 43.83 ns 0.141 ns 0.125 ns 1.00 0.01
NintArray Job-IGDLAQ \main\corerun.exe 1024 43.62 ns 0.207 ns 0.194 ns 1.00 0.00
StringArray Job-NTZEWB \AVX256\corerun.exe 200000 44,634.81 ns 248.081 ns 219.917 ns 1.03 0.01
StringArray Job-GRYOMD \NoSSE\corerun.exe 200000 47,465.84 ns 213.360 ns 178.165 ns 1.09 0.00
StringArray Job-IGDLAQ \main\corerun.exe 200000 43,425.92 ns 207.534 ns 194.127 ns 1.00 0.00
StringArray_Unordered Job-NTZEWB \AVX256\corerun.exe 200000 46,459.96 ns 353.088 ns 330.278 ns 1.03 0.01
StringArray_Unordered Job-GRYOMD \NoSSE\corerun.exe 200000 46,701.84 ns 301.490 ns 282.014 ns 1.03 0.01
StringArray_Unordered Job-IGDLAQ \main\corerun.exe 200000 45,213.14 ns 228.236 ns 202.325 ns 1.00 0.00
NintArray Job-NTZEWB \AVX256\corerun.exe 200000 35,568.12 ns 704.679 ns 811.509 ns 1.01 0.02
NintArray Job-GRYOMD \NoSSE\corerun.exe 200000 34,635.92 ns 686.125 ns 641.801 ns 0.98 0.02
NintArray Job-IGDLAQ \main\corerun.exe 200000 35,296.74 ns 684.155 ns 702.577 ns 1.00 0.00
StringArray Job-NTZEWB \AVX256\corerun.exe 4000000 2,067,140.31 ns 41,252.291 ns 67,778.698 ns 1.00 0.05
StringArray Job-GRYOMD \NoSSE\corerun.exe 4000000 2,309,220.21 ns 44,077.554 ns 50,759.777 ns 1.11 0.05
StringArray Job-IGDLAQ \main\corerun.exe 4000000 2,078,519.09 ns 37,302.752 ns 62,324.518 ns 1.00 0.00
StringArray_Unordered Job-NTZEWB \AVX256\corerun.exe 4000000 2,108,473.91 ns 41,784.542 ns 69,812.582 ns 1.01 0.04
StringArray_Unordered Job-GRYOMD \NoSSE\corerun.exe 4000000 2,324,801.37 ns 45,282.636 ns 64,943.004 ns 1.11 0.02
StringArray_Unordered Job-IGDLAQ \main\corerun.exe 4000000 2,109,031.42 ns 41,459.311 ns 44,360.998 ns 1.00 0.00
NintArray Job-NTZEWB \AVX256\corerun.exe 4000000 1,317,257.37 ns 26,272.227 ns 70,125.893 ns 1.00 0.07
NintArray Job-GRYOMD \NoSSE\corerun.exe 4000000 1,314,460.98 ns 26,219.591 ns 76,483.778 ns 0.99 0.06
NintArray Job-IGDLAQ \main\corerun.exe 4000000 1,324,209.74 ns 26,438.074 ns 71,476.872 ns 1.00 0.00

It seems that using longer vector size has more impact on relatively smaller size. Huge size are more limited by CPU cache I think?

@huoyaoyuan
Copy link
Member

More numbers for smaller sizes:

Method Job Toolchain ArraySize Mean Error StdDev Ratio RatioSD
StringArray Job-DBGZIY \AVX256\corerun.exe 8 5.072 ns 0.0567 ns 0.0530 ns 1.14 0.01
StringArray Job-BVSPDW \NoSSE\corerun.exe 8 4.444 ns 0.0219 ns 0.0205 ns 1.00 0.01
StringArray Job-SMRFMX \main\corerun.exe 8 4.457 ns 0.0307 ns 0.0287 ns 1.00 0.00
StringArray_Unordered Job-DBGZIY \AVX256\corerun.exe 8 5.378 ns 0.0456 ns 0.0427 ns 1.03 0.01
StringArray_Unordered Job-BVSPDW \NoSSE\corerun.exe 8 4.449 ns 0.0298 ns 0.0279 ns 0.85 0.01
StringArray_Unordered Job-SMRFMX \main\corerun.exe 8 5.219 ns 0.0477 ns 0.0446 ns 1.00 0.00
NintArray Job-DBGZIY \AVX256\corerun.exe 8 2.933 ns 0.0197 ns 0.0184 ns 0.98 0.02
NintArray Job-BVSPDW \NoSSE\corerun.exe 8 3.105 ns 0.0295 ns 0.0276 ns 1.03 0.02
NintArray Job-SMRFMX \main\corerun.exe 8 3.005 ns 0.0774 ns 0.0686 ns 1.00 0.00
StringArray Job-DBGZIY \AVX256\corerun.exe 32 6.754 ns 0.0543 ns 0.0508 ns 1.12 0.01
StringArray Job-BVSPDW \NoSSE\corerun.exe 32 9.033 ns 0.0266 ns 0.0222 ns 1.50 0.01
StringArray Job-SMRFMX \main\corerun.exe 32 6.009 ns 0.0274 ns 0.0243 ns 1.00 0.00
StringArray_Unordered Job-DBGZIY \AVX256\corerun.exe 32 6.702 ns 0.0752 ns 0.0704 ns 1.01 0.01
StringArray_Unordered Job-BVSPDW \NoSSE\corerun.exe 32 9.125 ns 0.0790 ns 0.0739 ns 1.37 0.02
StringArray_Unordered Job-SMRFMX \main\corerun.exe 32 6.654 ns 0.0462 ns 0.0432 ns 1.00 0.00
NintArray Job-DBGZIY \AVX256\corerun.exe 32 4.749 ns 0.0260 ns 0.0243 ns 0.94 0.00
NintArray Job-BVSPDW \NoSSE\corerun.exe 32 4.762 ns 0.0240 ns 0.0212 ns 0.94 0.00
NintArray Job-SMRFMX \main\corerun.exe 32 5.055 ns 0.0199 ns 0.0186 ns 1.00 0.00
StringArray Job-DBGZIY \AVX256\corerun.exe 256 15.400 ns 0.0852 ns 0.0797 ns 0.70 0.00
StringArray Job-BVSPDW \NoSSE\corerun.exe 256 29.602 ns 0.1023 ns 0.0957 ns 1.34 0.01
StringArray Job-SMRFMX \main\corerun.exe 256 22.031 ns 0.1651 ns 0.1544 ns 1.00 0.00
StringArray_Unordered Job-DBGZIY \AVX256\corerun.exe 256 17.604 ns 0.1319 ns 0.1234 ns 0.80 0.01
StringArray_Unordered Job-BVSPDW \NoSSE\corerun.exe 256 29.486 ns 0.1414 ns 0.1322 ns 1.34 0.01
StringArray_Unordered Job-SMRFMX \main\corerun.exe 256 21.940 ns 0.1629 ns 0.1524 ns 1.00 0.00
NintArray Job-DBGZIY \AVX256\corerun.exe 256 14.972 ns 0.0466 ns 0.0389 ns 1.00 0.00
NintArray Job-BVSPDW \NoSSE\corerun.exe 256 14.953 ns 0.0796 ns 0.0745 ns 1.00 0.01
NintArray Job-SMRFMX \main\corerun.exe 256 14.970 ns 0.0444 ns 0.0415 ns 1.00 0.00

Conclusion: using AVX is more impactful for arrays (size 256 ~ 1024), not structs (size 8 ~ 32). It can be a separated topic.

@EgorBo
Copy link
Member

EgorBo commented May 1, 2024

You have not done this part: The JIT helper implementation should look like Buffer::BulkMoveWithWriteBarrier. It polls for GC among other things.

@jkotas hm.. do you have an idea how a GC poll fixes the perf here? I am a bit puzzled as it seems to be fixing it indeed 🤔

Conclusion: using AVX is more impactful for arrays (size 256 ~ 1024), not structs (size 8 ~ 32). It can be a separated topic.

@huoyaoyuan I guess if you figure out how to insert AVX there properly without noticeable regressions then it should be a good change?

@EgorBo EgorBo modified the milestones: Future, 9.0.0 May 1, 2024
@jkotas
Copy link
Member

jkotas commented May 1, 2024

do you have an idea how a GC poll fixes the perf here? I am a bit puzzled as it seems to be fixing it indeed

HELPER_METHOD_FRAME is expensive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants