JIT should use boxing helper to box structs with many GC pointers #101699

MichalStrehovsky · 2024-04-29T21:55:24Z

Unoptimized code is faster than optimized code: #101605 (comment)

#101605 (comment)

Prototype with some extra discussion at #101669.

dotnet-policy-service · 2024-04-29T21:55:45Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

EgorBo · 2024-04-29T22:09:42Z

Marking as Future as we already have a lot of issues for 9.0, but I think I can try to land it this week

EgorBo · 2024-04-30T00:56:03Z

Working prototype (CoreCLR): EgorBo@1c9f3a4

EgorBo · 2024-04-30T01:16:17Z

@jkotas what would be the best way to handle possible gc starvation for huge structs - can we handle that on VM's side inside the JIT_StructWriteBarrier call?

jkotas · 2024-04-30T03:58:37Z

The JIT helper implementation should look like Buffer::BulkMoveWithWriteBarrier. It polls for GC among other things.
Null reference exceptions for null source/destination may need an extra care.
The bulk copy is great when the target is in recently allocated object - unlikely to introduce extra work for GC. It can be mixed bag for arbitrary assignments - it may introduce extra work for GC. I think we can give it a try and see whether it shows up anywhere. (cc @dotnet/gc).

huoyaoyuan · 2024-04-30T07:18:43Z

I have some questions about further optimizing GC pointer copying:

Currently the actual copy is done in FCall so that GC can only happen at chunk boundaries. Large copy are split in to chunks and implicit GC poll happens when transiting between managed and FCall every time.

The actual GCSafeCopyHelpers are implemented in native code with SSE enabled only (_mm_loadu_ps/_mm_store_ps) since it's in baseline instruction set. Can it achieve significantly higher throughput if other SIMD instruction sets are used (AVX2/512 and ARM intrinsics)? How can the instruction set be dynamically selected, by special managed code block that runs in cooperated mode, or by choosing native stubs dynamically?

EgorBo · 2024-04-30T12:43:32Z

and ARM intrinsics

I'd expect NEON to kick in here implicitly as is.

Can it achieve significantly higher throughput if other SIMD instruction sets are used

Probably? Depends on how often we see large inputs here, we, probably, do see them when we do various Array/Span CopyTo, but in case of the struct copy discussed here - unlikely (a struct must be really big to see impact from AVX).

How can the instruction set be dynamically selected, by special managed code block that runs in cooperated mode, or by choosing native stubs dynamically?

I'd guess that the simplest option would be to have a dynamic ISA check in the existing code? Just like we e.g. check for atomics on arm64 in C++ code. Obviously, it means + one more condition overhead for small inputs

PS: one imprortant thing we have to keep in mind is the atomicity guarantees, e.g. SIMD loads/store do have them when input is 16 byte aligned (and the alignment is not changed) on x86/64

EgorBo · 2024-05-01T01:40:37Z

hm, I finished my prototype but for some reason I am not seeing the expected perf improvements compared to raw box helper (on Windows-x64) 🤔

GcStruct9 s9;

[Benchmark] 
public object BoxS9() => s9;

record struct GcStruct9(
    string a1, string a2, string a3, 
    string a4, string a5, string a6,
    string a7, string a8, string a9);

Codegen in Main:

; Method MyBench:BoxS9():System.Object:this (FullOpts)
G_M000_IG01:
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 32
       mov      rsi, rcx

G_M000_IG02:
       mov      rcx, 0x7FFBDD35BE20
       call     CORINFO_HELP_NEWSFAST
       mov      rbx, rax
       add      rsi, 40
       lea      rdi, bword ptr [rbx+0x08]
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       call     CORINFO_HELP_ASSIGN_BYREF
       mov      rax, rbx

G_M000_IG03:
       add      rsp, 32
       pop      rbx
       pop      rsi
       pop      rdi
       ret      
; Total bytes of code: 92

Codegen in the prototype:

; Method MyBench:BoxS9():System.Object:this (FullOpts)
G_M5744_IG01:
       push     rsi
       push     rbx
       sub      rsp, 40
       mov      rbx, rcx

G_M5744_IG02:
       mov      rcx, 0x7FFB87E5B0C8      ; GcStruct9
       call     CORINFO_HELP_NEWSFAST
       mov      rsi, rax
       lea      rcx, bword ptr [rsi+0x08]
       lea      rdx, bword ptr [rbx+0x28]
       mov      r8d, 72
       call     CORINFO_HELP_ASSIGN_STRUCT
       mov      rax, rsi

G_M5744_IG03:
       add      rsp, 40
       pop      rbx
       pop      rsi
       ret      
; Total bytes of code: 56

Main: 13.5 ns
PR: 12.5 ns
BoxHelper: 6.5 ns

At first, I thought that the box helper is faster because it knows that the pointers never overlap, but it seems to be calling the same routine as my prototype...

jkotas · 2024-05-01T01:46:10Z

You have not done this part: The JIT helper implementation should look like Buffer::BulkMoveWithWriteBarrier. It polls for GC among other things.

huoyaoyuan · 2024-05-01T07:27:00Z

a struct must be really big to see impact from AVX

How about inline array?

I did an experiment about using AVX in copy, and disabling SSE.
Touching arraynative.inl requires a full rebuild for me, which wastes hours.

Benchmark code:

        [Params(1024, 200000, 4000000)]
        public int ArraySize { get; set; }

        [GlobalSetup]
        public void Setup()
        {
            src_str = Enumerable.Range(1, ArraySize).Select(i => $"abcd{i}").ToArray();
            src_str_unordered = [.. src_str];
            new Random(20240430).Shuffle(src_str_unordered);
            dst_str = new string[ArraySize];
            src_nint = Enumerable.Range(1, ArraySize).Select(i => (nint)i).ToArray();
            dst_nint = new nint[ArraySize];
        }

        private string[] src_str;
        private string[] src_str_unordered;
        private string[] dst_str;
        private nint[] src_nint;
        private nint[] dst_nint;

        [Benchmark]
        public string[] StringArray()
        {
            src_str.CopyTo(dst_str, 0);
            return dst_str;
        }

        [Benchmark]
        public string[] StringArray_Unordered()
        {
            src_str_unordered.CopyTo(dst_str, 0);
            return dst_str;
        }

        [Benchmark]
        public nint[] NintArray()
        {
            src_nint.CopyTo(dst_nint, 0);
            return dst_nint;
        }
    }

Results:

Method	Job	Toolchain	ArraySize	Mean	Error	StdDev	Ratio	RatioSD
StringArray	Job-NTZEWB	\AVX256\corerun.exe	1024	46.51 ns	0.221 ns	0.206 ns	0.60	0.00
StringArray	Job-GRYOMD	\NoSSE\corerun.exe	1024	108.67 ns	0.414 ns	0.387 ns	1.40	0.01
StringArray	Job-IGDLAQ	\main\corerun.exe	1024	77.41 ns	0.277 ns	0.231 ns	1.00	0.00


StringArray_Unordered	Job-NTZEWB	\AVX256\corerun.exe	1024	50.76 ns	0.290 ns	0.272 ns	0.65	0.01
StringArray_Unordered	Job-GRYOMD	\NoSSE\corerun.exe	1024	111.31 ns	0.627 ns	0.586 ns	1.43	0.01
StringArray_Unordered	Job-IGDLAQ	\main\corerun.exe	1024	77.82 ns	0.534 ns	0.474 ns	1.00	0.00


NintArray	Job-NTZEWB	\AVX256\corerun.exe	1024	44.01 ns	0.164 ns	0.153 ns	1.01	0.00
NintArray	Job-GRYOMD	\NoSSE\corerun.exe	1024	43.83 ns	0.141 ns	0.125 ns	1.00	0.01
NintArray	Job-IGDLAQ	\main\corerun.exe	1024	43.62 ns	0.207 ns	0.194 ns	1.00	0.00


StringArray	Job-NTZEWB	\AVX256\corerun.exe	200000	44,634.81 ns	248.081 ns	219.917 ns	1.03	0.01
StringArray	Job-GRYOMD	\NoSSE\corerun.exe	200000	47,465.84 ns	213.360 ns	178.165 ns	1.09	0.00
StringArray	Job-IGDLAQ	\main\corerun.exe	200000	43,425.92 ns	207.534 ns	194.127 ns	1.00	0.00


StringArray_Unordered	Job-NTZEWB	\AVX256\corerun.exe	200000	46,459.96 ns	353.088 ns	330.278 ns	1.03	0.01
StringArray_Unordered	Job-GRYOMD	\NoSSE\corerun.exe	200000	46,701.84 ns	301.490 ns	282.014 ns	1.03	0.01
StringArray_Unordered	Job-IGDLAQ	\main\corerun.exe	200000	45,213.14 ns	228.236 ns	202.325 ns	1.00	0.00


NintArray	Job-NTZEWB	\AVX256\corerun.exe	200000	35,568.12 ns	704.679 ns	811.509 ns	1.01	0.02
NintArray	Job-GRYOMD	\NoSSE\corerun.exe	200000	34,635.92 ns	686.125 ns	641.801 ns	0.98	0.02
NintArray	Job-IGDLAQ	\main\corerun.exe	200000	35,296.74 ns	684.155 ns	702.577 ns	1.00	0.00


StringArray	Job-NTZEWB	\AVX256\corerun.exe	4000000	2,067,140.31 ns	41,252.291 ns	67,778.698 ns	1.00	0.05
StringArray	Job-GRYOMD	\NoSSE\corerun.exe	4000000	2,309,220.21 ns	44,077.554 ns	50,759.777 ns	1.11	0.05
StringArray	Job-IGDLAQ	\main\corerun.exe	4000000	2,078,519.09 ns	37,302.752 ns	62,324.518 ns	1.00	0.00


StringArray_Unordered	Job-NTZEWB	\AVX256\corerun.exe	4000000	2,108,473.91 ns	41,784.542 ns	69,812.582 ns	1.01	0.04
StringArray_Unordered	Job-GRYOMD	\NoSSE\corerun.exe	4000000	2,324,801.37 ns	45,282.636 ns	64,943.004 ns	1.11	0.02
StringArray_Unordered	Job-IGDLAQ	\main\corerun.exe	4000000	2,109,031.42 ns	41,459.311 ns	44,360.998 ns	1.00	0.00


NintArray	Job-NTZEWB	\AVX256\corerun.exe	4000000	1,317,257.37 ns	26,272.227 ns	70,125.893 ns	1.00	0.07
NintArray	Job-GRYOMD	\NoSSE\corerun.exe	4000000	1,314,460.98 ns	26,219.591 ns	76,483.778 ns	0.99	0.06
NintArray	Job-IGDLAQ	\main\corerun.exe	4000000	1,324,209.74 ns	26,438.074 ns	71,476.872 ns	1.00	0.00

It seems that using longer vector size has more impact on relatively smaller size. Huge size are more limited by CPU cache I think?

huoyaoyuan · 2024-05-01T07:43:41Z

More numbers for smaller sizes:

Method	Job	Toolchain	ArraySize	Mean	Error	StdDev	Ratio	RatioSD
StringArray	Job-DBGZIY	\AVX256\corerun.exe	8	5.072 ns	0.0567 ns	0.0530 ns	1.14	0.01
StringArray	Job-BVSPDW	\NoSSE\corerun.exe	8	4.444 ns	0.0219 ns	0.0205 ns	1.00	0.01
StringArray	Job-SMRFMX	\main\corerun.exe	8	4.457 ns	0.0307 ns	0.0287 ns	1.00	0.00

StringArray_Unordered	Job-DBGZIY	\AVX256\corerun.exe	8	5.378 ns	0.0456 ns	0.0427 ns	1.03	0.01
StringArray_Unordered	Job-BVSPDW	\NoSSE\corerun.exe	8	4.449 ns	0.0298 ns	0.0279 ns	0.85	0.01
StringArray_Unordered	Job-SMRFMX	\main\corerun.exe	8	5.219 ns	0.0477 ns	0.0446 ns	1.00	0.00

NintArray	Job-DBGZIY	\AVX256\corerun.exe	8	2.933 ns	0.0197 ns	0.0184 ns	0.98	0.02
NintArray	Job-BVSPDW	\NoSSE\corerun.exe	8	3.105 ns	0.0295 ns	0.0276 ns	1.03	0.02
NintArray	Job-SMRFMX	\main\corerun.exe	8	3.005 ns	0.0774 ns	0.0686 ns	1.00	0.00

StringArray	Job-DBGZIY	\AVX256\corerun.exe	32	6.754 ns	0.0543 ns	0.0508 ns	1.12	0.01
StringArray	Job-BVSPDW	\NoSSE\corerun.exe	32	9.033 ns	0.0266 ns	0.0222 ns	1.50	0.01
StringArray	Job-SMRFMX	\main\corerun.exe	32	6.009 ns	0.0274 ns	0.0243 ns	1.00	0.00

StringArray_Unordered	Job-DBGZIY	\AVX256\corerun.exe	32	6.702 ns	0.0752 ns	0.0704 ns	1.01	0.01
StringArray_Unordered	Job-BVSPDW	\NoSSE\corerun.exe	32	9.125 ns	0.0790 ns	0.0739 ns	1.37	0.02
StringArray_Unordered	Job-SMRFMX	\main\corerun.exe	32	6.654 ns	0.0462 ns	0.0432 ns	1.00	0.00

NintArray	Job-DBGZIY	\AVX256\corerun.exe	32	4.749 ns	0.0260 ns	0.0243 ns	0.94	0.00
NintArray	Job-BVSPDW	\NoSSE\corerun.exe	32	4.762 ns	0.0240 ns	0.0212 ns	0.94	0.00
NintArray	Job-SMRFMX	\main\corerun.exe	32	5.055 ns	0.0199 ns	0.0186 ns	1.00	0.00

StringArray	Job-DBGZIY	\AVX256\corerun.exe	256	15.400 ns	0.0852 ns	0.0797 ns	0.70	0.00
StringArray	Job-BVSPDW	\NoSSE\corerun.exe	256	29.602 ns	0.1023 ns	0.0957 ns	1.34	0.01
StringArray	Job-SMRFMX	\main\corerun.exe	256	22.031 ns	0.1651 ns	0.1544 ns	1.00	0.00

StringArray_Unordered	Job-DBGZIY	\AVX256\corerun.exe	256	17.604 ns	0.1319 ns	0.1234 ns	0.80	0.01
StringArray_Unordered	Job-BVSPDW	\NoSSE\corerun.exe	256	29.486 ns	0.1414 ns	0.1322 ns	1.34	0.01
StringArray_Unordered	Job-SMRFMX	\main\corerun.exe	256	21.940 ns	0.1629 ns	0.1524 ns	1.00	0.00

NintArray	Job-DBGZIY	\AVX256\corerun.exe	256	14.972 ns	0.0466 ns	0.0389 ns	1.00	0.00
NintArray	Job-BVSPDW	\NoSSE\corerun.exe	256	14.953 ns	0.0796 ns	0.0745 ns	1.00	0.01
NintArray	Job-SMRFMX	\main\corerun.exe	256	14.970 ns	0.0444 ns	0.0415 ns	1.00	0.00

Conclusion: using AVX is more impactful for arrays (size 256 ~ 1024), not structs (size 8 ~ 32). It can be a separated topic.

EgorBo · 2024-05-01T09:54:56Z

You have not done this part: The JIT helper implementation should look like Buffer::BulkMoveWithWriteBarrier. It polls for GC among other things.

@jkotas hm.. do you have an idea how a GC poll fixes the perf here? I am a bit puzzled as it seems to be fixing it indeed 🤔

Conclusion: using AVX is more impactful for arrays (size 256 ~ 1024), not structs (size 8 ~ 32). It can be a separated topic.

@huoyaoyuan I guess if you figure out how to insert AVX there properly without noticeable regressions then it should be a good change?

jkotas · 2024-05-01T12:07:42Z

do you have an idea how a GC poll fixes the perf here? I am a bit puzzled as it seems to be fixing it indeed

HELPER_METHOD_FRAME is expensive

MichalStrehovsky added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 29, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 29, 2024

EgorBo added this to the Future milestone Apr 29, 2024

EgorBo removed the untriaged New issue has not been triaged by the area owner label Apr 29, 2024

EgorBo self-assigned this Apr 29, 2024

EgorBo mentioned this issue May 1, 2024

JIT: Bulk copy of byrefs #101761

Merged

EgorBo modified the milestones: Future, 9.0.0 May 1, 2024

EgorBo closed this as completed in #101761 May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT should use boxing helper to box structs with many GC pointers #101699

JIT should use boxing helper to box structs with many GC pointers #101699

MichalStrehovsky commented Apr 29, 2024

dotnet-policy-service bot commented Apr 29, 2024

EgorBo commented Apr 29, 2024

EgorBo commented Apr 30, 2024 •

edited

EgorBo commented Apr 30, 2024

jkotas commented Apr 30, 2024

huoyaoyuan commented Apr 30, 2024

EgorBo commented Apr 30, 2024 •

edited

EgorBo commented May 1, 2024 •

edited

jkotas commented May 1, 2024

huoyaoyuan commented May 1, 2024 •

edited

huoyaoyuan commented May 1, 2024

EgorBo commented May 1, 2024

jkotas commented May 1, 2024

JIT should use boxing helper to box structs with many GC pointers #101699

JIT should use boxing helper to box structs with many GC pointers #101699

Comments

MichalStrehovsky commented Apr 29, 2024

dotnet-policy-service bot commented Apr 29, 2024

EgorBo commented Apr 29, 2024

EgorBo commented Apr 30, 2024 • edited

EgorBo commented Apr 30, 2024

jkotas commented Apr 30, 2024

huoyaoyuan commented Apr 30, 2024

EgorBo commented Apr 30, 2024 • edited

EgorBo commented May 1, 2024 • edited

jkotas commented May 1, 2024

huoyaoyuan commented May 1, 2024 • edited

huoyaoyuan commented May 1, 2024

EgorBo commented May 1, 2024

jkotas commented May 1, 2024

EgorBo commented Apr 30, 2024 •

edited

EgorBo commented Apr 30, 2024 •

edited

EgorBo commented May 1, 2024 •

edited

huoyaoyuan commented May 1, 2024 •

edited