Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List Add is way slower (almost 3 times) in net9.0 preview 3 than with net8.0 #101437

Open
linkdotnet opened this issue Apr 23, 2024 · 50 comments
Open
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@linkdotnet
Copy link

With the latest preview 3 (net9) there seems to be a major performance regression on MacOS 14.4 (M2 Pro) with lists. Given that they are one of the most used types, it should receive special treatment.

Benchmark

[Benchmark]
public List<int> ListAdd10000()
{
    var list = new List<int>();
    for (var i = 0; i < 10_000; i++)
    {
        list.Add(i);
    }

    return list;
}

[Benchmark]
public List<int> ListAdd10000PreAlloc()
{
    var list = new List<int>(10_000);
    for (var i = 0; i < 10_000; i++)
    {
        list.Add(i);
    }

    return list;
}

Results:

BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0]
Apple M2 Pro, 1 CPU, 12 logical and 12 physical cores
.NET SDK 9.0.100-preview.3.24204.13
  [Host]   : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD
  .NET 8.0 : .NET 8.0.2 (8.0.224.6711), Arm64 RyuJIT AdvSIMD
  .NET 9.0 : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD


| Method               | Runtime  | Mean      | Error     | StdDev    | Ratio | 
|--------------------- |--------- |----------:|----------:|----------:|------:|-
| ListAdd10000         | .NET 8.0 |  9.739 us | 0.0236 us | 0.0209 us |  1.00 | 
| ListAdd10000         | .NET 9.0 | 25.646 us | 0.1531 us | 0.1432 us |  2.63 | 
|                      |          |           |           |           |       | 
| ListAdd10000PreAlloc | .NET 8.0 |  6.701 us | 0.0175 us | 0.0146 us |  1.00 | 
| ListAdd10000PreAlloc | .NET 9.0 | 22.562 us | 0.1429 us | 0.1336 us |  3.37 |

Interestingly even the pre-allocated list is comparably slow.

dotnet --info

Output dotnet --info .NET SDK: Version: 9.0.100-preview.3.24204.13 Commit: 81f61d8290 Workload version: 9.0.100-manifests.77bb7ba9 MSBuild version: 17.11.0-preview-24178-16+7ca3c98fa

Runtime Environment:
OS Name: Mac OS X
OS Version: 14.4
OS Platform: Darwin
RID: osx-arm64
Base Path: /usr/local/share/dotnet/sdk/9.0.100-preview.3.24204.13/

.NET workloads installed:
There are no installed workloads to display.

Host:
Version: 9.0.0-preview.3.24172.9
Architecture: arm64
Commit: 9e6ba1f

.NET SDKs installed:
6.0.417 [/usr/local/share/dotnet/sdk]
7.0.306 [/usr/local/share/dotnet/sdk]
7.0.404 [/usr/local/share/dotnet/sdk]
8.0.100 [/usr/local/share/dotnet/sdk]
8.0.101 [/usr/local/share/dotnet/sdk]
8.0.200 [/usr/local/share/dotnet/sdk]
9.0.100-preview.2.24121.2 [/usr/local/share/dotnet/sdk]
9.0.100-preview.2.24157.14 [/usr/local/share/dotnet/sdk]
9.0.100-preview.3.24204.13 [/usr/local/share/dotnet/sdk]

.NET runtimes installed:
Microsoft.AspNetCore.App 6.0.25 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 7.0.14 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.0 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.1 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.2 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 9.0.0-preview.2.24120.6 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 9.0.0-preview.2.24128.4 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 9.0.0-preview.3.24172.13 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 6.0.25 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 7.0.14 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.0 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.1 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.2 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 9.0.0-preview.2.24120.11 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 9.0.0-preview.2.24128.5 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 9.0.0-preview.3.24172.9 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]

Other architectures found:
x64 [/usr/local/share/dotnet/x64]

Environment variables:
Not set

global.json file:
Not found

Learn more:
https://aka.ms/dotnet/info

Download .NET:
https://aka.ms/dotnet/download

@linkdotnet linkdotnet added the tenet-performance Performance related issue label Apr 23, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 23, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 23, 2024
@vcsjones vcsjones added area-System.Collections and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Apr 23, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

Doesn't repro for me on x64-windows:


BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.100-preview.3.24204.13
  [Host]   : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 8.0 : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 9.0 : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method               | Job      | Runtime  | Mean     | Error     | StdDev    |
|--------------------- |--------- |--------- |---------:|----------:|----------:|
| ListAdd10000         | .NET 8.0 | .NET 8.0 | 7.339 us | 0.0598 us | 0.0559 us |
| ListAdd10000PreAlloc | .NET 8.0 | .NET 8.0 | 5.002 us | 0.0988 us | 0.1508 us |
| ListAdd10000         | .NET 9.0 | .NET 9.0 | 7.472 us | 0.1401 us | 0.2597 us |
| ListAdd10000PreAlloc | .NET 9.0 | .NET 9.0 | 4.992 us | 0.0987 us | 0.1730 us |

don't have an macos-arm64 machine to test

@vcsjones
Copy link
Member

@EgorBo I can repro on my M1.

BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0]
Apple M1 Ultra, 1 CPU, 20 logical and 20 physical cores
.NET SDK 9.0.100-preview.3.24204.13
[Host] : .NET 8.0.3 (8.0.324.11423), Arm64 RyuJIT AdvSIMD
Job-OISOJI : .NET 8.0.3 (8.0.324.11423), Arm64 RyuJIT AdvSIMD
Job-YEXJFM : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD

Method Runtime Mean Error StdDev Ratio
ListAdd10000 .NET 8.0 10.488 us 0.0102 us 0.0080 us 1.00
ListAdd10000 .NET 9.0 28.429 us 0.0501 us 0.0469 us 2.71
ListAdd10000PreAlloc .NET 8.0 7.137 us 0.0147 us 0.0138 us 1.00
ListAdd10000PreAlloc .NET 9.0 24.054 us 0.0877 us 0.0732 us 3.37

@linkdotnet
Copy link
Author

The same applies to 9.0.100-preview.2.24157.14 as well - so the regression might be since preview.1 or preview.2.

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

Interesting! Reproduces for me on M2 Max as well (just found it 🙂):

BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0]
Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
.NET SDK 9.0.100-preview.3.24204.13
  [Host]   : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
  .NET 8.0 : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
  .NET 9.0 : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD


| Method               | Job      | Runtime  | Mean      | Error     | StdDev    |
|--------------------- |--------- |--------- |----------:|----------:|----------:|
| ListAdd10000         | .NET 8.0 | .NET 8.0 |  9.633 us | 0.0906 us | 0.0848 us |
| ListAdd10000PreAlloc | .NET 8.0 | .NET 8.0 |  6.688 us | 0.0389 us | 0.0364 us |
| ListAdd10000         | .NET 9.0 | .NET 9.0 | 24.315 us | 0.1284 us | 0.1201 us |
| ListAdd10000PreAlloc | .NET 9.0 | .NET 9.0 | 21.207 us | 0.0854 us | 0.0757 us |

@tannergooding
Copy link
Member

Didn't we change Arm64 to start using the newer Arm64 barrier intstructions for the GC if they existed around that point?

IIRC, @kunalspathak did some of that work.

@tannergooding
Copy link
Member

Not finding the barrier PR I was remembering, but there was #97953 too

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

Doesn't repro on Linux-arm64

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

codegen diff: https://www.diffchecker.com/6tDgYQ8w/ - the only difference is ldp on net9 vs 2 ldr on net8. Hard to imagine it being a culprit 🤔

@tannergooding
Copy link
Member

Could be something more subtle has changed and is impacting the general measurements or iteration count?

ldr and ldp are both supposed to be 3 cycle, 3 latency. But alternatively perhaps there's some weird data dependency or other issue here causing the delay?

I don't think Apple or Arm64 in general has anything like Intel VTune or AMD uProf, so seeing where the stalls are happening isn't as easy, unfortunately.

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

Could be something more subtle has changed and is impacting the general measurements or iteration count?

ldr and ldp are both supposed to be 3 cycle, 3 latency. But alternatively perhaps there's some weird data dependency or other issue here causing the delay?

I don't think Apple or Arm64 in general has anything like Intel VTune or AMD uProf, so seeing where the stalls are happening isn't as easy, unfortunately.

I've just checked it locally - I've built a runtime that exactly matches 9.0-preview3 and removed the Ldr->Ldp optimization - it has fixed the perf 😐

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

Here is the asm diff for ReplaceLdrStrWithPairInstr being there (left) and removed (right) on the same runtime: https://www.diffchecker.com/8C38Yoor/

@vcsjones vcsjones added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Collections labels Apr 23, 2024
@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

After some brainstorming in our Community Discord:

  1. The loop alignment is the same in both cases - we even tried to put an nop after ldp so the overall codegen's (and loop) size matches -- no change
  2. The issue was introduced in JIT: Reorder indirs on arm64 to make them amenable to ldp optimization #92768 in .NET 9.0. That PR allows pre-existing ReplaceLdrStrWithPairInstr optimization to recognize more pairs of loads as ldp.
  3. Linux-arm64 Ampere is not affected it seems (if I correctly measured) so for now it's macOS arm64 specific. Perhaps someone can try it on some other non-mac arm64 machine? The benchmark is:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

[SimpleJob(runtimeMoniker: RuntimeMoniker.Net80)]
[SimpleJob(runtimeMoniker: RuntimeMoniker.Net90)]
public class MyClass
{
    static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(MyClass).Assembly).Run(args);
    }

    [Benchmark]
    public List<int> ListAdd10000PreAlloc()
    {
        var list = new List<int>(10_000);
        for (var i = 0; i < 10_000; i++)
            list.Add(i);
        return list;
    }
}

(with <TargetFrameworks>net8.0;net9.0</TargetFrameworks> in the csproj)

@tannergooding
Copy link
Member

It could potentially be related to behavioral changes that exist for LDP when LSE2 is supported.

In particular

If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load or store two 64-bit registers are single-copy
atomic when all of the following conditions are true:
• The overall memory access is aligned to 16 bytes.
• Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that access fewer than 16 bytes are single-copy
atomic when all of the following conditions are true:
• All bytes being accessed are within a 16-byte quantity aligned to 16 bytes.
• Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memor

The actual operation (or a snippet of it) is then described loosely as:

if HaveLSE2Ext() && !signed then
    bits(2*datasize) full_data;
    full_data = Mem[address, 2*dbytes, accdesc, TRUE];
    if BigEndian(accdesc.acctype) then
        data2 = full_data<(datasize-1):0>;
        data1 = full_data<(2*datasize-1):datasize>;
    else
        data1 = full_data<(datasize-1):0>;
        data2 = full_data<(2*datasize-1):datasize>;
else
    data1 = Mem[address, dbytes, accdesc];
    data2 = Mem[address+dbytes, dbytes, accdesc];

So it may be beneficial to test this on a Linux machine with LSE2 to see if that makes a difference or not.

@vcsjones
Copy link
Member

vcsjones commented Apr 23, 2024

Graviton 3:

Method Job Runtime Mean Error StdDev
ListAdd10000PreAlloc .NET 8.0 .NET 8.0 23.84 us 0.422 us 0.414 us
ListAdd10000PreAlloc .NET 9.0 .NET 9.0 23.75 us 0.270 us 0.252 us

To @tannergooding's point though, either the hardware or the kernel I am using do not support LSE2 based on dmesg output.

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

image

this change fixes the perf too. the loads we're looking at are _size and _version inside List<T> so we're likely dealing with some sort of stall here probably caused by LSE2 that Tanner mentioned

@tannergooding
Copy link
Member

My guess, given the above, is then that since LSE2 makes ldp atomic for loads that exist within a 16-byte window (and we know these will be since the they'll be 8-byte aligned), that the write to _version is placing a stall against the read for _size, which wouldn't exist for two independent ldr.

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

I like that guess. Although, the fact that on Graviton 3 we're seeing 23 us for both cases (and I am seeing similar numbers on Ampere-linux-arm64) while x64 and apple m1 show 7us for the best case, might be hinting something else

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

a dummy field between _size and _version in List<> fixes the issue as well (since it breaks the ldp optimization too)

@EgorBo
Copy link
Member

EgorBo commented Apr 24, 2024

Minimal repro:

using System.Diagnostics;
using System.Runtime.CompilerServices;

public class MyClass
{
    static void Main()
    {
        var mc = new MyClass();
        Stopwatch sw = Stopwatch.StartNew();
        while (true)
        {
            sw.Restart();
            for (int i = 0; i < 10000; i++)
                mc.Test();
            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
        }
    }

    int field1;
    int field2;

    [MethodImpl(MethodImplOptions.NoInlining)]
    public void Test()
    {
        for (int i = 0; i < 10000; i++)
        {
            field1++;
            field2++;
        }
    }
}

Codegen diff for Test(): https://www.diffchecker.com/72No1VJ6/
Left - Main, ~191 ms per iteration
Right - LDP optimization disabled, ~67 ms per iteration (~3x faster)

@EgorBo
Copy link
Member

EgorBo commented Apr 24, 2024

cc @a74nh @TamarChristinaArm Perhaps, you know why we could see such a terrible perf hit from ldp on Apple arm64 CPUs

@EgorBo
Copy link
Member

EgorBo commented Apr 24, 2024

The fun part that this benchmark is 3x faster under x64 emulation (Rosetta), e.g.:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using System.Runtime.Intrinsics;

BenchmarkRunner.Run<MyClass>();

[InProcess]
public class MyClass
{
    [Benchmark]
    public List<int> ListAdd10000PreAlloc()
    {
        var list = new List<int>(10_000);
        for (var i = 0; i < 10_000; i++)
            list.Add(i);
        return list;
    }
}

Then:

dotnet publish --sc -f net9.0 -r osx-x64 
dotnet publish --sc -f net9.0 -r osx-arm64

Native (arm64):

Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
  [Host] : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD

| Method               | Mean     | Error    | StdDev   |
|--------------------- |---------:|---------:|---------:|
| ListAdd10000PreAlloc | 22.07 us | 0.258 us | 0.229 us |

Rosetta (x64) on the same hw

Apple M2 Max 2.40GHz, 1 CPU, 12 logical and 12 physical cores
  [Host] : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT SSE4.2

| Method               | Mean     | Error     | StdDev    |
|--------------------- |---------:|----------:|----------:|
| ListAdd10000PreAlloc | 7.724 us | 0.0862 us | 0.0764 us |

@linkdotnet
Copy link
Author

linkdotnet commented Apr 24, 2024

On an arm64 RPi 4B there is no regression as well:

BenchmarkDotNet v0.13.12, Debian GNU/Linux 12 (bookworm)
Unknown processor
.NET SDK 9.0.100-preview.3.24204.13
[Host] : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD
.NET 8.0 : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
.NET 9.0 : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD

Method Job Runtime Mean Error StdDev
ListAdd10000PreAlloc .NET 8.0 .NET 8.0 74.73 us 0.164 us 0.128 us
ListAdd10000PreAlloc .NET 9.0 .NET 9.0 64.74 us 0.468 us 0.415 us

@jakobbotsch
Copy link
Member

jakobbotsch commented Apr 24, 2024

Are you saying that LSE2 makes it impossible for the hardware to implement ldp as efficiently as before, leading to this 3.5x slowdown simply from updating to LSE2 capable ARM64 hardware?
It feels hard to believe. I'm hopeful this is some poorly handled case in Apple Silicon that we'll be able to understand better. Regardless it seems we'll need to figure out how exactly we want to disable this peephole.

@EgorBo
Copy link
Member

EgorBo commented Apr 24, 2024

I've checked that my linux-arm64 box has lse2 (via hwCap & HWCAP_USCAT) and it doesn't have this problem.

I'm hopeful this is some poorly handled case in Apple Silicon

... or just nicely handled case for ldr case there 🙂

@tannergooding
Copy link
Member

tannergooding commented Apr 24, 2024

Are you saying that LSE2 makes it impossible for the hardware to implement ldp as efficiently as before, leading to this 3.5x slowdown simply from updating to LSE2 capable ARM64 hardware?

Yes, for some scenarios. For most cases ldp will be strictly better/more efficient, but for others it has the chance to introduce a new data dependency due to the LSE2 requirement that it be atomic in some scenarios.


If you have (semi pseudo-code):

label: 
    ldr w0, [addr1]
    ldr w1, [addr1 + #4]
    ; do something with w1
    add w0, w0, #1
    str [addr1], w0
    br label

Then the code will first iterate through the loop normally, and the second iteration ldr w0, [addr1] will be scheduled but will be dependent on str [addr1], w0 from the previous iteration completing first. However, ldr w1, [addr2] is a separate load and therefore can happen independently of the first load and so we can continue onto executing do something with w1 even if the str [addr1], w0 hasn't completed yet.

This all follows the documented Arm64 memory model, which is a weakly ordered model. Where by default:

  • A read that is generated by a load instruction that loads a single general-purpose register and is aligned to the
    size of the read in the instruction is single-copy atomic.
  • A write that is generated by a store instruction that stores a single general-purpose register and is aligned to
    the size of the write in the instruction is single-copy atomic.
  • Reads that are generated by a Load Pair instruction that loads two general-purpose registers and are aligned
    to the size of the load to each register are treated as two single-copy atomic reads, one for each register being
    loaded.

and where the important part of the general semantics under B2.3 Definition of the Arm memory model can basically be summarized as a read effect E2 will occur after write effect E1 if they are to the same location and cannot be reordered with regards to eachother (it is formally defined as the Local read successor).

Thus, given the above the follow program is strictly equivalent:

label: 
    ldp w0, w1, [addr1]
    ; do something with w1
    add w0, w0, #1
    str [addr1], w0
    br label

LSE2, however, changes the semantics for ldp such that

If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load or store two 64-bit registers are single-copy
atomic when all of the following conditions are true:

  • The overall memory access is aligned to 16 bytes.
  • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that access fewer than 16 bytes are single-copy
atomic when all of the following conditions are true:

  • All bytes being accessed are within a 16-byte quantity aligned to 16 bytes.
  • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

This change to make the load a single-copy atomic that encompasses both registers, rather than it being 2 separate single-copy atomic operations (one for each register), really flips how ldp works (at least based on my understanding of the rules) and it means that due to the memory ordering rules of read effects to write effects and for local read successors, that the load of data into w1 can no longer happen independently of the str [addr], w0

@tannergooding
Copy link
Member

There's notably, beyond the LSE2 changes, a number of additional changes to the memory model specifically for Armv8.4+ (detailed in B2.2.1.1 Changes to single-copy atomicity in Armv8.4), including in how coherence works for some types of overlapping reads/writes that are single-copy atomic.

@tamlin-mike
Copy link

I think @EgorBo hit the nail on the head with moving the increment of _version last in the Add method.

Not only does it (seemingly) avoid the CPU stall, I'd argue it's also correct.
Having the increment at the top means it always increments version, even if something later goes wrong, such as terminating with an OOM exception. Now you have placed the list in a conceptually inconsistent state; the List is not modified, but _version claims it is modified.
Placing the increment last makes sure it's only incremented once everything else is completed.

Performace-wise, having that read-modify-write just before method return could allow the CPU decode/execute the return (even if it may be only be a conceptual return instruction, due to the AggressiveInlining tag) in parallel with the RMW (assuming superscalar CPU).

Provided that logic is valid, perhaps it could be worth running comparative perf-tests for every ISA supported, with the RMW at head, vs. moved to the very end?

The core issue with the JIT obviously remains, but it would at least fix a logic error in the runtime.

@tannergooding
Copy link
Member

Having the increment at the top means it always increments version, even if something later goes wrong, such as terminating with an OOM exception. Now you have placed the list in a conceptually inconsistent state; the List is not modified, but _version claims it is modified.

If an exception is thrown, you generally have to assume the potential for torn/corrupted state. The same general issue exists for every single one of the field mutations.

You typically want to try and avoid such cases where possible, but this is one of the cases where multi-threaded access is already considered UB and where incrementing the version where the only normal error would be an OOM, which is already considered a "catastrophic failure", is fine.

@Clockwork-Muse

This comment was marked as off-topic.

@neon-sunset

This comment was marked as off-topic.

@EgorBo

This comment was marked as off-topic.

@tamlin-mike

This comment was marked as off-topic.

@tannergooding

This comment was marked as off-topic.

@Clockwork-Muse

This comment was marked as off-topic.

@tannergooding

This comment was marked as off-topic.

@mqudsi

This comment was marked as off-topic.

@tannergooding

This comment was marked as off-topic.

@jkotas

This comment was marked as off-topic.

@mqudsi

This comment was marked as off-topic.

@EgorBo EgorBo self-assigned this Apr 27, 2024
@EgorBo EgorBo added this to the Future milestone Apr 27, 2024
@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Apr 27, 2024
@EgorBo
Copy link
Member

EgorBo commented Apr 27, 2024

So we have a good guess (discussed internally) what's going on here, but it's not clear how to detect bad patterns and disable ldr->ldp optimization. The PRs which added ldr->ldp optimization only have a very few improvement reports attached to them (except ldp for SIMD regs), so we can, probably, just conservatively disable it for Apple platforms (we've confirmed that this problem is Apple specific at this point).

I think the first thing we have to do is to add macOS-arm64 platform to our dotnet/performance runs (it's currently Linux and Windows only) before we make any changes.

@marcan
Copy link

marcan commented Apr 28, 2024

Reminder that Linux on Apple Silicon is a thing, both virtualized and bare metal, and virtual Windows ARM64 on Apple Silicon is also a thing. Please do not gate this on macOS only.

Given the LSE2 theory, I think it makes sense to treat this as a performance issue potentially affecting more CPU platforms (especially future ones), given that there is a plausible explanation for the mechanism and why the code is, for at least some logical implementations, actually sub-optimal. If the gains of this peephole opt are otherwise minimal, and it's not easy to detect the problem code sequences (the ones where it triggers extra stalls) then I think the logical thing to do is just disable it across the board. Alternatively, if you think this is rare enough and the List.Add case is an outlier, then just work around it there (e.g. with the version change).

But please don't use OS platform as a gate for this, since that makes no sense given multiple OSes exist (you'd have to use CPU implementer instead, and I assume that's not practical given ahead-of-time compilation?).

@neon-sunset
Copy link
Contributor

neon-sunset commented Apr 28, 2024

It looks like part of the discussion was marked off-topic.

Just wanted to re-iterate that this particular regression is still solvable by removing _version when users target Release altogether, and it might be a good opportunity to finally explore this option as described in #81523

@linkdotnet
Copy link
Author

It looks like part of the discussion was marked off-topic.

Just wanted to re-iterate that this particular regression is still solvable by removing _version when users target Release altogether, and it might be a good opportunity to finally explore this option as described in #81523

Wouldn’t that only solve the problem for List?
Leaving an area for potential other regressions now and the future!?

@linkdotnet
Copy link
Author

Don’t get me wrong - one can still lead this discussion but I do feel it should be handled separately (and not only because of performance characteristics)

@EgorBo
Copy link
Member

EgorBo commented Apr 28, 2024

Minimal repro for the Apple issue using Clang and inline asm: https://gist.github.com/EgorBo/88196a218559ec93a197a7d1d5600548

ldp makes this "benchmark" ~4-5x slower. An interesting thing is that if we have ldp and stp - it's still slow (we thought that the problem reproduces only when we mix str with ldp and stp with ldp would be fine)

@mqudsi
Copy link

mqudsi commented Apr 29, 2024

Wouldn't just moving the _version write as @EgorBo demonstrated just be the least controversial change/fix here (fixes this particular performance regression, avoids arguing about the bigger questions involved with removing version checks altogether, keeps version checks in place even for List.Add())?

The performance regressions can be tackled by monitoring macOS benchmark results (the point about Apple Silicon != macOS is well taken, but also somewhat besides the point since Linux (OS) performance is tracked separately and monitoring macOS (OS + hw) performance would have caught this for both newer Macs (OS + hw) and Linux-on-Apple-silicon — you do not need "idealized" performance monitoring where only one factor changes between each monitored configuration) and figuring out where other notable regressions occur?

You probably don't want to blanket disable ldr to ldp optimizations, as this was a pathological case where there was a false data dependency on two adjacent fields but there likely are cases where compiler optimizations can take advantage of the single copy atomicity improvements to net performance gains.

(But I must confess I am confused why this would problem would not manifest on non-Apple silicon aarch64 w/ FEAT_LSE2 present, as reported earlier in this thread, unless the cpu wasn't taking advantage of the available instruction-level parallelism before so there is no regression now?)

@Developer-Ecosystem-Engineering

We are reviewing this issue thanks for raising it!

@Developer-Ecosystem-Engineering

The use of LDP instructions is generally encouraged. In this specific case, the structure of the loop (that includes a memory dependence between stores and loads), coupled with the use of LDP instructions prevents certain hardware optimizations from engaging. The small and tight nature of the loop also exacerbates the impact.

Please also see section 4.6.11 of the Apple Silicon CPU optimization guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests