Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array.IndexOf at string[] slower that naive implementation #101595

Open
kuda1978 opened this issue Apr 26, 2024 · 9 comments
Open

Array.IndexOf at string[] slower that naive implementation #101595

kuda1978 opened this issue Apr 26, 2024 · 9 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@kuda1978
Copy link

kuda1978 commented Apr 26, 2024

Description

Array.IndexOf on string array slower in 2.5 times that naive iterraction and compare.

Configuration

BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4291/22H2/2022Update)
Intel Core i5-9600K CPU 3.70GHz (Coffee Lake), 1 CPU, 6 logical and 6 physical cores
.NET SDK 8.0.300-preview.24203.14
  [Host]   : .NET 6.0.29 (6.0.2924.17105), X64 RyuJIT AVX2
  .NET 6.0 : .NET 6.0.29 (6.0.2924.17105), X64 RyuJIT AVX2
  .NET 7.0 : .NET 7.0.18 (7.0.1824.16914), X64 RyuJIT AVX2
  .NET 8.0 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2

Data

Benchmark result

| Method          | Job      | Runtime  | Mean      | Error     | StdDev    |
|---------------- |--------- |--------- |----------:|----------:|----------:|
| IndexOfStr      | .NET 6.0 | .NET 6.0 | 31.391 us | 0.1531 us | 0.1357 us |
| IndexOfStrNaive | .NET 6.0 | .NET 6.0 | 10.701 us | 0.0878 us | 0.0778 us |
| IndexOfStr      | .NET 7.0 | .NET 7.0 | 26.413 us | 0.2391 us | 0.2236 us |
| IndexOfStrNaive | .NET 7.0 | .NET 7.0 |  8.789 us | 0.1657 us | 0.1842 us |
| IndexOfStr      | .NET 8.0 | .NET 8.0 | 21.671 us | 0.1674 us | 0.1566 us |
| IndexOfStrNaive | .NET 8.0 | .NET 8.0 |  8.931 us | 0.1344 us | 0.1192 us |

Benchmark source

[SimpleJob(RuntimeMoniker.Net60)]
[SimpleJob(RuntimeMoniker.Net70)]
[SimpleJob(RuntimeMoniker.Net80)]
public class IndexOfBenchmark
{
    private readonly string[] _strArr = Enumerable.Range(1, 10_000).Select(n => n.ToString()).ToArray();

    [Benchmark]
    public int IndexOfStr() => Array.IndexOf(_strArr, "10000");

    [Benchmark]
    public int IndexOfStrNaive()
    {
        for (int i = 0; i < _strArr.Length; i++)
            if (_strArr[i] == "10000")
                return i;

        return -1;
    }
}

Analysis

I think, that problem in JIT optimization of method System.Collections.Generic.GenericEqualityComparer<T>.IndexOf
In case of split this method to two methods for finding null values and not null values it runs faster.

[SimpleJob(RuntimeMoniker.Net60)]
[SimpleJob(RuntimeMoniker.Net70)]
[SimpleJob(RuntimeMoniker.Net80)]
public class IndexOfInternalBenchmark
{
    private readonly string[] _strArr = Enumerable.Range(1, 10_000).Select(n => n.ToString()).ToArray();

    [Benchmark]
    public int IndexOfOriginal() => IndexOfOriginal<string>(_strArr, "10000", 0, _strArr.Length);

    // Copy of: System.Collections.Generic.GenericEqualityComparer<T>.IndexOf
    private static int IndexOfOriginal<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        int endIndex = startIndex + count;
        if (value == null)
        {
            for (int i = startIndex; i < endIndex; i++)
            {
                if (array[i] == null) return i;
            }
        }
        else
        {
            for (int i = startIndex; i < endIndex; i++)
            {
                if (array[i] != null && array[i]!.Equals(value)) return i;
            }
        }
        return -1;
    }

    [Benchmark]
    public int IndexOfSplitted() => IndexOfSplitted<string>(_strArr, "10000", 0, _strArr.Length);

    private static int IndexOfSplitted<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        if (value != null)
        {
            return IndexOfSplittedNotNull(array, value, startIndex, count);
        }
        else
        {
            return IndexOfSplittedNull(array, value, startIndex, count);
        }
    }

    internal static int IndexOfSplittedNull<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        int endIndex = startIndex + count;
        for (int i = startIndex; i < endIndex; i++)
        {
            if (array[i] == null) return i;
        }
        return -1;
    }

    internal static int IndexOfSplittedNotNull<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        int endIndex = startIndex + count;
        for (int i = 0; i < endIndex; i++)
        {
            if (array[i] != null && array[i]!.Equals(value)) return i;
        }
        return -1;
    }
}

Result:

| Method          | Job      | Runtime  | Mean     | Error    | StdDev   |
|---------------- |--------- |--------- |---------:|---------:|---------:|
| IndexOfOriginal | .NET 6.0 | .NET 6.0 | 26.40 us | 0.096 us | 0.080 us |
| IndexOfSplitted | .NET 6.0 | .NET 6.0 | 26.41 us | 0.123 us | 0.115 us |
| IndexOfOriginal | .NET 7.0 | .NET 7.0 | 31.21 us | 0.187 us | 0.166 us |
| IndexOfSplitted | .NET 7.0 | .NET 7.0 | 10.44 us | 0.157 us | 0.147 us |
| IndexOfOriginal | .NET 8.0 | .NET 8.0 | 24.16 us | 0.130 us | 0.122 us |
| IndexOfSplitted | .NET 8.0 | .NET 8.0 | 12.28 us | 0.165 us | 0.154 us |
@kuda1978 kuda1978 added the tenet-performance Performance related issue label Apr 26, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 26, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 26, 2024
@huoyaoyuan huoyaoyuan added area-System.Runtime and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Apr 26, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

@huoyaoyuan
Copy link
Member

I wonder if inverting the if (value == null) condition would change the result. Not null value should be treated as more common case. Finding null value can be delegated to SIMD path.

@EgorBo EgorBo added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Runtime untriaged New issue has not been triaged by the area owner labels Apr 26, 2024
@EgorBo EgorBo added this to the Future milestone Apr 26, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member

EgorBo commented Apr 26, 2024

A minimal repro for the underlying issue:

[MethodImpl(MethodImplOptions.NoInlining)]
public bool Test<T>(T[] array, T value) where T : IEquatable<T>
{
    return array[0].Equals(value);
}

We currently are never able to optimize this Equals if the method is not inlined into non-shared generic context. Dynamic PGO also gives up here because VM tells JIT the method should be invoked as indirect and we never optimize such calls with PGO currently. Codegen:

; Assembly listing for method Test[System.__Canon](System.__Canon[],System.__Canon):ubyte:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
G_M40771_IG01:  ;; offset=0x0000
       push     rsi
       push     rbx
       sub      rsp, 40
       mov      qword ptr [rsp+0x20], rdx
       mov      rbx, r8
       mov      rsi, r9
						;; size=17 bbWeight=1 PerfScore 3.75
G_M40771_IG02:  ;; offset=0x0011
       mov      rcx, qword ptr [rdx+0x38]
       mov      r11, qword ptr [rcx+0x10]
       test     r11, r11
       je       SHORT G_M40771_IG04
						;; size=13 bbWeight=1 PerfScore 5.25
G_M40771_IG03:  ;; offset=0x001E
       jmp      SHORT G_M40771_IG05
						;; size=2 bbWeight=0.80 PerfScore 1.60
G_M40771_IG04:  ;; offset=0x0020
       mov      rcx, rdx
       mov      rdx, 0x7FFB713E76B0      ; global ptr
       call     CORINFO_HELP_RUNTIMEHANDLE_METHOD
       mov      r11, rax
						;; size=21 bbWeight=0.20 PerfScore 0.35
G_M40771_IG05:  ;; offset=0x0035
       cmp      dword ptr [rbx+0x08], 0
       jbe      SHORT G_M40771_IG07
       mov      rcx, gword ptr [rbx+0x10]
       mov      rdx, rsi
       call     [r11]
       nop      
						;; size=17 bbWeight=1 PerfScore 9.50
G_M40771_IG06:  ;; offset=0x0046
       add      rsp, 40
       pop      rbx
       pop      rsi
       ret      
						;; size=7 bbWeight=1 PerfScore 2.25
G_M40771_IG07:  ;; offset=0x004D
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code 83
***** BB01 [0000]
STMT00001 ( ??? ... ??? )
               [000009] DAC-G+-----                         *  STORE_LCL_VAR long   V06 tmp2         
               [000006] --C-G+-----                         \--*  CALL help long   CORINFO_HELP_RUNTIMEHANDLE_METHOD
               [000004] !----+----- arg0 in rcx                +--*  LCL_VAR   long   V01 TypeCtx      
               [000005] H----+-N--- arg1 in rdx                \--*  CNS_INT(h) long   0x7ffb7140bfe0 global ptr

***** BB01 [0000]
STMT00003 ( ??? ... ??? )
               [000015] --CXG+-----                         *  RETURN    int   
               [000011] --CXG+-----                         \--*  CALL ind stub int   
               [000010] -----+----- calli tgt                  \--*  LCL_VAR   long   V06 tmp2         
               [000027] ---XG+----- this in rcx                +--*  COMMA     ref   
               [000020] ---X-+-----                            |  +--*  BOUNDS_CHECK_Rng void  
               [000001] -----+-----                            |  |  +--*  CNS_INT   int    0
               [000019] ---X-+-----                            |  |  \--*  ARR_LENGTH int   
               [000000] -----+-----                            |  |     \--*  LCL_VAR   ref    V02 arg1         
               [000028] n---G+-----                            |  \--*  IND       ref   
               [000026] -----+-----                            |     \--*  ARR_ADDR  byref ref[]
               [000025] -----+-----                            |        \--*  ADD       byref 
               [000017] -----+-----                            |           +--*  LCL_VAR   ref    V02 arg1         
               [000024] -----+-----                            |           \--*  CNS_INT   long   16
               [000016] -----+----- vsd cell in r11            +--*  LCL_VAR   long   V06 tmp2         
               [000003] -----+----- arg2 in rdx                \--*  LCL_VAR   ref    V03 arg2       

cc @AndyAyersMS

@kuda1978
Copy link
Author

kuda1978 commented Apr 26, 2024

@EgorBo
I think that you have the mistake in minimal repro.

It seems strange to me that after splitting the method into three, it runs twice as fast. See the method IndexOfSplitted of second benchmark.

@huoyaoyuan
Copy link
Member

huoyaoyuan commented Apr 28, 2024

Simpler:

    [Benchmark]
    public int IndexOfNullCheck() => IndexOfNullCheck<string>(_strArr, "10000", 0, _strArr.Length);

    private static int IndexOfNullCheck<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        if (value == null)
        {
            return -1;
        }

        int endIndex = startIndex + count;

        for (int i = startIndex; i < endIndex; i++)
        {
            if (array[i] != null && array[i]!.Equals(value)) return i;
        }
        return -1;
    }


    [Benchmark]
    public int IndexOfNoNullCheck() => IndexOfNoNullCheck<string>(_strArr, "10000", 0, _strArr.Length);

    private static int IndexOfNoNullCheck<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        int endIndex = startIndex + count;

        for (int i = startIndex; i < endIndex; i++)
        {
            if (array[i] != null && array[i]!.Equals(value)) return i;
        }
        return -1;
    }
Method Mean Error StdDev
IndexOfNullCheck 22.34 us 0.099 us 0.092 us
IndexOfNoNullCheck 11.17 us 0.108 us 0.090 us

Just a null check can trigger the pessimisation.
I can't reproduce it under non-generic context. It actually performs better than both generic versions.
@EgorBo Hopefully you can improve dasm tools under generic context so that I can investigate further.

@neon-sunset
Copy link
Contributor

neon-sunset commented Apr 28, 2024

@EgorBo Hopefully you can improve dasm tools under generic context so that I can investigate further

You can use [DisassemblyDiagnoser] with the desired call depth to see what's going on.

@huoyaoyuan
Copy link
Member

It may be an artifact of measurement then. The null check affects inline decision. Both benchmarks provides very similar result with same AggressiveInlining or NoInlining config.

@huoyaoyuan
Copy link
Member

Egor is probably correct that it's about inlining and devirtualization. Manually extract the called methods and apply AggressiveInlining:

public class IndexOfBenchmark
{
    private readonly string[] _strArr = Enumerable.Range(1, 10_000).Select(n => n.ToString()).ToArray();

    [Params("10000")]
    public string Value { get; set; }

    [Benchmark]
    public int IndexOfStr() => Array.IndexOf(_strArr, Value);

    [Benchmark]
    public int IndexOfStrNaive()
    {
        for (int i = 0; i < _strArr.Length; i++)
            if (_strArr[i] == Value)
                return i;

        return -1;
    }

    [Benchmark]
    public int IndexOfAggressiveInlining() => ArrayIndexOf(_strArr, Value, 0, _strArr.Length);

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static unsafe int ArrayIndexOf<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        if (array == null)
        {
            throw new Exception();
        }

        if ((uint)startIndex > (uint)array.Length)
        {
            throw new Exception();
        }

        if ((uint)count > (uint)(array.Length - startIndex))
        {
            throw new Exception();
        }

        return GenericEqualityComparer<T>(array, value, startIndex, count);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    internal static int GenericEqualityComparer<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        int endIndex = startIndex + count;
        if (value == null)
        {
            for (int i = startIndex; i < endIndex; i++)
            {
                if (array[i] == null) return i;
            }
        }
        else
        {
            for (int i = startIndex; i < endIndex; i++)
            {
                if (array[i] != null && array[i]!.Equals(value)) return i;
            }
        }
        return -1;
    }
}

It just performs much better:

Method Value Mean Error StdDev
IndexOfStr 10000 20.592 us 0.3979 us 0.4737 us
IndexOfStrNaive 10000 11.106 us 0.0736 us 0.0689 us
IndexOfAggressiveInlining 10000 7.085 us 0.0434 us 0.0385 us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants