Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown on several vector variations #147

Open
TheTryton opened this issue Jul 8, 2020 · 2 comments
Open

Slowdown on several vector variations #147

TheTryton opened this issue Jul 8, 2020 · 2 comments

Comments

@TheTryton
Copy link

[MSVC] (CPU: Xeon 1231v3)
I noticed slowdown when I was experimenting with several combinations of matrix sizes and simdpp instruction sets.
I implemented matrix multiplication in 3 different ways: plain C, unmodified simdpp, own modification of simdpp. All implementations resemble code below:

matrix<float, 4, 4> result;
    __m128 m2rows[4] =
    {
        _mm_load_ps(m2.row(0).data()),
        _mm_load_ps(m2.row(1).data()),
        _mm_load_ps(m2.row(2).data()),
        _mm_load_ps(m2.row(3).data()),
    };

    auto m1row = m1.begin();
    auto resultrow = result.begin();
    for (;m1row != m1.end(); m1row += 4, resultrow += 4)
    {
        __m128 m1row_v[4] =
        {
            _mm_load_ps1(m1row),
            _mm_load_ps1(m1row + 1),
            _mm_load_ps1(m1row + 2),
            _mm_load_ps1(m1row + 3)
        };

        __m128 temp_mul[4] =
        {
            _mm_mul_ps(m1row_v[0], m2rows[0]),
            _mm_mul_ps(m1row_v[1], m2rows[1]),
            _mm_mul_ps(m1row_v[2], m2rows[2]),
            _mm_mul_ps(m1row_v[3], m2rows[3]),
        };

        __m128 resultrow_v = _mm_add_ps(_mm_add_ps(temp_mul[0], temp_mul[1]), _mm_add_ps(temp_mul[2], temp_mul[3]));
        _mm_store_ps(resultrow, resultrow_v);
    }

    return result;

(Every structure is properly aligned and has valid size in order to load/store specific simd type)
Below I will explain performance problems caused in each specific combination of matrix size and instruction set selected:
Performance calculated as an average of 100000 iterations and 10 runs of every implementations.
Compiled with /O2 and /Ob2 optimizations (Release).

  • matrix 4x4 (SSE2):
    plain C: ~9ns
    simdpp: ~ 9ns
    personal modification simdpp: ~ 9ns
    Here everything works as expected

  • matrix 8x8 (SSE2):
    plain C: ~63ns
    simdpp : ~ 377ns
    personal modification simdpp: ~ 64ns
    In this combination we can see performance loss on simdpp side. After analysing implementation my solution was to change line where actual load was actually taking place - i_load(V&a, const char* p) template function. In this function MSVC compiler couldn't simplify the for loop written inside -> after change to constexpr offset added to pointer (std::index_sequence trick) execution time decreased and was matching the one from plain C (modification made in personal modification of simdpp)

  • matrix 16x16 (SSE2):
    plain C: ~272ns
    simdpp: ~6532ns
    personal modification simdpp: ~ 270ns
    This combination presents the same problem as previous one.

  • matrix 4x4 (AVX2):
    plain C (__m128 falloff): ~9ns
    simdpp (__m128 falloff): ~ 593ns
    personal modification simdpp (__m128 falloff): ~ 9ns
    Even though nothing really happened (still using __m128 internally as in SSE2 matrix 4x4 case) simdpp performance dropped significantly. After digging through implementation and Intel Intrinsics Guide I found a solution - replace _mm_broadcast_ss with _mm_load_ss and permute4<0,0,0,0>(v) as if I didn't define AVX2 instruction set. In Intel Intrinsics Guide we can read that _mm_broadcast_ss happens to have latency of 6 while having throughput of 0.5 CPI which means that for four load_splat I'm doing CPU is waiting while doing nothing for about 22 cycles. With using _mm_load_ss and permute4<0,0,0,0>(v) (which is almost always equivalent to _mm_load_ps1) CPU doesn't waste these cycles (this change translates to instructions movss and shufps which both have throughput of 1 CPI and latency of 1 cycle).

  • matrix8x8 (AVX2):
    plain C (__m256): ~33ns
    simdpp (__m256): ~ 32ns
    personal modification simdpp (__m256): ~ 34ns
    As expected simdpp performance matches expectations.

  • matrix16x16 (AVX2):
    plain C (__m256): ~260ns
    simdpp (__m256): ~ 2915ns
    personal modification simdpp (__m256): ~ 257ns
    Again simdpp creates a performance hit by using for loop as explained in the second case.

Problems stated above are more common in simdpp implementation (eg. add, mul, sub functions). I haven't tested above code using G++ and Clang on my platform but I suppose some of these problems can still happen (eg. the one with _mm_broadcast_ss). Also I haven't tested performance of these matrix multiplications on AVX512F because I don't have a CPU supporting that instruction set but some of issues of similar kind may be present with use of this instruction set.

@abique
Copy link
Contributor

abique commented Jul 16, 2020

It would be interesting to reproduce this benchmark with gcc and clang.

@TheTryton
Copy link
Author

It would be interesting to reproduce this benchmark with gcc and clang.

I've tested them now and it seems that these slowdowns are not present on both gcc (9.3.0) and clang (9.0.1). From my investigation those slowdowns are mainly due to MSVC inferior optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants