Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tried AVX512 pre-sieving: no speedup #140

Open
kimwalisch opened this issue Oct 30, 2023 · 0 comments
Open

Tried AVX512 pre-sieving: no speedup #140

kimwalisch opened this issue Oct 30, 2023 · 0 comments
Assignees

Comments

@kimwalisch
Copy link
Owner

kimwalisch commented Oct 30, 2023

I tried AVX512 pre-sieving using the 2 algorithms below.

On AMD EPYC 4th gen CPUs (Genoa) I saw no speedup using both GCC and Clang (compared to the default SSE2 pre-sieving algorithm). On Intel CPUs I masured 1% to 2% speedup using GCC (using ./primesieve 1e11 -t1) but no speedup using Clang. Overall I think that the added complexity is not worth it. Supporting AVX512 pre-sieving would likely require using GCC's multi-arch feature, which makes the code significantly more complex.

The AVX512 pre-sieving code is available on the avx512_presieve branch (note that code is for testing only, it is not production quality yet). It may be useful to retest this code in a few years, it is possible that on future x64 CPUs the AVX512 code will perform better.

Algorithm 1

__attribute__ ((target ("avx512f,avx512bw")))
void andBuffers(const uint8_t* __restrict buf0,
                const uint8_t* __restrict buf1,
                const uint8_t* __restrict buf2,
                const uint8_t* __restrict buf3,
                const uint8_t* __restrict buf4,
                const uint8_t* __restrict buf5,
                const uint8_t* __restrict buf6,
                const uint8_t* __restrict buf7,
                uint8_t* __restrict output,
                std::size_t bytes)
{
  for (std::size_t i = 0; i < bytes; i += sizeof(__m512i))
  {
    __mmask64 mask = (i + 64 < bytes) ? 0xffffffffffffffffull : 0xffffffffffffffffull >> (i + 64 - bytes);

    _mm512_mask_storeu_epi8((__m512i*) &output[i], mask,
        _mm512_and_si512(
            _mm512_and_si512(
                _mm512_and_si512(_mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf0[i]), _mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf1[i])),
                _mm512_and_si512(_mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf2[i]), _mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf3[i]))),
            _mm512_and_si512(
                _mm512_and_si512(_mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf4[i]), _mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf5[i])),
                _mm512_and_si512(_mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf6[i]), _mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf7[i])))));
  }
}

Algorithm 2

__attribute__ ((target ("avx512f,avx512bw")))
void andBuffers(const uint8_t* __restrict buf0,
                const uint8_t* __restrict buf1,
                const uint8_t* __restrict buf2,
                const uint8_t* __restrict buf3,
                const uint8_t* __restrict buf4,
                const uint8_t* __restrict buf5,
                const uint8_t* __restrict buf6,
                const uint8_t* __restrict buf7,
                uint8_t* __restrict output,
                std::size_t bytes)
{
  std::size_t i = 0;

  for (; i + 64 < bytes; i += sizeof(__m512i))
  {
    _mm512_storeu_epi8((__m512i*) &output[i],
        _mm512_and_si512(
            _mm512_and_si512(
                _mm512_and_si512(_mm512_loadu_epi8((const __m512i*) &buf0[i]), _mm512_loadu_epi8((const __m512i*) &buf1[i])),
                _mm512_and_si512(_mm512_loadu_epi8((const __m512i*) &buf2[i]), _mm512_loadu_epi8((const __m512i*) &buf3[i]))),
            _mm512_and_si512(

                _mm512_and_si512(_mm512_loadu_epi8((const __m512i*) &buf4[i]), _mm512_loadu_epi8((const __m512i*) &buf5[i])),
                _mm512_and_si512(_mm512_loadu_epi8((const __m512i*) &buf6[i]), _mm512_loadu_epi8((const __m512i*) &buf7[i])))));
  }

  __mmask64 mask = 0xffffffffffffffffull >> (i + 64 - bytes);

  _mm512_mask_storeu_epi8((__m512i*) &output[i], mask,
    _mm512_and_si512(
      _mm512_and_si512(
        _mm512_and_si512(_mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf0[i]), _mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf1[i])),
        _mm512_and_si512(_mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf2[i]), _mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf3[i]))),
      _mm512_and_si512(
        _mm512_and_si512(_mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf4[i]), _mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf5[i])),
        _mm512_and_si512(_mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf6[i]), _mm512_maskz_loadu_epi8(mask, (const __m512i*) &buf7[i])))));
}
@kimwalisch kimwalisch self-assigned this Oct 30, 2023
@kimwalisch kimwalisch changed the title AVX512 pre-sieving: no speedup Tried AVX512 pre-sieving: no speedup Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant