You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had a small question regarding the xxhash code. There is usage of __GNUC__ in some places in the code(in footnote) to get the performance of the compiled code by GCC similar to clang . My question is that even ICC defines __GNUC__. Was the code benchmarked for the performance considerations with ICC ? or should we be using an extra !defined(__INTEL_COMPILER) .
I am talking about the following two comments in your code
XXH_FORCE_INLINExxh_u64XXH3_mix16B(constxxh_u8*XXH_RESTRICTinput,
constxxh_u8*XXH_RESTRICTsecret, xxh_u64seed64)
{
#if defined(__GNUC__) && !defined(__clang__) /* GCC, not Clang */ \
&& defined(__i386__) && defined(__SSE2__) /* x86 + SSE2 */ \
&& !defined(XXH_ENABLE_AUTOVECTORIZE) /* Define to disable like XXH32 hack *//* * UGLY HACK: * GCC for x86 tends to autovectorize the 128-bit multiply, resulting in * slower code. * * By forcing seed64 into a register, we disrupt the cost model and * cause it to scalarize. See `XXH32_round()` * * FIXME: Clang's output is still _much_ faster -- On an AMD Ryzen 3600, * XXH3_64bits @ len=240 runs at 4.6 GB/s with Clang 9, but 3.3 GB/s on * GCC 9.2, despite both emitting scalar code. * * GCC generates much better scalar code than Clang for the rest of XXH3, * which is why finding a more optimal codepath is an interest. */XXH_COMPILER_GUARD(seed64);
/* * UGLY HACK: * GCC usually generates the best code with -O3 for xxHash. * * However, when targeting AVX2, it is overzealous in its unrolling resulting * in code roughly 3/4 the speed of Clang. * * There are other issues, such as GCC splitting _mm256_loadu_si256 into * _mm_loadu_si128 + _mm256_inserti128_si256. This is an optimization which * only applies to Sandy and Ivy Bridge... which don't even support AVX2. * * That is why when compiling the AVX2 version, it is recommended to use either * -O2 -mavx2 -march=haswell * or * -O2 -mavx2 -mno-avx256-split-unaligned-load * for decent performance, or to use Clang instead. * * Fortunately, we can control the first one with a pragma that forces GCC into * -O2, but the other one we can't control without "failed to inline always * inline function due to target mismatch" warnings. */
The text was updated successfully, but these errors were encountered:
I would be surprised if this code performance hack was checked on ICC,
though with godbolt, that might actually be possible.
cc @easyaspi314 , the author of this code.
Coming back to this topic,
it seems there are still a few places in the code base where the following preprocessor test is present : #if defined(__GNUC__) && !defined(__clang__)
I presume it's the kind of test where excluding icc would (presumably) be useful too.
I had a small question regarding the xxhash code. There is usage of
__GNUC__
in some places in the code(in footnote) to get the performance of the compiled code by GCC similar to clang . My question is that even ICC defines__GNUC__
. Was the code benchmarked for the performance considerations with ICC ? or should we be using an extra!defined(__INTEL_COMPILER)
.I am talking about the following two comments in your code
The text was updated successfully, but these errors were encountered: