02 May 09:58

mr-c

71fd833

v0.8.2 Latest

Latest

SIMDe 0.8.2

Summary

Start of RISCV64 optimized implementation using the RVV1.0 vector extension! Thank you @eric900115 @howjmay @zengdage
62 of the ARM Neon intrinsics added in SIMDe 0.8.0 had to be removed for not exactly matching the specs and real hardware
(from the FCVTZS/FCVTMS/FCVTPS/FCVTNS families). This brings us down from 100% coverage of the NEON functions to 99.07%.

For the entire project: 126 files changed, 5522 insertions(+), 2772 deletions(-)

For just the simde folder: 89 files changed, 4330 insertions(+), 2199 deletions(-)

Details

Implementation of Arm intrinsics

NEON

arm neon: disable some FCVTZS/FCVTMS/FCVTPS/FCVTNS family intrinsics 339ffe4 @mr-c
arm neon sm3: check constant range 3d34fcd @mr-c
arm 32 bits: native def fixes; workarounds for gcc 22900e6 @Cuda-Chen
x86 implementations: allow _m128 access from SSE 114c3cd @mr-c

WASM intrinsics

wasm x86 impl: some were incorrectly marked SSE instead of SSE2 fee149a @mr-c

x86 intrinsics

SVML

SSE is good enough for native m128i and m128d types & functions 9982b27 @mr-c

XOP

fix some native functions 608200b @mr-c

Arch support

arm / arm64

arm platform: cleanup feature detection. 08c21f3 @mr-c
arm: enable more intrinsic function for armv7 416091e @zengdage

RISCV64

Initial Support for the RISC-V Vector Extension (RVV1.0) in ARM NEON (#1130) b4e805a @eric900115
arm: fix some neon2rvv intrinsic function error 2a548e5 @zengdage
arm: Add neon2rvv support in vand series intrinsics dac67f3 @howjmay
arm: improve performance in vabd_xxx for risc-v b63ba04 @zengdage
arm: improve performance in vhadd_xxx for risc-v a68fa90 @zengdage

Compiler Specific

Clang

detect clang versions 18 & 19 ed4a5cd @mr-c
arm neon clang: skip vrnd native before clang v18 e647f10 @mr-c
apple clang arm64: ignore SHA2 be48ef8 @mr-c

Emscripten

use __builtin_roundeven{f,} from version 3.1.43 onwards 4379740 @mr-c

MSVC

x86 test msvc: really disable warning 4799,4730 487507d @mr-c
sse2 MSVC _mm_pause implementaiton for x86 8d95f83 @mr-c
SSE is good enough for native m128i and m128d types & functions 9982b27 @mr-c

Testing with Docker/Podman & CI

CI: don't run twice on dependabot branches 70748cd @mr-c

GitHub Actions

test Mac arm64 0080b28 @mr-c
macos: report log if there is a configuration failure. df3e930 @mr-c
build(deps): bump actions/checkout from 3 to 4 (#1149) 9605608 @dependabot[bot]
build(deps): bump codecov/codecov-action from 3 to 4 25382c1 @dependabot[bot]
codecov: use token 2c45dd4 @mr-c
Add gcc arm 32bit armv8-a test in CI 72bde75 @Cuda-Chen
build for AMD Buildozer version 2 9746537 @mr-c

Packit CI

Drop i386 (i686) support. (#1155) cf68aaf @junaruga

Semaphore CI

stop testing on GCC 5 & 6, clang 3.9 & 4 due to forced upgrade to Ubuntu 20.04 9982f10 @mr-c

Misc

update list of fully implemented instruction sets (#1152) b568fcd @mr-c
typo fixes from codespell 8639fef @mr-c
README.md - move CLMUL to partial, list more of the CI.yml architectures 285b50d @Torinde
Update README.md - link to VPCLMULQDQ; mention MSA (#1157) 517da84 @Torinde
Update README.md (#1156) b88a66d @mr-c
README: two more related projects 7429dff @mr-c

New Contributors

@eric900115 made their first contribution in #1130
@Cuda-Chen made their first contribution in #1116
@Torinde made their first contribution in #1157
@zengdage made their first contribution in #1172
@howjmay made their first contribution in #1174

Full Changelog: v0.8.0...v0.8.2

Contributors

junaruga, mr-c, and 6 other contributors

Assets 3

30 Apr 16:39

mr-c

v0.8.2-rc1

71fd833

v0.8.2-rc1 Pre-release

Pre-release

See draft release notes at https://github.com/simd-everywhere/simde/wiki/Release-Notes for changes since 0.8.0

Full Changelog: v0.8.0...v0.8.2-rc1

Assets 2

14 Mar 13:03

mr-c

v0.8.0

589c7d5

v0.8.0

SIMDe 0.8.0

Summary

Complete set of implementations for all NEON intrinsics have been finished, up from 56.46% in the previous release! (@yyctw @wewe5215)
SIMDe PRs are tested using Fedora Rawhide (@junaruga)

For the entire project: 656 files changed, 202635 insertions(+), 1724 deletions(-)

For just the simde folder: 295 files changed, 47053 insertions(+), 896 deletions(-)

There are a total of 6876 SIMD functions on x86, 2930 (43.17%) of which have been implemented in SIMDe so far. Specifically for AVX-512, of the 5160 functions currently in AVX-512, SIMDe implements 1510 (29.26%).

Note: Intel has removed the intrinsics that were unique to Intel Xeon Phi (ER, PF, 4MAPS, and 4VNNIW) from their intrinsic list. SIMDe will retain those few implementations we already had, but this changes how our completeness statistics are calculated.

Newly added function families

AES: 5 of 6 (83.33%)

Newly AVX512 added function families

castph: 1 of 9 (11.11%) implemented.
cvtus_storeu: 1 of 18 (5.56%) implemented.
fpclass: 3 of 24 (12.50%) implemented.
i32gather: 1 of 8 (12.50%) implemented.
i64gather: 8 of 8 💯
permutex: 3 of 12 (25.00%) implemented.
rcp14: 1 of 24 (4.17%) implemented.
reduce
reduce_max: 7 of 31 (22.58%) implemented.
reduce_min: 7 of 31 (22.58%) implemented.
shufflehi: 1 of 7 (14.29%) implemented.
shufflelo: 1 of 7 (14.29%) implemented.

Additions to existing families

AVX512BW: 7 additional, 337 of 790 (42.66%)
AVX512DQ: 5 additional, 112 total of 376 (29.79%)
AVX512F: 48 additional, 1087 total of 2812 (38.66%)
AVX512_FP16: 15 additional, 17 total of 1105 (1.54%)

Neon

SIMDe currently implements 6670 out of 6670 (100.00%) NEON functions; up from 56.46% in the previous release!

Newly added families

abal
abal_high
abd
abdh
abdl_high
addhn_high
aes
bfdot
bfdot_lane
cadd_rot
cale
calt
cmla_lane
cmla_rot_lane
copy_lane
cvt_high
cvt_n
cvta
cvtn
cvtp
cvtx
cvtx_high
div
dupb_lane
duph_lane
eor3
fmlal
fms
fms_lane
fms_n
ld2_dup
ld2_lane
ld3_dup
ld3_lane
ld4_dup
maxnmv
minnmv
mla_lane
mla_high_lane
mls_lane
mlsl_high_lane
mmla
mull_high_lane
mull_high_n
mulx
mulx_lane
pmaxnm
pminnm
qdmlal
qdmlal_high
qdmlal_high_lane
qdmlal_high_n
qdmlal_lane
qdmlal_n
qdmlsl
qdmlsl_high
qdmlsl_high_lane
qdmlsl_high_n
qdmlsl_lane
qdmlsl_n
qdmlslh
qdmlslh_lane
qdmulhh
qdmulhh_lane
qdmull_high
qdmull_high_lane
qdmull_high_n
qdmull_lane
qdmull_n
qdmullh_lane
qmovun_high
qrdmlah
qrdmlah_lane
qrdmlahh
qrdmlahh_lane
qrdmlsh
qrdmlsh_lane
qrdmlshh
qrdmlshh_lane
qrdmulhh_lane
qrshl
qrshlh
qrshrn_high_n
qrshrnh_n
qrshrun_high_n
qrshrunh_n
qshl_n
qshlh_n
qshluh_n
qshrn_high_n
qshrnh_n
qshrun_high_n
qshrunh_n
raddhn
raddhn_high
rax
recp
rnd32x
rnd32x
rnd32x
rnd64z
rnda
rndx
rshrn_high_n
rsubhn
rsubhn
set_lane
sha1
sha1h
sha256
sha512
shll_high_n
shrn_high_n
sli_n
sm3
sm4
sqrt
st1_x2
st1_x3
st1_x4
st1q_x2
st1q_x3
st1q_x4
subhn_high
sudot_lane
usdot
usdot_lane

Finally complete families

cvtn
mla_lane

Details

simde-f16: improve _Float16 usage; better INFHF/NANHF defs 8910057 @mr-c
simde_float16: prefer __fp16 if available aba26f6 @mr-c

Implementation of Arm intrinsics

NEON

cvtn: vcvtnq_{s32_f32,s64_f64}: add SSE & AVX512 optimized implementations e134cc7 @mr-c
cvtn: vcvtnq_u32_f32 is a V8 function 8432c70 @mr-c
min: Remove non-working MMX specialization from simde_vmin_s16 6858b92 @M-HT
shll: Extend constant range in simde_vshll_n_XXX intrinsics (#1064) beb1c61 @M-HT
various: Implement some f16XN types and f16 related intrinsics. (#1071) aae2245 @yyctw
qtbl/qtbx polyfills for A32V7 a2fef9e @easyaspi314
arm: use SIMDE_ARCH_ARM_FMA 7198d6d @mr-c
arm neon: Complex operations from Armv8.3-a (#1077) d08d67c @wewe5215
more fp16 using intrinsics supported by architecture v7 (skip version) (#1081) 5e7c4d4 @yyctw
st1{,q}_*_x{2,3,4}: initial implementation (#1082) 879d1a0 @yyctw
part 1 of implement all intrinsics supported by architecture A64 (#1090) 2eedece @yyctw
Add AES instructions. 23adcd2 805ccd2 @yyctw
Modified simde_float16 to simde_float16_t (#1100) 8a05dc6 @yyctw
implement all intrinsics supported by architecture A64-remaining part (#1093) 018ba24 @yyctw
add enable vmlaq_laneq_f32 and vcvtq_n_f64_u64 c7d314b @yyctw
implement all bf16-related intrinsics (#1110) c59db7c @yyctw
arm/neon abs: negating INT_MIN is undefined behavior in C/C++ c200c16 @mr-c

SVE Intrinsics

Improve performance of simde_mm512_add_epi32 (#1126) 6cde31c @AymenQ

WASM intrinsics

simd128: fix altivec_p7 version of wasm_f64x2_pmin 96d6e53 @mr-c
simd128: add missing unsigned functions ea5e283 @mr-c
simd128 f{32x4,64x2}_min: add workaround for a gcc<6 issue d5d6d10 @mr-c
detect support for Relaxed SIMD mode 2e66dd4 @mr-c
simd128/relaxed: begin MIPS implementations db8ad84 @mr-c
relaxed: add f{32x4,64x2}_relaxed_{min,max} 9d1a34e @mr-c
relaxed: updated names; reordered FMA operations 8cc8874 @mr-c

x86 intrinsics

sse{,2,4.1}, avx{,2} *_stream_{,load}: use __builtin_nontemporal_{load,store} 6ce6030 @mr-c

SSE*

sse: Fix issues related to MXCSR register (#1060) 653aba8 @M-HT
sse: implement _mm_movelh_ps for Arm64 514564e @mr-c
sse _mm_movemask_ps: remove unused code fba97e4 @mr-c
sse2 mm_pause: more archs, add a basic test 692a2e8 @mr-
sse4.1: use logical OR instead of bitwise OR in neon impl of _mm_testnzc_si128 edd4678 @mr-c
sse4.1 _mm_testz_si128: fix backwards short circuit logic f132275 @mr-c

AVX

run test from #926 ce9708c @mr-c
simde_mm256_shuffle_pd fix for natural vector size < 128 1594d7c @mr-c

AVX2

correction of simde_mm256_sign_epi{8,16,32} (#1123) c376610 @Proudsalsa

AVX512

fpclass: naive implementation 353bf5f @mr-c
loadu: fix native detection 305f434 @mr-c
set: add simde_x_mm512_set_m256{,d} 67e0c50 @mr-c
gather: add MSVC native fallbacks 7b7e3f6 @mr-c
AVX512FP16 / m512h initial support e97691c @mr-c
fix many native aliases 75014b9 @mr-c

CLMUL

fix natives, some require VPCLMULQDQ f819c52 @mr-c

SVML

enable SIMDE_X86_SVML_NATIVE for MSVC 2019+ 593af95 @mr-c

AES

aes: initial implementation of most aes instructions (#1072) 8632391 @Vineg

MIPS MSA intrinics

msa neon impl: float64x2_t is not avail in A32V7 ae4c4ab @mr-c

Arch support

x86(-64)

fix SIMDE_ARCH_X86_SSE4_2 define 5e4b308 @cbielow

arm64

x86 aes: add neon implementation using the crypto extension fb3554f @mr-

Altivec

neon/st1: disable last remaining AltiVec implementation 0521245 @mr-c

Power

sse2,wasm simd128: skip SIMDE_CONVERT_VECTOR_ impementations on PowerPC 4de999a @mr-c
wasm simd128: more powerpc fixes 7cb5691 @mr-c

Compiler Specific

GCC

GCC AVX512F: SIMDE_BUG_GCC_95399 was fixed in GCC 9.5, 10.4, 11.4, 12+ 3fa89c5 @mr-c
GCC x86/x64: SIMDE_BUG_GCC_98521 was fixed in 10.3 edde42e @mr-c
GCC x86: SIMDE_BUG_GCC_94482 was fixed in 8.5, 9.4, 10+ 43d86a3 @mr-c
Add workaround for GCC bug 111609 fdafd8e @M-HT
arm neon ld2: silence warnings at -O3 on gcc risc-v 8f56628 @mr-c
avx512 abs: refine GCC compiler checks for _mm512{,_mask}_abs_pd (#1118) 5405bbd @thomas-schlichter

Clang

clang powerpc: vec_bperm bug was fixed in clang-14 6feb28a @mr-c
clmul: aarch64 clang has difficulties with poly64x1_t 1e1bd76 @mr-c
aarch64: optimization bug 45541 was fixed in clang-15 7ca5712 @mr-c
A32V7: Don't trust clang for load multiple on A32V7 927f141 @easyaspi314
wasm: SIMDE_BUG_CLANG_60655 is fixed in the upcoming 17.0 release 25cebbe @mr-c
simde-detect-clang.h: add clang 17 detection 923f8ac 684baa1 50d98c1 @Coeur

ClangCL

fp16: don't use _Float16 on ClangCL if not supported 8a6b8c5 @mr-c
svml: don't...

Contributors

junaruga, Coeur, and 13 other contributors

Assets 3

07 Mar 14:15

mr-c

v0.8.0-rc2

c200c16

v0.8.0-rc2 Pre-release

Pre-release

See draft release notes at https://github.com/simd-everywhere/simde/wiki/Release-Notes for changes since 0.7.6

What's Changed since RC1

WASM Relaxed SIMD updates by @mr-c in #1112
emcc tot: set -Wno-switch-default by @mr-c in #1115
avx512 abs: refine GCC compiler checks for _mm512{,_mask}_abs_pd by @thomas-schlichter in #1118
correction of simde_mm256_sign_epi16(). by @Proudsalsa in #1123
apply arm64 windows workaround only on older version msvc by @Changqing-JING in #1121
gh-actions: add clang-17 by @mr-c in #1127
Improve performance of simde_mm512_add_epi32 by @AymenQ in #1126
typo: XCode -> Xcode by @Coeur in #1129
Update simde-detect-clang.h for clang 13 detection by @Coeur in #1131
Update simde-detect-clang.h for clang 17 detection by @Coeur in #1132
build(deps): bump ad-m/github-push-action from 0.6.0 to 0.8.0 by @dependabot in #1134
build(deps): bump actions/setup-dotnet from 3 to 4 by @dependabot in #1135
build(deps): bump actions/setup-python from 4 to 5 by @dependabot in #1137
build(deps): bump github/codeql-action from 2 to 3 by @dependabot in #1138
GitHub Actions emscripten: use older release for now by @mr-c in #1133
build(deps): bump actions/checkout from 3 to 4 by @dependabot in #1139
docs: explain how to target a single test by @mr-c in #1140
arm/neon abs: negating INT_MIN is undefined behavior by @mr-c in #1141

New Contributors

@thomas-schlichter made their first contribution in #1118
@Proudsalsa made their first contribution in #1123
@Changqing-JING made their first contribution in #1121
@AymenQ made their first contribution in #1126
@Coeur made their first contribution in #1129
@dependabot made their first contribution in #1134

Full Changelog: v0.8.0-rc1...v0.8.0-rc2

Contributors

Coeur, mr-c, and 5 other contributors

Assets 2

20 Nov 17:41

mr-c

v0.8.0-rc1

e651ec3

v0.8.0-rc1 Pre-release

Pre-release

See draft release notes at https://github.com/simd-everywhere/simde/wiki/Release-Notes

New Contributors

@cbielow made their first contribution in #1055
@M-HT made their first contribution in #1060
@yyctw made their first contribution in #1071
@Vineg made their first contribution in #1072
@wewe5215 made their first contribution in #1077

Full Changelog: v0.7.6...v0.8.0-rc1

Contributors

Vineg, cbielow, and 3 other contributors

Assets 2

16 May 16:51

mr-c

v0.7.6

fefc785

v0.7.6

Summary

See, I knew we should release more often!

Details

Implementation of Arm intrinsics

NEON

neon/abd,ext,cmla{,_rot{180,270,90}}: additional wasm128 implementations 3a18dff @mr-c
neon/cvtn: basic implementation of a few functions fefc785 @mr-c
neon/mla_lane: initial implementation using mla+dup 554ab18 @ngzhian
neon/shl,rshl: fix avx include to unbreak amalgamated hearders 3748a9f @mr-c
neon/shll_n: make vshll_n_u32 test operational 356db0c @mr-c
neon/qabs: restore SSE2 impl for vqabsq_s8 f614843 @mr-c

x86 intrinsics

mmx: loogson impl promotions over SIMDE_SHUFFLE_VECTOR_ 51bf6f2 @mr-c
x86/sse*,avx: add additional SIMD128 implementations e28a87e @mr-c

SSE*

sse{,2,3,4.1},avx: more WASM shuffle implementations 097dd12 @mr-c
sse*,avx: add additional SIMD128 implementations e28a87e @mr-c
sse: allow native _mm_loadh_pi on MSVC x64 314452b @mr-c

AVX512

avx512: typo fix for typedef of __mmask64 e8390a3 4a9f01a @mr-c
avx512/madd: fix native alias arguments for _mm512_madd_epi16 bcf4adb @mr-c

Arch support

simde-arch: #include Hedley for setting F16C for MSVC 2022+ with AVX2 f9cf467 @mr-c

Testing with Docker/Podman & CI

tests: simde_assert_equal_{v,}f funcs were silently failing 395efd9 @mr-c
tests: Quiet another Clang < v5 warning that resurfaced d9d2b45 @mr-c
tests: audit use of HEDLEY_DIAGNOSTIC_PUSH and _POP 284c88a @mr-c
test: ignore -Wc99-extensions e264ff5 @mr-c
neon/aba: vaba_s32 test was not being run f86346a @mr-c
sve/and: the svand_n_s8_m test is incomplete, mark it as such b962f07 @mr-c
tests: combine declarations in test functions 76c7d37 @mr-c

Local testing with Docker/Podman

docker: add wasm64 target 29db539 @mr-c

Drone.io

remove Drone.io fd10911 @mr-c

GitHub Actions

gh-actions: confirm that all header files are installed 8d5e05a @mr-c
gh-actions: put wasm64 under CI 6702820 @mr-c

Netlify

netlify: disable for now caa0929 @mr-c

Misc

meson install: arm/neon/ld1 & x86/avx512.h 27836b1 @mr-c
Update clang version detection for 14..16 and add link 4957a9e @jan-wassenberg

Contributors

mr-c, ngzhian, and jan-wassenberg

Assets 3

05 May 05:35

mr-c

v0.7.4

0c26988

v0.7.4

SIMDe 0.7.4

Summary

Minimum meson version is now 0.54
40 new NEON families implemented, SVE API implementation started (14 families)
Initial support for x86 F16C API
Initial support for MIPS MSA API
Initial support for Arm Scalable Vector Extensions (SVE) API
Initial support for WASM SIMD128 API
Initial support for the E2K (Elbrus) architecture
MSVC has many fixes, now compiled in CI using /ARCH:AVX, /ARCH:AVX2, and /ARCH:AVX512

X86

There are a total of 7470 SIMD functions on x86, 2971 (39.77%) of which have been implemented in SIMDe so far.
Specifically for AVX-512, of the 5270 functions currently in AVX-512, SIMDe implements 1439 (27.31%)

Newly added function families

AVX512CD: 21 of 42 (50.00%)
AVX512VPOPCNTDQ: 18 of 18 💯
AVX512_4VNNIW: 6 of 6 (100.00%)
AVX512_BF16: 9 of 38 (23.68%)
AVX512_BITALG: 24 of 24 💯
AVX512_FP16: 2 of 1105 (0.18%)
AVX512_VBMI2 3 of 150 (2.00%)
AVX512_VNNI: 36 of 36 💯
AVX_VNNI: 8 of 16 (50.00%)

Additions to existing families

AVX512F: 579 additional, 856 total of 2660 (31.80%)
AVX512BW: 178 additional, 335 total of 828 (40.46%)
AVX512DQ: 77 additional, 111 total of 399 (27.82%)
AVX512_VBMI: 9 additional, 30 total of 30 💯
KNCNI: 113 additional, 114 total of 595 (19.16%)
VPCLMULQDQ: 1 additional, 2 total of 2 💯

Neon

SIMDe currently implements 3745 out of 6670 (56.15%) NEON functions. If you don't count 16-bit floats and poly types, it's 3745 / 4969 (75.37%).

Newly added families

addhn
bcax
cage
cmla
cmla_rot90
cmla_rot180
cmla_rot270
fma
fma_lane
fma_n
ld2
ld4_lane
mlal_high_n
mlal_lane
mls_n
mlsl_high_n
mlsl_lane
mull_lane
qdmulh_lane
qdmulh_n
qrdmulh_lane
qrshrn_n
qrshrun_n
qshlu_n
qshrn_n
qshrun_n
recpe
recps
rshrn_n
rsqrte
rsqrts
shll_n
shrn_n
sqadd
sri_n
st2
st2_lane
st3_lane
st4_lane
subhn
subl_high
xar

MSA

Overall, SIMDe implementents 40 of 533 (7.50%) functions from MSA.

Details

Implementation of Arm intrinsics

NEON

aarch64 + clang-1[345] fix for "implicit conversion changes signedness" a22c3cc @mr-c
neon: Implement f16 types 21496f6 @Glitch18
neon: port additional code to new style 1c744fd @nemequ
neon: replace some more abs/labs/llabs usage with simde_math_* versions c59853a @nemequ
neon: refactor to use different types on all targets c17957a @nemequ
neon: test for MMX/SSE instead of x86 when choosing implementation 0366dab @nemequ
neon/abd: add much better implementations c3ddbbe @nemequ 220db33 @ngzhian
neon/abs: add SSE2 integer abs implementations 6396dc8 @aqrit
neon/addhn: initial implementation e9ee066 @nemequ
neon/add: Implement f16 functions e69239c @Glitch18
neon/add{l,}v: SSE2/SSSE3 opts _vadd{lvq_s8, lvq_s16, lvq_u8, vq_u8} 8b4e375 dfffdde @mr-c
neon/{add,sub}w_high: use vmovl_high instead of vmovl + get_high b897331 @nemequ
neon/bcax: initial implementation 96ce481 0ed3dea @Glitch18
neon/bsl: Implement f16 functions edb75b5 @Glitch18
neon/cage: Initial f16 implementations 20df81d @Glitch18
neon/cagt: Implement f16 functions 452a6d3 @Glitch18
neon/ceq: Implement f16 functions f24ab3d @Glitch18
neon/ceqz: Implement f16 functions dd2ebf2 de301cd @Glitch18
neon/cge: Implement f16 functions a512986 f3ad0d4 647dc12 @Glitch18
neon/cgez: complete implementation of CGEZ family 6d86a20 @Glitch18
neon/cgt: Add implementation of remaining functions 9930c43 @Glitch18
neon/cgt, simd128: improve some unsigned comparisons on x86 ae6702a @nemequ
neon/cgtz: Add implementations of remaining functions 4d749b5 @Glitch18
neon/cle: add some x86 implementations 5906cc9 d81c7e7 @nemequ 7894c7d @Glitch18
neon/clez: Add implementaions of scalar functions bc72880 @Glitch18
neon/clt: Add implementations of scalar functions & SSE/AVX512 fallbacks bc636e1 6a19637 @Glitch18
neon/cltz: Add scalar functions and natural vector fallbacks 2960ef0 @Glitch18
neon/cmla, neon/cmla_rot{90,180,270}: check compiler versions e98152f @nemequ
neon/cmla, neon/cmla_rot{90,180,270}: CMLA requires armv8.3+ 280faae @nemequ
neon/cmla, neon/cmla_rot{90,180,270}, neon/fma: initial implementation 2aff4f9 @Glitch18
neon/cnt: add x86 implementations of vcntq_s8 a558d6d @nemequ
neon/cvt: add __builtin_convertvector implementations d06ea5b @nemequ
neon/cvt: add out-of-range and NaN tests 7d0e2ac @nemequ
neon/cvt: add some faster x86 float->int/uint conversions ceaaf13 @nemequ
neon/cvt: Add vcvt_f32_f64 and vcvt_f64_f32 implementations 8398f73 @Glitch18
neon/cvt: cast result of float/double comparison dc215cd @ngzhian
neon/cvt: disable some code on 32-bit x86 which uses _mm_cvttsd_si64 48edfa9 @nemequ
neon/cvt: don't use vec_ctsl on POWER 8f9582a @nemequ
neon/cvt: fix a couple of s390x implementations' NaN handling a8bd33d @nemequ
neon/cvt: fix compilation with -ffast-math d1d070d @nemequ
neon/cvt: Implement f16 functions b6a9882 @Glitch18
neon/cvt, relaxed-simd: add work-around for GCC bug #101614 11aa006 @nemequ
neon/cvt, simd128: fix compiler errors on PPC 965e68e @nemequ
neon/cvt: clang bug 46844 was fixed in clang 12.0 71e03a6 @mr-c
neon/dot_lane: add remaining implementation 3f1c1fa 4a9ca8a @Glitch18
neon/dup_lane: Complete implementation of function family 12fb731 df320d1 @Glitch18 014ee00 9461557 @nemequ
neon/dup_lane: use dup_n 2b4a009 @ngzhian
neon/dup_n: Implement f16 functions 14fdf88 @Glitch18
neon/dup_n: replace remaining functions with dup_n implementations 27a13b0 @nemequ
neon/dupq_lane: native and portable 893db57 @ngzhian
neon/ext: add __builtin_shufflevector implementation de8fe89 @ngzhian
neon/ext: add _mm_alignr_{,e}pi8 implementations 6d28f04 @nemequ
neon/ext: clean up shuffle-based implementation f1de709 @nemequ
neon/ext: simde_*{to,from}_m64 reqs MMX_NATIVE 13ee902 @mr-c
neon/ext: unroll SIMDE_CONSTIFY for testing macro implemented functions 62834fa @mr-c
neon/fma: add a couple x86 and PPC implementations 7a2860b @nemequ
neon/fma: add more extensive feature checking e541dd1 @nemequ
neon/fma_lane: Implement fmaq_lane functions a77e6ad 555ef3e @Glitch18
neon/fma_n: initial implementation 06d5a62 @nemequ dab4342 @nemequ
neon/get_high: add __builtin_shufflevector optimizations 4003afa @ngzhian
neon/get_low: use __builtin_shufflevector if available ea3f75e @ngzhian
neon/hadd,hsub: optimization for Wasm ebe09d8 @ngzhian
neon/ld1: add Wasm SIMD implementation a79bc15 @ngzhian
neon/ld1_dup: native and portable (64-bit vectors), f64 debb3c8 @ngzhian 6c71aac @Glitch18
neon/ld1_dup: split from ld1, dup_n fallbacks, WASM implementations 4c586e0 @nemequ
neon/ld1: Implement f16 functions 6e89a9c f26f775 @Glitch18
neon/ld1_lane: Implement remaining functions de2de8d @Glitch18 9051a51 @ngzhian
neon/ld1q: u8_x2, u8_x3, u8_x4 341006c @ngzhian
neon/ld1[q]_*_x2: initial implementation cd14634 @dgazzoni
neon/ld{2,3,4}: disable -Wmaybe-uninitialized on all recent GCC e142a59 @nemequ
neon/ld{2,3,4}: silence false positive diagnostic on GCC 7 3f737a3 @nemequ
neon/ld2: Implement remaining functions e68f728 @Glitch18 3b3014f @ngzhian 078bb00 @nemequ 041b1bd @mr-c
neon/ld4_lane: native and portable implementations a973cab @ngzhian 179fb79 @Glitch18 0d1ab79 @nemequ
neon/ld4: use conformant array parameters 723a8a8 @nemequ
neon/ld4: work around spurious warning on clang < 10 64e9db0 @nemequ
neon/min: add SSE2 vminq_u32 & vqsubq_u32 implementation 2cf165e 117de35 @nemequ
neon/{min,max}nm: add some headers for -ffast-math ebe5c7d @nemequ
neon/{min,max}nm: use simde_math_* prefixed min/max functions c1607d2 @nemequ
neon/mlal_high_n: initial implementation d6f75fa @dgazzoni
neon/mlal_lane: initial implementation 82e36ed 2168ca0 @nemequ
neon/mls: add _mm_fnmadd_* implementations of vmls*_f* 70e0c20 @nemequ
neon/mlsl_high_n: initial implementation ca1a4c3 @dgazzoni
neon/mlsl_lane: initial implementation de78ae9 @nemequ
neon/mls_n: initial implementation 042c6eb @nemequ
neon/movl: improve WASM...