Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for WebAssembly SIMD #81

Open
omnisip opened this issue Dec 17, 2020 · 35 comments
Open

Support for WebAssembly SIMD #81

omnisip opened this issue Dec 17, 2020 · 35 comments
Labels
enhancement New feature or request

Comments

@omnisip
Copy link

omnisip commented Dec 17, 2020

Hi there,

Are you intending on adding support for WebAssembly SIMD?

@gquintin gquintin added the question Further information is requested label Dec 18, 2020
@gquintin
Copy link
Contributor

Hi @omnisip,

We do not master web technologies very well. I have searched a bit the internet and found "emscripten". As I understand it C/C++ code can be compiled into LLVM IR and then emscripten takes the latter to produce WebAssembly. So NSIMD should work out of the box to produce WebAssembly.

But I feel I am missing something. Can you elaborate?

Thanks.

@gquintin
Copy link
Contributor

Discussing it with a colleague, I (think I) found and understood that you want in fact a new SIMD architecture for NSIMD namely WebAssembly which wraps WebAssembly intrinsics. Is it correct?

@gquintin gquintin added enhancement New feature or request and removed question Further information is requested labels Dec 18, 2020
@omnisip
Copy link
Author

omnisip commented Dec 18, 2020 via email

@gquintin
Copy link
Contributor

@omnisip, is there a "simple" way to test an implementation? Is it necessary to have a browser to execute the resulting code? On my testing servers I do not have any GUI installed and would like to continue this way. Can you give us any insight on this?

@omnisip
Copy link
Author

omnisip commented Dec 23, 2020

Sure. There are at least a few ways to get going. First, if any of the functions you use were part of resort standardization efforts (bulk majority presumably) -- it's been backported to nodejs 14 by adding the flags --experimental-wasm-simd. See here: https://emscripten.org/docs/porting/simd.html

If you need the newer stuff you can compile v8 from source and run d8 which is like a nodejs light.

For your purposes, in case you weren't planning on using it already, I'd use emscripten.

I'm CC'ing @tlively here who works on that as part of standardization efforts.

@gquintin
Copy link
Contributor

@omnisip, thank you for your answer, also I assume you plan to support AVX and so on in the future, so I should take that into account in NSIMD and prepare for 256-bits WebAssembly and so on. Is it correct?

@omnisip
Copy link
Author

omnisip commented Dec 23, 2020

The answer is that we expect to add 256 bit, but probably not until after the current standard is finalized.

I would first start with 128 before messing with 256.

@gquintin
Copy link
Contributor

Hi @omnisip, I have successfully built and install emscripten, binaryen and d8 (from v8) and tried a C++ Hello World:

$ cat test.cpp
#include <iostream>

int main() {
  std::cout << "Hello World!\n";
  return 0;
}
$ em++ test.cpp
$ d8 a.out.js
Hello World!

I will try some SIMD now.

@gquintin
Copy link
Contributor

@omnisip, I use tools/dev/gm.py to compile d8 but I did not find any installation prefix parameter. Do you know how to do this?

@tlively
Copy link

tlively commented Jan 28, 2021

You might find it easier to install nightly V8 from jsvu than to build it yourself.

@gquintin
Copy link
Contributor

gquintin commented Jan 29, 2021

@tlively thanks for the tip. SIMD seems ok.

#include <iostream>
#include <wasm_simd128.h>

int main() {
  float a[128], b[128], c[128];
  for (int i = 0; i < sizeof(a) / sizeof(a[0]); i++) {
    a[i] = (float)i;
    b[i] = (float)(sizeof(a) / sizeof(a[0]) - i);
    c[i] = 0.0;
  }

  for (int i = 0; i < sizeof(a) / sizeof(a[0]); i += 4) {
    v128_t va = wasm_v128_load((void *)(a + i));
    v128_t vb = wasm_v128_load((void *)(b + i));
    v128_t vc = wasm_f32x4_add(va, vb);
    wasm_v128_store((void *)(c + i), vc);
  }

  for (int i = 0; i < sizeof(a) / sizeof(a[0]); i++) {
    std::cout << a[i] << " + " << b[i] << " = " << c[i] << "\n";
  }
  return 0;
}

It compiles fine and executes fine:

$ em++ -msimd128 test.cpp
$ d8 --experimental-wasm-simd a.out.js
0 + 128 = 128
1 + 127 = 128
2 + 126 = 128
[...]
126 + 2 = 128
127 + 1 = 128

Wrapping the intrinsics into NSIMD should not take long but I have to find a contiguous pack of 3/4 hours.

@tlively
Copy link

tlively commented Jan 29, 2021

Note that the intrinsics are not quite stable yet because the proposal is still being finalized. I don't anticipate many changes to existing intrinsics, but there will be new ones added. Hopefully they will be stabilized in the next few weeks.

@gquintin
Copy link
Contributor

I am looking at emcc defines and found:

#define __wasm 1
#define __wasm32 1
#define __wasm32__ 1
#define __wasm__ 1

Does this mean that WASM bytecode is "32-bits". If yes, is there a "64-bits" bytecode planned?

@tlively
Copy link

tlively commented Jan 29, 2021

Yes, you can track its progress at https://github.com/WebAssembly/memory64.

@gquintin
Copy link
Contributor

Hi @tlively I have compiled NSIMD (emulation version) with emscripten. It worked fine. But there are warnings that I want to get rid off:

em++: warning: LLVM version appears incorrect (seeing "11.0", expected "12.0") [-Wversion-check]

I tried to compile with -Wno-version-check but it does not work, I still get the warning. Do you know how I can do this?

@tlively
Copy link

tlively commented Jan 29, 2021

How did you install Emscripten? With the recommended method of installing via the emsdk, you shouldn't see that warning.

@gquintin
Copy link
Contributor

I found out why, I made a mistake in my .emscripten file, I tried something but it did not work.

I am currently wrapping intrinsics. If I am not mistaken the extract_lane's and the replace_lane's are not provided for all the base types (signed and unsigned 8, 16, 32 and 64 bits integers). I think this is the same for the addition and substraction. It does not change anything in terms of assembly (at least for x86, Arm) but from a programmer point of view it is a burden to write reinterprets all the time to use a given intrinsic especially since writing correct reinterprets is not that easy:

  • GCC allows the use of union for that
  • c99 also allows union
  • C++ does not provide any mechanism for that (expect C++20 std::bit_cast)
  • Memcpy has to be used for that in many cases but is weird.

From what I know only NVCC has builtins for that : https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__CAST.html (*_as_* builtins).

Most compilers recognize this pattern and optimize it but writing:

i32 reinterpret(u32 a) {
  i32 ret;
  memcpy((void *)ret, (void *)a, sizeof(i32));
  return ret;
}

int main() {
  v128_t vec;
  u32 val;
  vec = wasm_i32x4_replace_lane(vec, 0, reinterpret(val));
  vec = wasm_i32x4_replace_lane(vec, 0, (i32)val); // not correct (except maybe in C++20)
}

is not convenient and weird and a lot of people cannot use C++20. Moreover reading the code is not straighforward as one has to determine what is the base type the code is manipulating (i32 or u32?). So my advice is to provide intrinsics on all types (where it makes sense of course) even if several of them translate into the same assembly opcode.

@gquintin
Copy link
Contributor

I did not find the equivalent of _mm_undefined_*, is there any? It would be great to have one to avoid warnings as sometimes we need an uninitialized SIMD vector particularily when emulating non-existant intrinsics:

v128_t div_i32x4(v128_t a, v128_t b) {
  v128_t ret;
  ret = wasm_i32x4_replace_lane(ret, 0,   // warning here, but it is intended
            wasm_i32x4_extract_lane(a, 0) / wasm_i32x4_extract_lane(b, 0));
  ret = wasm_i32x4_replace_lane(ret, 1,
            wasm_i32x4_extract_lane(a, 1) / wasm_i32x4_extract_lane(b, 1));
  ret = wasm_i32x4_replace_lane(ret, 2,
            wasm_i32x4_extract_lane(a, 2) / wasm_i32x4_extract_lane(b, 2));
  ret = wasm_i32x4_replace_lane(ret, 3,
            wasm_i32x4_extract_lane(a, 3) / wasm_i32x4_extract_lane(b, 3));
  return ret;
}

@gquintin
Copy link
Contributor

Sorry to make requests like that...

@omnisip
Copy link
Author

omnisip commented Jan 30, 2021

I think there is a extract_lane and replace_lane for all intrinsics. See here: https://github.com/WebAssembly/simd/blob/master/proposals/simd/ImplementationStatus.md

With respect to the undefined, there are no intrinsics for that, but, I'm certain there is a way to avoid the warnings. If x64 doesn't have it, the emulation layer for neon does with all of the vreinterpret*

@omnisip
Copy link
Author

omnisip commented Jan 30, 2021

If you just set it to zero in that case, I'm pretty sure clang will optimize it. Second, you can use the compiler built-in vector support to do those ops too even if we don't provide intrinsics.

@gquintin
Copy link
Contributor

I meant for all base types. Taking extract_lane there are:

  • wasm_i8x16_extract_lane
  • wasm_u8x16_extract_lane
  • wasm_i16x8_extract_lane
  • wasm_u16x8_extract_lane
  • wasm_i32x4_extract_lane
  • wasm_i64x2_extract_lane
  • wasm_f32x4_extract_lane
  • wasm_f64x2_extract_lane

I think you should also provide:

  • wasm_u32x4_extract_lane
  • wasm_u64x2_extract_lane

For the undefined, I agree with you but I want to make the life of the compiler as easy as possible. Counting on the compiler to optimize code has often led to disappointing results hence I do not trust the compiler on large codebase with __m128i v = _mm_setzero_si128(); instead of __m128i v = _mm_undefined_si128();.

@omnisip
Copy link
Author

omnisip commented Jan 30, 2021

Okay. I'm pretty sure the unsigned and signed variants are synonyms for each other with the exception of 8bit. The underlying bit representations don't change. If you'd like I can file a ticket or make a PR to add the synonyms for those two.

With respect to undefined, I've been thinking about this today and would suggest using the gcc vector intrinsic syntax for the cases where your need it. That way you can treat each input as part of an array and when you want to return a vector you can use the same syntax or the corresponding make*.

Examples for both of these cases should exist in wasm_simd128.h.

@gquintin
Copy link
Contributor

You are right about the fact that the sign and unsigned variants "are the same" from the chip point of view. This also goes for add, sub and mul for examples which do not have the unsigned variants. It won't change a thing for the compiler/chip but it makes the life of the programmer that uses intrinsics easier. I can open an issue if you want me to. I do not know if I will have enough time for a PR.

@tlively
Copy link

tlively commented Jan 31, 2021

This is great feedback, thanks @gquintin! We can definitely add some sort of undefined intrinsic. Can you help me understand why using C-style casts is not sufficient for casting the e.g. the result of wasm_i64x2_extract_lane to uint64_t or similar? I would like to avoid having multiple versions of intrinsics that lower to the same WebAssembly instruction where there is a reasonable workaround for users to use.

@gquintin
Copy link
Contributor

That's because the standard does not define how to do this kind of cast in all cases. More precisely when casting to/from signed/unsigned/float types the standard basically says that if the number can be represented in the destination type then all goes well but otherwise it is undefined-behavior or implemented-defined I do not remember. But in our situation:

int main() {
  v128_t vec;
  u32 val;
  vec = wasm_i32x4_replace_lane(vec, 0, (i32)val);
}

if val as a value greater than MAX_INT then the standard does not give any garanties that the bits of (i32)val will be the same as the bits of val because a value greater than MAX_INT cannot be reprensented in a 32-bits signed integer. What I say is true for all C and C++ standards except for C++20 and onwards because in C++20 they decided to impose that 2's complement will be the way to represents integers so (even though I do not have checked in the C++20 standard) it stands to reason to say that casts are completely defined (at least for integers) and that it has the expected behavior that is: no change in the bit patterns thanks to 2's complement properties.

But you know I think that in 99% the C-cast will just work as we expect but it is not standard C/C++. I know I am a pain and "tatillon" (french for persnicket ??) but in NSIMD we chose to write code as portable as possible hence we do not use C-cast for this kind of operation as we want to support many compilers/OS/architectures. As an example here is our u32 to i32 reinterpret function:

NSIMD_INLINE i64 nsimd_scalar_reinterpret_i64_u64(u64 a0) {
#ifdef NSIMD_IS_GCC
  union { u64 from; i64 to; } buf;
  buf.from = a0;
  return buf.to;
#else
  i64 ret;
  memcpy((void *)&ret, (void *)&a0, sizeof(ret));
  return ret;
#endif
}

@gquintin
Copy link
Contributor

I quote the C89 standard (sorry I am kind of old fashionned), I guess that one can find the same in all other standards:

3.2.1.2 Signed and unsigned integers
[...]
When [...] an unsigned integer is converted to its corresponding signed integer, if the value cannot be represented the result is implementation-defined.

@omnisip
Copy link
Author

omnisip commented Jan 31, 2021

Does the WebAssembly spec sufficiently address your concerns? See here: https://webassembly.github.io/spec/core/exec/numerics.html

@gquintin
Copy link
Contributor

gquintin commented Feb 1, 2021

If in some standards it is undefined behavior, no because it specifies how the VM for the WebAssembly works. Like Intel or Arm chips, integers are represented using 2's complement but it does not change the C/C++ standards and compilers can do what they want with (i32)val. One can imagine that if val > MAX_INT then (i32)val == MAX_INT. I don't think that this would contradict the standards.

For C89 it is implementation-defined, in the "Conversions" paragraph of numerics.html there are "Else the result is undefined." about the "trunc" operators which is the same as undefined behavior for the C/C++ standards.

Plus I am not sure that counting on implementation behaviors is wise.

@tlively
Copy link

tlively commented Feb 1, 2021

in the "Conversions" paragraph of numerics.html there are "Else the result is undefined." about the "trunc" operators which is the same as undefined behavior for the C/C++ standards.

This is a little confusing because of the structure of the WebAssembly spec, but the actual truncation instructions are defined to trap if the result cannot be represented. WebAssembly does not have any undefined behavior. (In this particular case, the LLVM WebAssembly backend inserts range checks to avoid the trap at runtime.)

But given that clang is currently the only C/C++ compiler that compiles to WebAssembly, you should be able to depend on its signed/unsigned integer casting behavior in WebAssembly SIMD code.

But what do other architectures do about this problem? Does x86 have a separate signed and unsigned version for their corresponding intrinsics?

@gquintin
Copy link
Contributor

gquintin commented Feb 2, 2021

Intel does as you do, intrinsics are not provided for all types. You have _mm_add_epi32 for signed int but you do not have _mm_add_epu32 which would be for unsigned int. This is the same for substraction and multitplication and many many other.

On Arm you have both intrinsics: vaddq_s32 for signed int and vaddq_u32 for unsigned int. Both intrinsics are compiled into the same instruction:

  • opcode = 4ea18400
  • assembly = add v0.4s, v0.4s, v1.4s

So I guess there are both approaches. I don't know for all programmers but for my collegues and I we prefer to have intrinsics on all types knowing that it will eventually give the same binary code. We see several advantages to that approach:

  1. Ease of writing code with intrinsics (well that's relative but anyway).
  2. Ease of reading code (and understanding the underlying type that is being manipulated)
  3. Ease of maintaining the code (consequence of the above)

Take the example of the NSIMD gather_linear operator and compare its SSE 4.2 and AARCH64 implementations: (I have remove some noise)

__m128d nsimd_gather_linear_sse42_f64(f64 const* a0, int a1) {
  __m128d ret;
  ret = _mm_undefined_pd();
  ret = _mm_castsi128_pd(_mm_insert_epi64(
            _mm_castpd_si128(ret),
            nsimd_scalar_reinterpret_i64_f64(a0[0 * a1]), 0));
  ret = _mm_castsi128_pd(_mm_insert_epi64(
            _mm_castpd_si128(ret),
            nsimd_scalar_reinterpret_i64_f64(a0[1 * a1]), 1));
  return ret;
 }

vs

float64x2_t nsimd_gather_linear_aarch64_f64(f64 const* a0, int a1) {
  float64x2_t ret;
  ret = vdupq_n_f64(a0[0]);
  ret = vsetq_lane_f64(a0[1 * a1], ret, 1);
  return ret;
}

There is not Intel intrinsic for inserting doubles (f64) into a SIMD register which would be called _mm_insert_pd I guess. Intel only provides _mm_insert_epi64 (for signed 64 bits integers). As a consequence one has to do many reinterprets (called cast for Intel intrinsics) to have a correct/portable code. Moreover, as i said in an early post, the implementation of nsimd_scalar_reinterpret_i64_f64 is not that simple and involves a memcpy that the compiler has to optimize away:

f64 nsimd_scalar_reinterpret_f64_i64(i64 a0) {
#ifdef NSIMD_IS_GCC
  union { i64 from; f64 to; } buf;
  buf.from = a0;
  return buf.to;
#else
  f64 ret;
  memcpy((void *)&ret, (void *)&a0, sizeof(ret));
  return ret;
#endif
}

So I guess one could add to the advantages:

  1. Simplier code for the compiler hence more oppportunities for optimizations

For the "truncation" stuff I think I did not look at your document carefully but if indeed you spec says something like or the following can be implied from other statements "conversion from N-bits unsigned integer to N-bits signed integers leaves the bits as-is" then indeed one can use the C-style cast conversion. In NSIMD we avoid as much as possible undefined behaviors and implementation defined stuff.

@tlively
Copy link

tlively commented Feb 2, 2021

Thanks for the detailed information! This is very useful feedback, and I will take it under consideration as we finalize the WebAssembly SIMD intrinsics over the next couple months 👍

@omnisip
Copy link
Author

omnisip commented Feb 2, 2021

That's an interesting sample. Dup is the standard behavior on arm for initializing a register, but isn't the standard on x86/x64. When writing code for WASM SIMD it's helpful to know some of these things since you want the compiler(s), clang + v8 for instance, to generate optimal code for your target platform. The best solution for this case might be a bit different than you expect.

For instance, if you're implementing gather which always pulls from memory, you might be incentivized to use the load32_zero or load64_zero instructions that will initialize a vector according to the first 4 bytes or 8 bytes without exceeding the boundary. Then from there, inserting. There's even an argument to use load64_ zero twice causing two normal memory loads into vectors, before shuffling the vectors together. Such would ensure that you're not getting hit with a cross boundary penalty as well as minimizing the number of shuffle ops.

For testing and optimization purposes, it would be good to compare this solution against what wasm simd make ops do to see if the compiler is smart enough to make it.

As a side note, such optimizations would probably prudent on your existing x64 and neon code if the compiler isn't making them already.

With respect to the casts, I understand both perspectives. If we came up with a solution with preprocessor macros that didn't increase code file size much and didn't add significant processing time for builds, we could probably add the synonyms with minimal complexity.

@gquintin
Copy link
Contributor

Let me put here what I have said to @omnisip by mail as a status update for this issue:

I have created a "wasm" branch: https://github.com/agenium-scale/nsimd/tree/wasm

It is only a matter of filling the egg/platform_wasm.py, I am currently very busy (any contribution welcome) but I went up to the load[234][au] operators. Their purposes is to load, for example load2:

RIRIRIRIRIRI... from memory into --> RRRR IIII (2 SIMD registers)

Same for load3 (RGBRGBRGBRGB ---> RRRR GGGG BBBB) and load4. The shuffles are a bit tricky to write but when this is done it will be straightfoward for the other operators.

@gquintin
Copy link
Contributor

gquintin commented Jun 1, 2021

Status update: a collegue is currently continuing the implementation of WASM into NSIMD. I'll keep you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants