New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebAssembly SIMD implementation for libjpeg-turbo #250
Comments
This will definitely not be supported with the 1.5.x branch. I'll look into it with the 2.0.x branch. |
/cc @dcommander please see my last EDIT above, thanks! |
libjpeg is a completely different product with a completely different build system. All indications are that emconfigure failed with libjpeg-turbo 1.5.x because of the SIMD extensions (i.e. because some of the source files have an extension of .asm.) The CMake failure appears to be due to Long and the short of it-- I'm going to have to reproduce this on my own machine and diagnose/debug it. Be patient. |
I can successfully build libjpeg-turbo 2.0 beta using
However, it complains that the SIMD objects are not valid LLVM bitcode, then gives warnings about unresolved SIMD symbols. At the moment, it's probably necessary to add |
@dcommander sorry for noob question but will building with |
It will still be faster than libjpeg, but only by about 30% or so. Definitely not the 2-6x speedup that you would get with the SIMD extensions. |
I am not proposing that as a long-term solution. We need to figure out how to enable the SIMD extensions with this. I just don't know how at the moment. |
Some further notes on building and testing libjpeg-turbo 2.0 using Emscripten: I just pushed a commit that preserves the value of
You can also embed test images by using the
Then you can do:
and connect to
Note also that the 32-bit native version is already slower than the 64-bit native version. It appears that getting SIMD acceleration to work will require porting the existing SSE2 NASM code to intrinsics, because Emscripten knows how to compile SSE2 intrinsics into bitcode. I have no idea how the resulting code will perform. It depends on how much overhead there is to the Emscripten intrinsics implementation. The good news is that I have funding to look into this, in the context of building a WASM version of the TurboVNC Viewer. |
thanks for the update and glad to hear getting SIMD acceleration is on your roadmap! |
It seems that SIMD is not yet supported in WebAssembly: https://webassembly.org/docs/future-features/ , emscripten-core/emscripten#6445 |
The third link you posted says: "Emscripten has full support for compiling code that utilizes the SSE1, SSE2, SSE3 and SSSE3 intrinsic function calls." Am I missing something? |
It's only when targeting pure javascript, not WebAssembly. |
Closing for now, since it doesn't appear that we'll have SIMD support in WASM anytime soon, and the build works fine otherwise -- as long as you pass |
Happened to come across this issue, Wasm SIMD has experimental support in emscripten, as well as in Chrome behind a flag. You can try this out using latest-upstream version of emscripten passing SIMD=1 at both compile, and link time. And by enabling SIMD in chrome by turning on the WebAssemblySimd flag, or passing --js-flags="--experimental-wasm-simd" on the commandline. This is not on by default, but ready to be experimented with as of Chrome 75+. |
How does WASM handle SIMD? Does it require specific instructions or intrinsics? I assume it can't just parse raw assembly code, which is what libjpeg-turbo currently uses. |
The WebAssembly SIMD proposal introduces a new set of SIMD types/operations. This is still in Phase 1, so not a part of the official standard yet, but we have an experimental implementation, and are gathering feedback in the form of early performance numbers. |
The WebAssembly SIMD proposal reached phase 4 last week, which means that SIMD will be shipped within the Wasm engines.
Porting the x86 SSE NASM code to an intrinsics implementation seems indeed the best step forward. Emscripten provides several SSE compat headers which are usually direct mappings to a native Wasm SIMD opcode. In case such opcode is missing, it is emulated via at most few other Wasm SIMD instructions, or in the worst case a scalarized fallback is taken. Alternatively, one could try to compile the recently added Arm Neon intrinsics to Wasm SIMD. However, this requires the implementation of some functions within the SIMDe project, see: |
Thanks for the info. Given how difficult it is to optimize x86 SIMD code (primarily because of the lack of available registers), I think the path of least resistance here is to develop a whole new SIMD implementation just for WASM. That would allow us to use the most optimal operations available under WASM without having to worry about how (or even whether) the SSE2 intrinsics are being implemented. It would also avoid messing with the existing SSE2 implementation. The NEON intrinsics overhaul in libjpeg-turbo 2.1 was painful, because GCC (as opposed to Clang) has an incomplete and/or suboptimal implementation of NEON intrinsics. Thus, I had to figure out which algorithms, when implemented using NEON intrinsics, regressed in performance under GCC and which ones didn't. Then I had to figure out how to auto-detect the likelihood of performance regression (based on whether a few key NEON intrinsics were implemented in the compiler) and fall back to the old GAS implementations of those algorithms if necessary. That pain was justifiable, because the old NEON GAS implementation required GCC or Clang (no Windows support except with MinGW), and older versions of Clang couldn't build the GAS code without gas-preprocessor. With x86, however, NASM is universally available and easy to install on all x86 operating systems, so there isn't any compelling reason to switch to intrinsics, and doing so would likely be a lot more painful than the NEON intrinsics overhaul. |
Hi, I work on the WebAssembly SIMD proposal and also the implementation in V8.
Emscripten has practically full support for
This has been slowly worked on, and I'm happy to say some projects are compiling fine such as Simd the image processing library, and libwebp. EDIT: I just found simd-everywhere/simde#646 (comment) so this is not a new result :) But I have some benchmark numbers based on I also ported libjpeg-turbo to Wasm SIMD via the NEON intrinsics. If you compare the change, it's fairly small:
Things compile and run fine, but the speedup is not amazing. This is somewhat a known problem, the NEON intrinsic support in SIMDe is not heavily optimized yet, and experiments like this can help to uncover areas of improvements. The other issue is that I am running this on a x86_64 build of d8, so there are multiple levels of translation here, NEON -> SIMDe (which could be scalarize or Wasm intrinsics) -> x86. I have a some preliminary numbers to share, see https://docs.google.com/spreadsheets/d/1g8eh_DJKemTsyJGyE3I8LwPMDCr8tb_-9LpZvMxEHwU/edit#gid=0 (can't make it public due to reasons, request for access and I will grant them, sorry for the trouble). An interesting thing (which could be a separate bug), is that I'm not seeing performance differences between scalar v.s. SIMD on native builds. I'm just using |
@ngzhian That all sounds promising. Do you think that an SSE2 intrinsics implementation of the libjpeg-turbo SIMD extensions would have better performance under WASM? If I understand your comment, it seems as if using x86 intrinsics would require fewer layers of translation. I’m not sure why you are not seeing a performance difference with |
A note about funding: This effort was initially funded by one of my clients that was interested in a WASM port of TurboVNC, but unfortunately they have decided to go in a different direction. Thus, there is no longer any direct funding to pursue a WASM port of libjpeg-turbo. If an organization reading this would like to fund this effort, please contact me. Otherwise my ability to work on this feature will be limited. |
Most likely yes, I did an earlier experiment for libwebp (see https://docs.google.com/document/d/1gm2GOZV6yUj31BAT9P83V_GoqJSyGi0NLVfXzjG2AuQ/edit?usp=sharing request and I will share), and the NEON headers are slower than SSE headers when I ran the result on x64 v8. There are 2 main reasons, 1 is the multiple layers of translation as mentioned (unfortunately I don't have a arm64 system readily available for testing), the bigger reason is probably mismatch of commonly used NEON instructions v.s. available Wasm instructions, such as load/store multiple elements. Those don't map well to Wasm, and end up being scalarized.
I bet I made a mistake somehow, let me try this again and report back, thanks! And thank you for your honest view on the funding situation, and being responsive on the issue tracker :) |
Turns out there are issues with how I'm running the test, I am now seeing differences in native SIMD v.s. no SIMD (from 23% to 211% speedup).
It looks like Wasm SIMD via the Neon intrinsics is not getting us a lot of performance win, probably need to look closer at what the issue is. |
A 23-211% speedup is still quite low for libjpeg-turbo. The x86-64 SIMD extensions in libjpeg-turbo should boost performance by roughly 200-700%. Perhaps your tests are measuring other operations, such as I/O or pixel format conversion, that limit the measured speedup of libjpeg-turbo because of Amdahl's Law. |
I am running the benchmarks like so: |
Ah, OK. That particular image is a bit of a corner case. I include it in order to get a sense of the library's performance with images that contain a lot of sharp lines, but some of the other images may be more representative of the library's overall performance. If I have to test just one image, I generally choose artificial.ppm. |
Okay! I will rerun this with artificial.ppm, any other images from https://github.com/libjpeg-turbo/libjpeg-turbo/tree/main/testimages you would suggest? I would like to run with at least 3 images. |
With artificial.ppm, i'm seeing Wasm SIMD not being any faster than scalar SIMD :( very strange. Edit: Hm, seems like i'm missing JSIMD_FORCENEON |
You should not use the images under the testimages/ directory in the source tree, as those are very small and are intended only for regression testing. Use artificial.ppm, nightshot_iso_100.ppm, and vgl_5674_0098.ppm available at https://libjpeg-turbo.org/About/Performance. |
Try to force enable NEON with JSIMD_FORCENEON doesn't seem to work well, the runs are even slower and I get some corruption:
Which is really weird, I also tried using the changes here kleisauke/wasm-vips@acd4c81 instead of my own in case I missed anything, and I get the same issues. (anyway, no response necessary, just keep a log of things I tried and problems faced) |
I'm sorry but i still cannot use Cmake to configure WASM build. Even with simd disabled. When i run cmake i get this error: I see this questing is here since 3 years. But didn't find any answer to this. Any help will be appreciated. P.S. Is there ready to use code for fastest jpeg decoding using Emscripten SIMD? |
Can you show the exact commands you are using?
This builds all the js and wasm files inside of the folder The important thing is to use |
@Honya2000 This issue is a feature request, not a technical support question. It tracks the potential for a WebAssembly SIMD implementation in libjpeg-turbo. That implementation does not yet exist. WebAssembly is not officially supported by The libjpeg-turbo Project yet, and until it is, you are responsible for figuring out how to make it work. I do not guarantee that it works at all. The best I can do at the moment is to show you how I personally build libjpeg-turbo using Emscripten on Linux. I have not tried to do likewise on Windows.
|
I was able to compile with emcmake. |
Just a note to say I would like to see SIMD support for libjpeg-turbo in Emscripten. |
Some benchmarks, loading a 859 x 1186 pixel jpeg.
Benchmarking PNG decoding, I have found WASM to be about 2.5x slower than native. So I expect we would see some performance increases with SIMD support in Emscripten libjpeg-turbo. |
@Ono-Sendai Refer to the comments above. You can already use libjpeg-turbo's Neon SIMD intrinsics with WASM, but apparently the performance is not spectacular. (I have not tested this myself.) Better performance may be possible with x86-64 SIMD intrinsics (#732), but that needs funding. I used to have specific funding to look into libjpeg-turbo/WASM, but the company providing that funding no longer needs this feature. Thus, it hasn't been a high priority. Any new SIMD implementation in libjpeg-turbo will be an expensive proposition, and any other solution to this problem would require using one of our existing SIMD implementations. Thus, the solution is either mostly dependent on downstream work in Emscripten, or it is mostly dependent on upstream funding to extend libjpeg-turbo. |
I tried the aarch64 hack mentioned in #250 (comment)
I hope you get funding for this work (I fund you $5 a month which probably doesn't cover it :) ) Presumably there is some way of compiling your hand-written assembly code into WASM. |
I'm trying to compile to webassembly.
fails:
Also tried:
emconfigure ../configure --disable-shared
; same errorfails:
Any help would be appreciated!
EDIT
with libjpeg instead of libjpeg-turbo, this all works fine:
The text was updated successfully, but these errors were encountered: