Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebAssembly SIMD implementation for libjpeg-turbo #250

Open
timotheecour opened this issue Jun 25, 2018 · 38 comments
Open

WebAssembly SIMD implementation for libjpeg-turbo #250

timotheecour opened this issue Jun 25, 2018 · 38 comments

Comments

@timotheecour
Copy link

timotheecour commented Jun 25, 2018

I'm trying to compile to webassembly.

  • approach 1:
git checkout 1.5.3 #last git tag with ./configure
mkdir build && cd build
emconfigure ../configure

fails:

checking whether -lc should be explicitly linked in... yes
checking dynamic linker characteristics... ERROR:root:no input files
note that input files without a known suffix are ignored, make sure your input files end with one of: ('.c', '.C', '.i', '.cpp', '.cxx', '.cc', '.c++', '.CPP', '.CXX', '.CC', '.C++', '.ii', '.m', '.mi', '.mm', '.mii', '/dev/null', '.bc', '.o', '.obj', '.lo', '.dylib', '.so', '.a', '.ll', '.h', '.hxx', '.hpp', '.hh', '.H', '.HXX', '.HPP', '.HH')
darwin17.6.0 dyld

ERROR:root:Configure step failed with non-zero return code 1! Command line: ['../configure'] at /path_to/image_compression/libjpeg-turbo/build

Also tried: emconfigure ../configure --disable-shared ; same error

  • approach 2
mkdir build && cd build
cmake .. -DCMAKE_TOOLCHAIN_FILE=$git_clone_D/toolchains/generic/Emscripten-wasm.cmake -DEMSCRIPTEN_PREFIX=$git_clone_D/emsdk/emscripten/1.38.6

fails:

CMake Error at CMakeLists.txt:38 (math):
  math cannot parse the expression: " * 8": syntax error, unexpected
  exp_TIMES, expecting exp_PLUS or exp_MINUS or exp_OPENPARENT or exp_NUMBER
  (2)


CMake Error at CMakeLists.txt:39 (string):
  string no output variable specified

CMake Warning at simd/CMakeLists.txt:5 (message):
  SIMD extensions not available for this CPU ().  Performance will suffer.
Call Stack (most recent call first):
  simd/CMakeLists.txt:349 (simd_fail)

Any help would be appreciated!

EDIT

with libjpeg instead of libjpeg-turbo, this all works fine:

git clone https://github.com/LuaDist/libjpeg
cd libjpeg
mkdir build && cd build
emconfigure ../configure
# gives: `checking dynamic linker characteristics... ERROR:root:no input files` but still works fine
emmake make
@timotheecour timotheecour changed the title webassembly output trying to adapt build to produce webassembly output Jun 25, 2018
@dcommander
Copy link
Member

This will definitely not be supported with the 1.5.x branch. I'll look into it with the 2.0.x branch.

@timotheecour
Copy link
Author

/cc @dcommander please see my last EDIT above, thanks!

@dcommander
Copy link
Member

libjpeg is a completely different product with a completely different build system. All indications are that emconfigure failed with libjpeg-turbo 1.5.x because of the SIMD extensions (i.e. because some of the source files have an extension of .asm.) The CMake failure appears to be due to CMAKE_SIZEOF_VOID_P being set incorrectly (or not at all) in a WASM environment, but getting past that problem may reveal a similar issue with the SIMD extensions.

Long and the short of it-- I'm going to have to reproduce this on my own machine and diagnose/debug it. Be patient.

@dcommander
Copy link
Member

I can successfully build libjpeg-turbo 2.0 beta using

emcmake cmake -G"Unix Makefiles" -DCMAKE_BUILD_TYPE=Release {source_directory}

However, it complains that the SIMD objects are not valid LLVM bitcode, then gives warnings about unresolved SIMD symbols. At the moment, it's probably necessary to add -DWITH_SIMD=0 to the command line above in order to disable the SIMD extensions, until I can figure out how to properly convert those objects to bitcode. If anyone knows how, please comment below. I couldn't find any information online.

@timotheecour
Copy link
Author

@dcommander sorry for noob question but will building with -DWITH_SIMD=0 result in same speed compared to libjpeg? or will that still provide benefits?

@dcommander
Copy link
Member

It will still be faster than libjpeg, but only by about 30% or so. Definitely not the 2-6x speedup that you would get with the SIMD extensions.

@dcommander
Copy link
Member

I am not proposing that as a long-term solution. We need to figure out how to enable the SIMD extensions with this. I just don't know how at the moment.

@dcommander
Copy link
Member

Some further notes on building and testing libjpeg-turbo 2.0 using Emscripten:

I just pushed a commit that preserves the value of CMAKE_EXECUTABLE_SUFFIX from the command line. Thus, you can do the following in order to build standalone HTML files for the various test programs:

cd {build_directory}
export LDFLAGS="--emrun"
emcmake cmake -G"Unix Makefiles" -DCMAKE_EXECUTABLE_SUFFIX=.html \
	-DWITH_SIMD=0 -DENABLE_SHARED=0 \
	-DCMAKE_C_FLAGS="-Wall -s ALLOW_MEMORY_GROWTH=1" \
	{source_directory}

You can also embed test images by using the --no-heap-copy and --preload-file linker flags, e.g.:

export LDFLAGS="--emrun --no-heap-copy --preload-file $HOME/test_images/artificial.ppm@artificial.ppm"

Then you can do:

emrun --serve_after_exit --serve_after_close --hostname `hostname` --no_browser tjbench-static.html

and connect to http://{hostname}:6931/tjbench-static.html?artificial.ppm&95&-rgb&-quiet in order to benchmark the performance of libjpeg-turbo/WASM in your particular browser. What I observe on my Mac (3.0 GHz Core i7), compared to a 32-bit native build without SIMD extensions:

  • Firefox Quantum 61.0.1:
    WASM version is ~15% slower than native for compression, ~30% slower than native for decompression
  • Chrome 67.0.3396.99:
    WASM version is ~45-50% slower than native for both compression and decompression

Note also that the 32-bit native version is already slower than the 64-bit native version.

It appears that getting SIMD acceleration to work will require porting the existing SSE2 NASM code to intrinsics, because Emscripten knows how to compile SSE2 intrinsics into bitcode. I have no idea how the resulting code will perform. It depends on how much overhead there is to the Emscripten intrinsics implementation. The good news is that I have funding to look into this, in the context of building a WASM version of the TurboVNC Viewer.

@timotheecour
Copy link
Author

thanks for the update and glad to hear getting SIMD acceleration is on your roadmap!

@mayeut
Copy link
Contributor

mayeut commented Jul 20, 2018

It seems that SIMD is not yet supported in WebAssembly: https://webassembly.org/docs/future-features/ , emscripten-core/emscripten#6445
It is however supported by emscripten when targeting javascript (i.e. not WASM): https://kripken.github.io/emscripten-site/docs/porting/simd.html

@dcommander
Copy link
Member

The third link you posted says:

"Emscripten has full support for compiling code that utilizes the SSE1, SSE2, SSE3 and SSSE3 intrinsic function calls."

Am I missing something?

@mayeut
Copy link
Contributor

mayeut commented Jul 20, 2018

It's only when targeting pure javascript, not WebAssembly.

@dcommander
Copy link
Member

Closing for now, since it doesn't appear that we'll have SIMD support in WASM anytime soon, and the build works fine otherwise -- as long as you pass -DWITH_SIMD=0. I'll continue to track this, but if someone else finds out before I do that SIMD support is available in WASM, please post an update here.

@dtig
Copy link

dtig commented May 15, 2019

Happened to come across this issue, Wasm SIMD has experimental support in emscripten, as well as in Chrome behind a flag. You can try this out using latest-upstream version of emscripten passing SIMD=1 at both compile, and link time. And by enabling SIMD in chrome by turning on the WebAssemblySimd flag, or passing --js-flags="--experimental-wasm-simd" on the commandline. This is not on by default, but ready to be experimented with as of Chrome 75+.

@dcommander
Copy link
Member

How does WASM handle SIMD? Does it require specific instructions or intrinsics? I assume it can't just parse raw assembly code, which is what libjpeg-turbo currently uses.

@dtig
Copy link

dtig commented May 15, 2019

The WebAssembly SIMD proposal introduces a new set of SIMD types/operations. This is still in Phase 1, so not a part of the official standard yet, but we have an experimental implementation, and are gathering feedback in the form of early performance numbers.

@dcommander dcommander reopened this Feb 23, 2021
@dcommander dcommander changed the title trying to adapt build to produce webassembly output WebAssembly SIMD implementation for libjpeg-turbo Feb 23, 2021
@kleisauke
Copy link
Contributor

The WebAssembly SIMD proposal reached phase 4 last week, which means that SIMD will be shipped within the Wasm engines.

It appears that getting SIMD acceleration to work will require porting the existing SSE2 NASM code to intrinsics, because Emscripten knows how to compile SSE2 intrinsics into bitcode. I have no idea how the resulting code will perform. It depends on how much overhead there is to the Emscripten intrinsics implementation.

Porting the x86 SSE NASM code to an intrinsics implementation seems indeed the best step forward. Emscripten provides several SSE compat headers which are usually direct mappings to a native Wasm SIMD opcode. In case such opcode is missing, it is emulated via at most few other Wasm SIMD instructions, or in the worst case a scalarized fallback is taken.

Alternatively, one could try to compile the recently added Arm Neon intrinsics to Wasm SIMD. However, this requires the implementation of some functions within the SIMDe project, see:
simd-everywhere/simde#646
kleisauke/wasm-vips@28bdd59
(note that this is an experiment, I'm not sure how this will perform)

@dcommander
Copy link
Member

Thanks for the info. Given how difficult it is to optimize x86 SIMD code (primarily because of the lack of available registers), I think the path of least resistance here is to develop a whole new SIMD implementation just for WASM. That would allow us to use the most optimal operations available under WASM without having to worry about how (or even whether) the SSE2 intrinsics are being implemented. It would also avoid messing with the existing SSE2 implementation.

The NEON intrinsics overhaul in libjpeg-turbo 2.1 was painful, because GCC (as opposed to Clang) has an incomplete and/or suboptimal implementation of NEON intrinsics. Thus, I had to figure out which algorithms, when implemented using NEON intrinsics, regressed in performance under GCC and which ones didn't. Then I had to figure out how to auto-detect the likelihood of performance regression (based on whether a few key NEON intrinsics were implemented in the compiler) and fall back to the old GAS implementations of those algorithms if necessary. That pain was justifiable, because the old NEON GAS implementation required GCC or Clang (no Windows support except with MinGW), and older versions of Clang couldn't build the GAS code without gas-preprocessor. With x86, however, NASM is universally available and easy to install on all x86 operating systems, so there isn't any compelling reason to switch to intrinsics, and doing so would likely be a lot more painful than the NEON intrinsics overhaul.

@ngzhian
Copy link

ngzhian commented Oct 2, 2021

Hi, I work on the WebAssembly SIMD proposal and also the implementation in V8.

Emscripten provides several SSE compat headers which are usually direct mappings to a native Wasm SIMD opcode.

Emscripten has practically full support for *mmintrin.h, so many projects which use intrinsics compile with to Wasm SIMD.

Alternatively, one could try to compile the recently added Arm Neon intrinsics to Wasm SIMD.

This has been slowly worked on, and I'm happy to say some projects are compiling fine such as Simd the image processing library, and libwebp.

EDIT: I just found simd-everywhere/simde#646 (comment) so this is not a new result :) But I have some benchmark numbers based on tjbench.

I also ported libjpeg-turbo to Wasm SIMD via the NEON intrinsics. If you compare the change, it's fairly small:

  1. Force CMAKE_SYSTEM_PROCESSOR to be aarch64 when we are compiling with Emcsripten, this makes libjpeg-turbo to think we are compiling for aarch64, and does the NEON intrinsics check
  2. Copy test images into build dir so we can run tjbench with test images
  3. Build with -DNEON_INTRINSICS=1, we force usage of intrinsics, since Emscripten has no support for native assembly, in effect overwrite this check
  4. Quick hack to workaround inline assembly
  5. A helper script builds the various configuration and runs benchmarks.

Things compile and run fine, but the speedup is not amazing. This is somewhat a known problem, the NEON intrinsic support in SIMDe is not heavily optimized yet, and experiments like this can help to uncover areas of improvements. The other issue is that I am running this on a x86_64 build of d8, so there are multiple levels of translation here, NEON -> SIMDe (which could be scalarize or Wasm intrinsics) -> x86.

I have a some preliminary numbers to share, see https://docs.google.com/spreadsheets/d/1g8eh_DJKemTsyJGyE3I8LwPMDCr8tb_-9LpZvMxEHwU/edit#gid=0 (can't make it public due to reasons, request for access and I will grant them, sorry for the trouble).

An interesting thing (which could be a separate bug), is that I'm not seeing performance differences between scalar v.s. SIMD on native builds. I'm just using -DWITH_SIMD to affect the builds, maybe I'm doing something wrong there.

@dcommander
Copy link
Member

dcommander commented Oct 2, 2021

@ngzhian That all sounds promising. Do you think that an SSE2 intrinsics implementation of the libjpeg-turbo SIMD extensions would have better performance under WASM? If I understand your comment, it seems as if using x86 intrinsics would require fewer layers of translation.

I’m not sure why you are not seeing a performance difference with WITH_SIMD=1 vs. WITH_SIMD=0. You might try setting REQUIRE_SIMD=1 when you set WITH_SIMD=1. That will cause CMake to fail if the libjpeg-turbo build system cannot enable the SIMD extensions for any reason (for instance, if it can’t find NASM.) Without REQUIRE_SIMD, the build system will silently fall back to WITH_SIMD=0 if it cannot enable the SIMD extensions.

@dcommander
Copy link
Member

A note about funding:

This effort was initially funded by one of my clients that was interested in a WASM port of TurboVNC, but unfortunately they have decided to go in a different direction. Thus, there is no longer any direct funding to pursue a WASM port of libjpeg-turbo. If an organization reading this would like to fund this effort, please contact me. Otherwise my ability to work on this feature will be limited.

@ngzhian
Copy link

ngzhian commented Oct 4, 2021

Do you think that an SSE2 intrinsics implementation of the libjpeg-turbo SIMD extensions would have better performance under WASM?

Most likely yes, I did an earlier experiment for libwebp (see https://docs.google.com/document/d/1gm2GOZV6yUj31BAT9P83V_GoqJSyGi0NLVfXzjG2AuQ/edit?usp=sharing request and I will share), and the NEON headers are slower than SSE headers when I ran the result on x64 v8. There are 2 main reasons, 1 is the multiple layers of translation as mentioned (unfortunately I don't have a arm64 system readily available for testing), the bigger reason is probably mismatch of commonly used NEON instructions v.s. available Wasm instructions, such as load/store multiple elements. Those don't map well to Wasm, and end up being scalarized.

You might try setting REQUIRE_SIMD=1 when you set WITH_SIMD=1

I bet I made a mistake somehow, let me try this again and report back, thanks!

And thank you for your honest view on the funding situation, and being responsive on the issue tracker :)

@ngzhian
Copy link

ngzhian commented Oct 4, 2021

Turns out there are issues with how I'm running the test, I am now seeing differences in native SIMD v.s. no SIMD (from 23% to 211% speedup).
To summarize results from the spreadsheet:

  • speedup of native SIMD over native no SIMD is in the range 23% to 211%
  • speedup/slowdown of Wasm SIMD over native SIMD is in the range -47% to -75%
  • speedup/slowdown of Wasm SIMD over Wasm no SIMD is in the range -4.53% to 18.03%
  • speedup/slowdown of Wasm no SIMD over native no SIMD is in the range -27% to -41%

It looks like Wasm SIMD via the Neon intrinsics is not getting us a lot of performance win, probably need to look closer at what the issue is.

@dcommander
Copy link
Member

A 23-211% speedup is still quite low for libjpeg-turbo. The x86-64 SIMD extensions in libjpeg-turbo should boost performance by roughly 200-700%. Perhaps your tests are measuring other operations, such as I/O or pixel format conversion, that limit the measured speedup of libjpeg-turbo because of Amdahl's Law.

@ngzhian
Copy link

ngzhian commented Oct 4, 2021

I am running the benchmarks like so: ./tjbench ../testimages/vgl_5674_0098.bmp 95 -rgb -qq -nowrite -warmup 10 and the data in the spreadsheet is just output from this command.

@dcommander
Copy link
Member

Ah, OK. That particular image is a bit of a corner case. I include it in order to get a sense of the library's performance with images that contain a lot of sharp lines, but some of the other images may be more representative of the library's overall performance. If I have to test just one image, I generally choose artificial.ppm.

@ngzhian
Copy link

ngzhian commented Oct 4, 2021

Okay! I will rerun this with artificial.ppm, any other images from https://github.com/libjpeg-turbo/libjpeg-turbo/tree/main/testimages you would suggest? I would like to run with at least 3 images.

@ngzhian
Copy link

ngzhian commented Oct 4, 2021

With artificial.ppm, i'm seeing Wasm SIMD not being any faster than scalar SIMD :( very strange.

Edit: Hm, seems like i'm missing JSIMD_FORCENEON

@dcommander
Copy link
Member

You should not use the images under the testimages/ directory in the source tree, as those are very small and are intended only for regression testing. Use artificial.ppm, nightshot_iso_100.ppm, and vgl_5674_0098.ppm available at https://libjpeg-turbo.org/About/Performance.

@ngzhian
Copy link

ngzhian commented Oct 5, 2021

Try to force enable NEON with JSIMD_FORCENEON doesn't seem to work well, the runs are even slower and I get some corruption:

d8 --prof tjbench.js -- testimages/artificial.ppm 95 -rgb -qq -nowrite -subsamp G

12.14
24.13
WARNING in line 216 while executing tjDecompress2():
Corrupt JPEG data: premature end of data segment
74.91

Which is really weird, I also tried using the changes here kleisauke/wasm-vips@acd4c81 instead of my own in case I missed anything, and I get the same issues.

(anyway, no response necessary, just keep a log of things I tried and problems faced)

@Honya2000
Copy link

I'm sorry but i still cannot use Cmake to configure WASM build. Even with simd disabled.
I'm using Ninja build system, because i know only this way to build WASM on windows host.

When i run cmake i get this error:
CMake Error at CMakeLists.txt:62 (math):
math cannot parse the expression: " * 8": syntax error, unexpected
exp_TIMES (2).

I see this questing is here since 3 years. But didn't find any answer to this.

Any help will be appreciated.

P.S. Is there ready to use code for fastest jpeg decoding using Emscripten SIMD?

@ngzhian
Copy link

ngzhian commented Oct 15, 2021

I'm sorry but i still cannot use Cmake to configure WASM build. Even with simd disabled. I'm using Ninja build system, because i know only this way to build WASM on windows host.

When i run cmake i get this error: CMake Error at CMakeLists.txt:62 (math): math cannot parse the expression: " * 8": syntax error, unexpected exp_TIMES (2).

I see this questing is here since 3 years. But didn't find any answer to this.

Any help will be appreciated.

P.S. Is there ready to use code for fastest jpeg decoding using Emscripten SIMD?

Can you show the exact commands you are using?
I have had success using this:

LDFLAGS='-sENVIRONMENT=shell -sINITIAL_MEMORY=100MB' emcmake cmake -S . -B build-wasm-scalar -G Ninja -DWITH_SIMD=0

This builds all the js and wasm files inside of the folder build-wasm-scalar. Note that I am also using Ninja, and this disables SIMD.

The important thing is to use emcmake. the LDFLAGS part is optional, those only matter when running tests using V8.

@dcommander
Copy link
Member

@Honya2000 This issue is a feature request, not a technical support question. It tracks the potential for a WebAssembly SIMD implementation in libjpeg-turbo. That implementation does not yet exist. WebAssembly is not officially supported by The libjpeg-turbo Project yet, and until it is, you are responsible for figuring out how to make it work. I do not guarantee that it works at all. The best I can do at the moment is to show you how I personally build libjpeg-turbo using Emscripten on Linux. I have not tried to do likewise on Windows.

export LDFLAGS="--emrun"

emcmake cmake -G"Unix Makefiles" \
        -DCMAKE_EXECUTABLE_SUFFIX=.html \
        -DWITH_SIMD=0 -DENABLE_SHARED=0 \
        -DCMAKE_C_FLAGS="-Wall -s ALLOW_MEMORY_GROWTH=1" \
        {source_directory} ${1+"$@"}

@Honya2000
Copy link

I was able to compile with emcmake.
But unfortunately it didn’t make any performance boost. I think because of missing simd support.
Will try to decode in js code asynchronously.

@Ono-Sendai
Copy link

Just a note to say I would like to see SIMD support for libjpeg-turbo in Emscripten.
Note that WASM does support SIMD instructions.
If libjpeg-turbo has a code path with intrinsics it should already 'just work'. Or perhaps it already does? (Haven't looked at the WASM disassembly).

@Ono-Sendai
Copy link

Ono-Sendai commented Apr 8, 2024

Some benchmarks, loading a 859 x 1186 pixel jpeg.

native
------
Time to load anewSquares_JPG_17115529236124618104.JPG: 3.538 ms.
Time to load anewSquares_JPG_17115529236124618104.JPG: 3.529 ms.
Time to load anewSquares_JPG_17115529236124618104.JPG: 3.537 ms.

web (Emscripten / WASM), built with -DWITH_SIMD=0
---
Time to load anewSquares_JPG_17115529236124618104.JPG: 15.00 ms.
Time to load anewSquares_JPG_17115529236124618104.JPG: 15.07 ms.
Time to load anewSquares_JPG_17115529236124618104.JPG: 14.97 ms.


(4.24x slower)

Benchmarking PNG decoding, I have found WASM to be about 2.5x slower than native. So I expect we would see some performance increases with SIMD support in Emscripten libjpeg-turbo.

@dcommander
Copy link
Member

@Ono-Sendai Refer to the comments above. You can already use libjpeg-turbo's Neon SIMD intrinsics with WASM, but apparently the performance is not spectacular. (I have not tested this myself.) Better performance may be possible with x86-64 SIMD intrinsics (#732), but that needs funding. I used to have specific funding to look into libjpeg-turbo/WASM, but the company providing that funding no longer needs this feature. Thus, it hasn't been a high priority. Any new SIMD implementation in libjpeg-turbo will be an expensive proposition, and any other solution to this problem would require using one of our existing SIMD implementations. Thus, the solution is either mostly dependent on downstream work in Emscripten, or it is mostly dependent on upstream funding to extend libjpeg-turbo.

@Ono-Sendai
Copy link

I tried the aarch64 hack mentioned in #250 (comment)
It seems to reduce performance if anything:

Time to load anewSquares_JPG_17115529236124618104.JPG: 15.60 ms

I hope you get funding for this work (I fund you $5 a month which probably doesn't cover it :) )

Presumably there is some way of compiling your hand-written assembly code into WASM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants