Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX512 support #640

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Conversation

feliwir
Copy link

@feliwir feliwir commented Jan 12, 2023

This is a draft PR to track my progress of AVX512 support and get early opinions on the changes. What follows is a list of algorithms supported by AVX512 so far:

  • jsimd_rgb_ycc_convert_avx512
  • jsimd_rgb_gray_convert_avx512
  • jsimd_h2v1_downsample_avx512
  • jsimd_h2v2_downsample_avx512
  • jsimd_ycc_rgb_convert_avx512
  • jsimd_h2v1_merged_upsample_avx512
  • jsimd_h2v2_merged_upsample_avx512
  • jsimd_h2v1_fancy_upsample_avx512
  • jsimd_h2v2_fancy_upsample_avx512
  • jsimd_fdct_islow_avx512
  • jsimd_idct_islow_avx512
  • jsimd_quantize_avx512
  • jsimd_convsamp_avx512

The other thing I'm missing is AVX512 detection in jpeg_simd_cpu_support

@dcommander
Copy link
Member

Unfortunately the nature of this project and its funding model is such that it doesn't allow for much fine-grained collaboration. libjpeg-turbo has a general fund that pays for a day or two of my labor per month to do general maintenance and bug fixing, but that doesn't give me much flexibility to look at major new features unless I can secure specific funding from some benevolent corporation or other organization. (Referring to https://libjpeg-turbo.org/About/Sponsors, most of the major features that have been added to libjpeg-turbo, and all of them that have been added in the last four years, have been funded in that way.) That funded development model, which capitalizes on my unique experience with the open source projects I founded and maintain, is the only way I can survive as an independent open source developer. It has been my experience that, even if contributed SIMD extensions are reasonably well-written, it takes at least 30 hours for me to test, benchmark, clean up, and integrate a full set of them. The most efficient way for me to do that is all at once, so I don't generally review WIPs. It won't take 30 hours in this case, since you aren't implementing a full set of SIMD extensions, but I would still prefer to wait and do the review once all are implemented.

My suggestions/comments/general notes regarding how to approach this:

  1. Don't bother with i386. At this point, it's a legacy architecture. You seem to have modified the i386 SIMD stub to call your new jsimd_convsamp_avx512() and jsimd_quantize_avx512() functions, but you didn't provide an i386 implementation of those functions. That's why the Appveyor build is failing.
  2. Add your copyright attribution to the header of any file that is meaningfully modified (modified functionally or informationally, as opposed to purely cosmetically.)
  3. Isolate each SIMD extension (each individual algorithm, such as sample conversion, quantization, color conversion, etc.) that you are adding, and use the methodology described here to compare the overall performance with and without the new SIMD extension. Here is an OpenOffice template that I use for tracking the results:
    libjpegturbo-fullsimd-template.ods. I generally use a standalone Linux machine for benchmarking, and I kill all unnecessary processes, disable any automatic CPU speed features, and disable all but one CPU in order to make the performance as consistent as possible. If I am benchmarking a compression algorithm, then I expect the decompression speedup (which should ideally be 0%) to fall within +/- 1%. That gives me confidence that any compression speedup is legitimate and has as little noise as possible. Similarly, when benchmarking a decompression algorithm, I expect the compression speedup to fall within +/- 1%. I generally do a full benchmark run first and then re-run the individual tests, if any, that fall outside of the expected range. (The -subsamp argument to tjbench is your friend.)
  4. Each SIMD extension (algorithm) should be implemented in a separate commit.
  5. You might prove me wrong, but I suspect that it won't be possible to meaningfully speed up the DCT/IDCT algorithms. Those algorithms deal with 8x8 blocks of 16-bit coefficients, so they can only exploit data parallelism within each 128-bit row. I was able to speed up the "accurate"/"islow" ("islow" is a misnomer these days) DCT/IDCT algorithms using AVX2 instructions only because those algorithms use 32-bit intermediate values. (The "ifast" algorithm, which doesn't use 32-bit intermediates, couldn't be accelerated with AVX2 relative to SSE2.) Since there are interdependencies among the row computations, I don't see any obvious way to exploit AVX512. I would save this algorithm for last, since it will be the most difficult to implement and will have the least likelihood of success.

Other than that, you have identified the algorithms that are most likely to benefit from AVX512. Just make sure that each individual SIMD extension can justify itself in terms of a meaningful overall speedup. If the contribution that any one SIMD extension makes to the overall performance is "in the noise", then it isn't worth increasing the complexity of the code. Also, the overall speedup of the entire AVX512 SIMD implementation, relative to the AVX2 implementation, needs to be > 5% and preferably > 10%, or it's not worth including.

@feliwir
Copy link
Author

feliwir commented Jan 12, 2023

Hey, thanks a lot for your very in depth answer and informations!
I do totally understand that it's not viable for you to review every single commit in this PR. As long as it is in a "draft" state I don't expect you to review it at all. It's neither tested yet or complete.

Don't bother with i386. At this point, it's a legacy architecture. You seem to have modified the i386 SIMD stub to call your new jsimd_convsamp_avx512() and jsimd_quantize_avx512() functions, but you didn't provide an i386 implementation of those functions. That's why the Appveyor build is failing.

Alright, i removed any i386 specific bits 👍

Each SIMD extension (algorithm) should be implemented in a separate commit.

Yes, I'll rewrite history to be meaningful. Having each algorithm in a separate commit is a good idea, thanks.

Overall I'm still very new to assembly (usually i use intrinsics when i need them), so i might need some time to complete this. However I'll carefully benchmark & validate all ported algorithms, so you don't have to spend time testing/ debugging broken code.
As you pointed out already most of the other algorithms are hard to optimize with AVX512, however i might find some opportunities, since there were also new XMM/YMM instructions added with AVX512, that were missing before. So we could maybe just take some of the AVX2 functions and improve them with AVX512 - need to study the algorithms in more depth.

Since I'm on Zen4 it might also be hard to benchmark, since it uses a "double pumped" approach for AVX512 (each instruction needing atleast 2 cycles). With intrinsics in other projects i was able to measure a performance-gain though, so I'm carefully optimistic.

@feliwir feliwir force-pushed the avx512-support branch 2 times, most recently from 506b8df to a0db679 Compare January 12, 2023 14:03
@dcommander
Copy link
Member

Sounds good. I am hip-deep in the 2.2 release and other projects anyhow, so I won't have time to review this for at least a few months. Take your time.

@feliwir
Copy link
Author

feliwir commented Jan 12, 2023

@dcommander AppVeyor fails since it uses NASM 2.10.1 which doesn't know anything about AVX512. Can i upgrade this as part of this PR or is AppVeyor obsolete due to GitHub Actions anyways?

@dcommander
Copy link
Member

@dcommander AppVeyor fails since it uses NASM 2.10.1 which doesn't know anything about AVX512. Can i upgrade this as part of this PR or is AppVeyor obsolete due to GitHub Actions anyways?

AppVeyor is still used for Windows builds. Please update appveyor.yml so that it uses NASM 2.13.03.

Unfortunately it appears as if Yasm doesn't yet support AVX-512 instructions, so the version of libjpeg-turbo in which this feature lands will need to disclaim support for Yasm and require NASM 2.14 or later, since 2.14 is needed for macOS.

@feliwir feliwir force-pushed the avx512-support branch 5 times, most recently from 4b04b54 to 1016c81 Compare January 16, 2023 20:03
@feliwir
Copy link
Author

feliwir commented Jan 17, 2023

Hey @dcommander,

I've implemented the most part of the forward DCT by now and i'm pretty much same speed as AVX2. I did some benchmarking and profiling (with UART simulators and AMD uProf) and also got some input from friends of mine.
It seems like the functions (e.g. quantize) do so little work by now (just 2 ZMM registers used) that AVX512 can't really perform well.
The loop for the forward DCT looks like this:

  for (bi = 0; bi < num_blocks; bi++, start_col += DCTSIZE) {
    /* Load data into workspace, applying unsigned->signed conversion */
    (*do_convsamp) (sample_data, start_col, workspace);

    /* Perform the DCT */
    (*do_dct) (workspace);

    /* Quantize/descale the coefficients, and store into coef_blocks[] */
    (*do_quantize) (coef_blocks[bi], divisors, workspace);
  }

I think it would be benefical to process multiple blocks at once inside the SIMD part - what's your opinion about this?
All the little function calls throttle the overall performance.
Since this is more than a little change i wanted to ask you about your take on the problem.

Another question that occured to me after digging deeper into the code:
We do already make use of function dispatching, couldn't we just dispatch to the right implementation right away? So this block becomes obsolete:

GLOBAL(void)
jsimd_convsamp(JSAMPARRAY sample_data, JDIMENSION start_col,
               DCTELEM *workspace)
{
  if (simd_support & JSIMD_AVX512)
    jsimd_convsamp_avx512(sample_data, start_col, workspace);
  else if (simd_support & JSIMD_AVX2)
    jsimd_convsamp_avx2(sample_data, start_col, workspace);
  else
    jsimd_convsamp_sse2(sample_data, start_col, workspace);
}

I'm not too god with ABI & API compatibility, so i'm not sure which changes are possible without breaking backwards compatibility.

@dcommander
Copy link
Member

I've implemented the most part of the forward DCT by now and i'm pretty much same speed as AVX2. I did some benchmarking and profiling (with UART simulators and AMD uProf) and also got some input from friends of mine. It seems like the functions (e.g. quantize) do so little work by now (just 2 ZMM registers used) that AVX512 can't really perform well. The loop for the forward DCT looks like this:

  for (bi = 0; bi < num_blocks; bi++, start_col += DCTSIZE) {
    /* Load data into workspace, applying unsigned->signed conversion */
    (*do_convsamp) (sample_data, start_col, workspace);

    /* Perform the DCT */
    (*do_dct) (workspace);

    /* Quantize/descale the coefficients, and store into coef_blocks[] */
    (*do_quantize) (coef_blocks[bi], divisors, workspace);
  }

I think it would be benefical to process multiple blocks at once inside the SIMD part - what's your opinion about this? All the little function calls throttle the overall performance. Since this is more than a little change i wanted to ask you about your take on the problem.

The contractors whom Intel hired to do the initial AVX2 implementation for libjpeg-turbo tried that approach for exactly the same reason you did. However, it was too disruptive to the libjpeg architecture, so I had to reject that portion of their patch. In general, the libjpeg API is outdated, brittle, and needlessly complicated, because it had to run on late 1980s/early 1990s machines. Those machines were often limited to 640k of directly accessible memory and often took minutes to compress/decompress a 1-megapixel image, so the libjpeg architecture is very fine-grained. It was designed so that application programmers could process JPEG images scanline-by-scanline, thus preventing the JPEG compression/decompression task from monopolizing the CPU and adversely affecting the responsiveness of the system. (Bear in mind that Windows 3.x used cooperative multitasking, so an application had to yield control-- usually implicitly by calling GetMessage()-- before another application could get any CPU time, and of course MS-DOS had no multitasking at all.) If we were starting from scratch in 2023 and designing an API for modern machines, it would not look anything like the libjpeg API. It would be a lot more coarse-grained and thus conducive to block-level data parallelism. (See #313.) The TurboJPEG 3 API overhaul that I'm currently working on (#517) is a step toward #313, but the ugly truth is that most applications still use the libjpeg API and will continue to do so for the foreseeable future. Because of its legacy, the libjpeg API is a de facto standard, and nothing will supplant it until another API can provide most of the same low-level control without the drawbacks of the libjpeg API. As long as we are supporting the libjpeg API, we have to support the libjpeg architecture, and that architecture is unfortunately limited to single-block processing in many cases. The DCT methods can process multiple blocks at a time, but the higher-level logic that controls them (the DCT coefficient controller) will never feed multiple blocks into the DCT methods unless chrominance subsampling is used. (Even then, only the luminance blocks will be processed in that manner.) Thus, it would be necessary to make deep modifications to the DCT coefficient controller logic as well as the lower-level methods, and those deep modifications would make it tricky (at best) to support both the new multi-block SIMD extension architecture as well as the current single-block SIMD extension architecture.

Unfortunately, many years ago I predicted that these architectural limitations would limit the speedup from AVX2 (which they did) and possibly prevent AVX-512 from being worthwhile at all except for accelerating 12-bit precision (#199.)

Another question that occured to me after digging deeper into the code: We do already make use of function dispatching, couldn't we just dispatch to the right implementation right away? So this block becomes obsolete:

GLOBAL(void)
jsimd_convsamp(JSAMPARRAY sample_data, JDIMENSION start_col,
               DCTELEM *workspace)
{
  if (simd_support & JSIMD_AVX512)
    jsimd_convsamp_avx512(sample_data, start_col, workspace);
  else if (simd_support & JSIMD_AVX2)
    jsimd_convsamp_avx2(sample_data, start_col, workspace);
  else
    jsimd_convsamp_sse2(sample_data, start_col, workspace);
}

I'm not too god with ABI & API compatibility, so i'm not sure which changes are possible without breaking backwards compatibility.

The main reason why that isn't done is the fact that it's difficult to access structure fields from raw assembly code, so the SIMD dispatcher dumbs down the arguments and feeds only the relevant integer values and pointers into the assembly routines. Changing that interface would require changing all existing SIMD extensions, and I have insufficient funding and even less desire to pursue that kind of overhaul. I doubt it would provide any noticeable speedup anyhow.

I think that, unfortunately, the AVX-512 SIMD extensions will have to justify themselves based solely on their ability to accelerate libjpeg-turbo without modifying the underlying libjpeg or SIMD interfaces. Even under ideal circumstances, deep modifications to the libjpeg architecture are challenging. There are so many touch points that it is impossible to test all of them, and the last time we attempted to make such deep changes (to implement partial image decompression), it took years to eliminate the bugs. In order to tackle such a project, I would need to work very closely with a paying customer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants