New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AVX512 support #640
base: main
Are you sure you want to change the base?
Add AVX512 support #640
Conversation
Unfortunately the nature of this project and its funding model is such that it doesn't allow for much fine-grained collaboration. libjpeg-turbo has a general fund that pays for a day or two of my labor per month to do general maintenance and bug fixing, but that doesn't give me much flexibility to look at major new features unless I can secure specific funding from some benevolent corporation or other organization. (Referring to https://libjpeg-turbo.org/About/Sponsors, most of the major features that have been added to libjpeg-turbo, and all of them that have been added in the last four years, have been funded in that way.) That funded development model, which capitalizes on my unique experience with the open source projects I founded and maintain, is the only way I can survive as an independent open source developer. It has been my experience that, even if contributed SIMD extensions are reasonably well-written, it takes at least 30 hours for me to test, benchmark, clean up, and integrate a full set of them. The most efficient way for me to do that is all at once, so I don't generally review WIPs. It won't take 30 hours in this case, since you aren't implementing a full set of SIMD extensions, but I would still prefer to wait and do the review once all are implemented. My suggestions/comments/general notes regarding how to approach this:
Other than that, you have identified the algorithms that are most likely to benefit from AVX512. Just make sure that each individual SIMD extension can justify itself in terms of a meaningful overall speedup. If the contribution that any one SIMD extension makes to the overall performance is "in the noise", then it isn't worth increasing the complexity of the code. Also, the overall speedup of the entire AVX512 SIMD implementation, relative to the AVX2 implementation, needs to be > 5% and preferably > 10%, or it's not worth including. |
Hey, thanks a lot for your very in depth answer and informations!
Alright, i removed any i386 specific bits 👍
Yes, I'll rewrite history to be meaningful. Having each algorithm in a separate commit is a good idea, thanks. Overall I'm still very new to assembly (usually i use intrinsics when i need them), so i might need some time to complete this. However I'll carefully benchmark & validate all ported algorithms, so you don't have to spend time testing/ debugging broken code. Since I'm on Zen4 it might also be hard to benchmark, since it uses a "double pumped" approach for AVX512 (each instruction needing atleast 2 cycles). With intrinsics in other projects i was able to measure a performance-gain though, so I'm carefully optimistic. |
506b8df
to
a0db679
Compare
Sounds good. I am hip-deep in the 2.2 release and other projects anyhow, so I won't have time to review this for at least a few months. Take your time. |
@dcommander AppVeyor fails since it uses NASM 2.10.1 which doesn't know anything about AVX512. Can i upgrade this as part of this PR or is AppVeyor obsolete due to GitHub Actions anyways? |
AppVeyor is still used for Windows builds. Please update appveyor.yml so that it uses NASM 2.13.03. Unfortunately it appears as if Yasm doesn't yet support AVX-512 instructions, so the version of libjpeg-turbo in which this feature lands will need to disclaim support for Yasm and require NASM 2.14 or later, since 2.14 is needed for macOS. |
4b04b54
to
1016c81
Compare
Hey @dcommander, I've implemented the most part of the forward DCT by now and i'm pretty much same speed as AVX2. I did some benchmarking and profiling (with UART simulators and AMD uProf) and also got some input from friends of mine. for (bi = 0; bi < num_blocks; bi++, start_col += DCTSIZE) {
/* Load data into workspace, applying unsigned->signed conversion */
(*do_convsamp) (sample_data, start_col, workspace);
/* Perform the DCT */
(*do_dct) (workspace);
/* Quantize/descale the coefficients, and store into coef_blocks[] */
(*do_quantize) (coef_blocks[bi], divisors, workspace);
} I think it would be benefical to process multiple blocks at once inside the SIMD part - what's your opinion about this? Another question that occured to me after digging deeper into the code: GLOBAL(void)
jsimd_convsamp(JSAMPARRAY sample_data, JDIMENSION start_col,
DCTELEM *workspace)
{
if (simd_support & JSIMD_AVX512)
jsimd_convsamp_avx512(sample_data, start_col, workspace);
else if (simd_support & JSIMD_AVX2)
jsimd_convsamp_avx2(sample_data, start_col, workspace);
else
jsimd_convsamp_sse2(sample_data, start_col, workspace);
} I'm not too god with ABI & API compatibility, so i'm not sure which changes are possible without breaking backwards compatibility. |
1016c81
to
42537c7
Compare
The contractors whom Intel hired to do the initial AVX2 implementation for libjpeg-turbo tried that approach for exactly the same reason you did. However, it was too disruptive to the libjpeg architecture, so I had to reject that portion of their patch. In general, the libjpeg API is outdated, brittle, and needlessly complicated, because it had to run on late 1980s/early 1990s machines. Those machines were often limited to 640k of directly accessible memory and often took minutes to compress/decompress a 1-megapixel image, so the libjpeg architecture is very fine-grained. It was designed so that application programmers could process JPEG images scanline-by-scanline, thus preventing the JPEG compression/decompression task from monopolizing the CPU and adversely affecting the responsiveness of the system. (Bear in mind that Windows 3.x used cooperative multitasking, so an application had to yield control-- usually implicitly by calling Unfortunately, many years ago I predicted that these architectural limitations would limit the speedup from AVX2 (which they did) and possibly prevent AVX-512 from being worthwhile at all except for accelerating 12-bit precision (#199.)
The main reason why that isn't done is the fact that it's difficult to access structure fields from raw assembly code, so the SIMD dispatcher dumbs down the arguments and feeds only the relevant integer values and pointers into the assembly routines. Changing that interface would require changing all existing SIMD extensions, and I have insufficient funding and even less desire to pursue that kind of overhaul. I doubt it would provide any noticeable speedup anyhow. I think that, unfortunately, the AVX-512 SIMD extensions will have to justify themselves based solely on their ability to accelerate libjpeg-turbo without modifying the underlying libjpeg or SIMD interfaces. Even under ideal circumstances, deep modifications to the libjpeg architecture are challenging. There are so many touch points that it is impossible to test all of them, and the last time we attempted to make such deep changes (to implement partial image decompression), it took years to eliminate the bugs. In order to tackle such a project, I would need to work very closely with a paying customer. |
42537c7
to
56755e5
Compare
Co-authored-by: Robert Clausecker <fuzxxl@gmail.com>
56755e5
to
c221126
Compare
This is a draft PR to track my progress of AVX512 support and get early opinions on the changes. What follows is a list of algorithms supported by AVX512 so far:
The other thing I'm missing is AVX512 detection in
jpeg_simd_cpu_support