Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements (SIMD extensions) for 12-bit JPEGs #199

Open
dcommander opened this issue Dec 23, 2017 · 28 comments
Open

Performance improvements (SIMD extensions) for 12-bit JPEGs #199

dcommander opened this issue Dec 23, 2017 · 28 comments

Comments

@dcommander
Copy link
Member

dcommander commented Dec 23, 2017

The following are possible projects that would improve libjpeg-turbo's support for JPEG images with 12-bit coefficients:

  1. Provide a 12-bit-enabled build of libjpeg-turbo in the official libjpeg-turbo packages. The easiest way to do this would be to simply install the 12-bit SDK in a different directory (such as /opt/libjpeg-turbo/lib*/12bit/ and /opt/libjpeg-turbo/include/12bit.) However, as is currently the case, programs would not be able to easily include support for both 8-bit and 12-bit JPEGs. Solving that problem would require renaming the API functions (e.g. jpeg12_set_defaults() instead of jpeg_set_defaults()), libraries (e.g. libjpeg12.a instead of libjpeg.a), and headers (e.g. jpeglib12.h instead of jpeglib.h).
  2. Modify the TurboJPEG API so that it supports 12-bit coefficients.
  3. Modify the arithmetic codec so that it supports 12-bit coefficients.
  4. Add SIMD extensions for 12-bit coefficients. I would not even attempt this without requiring at least AVX2 instructions. That would at least allow for, conceptually, expanding the existing 128-bit SSE2 algorithms into 256-bit registers, replacing bytes with words and words with double words (although, due to the dual-lane nature of AVX2, it unfortunately isn't as simple as that.)
@feliwir
Copy link

feliwir commented Apr 12, 2018

Any updates on this?

@dcommander dcommander added this to the libjpeg-turbo 2.1 milestone Apr 12, 2018
@dcommander
Copy link
Member Author

No. It's marked "funding needed" because it needs funding. :) I do not earn a salary for my work on libjpeg-turbo, and this project has no "general fund" that can be used to pay for my time spent maintaining it. Thus, I end up spending hundreds of hours a year in unpaid labor just to keep the project alive. I absolutely cannot spend more hundreds of unpaid hours to implement new features-- particularly features that are of limited benefit to me personally.

In general, new libjpeg-turbo features can only be implemented when a company or organization that needs the feature steps forward to pay for my labor. Patch submissions/pull requests are all well and good, but for a feature this large, even the most well-written patch would require many many hours of my time to test it, clean it up, and integrate it. And most patches are unfortunately not very well-written. I mean, even as late as last month, I was still fixing bugs in code that had been integrated from a patch submission in 2011. Companies will often try to either hire contractors or use their own internal developers to modify libjpeg-turbo, but because those developers don't understand the code like I do, it ends up costing more and taking more time to do things that way than to just pay me to do the work. Even worse, some of them will submit patches that require 50 hours of my labor to test, clean up, and integrate, and yet they expect me to do that work for free.

I was only able to release libjpeg-turbo 2.0 beta because of a grant from Mozilla. That's how this project moves forward, in general. If no funding is received, then it stays in break/fix mode unless I can justify the implementation of a new feature based on its benefit to my other projects/products. I don't know of any other way to do this without going broke or compromising the quality of libjpeg-turbo. Honestly, if the right organization stepped forward with the right offer to buy out control of this project, all of my investment in it, and the opportunity cost of potential future revenue from funded development, I would strongly consider that deal if I felt that they could continue to maintain the project in the same CPU- and O/S-agnostic manner and with the same high quality standards. Otherwise, all of the companies who benefit financially from libjpeg-turbo need to understand that they have a vested interest in keeping me from starving. I can always make a good business case for continuing to maintain the project, fix bugs, etc., because libjpeg-turbo feeds into my other projects, but in general, major developments in this project arise from funded development/sponsorship opportunities. Moving forward, I will not even consider any major new features unless some form of funding is secured or I can make a business case for the feature in the context of how it benefits my other projects/products.

@feliwir
Copy link

feliwir commented Apr 12, 2018

Thanks a lot for your detailed response! I can totally understand that you don’t have the resources to do such a huge feature without any benefit :) We could need 12 bit jpeg support at my company but we don’t do any opensource funding. I‘d like to donate as a private person, but i think a small donation won’t help you :)

Maybe you could add a Paypal donation badge?

@dcommander
Copy link
Member Author

There is a donation badge on http://libjpeg-turbo.org (I even accept cryptocurrency), and every donation helps, but yes, this feature would require thousands of dollars in funding to make happen.

@dcommander
Copy link
Member Author

dcommander commented Jul 12, 2018

Probably not. Historically, every time someone has tried to implement a major feature like this for libjpeg-turbo, the implementation has required a lot of labor on my part to clean it up, test it, and integrate it. I simply can't afford to do that labor for free, and with this feature, unless you really really know what you're doing (the fact that you are asking how and where to start suggests otherwise), it would probably take less effort for me to write the code myself than it would take for me to fix someone else's code. Thus, I would strongly prefer to find a corporate sponsor willing to pay for my labor to implement this feature from scratch.

As an example, Intel contributed the initial implementation of the AVX2 SIMD extensions for libjpeg-turbo 2.0, but it still took about 200 hours of my time to fix that code, integrate it, and revisit some of their design decisions that were too disruptive to the libjpeg architecture. (A popular mantra we had back when I worked for Fortune 500 companies was: "Development is easy, productization is hard.") Fortunately, Intel and Mozilla paid for that labor, because otherwise, I would have had to reject the feature or go broke trying to integrate it. I would like to avoid either scenario. I don't want anyone to waste their time writing code that I can't accept, but I also can't accept code that doesn't meet the quality standards of libjpeg-turbo. I have just never seen a major feature submission that didn't require a ton of labor to fix and integrate. Sometimes that labor is required months or years later, as the feature is being stabilized for release, and even if someone paid for the initial labor to integrate and test the feature, that contract is long expired. Sometimes the organization that paid for the labor can't even be reached. That happened with the partial image decompression feature that Google submitted for libjpeg-turbo 1.5. They paid for my initial labor to integrate it, but some bugs were discovered a year later, and the unpaid labor I had to donate in order to fix the bugs ended up being about as much as the paid labor to integrate the feature. I just can't do that anymore. I'm broke.

My other open source projects-- the ones that pay most of my bills-- are less prone to these sorts of logistical problems, because they have a general fund. Companies pay for a fixed amount of my labor each year (200 hours for TurboVNC, 100 hours for VirtualGL) that I can use to integrate and test strategic feature contributions, as well as to do other strategic things, like build system overhauls, that are important but which no one is likely to pay for. Historically, with libjpeg-turbo, I've donated all of that labor-- probably amounting to more than a year's salary spread over the 8 years that we've been a project-- toward things such as:

  • implementing, fixing, and optimizing ARM SIMD extensions
  • integrating and fixing features that were needed in order to get libjpeg-turbo into Android
  • integrating MIPS SIMD extensions
  • overhauling the build and packaging system multiple times (all of the labor to implement our universal CMake-based build system in libjpeg-turbo 2.0 was unpaid)
  • migrating the repository from Subversion to Git and moving the code to GitHub
  • fixing old Subversion merge commits so that 'git blame' works properly
  • answering numerous newbie questions
  • investigating and fixing security issues and other bugs (often, I don't know whether an issue is newbie error or a legitimate bug until I spend a couple of hours with it)
  • fixing bugs on platforms that I don't personally use
  • implementing new TurboJPEG functions to make it easier for new users to get up and running with libjpeg-turbo
  • integrating, testing, and assisting in the optimization of SIMD extensions for Huffman coding
  • writing treatises to explain why we're not violating the IJG's license, as the current IJG maintainer has claimed, and to explain why we're not adopting his unofficial extensions to the JFIF format
  • working with ISO/ITU-T to make libjpeg-turbo an acceptable candidate for a JPEG "reference implementation"
  • and probably about 100 other things that I can't remember at the moment.

If libjpeg-turbo had a 200-hour yearly general fund, it would have paid for all of those things. Absent that general fund, I just can't donate that kind of labor anymore. I am hoping that some of the corporations who have profited from my pro bono labor will understand the value of what I do for this project and the value of keeping it independent and vendor-agnostic. Without my work, the project would likely either split, with different implementations being maintained by different chip vendors, or it would become a mess of poorly-maintained contributions with varying degrees of quality.

Sorry for the long response. I try to describe this problem as often and as thoroughly as I can, in hopes that an organization with the means of solving it will read my descriptions, recognize the danger that our project is in, and step forward to help out. But unfortunately, at the moment, helping out primarily means contributing money rather than code.

@dcommander
Copy link
Member Author

Our support for 12-bit JPEG is already better than that of libjpeg, since you can easily compile a 12-bit-capable build of libjpeg-turbo by setting a CMake variable (-DWITH_12BIT=1), and our continuous integration system tests this mode of operation for regressions. In contrast, libjpeg requires editing one of the headers, and they don't include any 12-bit-specific unit tests.

@dcommander
Copy link
Member Author

To answer your question regarding code organization: it will probably take some hours of tinkering before I have a good idea of how to implement this. After spending a lot of hours integrating, refactoring, and optimizing the AVX2 SIMD extensions for 8-bit JPEGs, it's clear that AVX2 isn't a "slam dunk." If those instructions aren't very carefully optimized, then they can actually be slower than SSE2 instructions, and the dual-lane nature of AVX2 makes it such that I can't simply expand the SSE2 instructions into 256-bit-wide equivalents. If I could do that, then 12-bit acceleration would be straightforward, and most anyone with NASM and SIMD experience could probably help me port the existing algorithms to 256-bit. At the moment, however, I'm not even sure whether it's feasible to do this using AVX2. I will need to try it first with a couple of key algorithms and get a general feel for the types of algorithmic modifications that will be necessary. It may be the case that AVX-512 is required in order to fully accelerate 12-bit JPEGs, since AVX-512 has two 256-bit lanes and thus supports an EVEX-encoded 256-bit instruction mode that should allow for a similar algorithmic structure to SSE2. Since AVX-512 has 32 registers, however, I would strongly prefer that any AVX-512 code be written using intrinsics rather than NASM, so that's going to require a completely different implementation. I'm not sure exactly where to put those files just yet-- probably under simd/x86_64-12bit and simd/i386-12bit. I would also need to obtain hardware that supports AVX-512 (currently I only have one machine that even supports AVX2, and the others are SSE3 and older), and I have no money to obtain said hardware.

Item (1) in the issue description above might be something that people could help with, if we could agree on an appropriate strategy. Most of that is just going to be tedious work, not really requiring a deep knowledge of the library. (2), (3), and (4) will require a deep knowledge.

@dcommander
Copy link
Member Author

My experience is that intrinsics work fine as long as register exhaustion isn't a concern-- hence the desire to use them only with instruction sets that have 32 registers or above.

@dcommander
Copy link
Member Author

Intrinsics should have a 1:1 correspondence with underlying SIMD instructions, but yes, I understand that C constructs such as loops will be optimized differently based on which compiler is being used. That effect is, in my experience, negligible. It is a moot point, though, if AVX2 proves capable of handling the 12-bit algorithms. In that case, I will use NASM for those algorithms rather than intrinsics. I don't share your confidence, because there are a lot of places in the code where the data would have to cross lanes. I ran into similar problems with the 8-bit AVX2 SIMD extensions. It was frequently the case that those problems couldn't be solved without using vperm2i128, which kills performance.

@dcommander
Copy link
Member Author

Yes, the code does a lot of loop unrolling already, but that's not what I'm talking about. I don't have time to explain further, but my basic point is that:

  • The most expedient way of getting SIMD support for 12-bit JPEGs is to expand the existing SSE2 algorithms to 256-bit equivalents without making any deep modifications to those algorithms.
  • AVX2's dual-lane nature makes it improbable that such a simple expansion of the algorithms can be accomplished. Thus, a performant AVX2 implementation is likely to be difficult if not impossible. When I say "impossible", I mean that it might not be possible with AVX2 to achieve the speedup necessary to justify the feature. It may be necessary to use EVEX256 to achieve the desired speedup.

@mayeut
Copy link
Contributor

mayeut commented Jul 20, 2018

As another side-note about that:

If you are planning to use intrinsics though, you need to be aware that, at least for Intel compiler, there is no guarantee that it will emit the exact instruction whose intrinsic you used in the code -- the instruction mix can differ between compiler versions and micro-architectural optimization switches used which could make it hard to have consistent and reproducible results with intrinsics.

Intrinsics should have a 1:1 correspondence with underlying SIMD instructions, ...

Although Intrinsics mostly have a 1:1 logical correspondence with underlying SIMD instructions, compiler still have an internal representation & an optimizer. I've also seen clang generate slower code than pure scalar code. Looking at disassembly, the 1:1 correspondence was not respected at all. To be fair, I've also seen it optimize (the right way this time) something I though already optimized (I then back ported the generated assembly to intrinsics).

@dcommander
Copy link
Member Author

I feel as if the loss of productivity due to tracking down those sorts of issues is still probably going to be less than the loss of productivity due to dealing with raw NASM code.

@feliwir
Copy link

feliwir commented Jan 27, 2019

I was wondering if it would be possible to contribute the functon renaming in some way, so it's possible to link both 8-bit and 12-bit libraries into the same binary. I guess instead of macros we'd want to have seperate headers & functions as you've already mentioned in your first post.
This could maybe be seperate from the performance improvements?

@dcommander
Copy link
Member Author

dcommander commented Feb 10, 2019

Yes, that could be separate, but I'm not sure if it makes sense to farm it out. It would probably take me as long to review and test your code as it would to implement the feature myself, since I already have a good idea of how the build system should be modified to build and test both 12-bit and 8-bit JPEG. I'm leaning strongly toward just suffixing everything with "12" (jpeglib12.h, jpeg12_*(), libjpeg12), since that would allow both 8-bit and 12-bit usage within the same program, but I predict that I'll run into problems in the context of trying to make cjpeg and djpeg support both 8-bit and 12-bit, and those problems may necessitate a different strategy.

@feliwir
Copy link

feliwir commented Feb 14, 2019

We went for a similar approach with libjpeg at our company to support both 12-bit and 8-bit at the same time. I can try to set something up, but it will probably indeed end up faster if you do it yourself at the end.
Using both 8-bit and 12-bit at the same time in the program is a key requirement for me though.

@dcommander
Copy link
Member Author

I definitely want to have a combined 8-bit/12-bit package in 2.1, and I think the workload for just that part of it will probably be doable without funded development, but I need to see how the general fund plays out over the next couple of months. I already used up most of it for March in order to get the 2.0.2 release out the door, so I'm back into critical bug fix mode until the funding catches up. The 12-bit SIMD extensions will likely never happen without major funded development, but the additional visibility for the 12-bit feature might goad some organization into ponying up the money to accelerate it.

@feliwir
Copy link

feliwir commented May 17, 2019

If it's of any help there is gdcmjpeg, which supports multiple bit depths in one library aswell: https://github.com/malaterre/GDCM/tree/master/Utilities/gdcmjpeg . Basically just a modification from standard libjpeg

@dcommander
Copy link
Member Author

Interesting. They use libjpeg12 as the 12-bit library name, which I also want to do, but they store the headers under different directories (8/, 12/, etc.) Their approach would not allow 8-bit and 12-bit JPEGs to be processed by the same program, since they aren't doing anything to avoid conflicts between the 8-bit and 12-bit libjpeg API function names. I am leaning very strongly toward treating 12-bit as a completely separate API, with different function prefixes (jpeg12_ instead of jpeg_), header names (jpeglib12.h, etc.), and library names. I think that makes the most sense for backward compatibility. Truth is, there isn't really a standard around this, so we have the opportunity to set the trend here. Knowing what I know about how libjpeg-turbo is distributed, distributors will probably want to package the 12-bit build separately, so it would make sense to make it completely independent of the 8-bit build (i.e. you can enable/disable 8-bit and 12-bit separately with CMake options.)

@feliwir
Copy link

feliwir commented May 19, 2019

So you don't want to support both bit depth in parallel in a single library, if i understoof you correctly? Or would the cmake option also allow to pack both into the same library.

@dcommander
Copy link
Member Author

I haven't dug into the issue yet, but if possible, I think it would be best to keep 8-bit and 12-bit in separate libraries but allow those libraries to be used within the same program if desired. That has the following advantages:

  1. It would make it easier for mobile developers to control the size of their applications without using a custom build of libjpeg-turbo. Some mobile developers even go to the trouble of building libjpeg-turbo with only compression or decompression functions. They will almost certainly complain if I add 12-bit functions to the existing library. I'm thinking mainly of iOS developers here, since Android developers have to build libjpeg-turbo from source anyhow.
  2. It would make it easier for distributors to package 12-bit libjpeg-turbo separately, so it doesn't have to be installed in order to use 8-bit libjpeg-turbo.
  3. The libjpeg API undergoes a great deal of scrutiny (even to the point that its backward compatibility is tracked automatically: https://abi-laboratory.pro/?view=timeline&l=libjpeg-turbo), so putting 12-bit libjpeg-turbo into a separate library will guard against regressions in the very popular 8-bit libjpeg API.

Again, though, I cannot come to a final decision on this until I dig into it. All we're doing is spitballing ideas at the moment. There are likely to be issues that will be encountered once I start down a particular path, and those issues may dictate another path.

@dcommander
Copy link
Member Author

I had hoped that I could manage to at least do the build consolidation prior to libjpeg-turbo 2.1, but unfortunately, that's going to have to be deferred to a later release due to lack of project resources.

@dcommander dcommander removed this from the libjpeg-turbo 2.1 milestone Oct 23, 2020
dcommander added a commit that referenced this issue Mar 11, 2022
Partially implements #199

This commit also implements a request from #178 (the ability to compile
the libjpeg example as a standalone program.)
@dcommander
Copy link
Member Author

The dev/2.2 evolving pre-release build now contains support for both 8-bit and 12-bit JPEGs. This is accomplished by using two separate libraries, libjpeg.* (jpeg.lib/jpeg*.dll for Windows/MSVC) and libjpeg12.* (jpeg12.lib/jpeg12-*.dll for Windows/MSVC), and two separate header files, jpeglib.h and jpeg12lib.h. The 12-bit library uses namespace mangling to avoid namespace conflicts with the 8-bit library, so any libjpeg API function, data type, structure, and macro that is affected by the sample size (most of them) will be named jpeg12_* or *J12* or j12_* instead of jpeg_* or *J* or j_*.

I also repurposed the libjpeg example code so that it is now compiled into a standalone application by the build system, which also invokes that application as part of the test suite. This ensure that 8-bit and 12-bit JPEGs can be used within the same application, regardless of whether the application uses static or dynamic linking.

Please give it a try and give me feedback. We have plenty of time to change the implementation between now and the 2.2 release (which isn't even on the schedule yet.)

12-bit support can be added to the TurboJPEG API as part of the TurboJPEG v3 overhaul, for which I'm currently trying to obtain funding. The new API structure will make it a lot easier to support 12-bit JPEGs, since it would only require a separate set of buffer set/get functions for 12-bit samples rather than a completely separate API.

SIMD support for 12-bit JPEGs should also now be straightforward to integrate. The aforementioned namespace mangling uses jsimd12_ as a prefix for the no-op SIMD function stubs in the 12-bit library, so those stubs can eventually be replaced with actual SIMD functions.

@dcommander
Copy link
Member Author

Since the remaining functional improvements for 12-bit JPEGs involve the TurboJPEG API, those have been subsumed by #517. I am repurposing this issue to discuss 12-bit performance improvements (i.e. SIMD extensions.)

@dcommander dcommander changed the title Performance and functional improvements for 12-bit JPEGs Performance improvements (SIMD extensions) for 12-bit JPEGs Mar 15, 2022
@dcommander
Copy link
Member Author

With all apologies to those who may have already started integrating the libjpeg12 API introduced in 7fec507, I came to realize that that implementation was an unreadable and unmaintainable mess. (It also became clear, in the course of researching #402, that the approach in 7fec507 would eventually necessitate a whole new API and library just for 16-bit lossless JPEG images, which is untenable.) Thus, I used a significant amount of uncompensated labor to completely revisit the problem. Refer to e8b40f3 for more details, but in a nutshell, 8-bit and 12-bit data precision now share the same exposed libjpeg API structures (with caveats regarding the need to type-cast a couple of rarely-used jpeg_decompress_struct fields and a couple of jpeg_memory_mgr methods when using 12-bit samples.) This made it possible to support 12-bit data precision using only a handful of new libjpeg API functions rather than a whole new API, which also made it possible to support 12-bit data precision in the existing cjpeg, djpeg, and jpegtran applications. (In cjpeg, this is enabled by passing -precision 12 to the application. djpeg and jpegtran handle 12-bit images automatically.)

@dcommander
Copy link
Member Author

dcommander commented Jan 31, 2023

In libjpeg-turbo 3.0.x, 12-bit data precision is now fully supported in the libjpeg and TurboJPEG APIs and associated programs, without the need to re-compile libjpeg-turbo from source. Thus, all of the original goals for this issue have been met except for SIMD acceleration of the 12-bit-per-sample lossy algorithms.

(Note that, because of the existing Huffman encoding SIMD extensions, the compression of 12-bit lossy baseline JPEG images is already sped up by about 7-10%, and the compression of 12-bit lossy progressive JPEG images is sped up by about 25%.)

@oliver-batchelor
Copy link

oliver-batchelor commented Mar 30, 2023

I made a python binding for the 12 bit encoding (https://github.com/uc-vision/turbojpeg-python). I am curious what kind of performance is expected here?

My benchmarks give a throughput of approximately 1/10th of the 8 bit API.
The image in question here is a 4000x3000 colour image (scaled by 16 for the 12 bit case).

python scripts/bench-encoding.py images/1.jpg -j 24 -q 90 -n 1000                                                                                                          

turbojpeg threaded j=24: 208.9 images/s   (pyturbojpeg published library)
turbojpegv3 threaded j=24: 271.9 images/s   (new binding 8 bit)
turbojpegv3 (12bit) threaded j=24:  34.9 images/s  (new binding 12 bit)

@dcommander
Copy link
Member Author

@oliver-batchelor On my Mac Mini, which has a 3 GHz dual-core Intel Core i7, I observe (with TJBench) that compressing the 8-bit artificial.ppm image from http://imagecompression.info/test_images into an 8-bit JPEG image is about 5-7x faster than compressing the 16-bit artificial.ppm image from http://imagecompression.info/test_images into a 12-bit JPEG image. Decompressing the 8-bit JPEG image is about 2-3x faster than decompressing the 12-bit JPEG image.

I'm not sure what you mean by "scaled by 16", but if you are enabling one of the scaled IDCT algorithms in libjpeg-turbo (which scales the image in the frequency domain) or you are performing scaling in the spatial domain, then that will create additional overhead.

@oliver-batchelor
Copy link

oliver-batchelor commented Mar 31, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants