Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster MCU_Decode #657

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

icodywu
Copy link

@icodywu icodywu commented Feb 28, 2023

In libjpeg-turbo, the decompression is predominated by MCU and Huffman decoding,
which accounts for about 73% of decompression time in our observation of decompressing a large pool of images.
We have re-designed MCU and Huffman decoding to achieve substantially higher speed.
With the new algorithmic design, we observed an overall 40% speedup in jpeg decompression.

Our new design mainly includes four aspects.

  1. a faster Huffman decoding method utilizing two lookup tables, which performs Huffman decoding
    in at most twice lookups. It also contains an effective preprocessing to dynamically skip
    the second lookup.
  2. Different decoder structures for DC and AC, to account for their vastly different state number.
    Their structural difference lies in lookup table sizes, which allows for better cashing.
  3. New structure of Huffman state to eliminate existing arithmetic operations and particularly
    to remove a critical branching inside MCU decoding loop.
  4. Cleverly filling bit_buffer so that it is always adequate for the proposed Huffman lookahead
    table decoding. It allows the deprecation of the cumbersome function jpeg_huff_decode().

We maintained the original functional inputs/outputs so that the new files simply replace the counterpart
existing files in libjpeg-turbo and are ready to run.

Functional modifications

    1. modified the structure huff_entropy_decoder.
    1. modified the macro FILL_BIT_BUFFER_FAST
    1. modified the following seven functions:
  • METHODDEF(void) start_pass_huff_decoder(j_decompress_ptr cinfo);
  • GLOBAL(void) jpeg_make_d_derived_tbl(j_decompress_ptr cinfo, boolean isDC,
  •                                                int tblno, void** pdtbl);
    
  • GLOBAL(boolean) jpeg_fill_bit_buffer(bitread_working_state *state,
  •                register bit_buf_type get_buffer, register int bits_left,
    
  •                int nbits);
    
  • LOCAL(boolean) decode_mcu_slow(j_decompress_ptr cinfo, JBLOCKROW *const MCU_data);
  • LOCAL(boolean) decode_mcu_fast(j_decompress_ptr cinfo, JBLOCKROW *const MCU_data);
  • METHODDEF(boolean) decode_mcu(j_decompress_ptr cinfo, JBLOCKROW *MCU_data);
  • GLOBAL(void) jinit_huff_decoder(j_decompress_ptr cinfo);
  • while maintaining their original functional inputs/outputs.
    1. deprecated the function GLOBAL(void) jpeg_huff_decode();

@dcommander
Copy link
Member

Definitely not 40%. Using the benchmark procedures and test images described here, I observe:

2.8 GHz Intel Xeon W3530

  • 0.2% - 6.3% (+/- 1%) [average 3.1%] overall decompression speedup with 64-bit code

3.6 GHz Intel Xeon W2123

  • 3.3% - 11.1% (+/- 2%) [average 7.0%] overall decompression speedup with 64-bit code

Other issues:

  • Decompression does not seem to complete with 32-bit code (infinite loop?)
  • Your code should be based on the main branch. As it is, I had to disable C_LOSSLESS_SUPPORTED and D_LOSSLESS_SUPPORTED in order to build it.
  • Your code should not introduce trailing whitespace.
  • Your code should be formatted and spaced consistently with the rest of the libjpeg-turbo code base.
  • Your name should be added to the copyright headers, as required by the IJG license, to indicate which files you modified.
  • Code comments should not be personal. That is, do not refer to yourself, and when describing modifications, mention the baseline against which the modifications were made.
  • Commented debugging code should be removed.
  • Existing code comments should not be removed unless there is a good reason to do so.

Please bear in mind that libjpeg-turbo is used indirectly by billions of people every day (through Firefox, Chrome, Linux, Android, etc.) and is an ISO/ITU-T reference implementation, so extreme care must be taken when modifying it. The Huffman decoder is an attack surface for security exploits, so even more care must be taken when modifying that particular module. The issues above do not instill me with confidence that such care was taken.

@icodywu
Copy link
Author

icodywu commented Mar 14, 2023

I did observe a 40% speedup for large image decompression in my image library. A key idea behind is to allocate more time to build a more sophisticated Huffman decompression table while making Huffman decompression more efficient. Indeed the performance gain depends on the image size. In the case of tiny images, the advantage of faster Huffman decompression is offset by the slower building of Huffman decompression table.
I shall accommodate your technical comments soon.

@dcommander
Copy link
Member

The images I tested were 1 to 10 megapixels in size, so not "tiny" by any means. How are you measuring performance?

@icodywu
Copy link
Author

icodywu commented Mar 31, 2023

Sorry for my delayed responses. See my answers below.

Decompression does not seem to complete with 32-bit code (infinite loop?)

I have modified the code to accommodate 32-bit CPU. Could you re-test it

Your code should be based on the main branch. As it is, I had to disable C_LOSSLESS_SUPPORTED and D_LOSSLESS_SUPPORTED in order to build it.

It is. I did not find C_LOSSLESS_SUPPORTED and D_LOSSLESS_SUPPORTED in any of the three modified files.

Your code should not introduce trailing whitespace.

removed.

Your code should be formatted and spaced consistently with the rest of the libjpeg-turbo code base.

Done.
Your name should be added to the copyright headers, as required by the IJG license, to indicate which files you modified.
Done.

Code comments should not be personal. That is, do not refer to yourself, and when describing modifications, mention the baseline against which the modifications were made.

Followed.

Commented debugging code should be removed.

Done.

Existing code comments should not be removed unless there is a good reason to do so.

Restored.

@icodywu icodywu changed the title Faster MCU_Decode to boost overall decompression throughput by 40%. Faster MCU_Decode Mar 31, 2023
@icodywu
Copy link
Author

icodywu commented Mar 31, 2023

Performance comparison was actually measured by a coworker who recently quit. I have been scrambling to pick up his workload, including replicating his testing environment. One key checkpoint is that MCU_Decode() accounts for 73% CPU time in the existing version. I also know the pool includes some huge-size images of ~400MB. Also, the pool contains NO progressive images.

In addition, the performance is system dependent on the parameter pair
#define DC_LOOKAHEAD 8 /* # of bits of lookahead /
DC_derived_tbl.look2up[128]
wherein the alternatives are 6: 512; 7: 256; 8: 128; 9: 64
and
#define AC_LOOKAHEAD 10 /
# of bits of lookahead */
AC_derived_tbl.look2up[280]
where the alternatives are 8: 472; 9: 344; 10: 280

@dcommander
Copy link
Member

Given that the libjpeg-turbo General Fund is exhausted for the next 13 months, it is unlikely that I will be able to look at this any time soon. I have no reason to believe that your fixes magically fixed the performance. I tested with fairly large (up to 7.4-megapixel) images. You have never indicated the type of CPU on which you ran your benchmarks, but 73% CPU time spent on Huffman decoding is definitely not normal.

@icodywu
Copy link
Author

icodywu commented Apr 9, 2023

AMD EPYC 7742 64-core, base clock 2.25GHz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants