Faster MCU_Decode #657

icodywu · 2023-02-28T18:59:16Z

In libjpeg-turbo, the decompression is predominated by MCU and Huffman decoding,
which accounts for about 73% of decompression time in our observation of decompressing a large pool of images.
We have re-designed MCU and Huffman decoding to achieve substantially higher speed.
With the new algorithmic design, we observed an overall 40% speedup in jpeg decompression.

Our new design mainly includes four aspects.

a faster Huffman decoding method utilizing two lookup tables, which performs Huffman decoding
in at most twice lookups. It also contains an effective preprocessing to dynamically skip
the second lookup.
Different decoder structures for DC and AC, to account for their vastly different state number.
Their structural difference lies in lookup table sizes, which allows for better cashing.
New structure of Huffman state to eliminate existing arithmetic operations and particularly
to remove a critical branching inside MCU decoding loop.
Cleverly filling bit_buffer so that it is always adequate for the proposed Huffman lookahead
table decoding. It allows the deprecation of the cumbersome function jpeg_huff_decode().

We maintained the original functional inputs/outputs so that the new files simply replace the counterpart
existing files in libjpeg-turbo and are ready to run.

Functional modifications

1. modified the structure huff_entropy_decoder.
1. modified the macro FILL_BIT_BUFFER_FAST
1. modified the following seven functions:
METHODDEF(void) start_pass_huff_decoder(j_decompress_ptr cinfo);
GLOBAL(void) jpeg_make_d_derived_tbl(j_decompress_ptr cinfo, boolean isDC,

                                               int tblno, void** pdtbl);

GLOBAL(boolean) jpeg_fill_bit_buffer(bitread_working_state *state,

               register bit_buf_type get_buffer, register int bits_left,

```
               int nbits);
```
LOCAL(boolean) decode_mcu_slow(j_decompress_ptr cinfo, JBLOCKROW *const MCU_data);
LOCAL(boolean) decode_mcu_fast(j_decompress_ptr cinfo, JBLOCKROW *const MCU_data);
METHODDEF(boolean) decode_mcu(j_decompress_ptr cinfo, JBLOCKROW *MCU_data);
GLOBAL(void) jinit_huff_decoder(j_decompress_ptr cinfo);
while maintaining their original functional inputs/outputs.
1. deprecated the function GLOBAL(void) jpeg_huff_decode();

dcommander · 2023-02-28T22:38:27Z

Definitely not 40%. Using the benchmark procedures and test images described here, I observe:

2.8 GHz Intel Xeon W3530

0.2% - 6.3% (+/- 1%) [average 3.1%] overall decompression speedup with 64-bit code

3.6 GHz Intel Xeon W2123

3.3% - 11.1% (+/- 2%) [average 7.0%] overall decompression speedup with 64-bit code

Other issues:

Decompression does not seem to complete with 32-bit code (infinite loop?)
Your code should be based on the main branch. As it is, I had to disable C_LOSSLESS_SUPPORTED and D_LOSSLESS_SUPPORTED in order to build it.
Your code should not introduce trailing whitespace.
Your code should be formatted and spaced consistently with the rest of the libjpeg-turbo code base.
Your name should be added to the copyright headers, as required by the IJG license, to indicate which files you modified.
Code comments should not be personal. That is, do not refer to yourself, and when describing modifications, mention the baseline against which the modifications were made.
Commented debugging code should be removed.
Existing code comments should not be removed unless there is a good reason to do so.

Please bear in mind that libjpeg-turbo is used indirectly by billions of people every day (through Firefox, Chrome, Linux, Android, etc.) and is an ISO/ITU-T reference implementation, so extreme care must be taken when modifying it. The Huffman decoder is an attack surface for security exploits, so even more care must be taken when modifying that particular module. The issues above do not instill me with confidence that such care was taken.

icodywu · 2023-03-14T11:51:32Z

I did observe a 40% speedup for large image decompression in my image library. A key idea behind is to allocate more time to build a more sophisticated Huffman decompression table while making Huffman decompression more efficient. Indeed the performance gain depends on the image size. In the case of tiny images, the advantage of faster Huffman decompression is offset by the slower building of Huffman decompression table.
I shall accommodate your technical comments soon.

dcommander · 2023-03-15T17:08:29Z

The images I tested were 1 to 10 megapixels in size, so not "tiny" by any means. How are you measuring performance?

icodywu · 2023-03-31T19:30:29Z

Sorry for my delayed responses. See my answers below.

Decompression does not seem to complete with 32-bit code (infinite loop?)

I have modified the code to accommodate 32-bit CPU. Could you re-test it

Your code should be based on the main branch. As it is, I had to disable C_LOSSLESS_SUPPORTED and D_LOSSLESS_SUPPORTED in order to build it.

It is. I did not find C_LOSSLESS_SUPPORTED and D_LOSSLESS_SUPPORTED in any of the three modified files.

Your code should not introduce trailing whitespace.

removed.

Your code should be formatted and spaced consistently with the rest of the libjpeg-turbo code base.

Done.
Your name should be added to the copyright headers, as required by the IJG license, to indicate which files you modified.
Done.

Code comments should not be personal. That is, do not refer to yourself, and when describing modifications, mention the baseline against which the modifications were made.

Followed.

Commented debugging code should be removed.

Done.

Existing code comments should not be removed unless there is a good reason to do so.

Restored.

icodywu · 2023-03-31T19:43:57Z

Performance comparison was actually measured by a coworker who recently quit. I have been scrambling to pick up his workload, including replicating his testing environment. One key checkpoint is that MCU_Decode() accounts for 73% CPU time in the existing version. I also know the pool includes some huge-size images of ~400MB. Also, the pool contains NO progressive images.

In addition, the performance is system dependent on the parameter pair
#define DC_LOOKAHEAD 8 /* # of bits of lookahead /
DC_derived_tbl.look2up[128]
wherein the alternatives are 6: 512; 7: 256; 8: 128; 9: 64
and
#define AC_LOOKAHEAD 10 / # of bits of lookahead */
AC_derived_tbl.look2up[280]
where the alternatives are 8: 472; 9: 344; 10: 280

dcommander · 2023-04-08T19:10:17Z

Given that the libjpeg-turbo General Fund is exhausted for the next 13 months, it is unlikely that I will be able to look at this any time soon. I have no reason to believe that your fixes magically fixed the performance. I tested with fairly large (up to 7.4-megapixel) images. You have never indicated the type of CPU on which you ran your benchmarks, but 73% CPU time spent on Huffman decoding is definitely not normal.

icodywu · 2023-04-09T04:39:51Z

AMD EPYC 7742 64-core, base clock 2.25GHz

Faster MCU_Decode to boost overall decompression throughput by 40%.

f130ddf

accommodated reviews by DRC

60a4fd0

icodywu changed the title ~~Faster MCU_Decode to boost overall decompression throughput by 40%.~~ Faster MCU_Decode Mar 31, 2023

dcommander force-pushed the main branch from 71f0e75 to a403ad5 Compare June 16, 2023 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster MCU_Decode #657

Faster MCU_Decode #657

icodywu commented Feb 28, 2023

dcommander commented Feb 28, 2023

icodywu commented Mar 14, 2023 •

edited

dcommander commented Mar 15, 2023

icodywu commented Mar 31, 2023

icodywu commented Mar 31, 2023 •

edited

dcommander commented Apr 8, 2023

icodywu commented Apr 9, 2023

Faster MCU_Decode #657

Are you sure you want to change the base?

Faster MCU_Decode #657

Conversation

icodywu commented Feb 28, 2023

dcommander commented Feb 28, 2023

icodywu commented Mar 14, 2023 • edited

dcommander commented Mar 15, 2023

icodywu commented Mar 31, 2023

icodywu commented Mar 31, 2023 • edited

dcommander commented Apr 8, 2023

icodywu commented Apr 9, 2023

icodywu commented Mar 14, 2023 •

edited

icodywu commented Mar 31, 2023 •

edited