Question: overriding jpeg_get_large #308

flx42 · 2018-11-28T19:50:46Z

Request

I'm trying to swap the implementation of jpeg_get_large to not use the default malloc call. So far I see the following options:

a. Patching jmemnobs.c and building a customized libjpeg-turbo for my project.
b. Use LD_PRELOAD at runtime. Not very convenient to use.
c. Provide a custom implementation of jpeg_memory_mgr. I believe this would mean re-implementing all the pool logic. This seems unnecessary complicated just to change jpeg_get_large.

Is there another way? Should there be another way?

Background

For our workloads, we have determined it's better to run with transparent_hugepage set to madvise instead of always. You can find multiple summaries on this topic through Google, and many projects recommend disabling it:
https://alexandrnikitin.github.io/blog/transparent-hugepages-measuring-the-performance-impact/

For libjpeg-turbo, this has a significant impact on performance for (very) large images, since a regular malloc is used:

$ sudo mkdir /tmp/tmpfs
$ sudo mount -t tmpfs tmpfs /tmp/tmpfs/
$ cd /tmp/tmpfs
$ wget https://upload.wikimedia.org/wikipedia/commons/7/7e/In_the_Conservatory.jpg

$ echo 'always' | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
$ sudo perf stat /opt/libjpeg-turbo/bin/jpegtran -outfile out.jpg In_the_Conservatory.jpg

 Performance counter stats for '/opt/libjpeg-turbo/bin/jpegtran -outfile out.jpg In_the_Conservatory.jpg':

       6887.508134      task-clock (msec)         #    1.000 CPUs utilized          
                 9      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             4,777      page-faults               #    0.694 K/sec                  
    25,305,340,354      cycles                    #    3.674 GHz                    
    43,888,294,171      instructions              #    1.73  insn per cycle         
     6,464,733,135      branches                  #  938.617 M/sec                  
       291,011,151      branch-misses             #    4.50% of all branches        

       6.887748021 seconds time elapsed

$ echo 'madvise' | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
$ sudo perf stat /opt/libjpeg-turbo/bin/jpegtran -outfile out.jpg In_the_Conservatory.jpg

 Performance counter stats for '/opt/libjpeg-turbo/bin/jpegtran -outfile out.jpg In_the_Conservatory.jpg':

       7765.573369      task-clock (msec)         #    1.000 CPUs utilized          
                11      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
         1,008,381      page-faults               #    0.130 M/sec                  
    28,466,581,426      cycles                    #    3.666 GHz                    
    46,853,774,484      instructions              #    1.65  insn per cycle         
     6,953,442,264      branches                  #  895.419 M/sec                  
       294,329,516      branch-misses             #    4.23% of all branches        

       7.765872825 seconds time elapsed

As you can see, there is 200x more page faults without transparent huge pages, yielding an execution time slower by one second.

If I recompile libjpeg-turbo to use madvise, I get similar performance as with THP set to always.

GLOBAL(void *) 
jpeg_get_large(j_common_ptr cinfo, size_t sizeofobject)
{ 
  void *p; 

  if (posix_memalign(&p, 2097152, sizeofobject) != 0)  
    return NULL;  

  madvise(p, sizeofobject, MADV_HUGEPAGE);  
  return p;
}

Solutions

@dcommander if you agree we should try to fix this besides the solutions a) b) and c) above, I see two options:

jpeg_get_large could be modified to use either posix_memalign+madvise or mmap+madvise. Possibly behind a ifdef tied to a cmake configuration flag.
jpeg_get_large/jpeg_get_small could be made weak symbols so that users can override them with their own implementation.

The text was updated successfully, but these errors were encountered:

dcommander · 2018-11-28T21:02:36Z

You don't need to re-implement all of jpeg_memory_mgr. You just need to copy the code from alloc_large() into a new function (my_alloc_large() or whatnot), call your own custom version of jpeg_get_large() (my_jpeg_get_large() or whatnot) from my_alloc_large(), and set cinfo->mem->alloc_large = my_alloc_large at run time.
https://groups.google.com/d/msg/libjpeg-turbo-users/4bj_pnWMIRc/PTRCqiSACAAJ

I don't think it would be appropriate to enable huge pages in libjpeg-turbo by default. Presumably this would waste memory for smaller images (not good, given libjpeg-turbo's usage in popular web browsers), and given that MADV_HUGEPAGE is not even documented in some Linux distributions, I have concerns about the compatibility and behavior of that feature with various Un*x platforms and Linux distributions and image sizes. At minimum, it may need to be enabled only for image size ranges where it can be shown to be beneficial, and determining that would require a lot of benchmarking and testing on a variety of platforms. Also, I strongly suspect that different operating systems, and perhaps even different Linux kernel versions, might need a different page alignment.

Also, given that others in the community have encountered a need to override jpeg_get_*() and jpeg_free_*() for completely different reasons, it doesn't make much sense to me to officially adopt your specific implementation of jpeg_get_large() (even with #ifdef), given that your implementation may or may not have the same utility for other users. It seems like the best general solution would be to make it easier to override the jpeg_get_*() and jpeg_free_*() functions. For security reasons, however, I am opposed to allowing this via weak linking (yes, users could still override malloc() with LD_PRELOAD, but application developers can-- and do-- also prevent this by compiling with -static, so I do not want to undermine that use case.) It really needs to be something that an application explicitly enables, perhaps via an API extension of some sort-- a callback mechanism, for instance, that can be explicitly enabled via a new API function. I'm happy to implement such a feature as a funded development project, but I can tell you that the odds of it happening without funding are approximately 0%, since I have no need for such a feature myself.

Long story short: your quickest and only free-as-in-beer solution is to clone the alloc_large() function, even though that does mean duplicating code.

flx42 · 2018-11-28T21:21:23Z

You just need to copy the code from alloc_large() into a new function (my_alloc_large() or whatnot), call your own custom version of jpeg_get_large() (my_jpeg_get_large() or whatnot) from my_alloc_large(), and set cinfo->mem->alloc_large = my_alloc_large at run time.

I did think of that, but this code uses structures from jmemmgr.c that are not exposed in the API, such as my_memory_mgr or large_pool_struct. Copying the internal structure definitions to my project doesn't seem like a safe solution. That's why I said it seems better to replace the whole implementation; unless I'm missing something.

for completely different reasons, it doesn't make much sense to me to officially adopt your specific implementation of jpeg_get_large() (even with #ifdef), given that your implementation may or may not have the same utility for other users. It seems like the best general solution would be to make it easier to override the jpeg_get_() and jpeg_free_()

Sure, my code snippet was a simple example, I didn't claim it was the code to use (otherwise I would have open a PR directly). I agree with all your points around small images and benchmarking.

I'm happy to implement such a feature as a funded development project, but I can tell you that the odds of it happening without funding are approximately 0%, since I have no need for such a feature myself.

ACK, thanks for the transparency.

dcommander · 2018-11-28T22:41:06Z

I did think of that, but this code uses structures from jmemmgr.c that are not exposed in the API, such as my_memory_mgr or large_pool_struct. Copying the internal structure definitions to my project doesn't seem like a safe solution. That's why I said it seems better to replace the whole implementation; unless I'm missing something.

Good point. I suppose you could include the entirely of jmemmgr.c in your code and change just what you need, but I admit that that's yucky.

ACK, thanks for the transparency.

Yeah, sorry about that. :( Unfortunately the only way I get paid for my work on libjpeg-turbo is through funded development. I've spent thousands of hours of pro bono work getting the project to where it is now, but as a result, I've created a financial hardship for myself, so I can't do any pro bono work on it anymore. I'm really hoping some benevolent corporation or other organization will recognize the precarious nature of this project and step forward to fully fund it, which is why I keep mentioning the funding situation every chance I get.

dcommander added enhancement funding needed labels Nov 28, 2018

dcommander mentioned this issue Jan 25, 2019

Provide memory manager overriding api #325

Closed

dcommander added API and removed funding needed labels Dec 21, 2019

dcommander added the funding needed label Oct 23, 2020

dcommander mentioned this issue Feb 24, 2022

Custom Memory Allocator Methods #584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: overriding jpeg_get_large #308

Question: overriding jpeg_get_large #308

flx42 commented Nov 28, 2018

dcommander commented Nov 28, 2018

flx42 commented Nov 28, 2018 •

edited

dcommander commented Nov 28, 2018 •

edited

Question: overriding jpeg_get_large #308

Question: overriding jpeg_get_large #308

Comments

flx42 commented Nov 28, 2018

Request

Background

Solutions

dcommander commented Nov 28, 2018

flx42 commented Nov 28, 2018 • edited

dcommander commented Nov 28, 2018 • edited

flx42 commented Nov 28, 2018 •

edited

dcommander commented Nov 28, 2018 •

edited