Skip to content

GSoC2012

Antonin Descampe edited this page Apr 28, 2015 · 1 revision

Improve openjpeg2000 encoding/decoding time

Jpeg2000 provides superior compression and advanced features like optionally lossless compression, region of interest coding, stream decoding etc. As a result of this complex feature set, the encoding and decoding process for jpeg2000 is computationally expensive. It has already been demonstrated that a significant speed up is achieved in many image processing applications by using the massively parallel GPU architecture. In fact there is previous literature that reports speed up on the GPUs for various components of jpeg2000 like DWT, EBCOT etc.

As a part of this project we plan to develop a parallel implementation for jpeg2000 encoding/decoding using the CUDA programming platform available for Nvidia GPUs. The decompression is more challenging than the compression and we plan to focus on the lossy decoding. At the end of this project, we hope to have a parallel implementation for tier 1/2 decoding, inverse dwt and inverse dc level shift which comprise the decoding pipeline.

Code Repository

The code for this project is pushed into openjpeg optimization branch

Progress

Compilation

As a part of commit, the appropriate changes were done to the cmake files to enable compilation of CUDA code with openjpeg library. The file gpu.cu, contains all the CUDA kernels and also the kernel wrappers. The kernel wrappers can be evoked from the openjpeg library files and these wrappers then invoke the appropriate kernel. The kernel wrapper functions are prefixed with gpu and kernel functions are prefixed with kernel

Inverse DC Level Shift

In commit, the function for inverse dc level shift Parameter 2 was implemented on the GPU (gpu_ dc_ level_ shift_ decode). The data is copied component by component to the GPU. The number of threads is equal to the image size and each thread adds the dc level shift value to the corresponding pixel.

Once, the entire pipeline has been implemented,Memory Setup we can remove the memory transfer overhead for this stage. Ideally before the first stage the image data is transferred to the GPU. It then continues to reside and also gets modified as the decoding stages t1/t2, inverse dwt and inverse dc level shift are performed. Finally after all stages, the output image is ready in the GPU memory and it is transferred back to the CPU output array.

Inverse Discrete Wavelet Transform

The commit has the version#1 of the complete implementation of the inverse DWT stage. The cuda implementation is as follows :

  1. Similar to the CPU code, processing of four Memory Setupvalues is done together using the float4 data type. Though a single cuda core does not have the vector computation capability (like MMX instructions on CPU), there is still benefit in using float4 because GPUs provide higher FLOPS/s and and memory access is faster i.e. loading a float4 is quicker than individually loading the 4 floats.

  2. For processing a rh x rw image in say decode_ h stage, the number of blocks is equal to rh/4 and each block has rw threads. A simpler way to understand is jth thread of ith block process four values : (4*i,j); (4*i+1,j); (4*i+2.j); (4*i+3;j). If rw is less than a threshold (currently 512), then the entire wavelet array of size rw per thread can be stored in the shared memory. Thus provided that the size of current resolution is less than a threshold we use the kernel with optimization of shared memory.

  3. If the size exceeds the threshold, then we can no longer use shared memory and a global memory array is used for the wavelet. The kernels which handle this case of overflow of shared memory have theParameter 2 _ global_ in their function names.

  4. Note that processing the entire wavelet array of size rw in a single block gives us a chance to use the block synchronization primitive __ synchthreads. Thus we can club the v4dwt_ interleave_ (h/v) and v4dwt_ decode_ step1 and v4dwt_ decode_ step2 in a single kernel. Such kernel fusion as and when possible results in optimal performance of the code.

Performance/Results

Inverse DWT

These are performance results for the commit84f53565

The timing measurements are performed using clock_gettime(CLOCK_MONOTONIC, ...), this is a monotonically increasing timer without drift adjustments and it is a standard for measuring the time execution in case of asynchronous events like cuda memory transfer or kernel calls.

The platform is Nvidia Geforce GTX 580 GPU and Intel Core i7 920 (2.67GHz) CPU.

The below table contains a split of the timings for various phases of the GPU code.

Test Images (Right)
Parameters (Down)
sintel_2k.j2k oldtowncross.j2k crowdrun.j2k duckstakeoff.j2k
Memory Setup Time (secs) 0.054404 + 0.001611 + 0.001551 = 0.057566 0.055298 + 0.002517 + 0.001976 = 0.059791 0.054851 + 0.001928 + 0.001972 = 0.058751 0.055063 + 0.001997 + 0.001954 = 0.059014
Computation Time (secs) 0.002360 + 0.002307 + 0.002309 = 0.006976 0.003153 + 0.003737 + 0.003089 = 0.009979 0.003108 + 0.003283 + 0.003057 = 0.009448 0.003099 + 0.004512 + 0.003135 = 0.010746
Output Memory Transfer Time (secs) 0.002965 + 0.002913 + 0.002833 = 0.008711 0.003318 + 0.003231 + 0.003295 = 0.009844 0.003315 + 0.003684 + 0.003278 = 0.010277 0.003285 + 0.003222 + 0.003304 = 0.009811

As per the readings above, the overall execution time is on average 0.05 secs for the memory setup phase, 0.01 secs for the compute phase and again 0.01 secs for the output memory transfer phase. Thus the overall inverse dwt time is time is about 0.07secs per image.

The above are component wise timings and note that in all cases, the time for the first component is very high for the memory setup phase as compared to the other components. It is a known fact that the first cuda call performs some bus initialization and is known to have a 50ms = 0.05secs overhead. Refer Slow Cuda Setup, Refer

Unfortunately, previous CPU benchmark by @nicolas mailing list report 0.05s time for the inverse dwt phase.

But we should take note that this 0.05s is a one time overhead for CUDA and it is dominant because we are looking only at the dwt inverse phase.

If we implement the complete pipeline as follows :

  1. Transfer image data to GPU.

  2. Apply all compute steps (t1/t2, inverse dwt, inverse dc level shift) through GPU kernels.

  3. Transfer output image data back to CPU.


The actual 0.05s overhead is only incurred once in step(1) above and the time for inverse dwt is only the compute time 0.01s which is 5x faster than the CPU time.

Also the time for tier2 decode phase is very high and if are able to achieve speed up on GPU for this phase, then the resulting speed up even with the 0.05s initial overhead will still be significant.

Inverse t1 decoding

Procedure

t1_decode_cblks_v2 performs decoding across all code blocks for every component. Decoding of each code block is done by calling t1_decode_cblk_v2, which internally calls the three decoding passes t1_dec_sigpass, t1_dec_refpass etc. The data filling for each decoded code block is done as follows :

for j = 0 to cblk_h

for i = 0 to cblk_w

tmp = datap[(j*cblk_w) + i] tiledp[(j*tile_w) + i] = tmpæ/2 end end

cblk_w and cblk_h unless otherwise specified have a max initialization of 64 x 64 in openjpeg.c

Coarse Parallelization

All code blocks ( of all compno/precno) are decoded in parallel. The actual decoding passes of each code block are performed sequentially.

One Cuda Block is used per code block, based on the following rationale :

  1. 64x64 = 4096 values per code block, and usual CUDA configurations allow 512 threads per cuda block. Thus if we develop a fully parallel approach at later stage, we will incur computation of 8 values per thread which is reasonable.

  2. For extension to fully parallel approach for EBCOT we will require synchronisation which is available in the CUDA block (synchthreads())

threadID = 0 for each code block will do the heavy computation i.e. call t1_decode_cblk_v2 and other subsequent functions t1_dec_(sig/ref)pass etc._

The code for t1_decode_cblk_v2 and some of the mqc functions apart from those called in / for (passno = 0; passno < seg->real_num_passes; ++passno)/ loop has been re-factored into device functions that are callable from the GPU code. Refer commit

The major challenge in implementing t1 decoding on GPU is to ensure memory transfer of the dynamic arrays that are part of the OPJ structs.

e.g. opj_tcd_cblk_dec_v2 has a OPJ_BYTEdata array and opj_tcd_seg_t seg array and opj_tcd_seg_t has internally a OPJ_BYTE data array.

The method devised in commit was to copy the structs to a device structure and additionally copy the dynamic array part separately to another flat array on the device.