High GIL Overhead in PML Update Kernel #429

esp0r · 2024-03-18T10:18:40Z

Code Version

Latest development branch commit: 4d5afa8

Observation

CPU usage of gprMax drops below 80% when the grid size is less than 1000x1000x1000 and the number of OpenMP threads exceeds 16. The following image illustrates the effective CPU usage on a 64-core system with a grid size of 200x200x200.

Profiling

During the PML update phase, numerous "bubbles" (marked in white) are observed, indicating that the CPU is neither occupied with OpenMP synchronization (indicated in red) nor engaged in calculations (shown in green).

Upon closer inspection, each PML layer update kernel can be divided into three segments, separated by OpenMP barriers. In the initial segment, the functions PyGILState_Ensure and PyEval_SaveThread are executed (the white portion). The execution time of these functions varies significantly between threads. Threads that complete these operations early enter a spinning state (the green portion) until all threads reach the barrier, after which the actual computation commences. Subsequently, PyEval_SaveThread and drop_gil are executed.

The Cause

The Cython code for a PML kernel can be found here:

gprMax/gprMax/cython/pml_updates_magnetic_HORIPML.pyx

Lines 69 to 90 in 4d5afa8

    
           for i in prange(0, nx, nogil=True, schedule='static', num_threads=nthreads): 
        
               ii = xf - (i + 1) 
        
               RA01 = RA[0, i] - 1 
        
               RB0 = RB[0, i] 
        
               RE0 = RE[0, i] 
        
               RF0 = RF[0, i] 
        
               for j in range(0, ny): 
        
                   jj = j + ys 
        
                   for k in range(0, nz): 
        
                       kk = k + zs 
        
                       # Hy 
        
                       materialHy = ID[4, ii, jj, kk] 
        
                       dEz = (Ez[ii + 1, jj, kk] - Ez[ii, jj, kk]) / dx 
        
                       Hy[ii, jj, kk] = (Hy[ii, jj, kk] + updatecoeffsH[materialHy, 4] * 
        
                                         (RA01 * dEz + RB0 * Phi1[0, i, j, k])) 
        
                       Phi1[0, i, j, k] = RE0 * Phi1[0, i, j, k] - RF0 * dEz 
        
                       # Hz 
        
                       materialHz = ID[5, ii, jj, kk] 
        
                       dEy = (Ey[ii + 1, jj, kk] - Ey[ii, jj, kk]) / dx 
        
                       Hz[ii, jj, kk] = (Hz[ii, jj, kk] - updatecoeffsH[materialHz, 4] * 
        
                                         (RA01 * dEy + RB0 * Phi2[0, i, j, k])) 
        
                       Phi2[0, i, j, k] = RE0 * Phi2[0, i, j, k] - RF0 * dEy

Below is a simplified excerpt of the generated C code:

#pragma omp parallel num_threads(nthreads)
{
  PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
  Py_BEGIN_ALLOW_THREADS
  #pragma omp for
  for (int i = 0; i < nx; i++) {
    // Main loop
  }
  Py_END_ALLOW_THREADS
  __Pyx_PyGILState_Release(__pyx_gilstate_save);
}

Each thread preserves its context and restores it after the inner parallel region. This behavior seems unnecessary since the kernel does not require Python interactions.

Benchmark

After manually removing all the GIL-related code in the generated C file and changing the OpenMP schedule method to dynamic, the "bubbles" disappeared, and CPU usage increased. Consequently, gprMax achieved improved scalability.

Solutions Attempted

The following methods have been tried without success, as Cython continues to generate the same C code:

with nogil:
  for i in prange(0, nx, nogil=False, schedule='static', num_threads=nthreads):

cpdef void order1_xminus(
...
) noexcept nogil:
  for i in prange(0, nx, nogil=False, schedule='static', num_threads=nthreads):

The text was updated successfully, but these errors were encountered:

craig-warren · 2024-03-19T13:16:16Z

@esp0r this is a really interesting analysis - thankyou for this work! Hopefully we can explore it further under GSoC2024. It would be useful to know if there are any Cython docs on this behaviour or any advice from Cython users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High GIL Overhead in PML Update Kernel #429

High GIL Overhead in PML Update Kernel #429

esp0r commented Mar 18, 2024

craig-warren commented Mar 19, 2024

High GIL Overhead in PML Update Kernel #429

High GIL Overhead in PML Update Kernel #429

Comments

esp0r commented Mar 18, 2024

Code Version

Observation

Profiling

The Cause

Benchmark

Solutions Attempted

craig-warren commented Mar 19, 2024