Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High GIL Overhead in PML Update Kernel #429

Open
esp0r opened this issue Mar 18, 2024 · 1 comment
Open

High GIL Overhead in PML Update Kernel #429

esp0r opened this issue Mar 18, 2024 · 1 comment

Comments

@esp0r
Copy link

esp0r commented Mar 18, 2024

Code Version

Latest development branch commit: 4d5afa8

Observation

CPU usage of gprMax drops below 80% when the grid size is less than 1000x1000x1000 and the number of OpenMP threads exceeds 16. The following image illustrates the effective CPU usage on a 64-core system with a grid size of 200x200x200.

CPU Usage Image

Profiling

During the PML update phase, numerous "bubbles" (marked in white) are observed, indicating that the CPU is neither occupied with OpenMP synchronization (indicated in red) nor engaged in calculations (shown in green).

Profiling Image

Upon closer inspection, each PML layer update kernel can be divided into three segments, separated by OpenMP barriers. In the initial segment, the functions PyGILState_Ensure and PyEval_SaveThread are executed (the white portion). The execution time of these functions varies significantly between threads. Threads that complete these operations early enter a spinning state (the green portion) until all threads reach the barrier, after which the actual computation commences. Subsequently, PyEval_SaveThread and drop_gil are executed.

PML Layer Update Image

The Cause

The Cython code for a PML kernel can be found here:

for i in prange(0, nx, nogil=True, schedule='static', num_threads=nthreads):
ii = xf - (i + 1)
RA01 = RA[0, i] - 1
RB0 = RB[0, i]
RE0 = RE[0, i]
RF0 = RF[0, i]
for j in range(0, ny):
jj = j + ys
for k in range(0, nz):
kk = k + zs
# Hy
materialHy = ID[4, ii, jj, kk]
dEz = (Ez[ii + 1, jj, kk] - Ez[ii, jj, kk]) / dx
Hy[ii, jj, kk] = (Hy[ii, jj, kk] + updatecoeffsH[materialHy, 4] *
(RA01 * dEz + RB0 * Phi1[0, i, j, k]))
Phi1[0, i, j, k] = RE0 * Phi1[0, i, j, k] - RF0 * dEz
# Hz
materialHz = ID[5, ii, jj, kk]
dEy = (Ey[ii + 1, jj, kk] - Ey[ii, jj, kk]) / dx
Hz[ii, jj, kk] = (Hz[ii, jj, kk] - updatecoeffsH[materialHz, 4] *
(RA01 * dEy + RB0 * Phi2[0, i, j, k]))
Phi2[0, i, j, k] = RE0 * Phi2[0, i, j, k] - RF0 * dEy

Below is a simplified excerpt of the generated C code:

#pragma omp parallel num_threads(nthreads)
{
  PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
  Py_BEGIN_ALLOW_THREADS
  #pragma omp for
  for (int i = 0; i < nx; i++) {
    // Main loop
  }
  Py_END_ALLOW_THREADS
  __Pyx_PyGILState_Release(__pyx_gilstate_save);
}

Each thread preserves its context and restores it after the inner parallel region. This behavior seems unnecessary since the kernel does not require Python interactions.

Benchmark

After manually removing all the GIL-related code in the generated C file and changing the OpenMP schedule method to dynamic, the "bubbles" disappeared, and CPU usage increased. Consequently, gprMax achieved improved scalability.

Improved CPU Usage Image

Scalability Image

Solutions Attempted

The following methods have been tried without success, as Cython continues to generate the same C code:

with nogil:
  for i in prange(0, nx, nogil=False, schedule='static', num_threads=nthreads):
cpdef void order1_xminus(
...
) noexcept nogil:
  for i in prange(0, nx, nogil=False, schedule='static', num_threads=nthreads):
@craig-warren
Copy link
Member

@esp0r this is a really interesting analysis - thankyou for this work! Hopefully we can explore it further under GSoC2024. It would be useful to know if there are any Cython docs on this behaviour or any advice from Cython users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants