Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segmentation fault in the proxy profiler #960

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dfyz
Copy link

@dfyz dfyz commented Aug 14, 2023

This PR fixes issue #950.

The only commit in this PR explains why the segfault happens and how it is fixed.

Tested by running nccl-tests with -DPROFILE_PROXY -DENABLE_TIMER and NCCL_PROXY_PROFILE=.... I'm not sure how to trigger a proxy operation with multiple subs for testing, though.

@liralon
Copy link

liralon commented Aug 15, 2023

Please provide a commit message body for your commit. Explain explicitly what was the issue that was fixed and what is the commit's change that fixes it.

@dfyz dfyz force-pushed the profiler-segfault-fix branch 2 times, most recently from b0ee28f to 20c0fe6 Compare September 15, 2023 20:51
@dfyz
Copy link
Author

dfyz commented Sep 15, 2023

@liralon @sjeaugey I added a commit body with an explanation of the issue and a description of how I tried to fix this. See also this comment for some additional details.

NCCL can be built with `-DPROFILE_PROXY -DENABLE_TIMER` and run
with `NCCL_PROXY_PROFILE=trace.json` environment variable to dump
the timeline of all proxy operations to `trace.json` using the
so-called Trace Event Format.

However, currently NCCL segfaults even when trying to profile very
simple operations. Just running any binary from `nccl-tests` is
enough to trigger the issue, for example:

$ cat run.sh
NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1 \
NCCL_PROXY_PROFILE="proxy_${OMPI_COMM_WORLD_RANK}.json" \
./build/all_reduce_perf -g 2
$ mpirun -n 2 ./run.sh
...
./run.sh: line 4: 372891 Segmentation fault [...]
...
$ gdb build/all_reduce_perf core
...
Program terminated with signal SIGSEGV, Segmentation fault.
57        event->timestamp[state%8] = gettime()-profilingStart;

Note that the only point of `NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1`
is to trigger network communication (and hence proxying) between the
two GPUs even when they are on the same host. If the ranks are on
different hosts, you don't need these variables to trigger the issue.

The root cause of the problem is that the proxy profiler tries to track
only `NCCL_STEPS` (i.e., 8) steps per sub at a time, since there can
never be more than `NCCL_STEPS` in flight at any given moment.
So, for any tracked step `e`, the profiler will:
  a) fill `e->timestamps[ncclProxyProfileBegin]` at the very beginning
     of the sub, allocating a slot in `args->subs[sub].profilingEvents[step%NCCL_STEPS]`
  b) fill `e->timestamps[x]` for `x < ncclProxyProfileEnd` (e.g.,
     x = ncclProxyProfileSendGPUWait, x = ncclProxyProfileSendWait, etc.)
     when `x` happens for the step
  c) fill `e->timestamps[ncclProxyProfileEnd]` when the step is finished
  d) set `args->subs[sub].profilingEvents[step%NCCL_STEPS] = NULL` to clear this
     slot corresponding to `e`, so that the next steps can be safely
     written to `slot`

The problem happens in a): the profiler tries to set `e->timestamps[ncclProxyProfileBegin]`
for *ALL* steps at once, ignoring the fact that we can only track 8
steps at at time. As as result, we essentially only allocate slots for the first 8 steps.
When b) first happens for e.g. step NVIDIA#8 (take a look at the gdb stack
trace above), the corresponding slot is still set to `NULL`, since we cleared
the slot after step #0 finished, but never re-allocated it.

This commit fixes this by first storing the timestamp when profiling began in the proxy args.
When b) first happens, we allocate a slot for the step, and copy the stored timestamp
to `e->timestamp[ncclProxyProfileBegin]`. This way, the generated
timeline is exactly the same as before, but we never try processing
more than `NCCL_STEPS` steps at once.
@dfyz
Copy link
Author

dfyz commented Nov 24, 2023

I rebased my fix over the latest master. There were no meaningful changes, I just had to resolve a trivial merge conflict, since some new fields were added to ncclProxySubArgs recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants