Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProofOfConcept: NCCL Profiler with Start/Stop hooks and sampling mode combined. #1208

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sanrise
Copy link

@sanrise sanrise commented Mar 4, 2024

Do not review. A concept patch for anyone to try how sampling and start/stop API work together. Please review the individual feature PRs linked below.

Commit 1: (PR on #1210)

aka Trace Mode:
To better debug application code and understand variations in performance at collective level we introduce calls to start/stop/report NCCL profiler. This allows application profilers to start/stop a NCCL profiler during runtime for selective introspection. The current NCCL profiler collects all events of a job, their corresponding collective names and the ability to dump a Chrometrace file.

Commit 2 (PR on #1209)

aka Sampling Mode:
To understand how NCCL interacts with the RDMA transport and identify persistent issues that may slow down all jobs (e.g., slow senders, low-priority network paths), we would like to collect (a representative sample of) all profiling events across all jobs running in an AI zone. While the current NCCL profiler does collect all events of a job, it is limited to a fixed maximum number of events, easily reached even by a one-iteration NCCL test.

(These changes have been authored and iterated on by @briancoutinho, @cristianlume and various engineers at Meta over a period of time)

NCCL provides a profiling/tracing capability to record various operations during collectives including setting up buffers, sending data to and from GPU etc. This change will enable us to control NCCL profiling from the application layer through a start/stop interface.

Enhancements
* It uses a compile time flag and traces the whole application. So it does not support start and stop API.
* Does not annotate the start and stop of the overall collective and provide collective name.
* Missing chunk/data size measurement.
* Add nccl API markers.
* Improve clean up for profiler and collective event buffers.
* Make trace dumping not dependent on collective markings.
* Future enhancements will include sampling to enable always-on
  collection.
Along with the on demand, runtime hooks provided by the
previous commit. We would also like to improve the continuous
collection by introducing sampling mechanisms.

In this approach proxy events are sampled and written to a
bpf map according to a user defined sampling weight.

The sampled mode is designed to be always on and should
not affect full (unsampled) traces being triggered by start/stop API.
@sanrise sanrise marked this pull request as draft March 4, 2024 23:23
@sanrise sanrise changed the title Modify NCCL profiler invocation, collection, dump-trace ProofOfConcept: NCCL Profiler with Start/Stop hooks and sampling mode combined. Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant