Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add timestamp to NCCL logging messages #887

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vtlam
Copy link

@vtlam vtlam commented Jun 14, 2023

In our experience, debugging NCCL issues is easier with timestamp information in NCCL log messages.

This pull request is an RFC prototype, largely inspired by GLOG implementation -- feel free to change the details if needed (e.g., timestamp formatting, timestamp location in the log messages etc).

Timestamp format (GLOG-like): mmdd hh:mm:ss.uuuuuu
where
mm: The month (zero padded; ie May is '05')
dd: The day (zero padded)
hh:mm:ss.uuuuuu: time in hours, minutes and fractional seconds

Example:

  • Before: hostname:822689:823163 [0] NCCL INFO Using network IB
  • After: 0614 11:26:37.758169 hostname:822689:823163 [0] NCCL INFO Using network IB

@cliffwoolley
Copy link
Collaborator

Thanks for the feedback, @vtlam ! A few thoughts just off the top of my head:

  • I believe GLOG format would list yyyymmdd instead of mmdd -- leaving out the year seems strange IMHO.
  • This does tend to make the log lines that much wider, and some of them are already pretty wide; maybe there should be an env var opt-in (or opt-out)? Not to mention that some log parsers would need to be adapted to any log formatting changes.
  • It's perhaps debatable whether the human-readable timestamp like here or a more compact machine-readable timestamp (compare e.g. dmesg to dmesg -T) is more helpful. But then again there's a zillion different ways someone might want a timestamp to be presented if we tried to add further configuration options, so maybe "keep it simple" is better? Not sure.

@vtlam
Copy link
Author

vtlam commented Jun 14, 2023

  • I believe GLOG format would list yyyymmdd instead of mmdd -- leaving out the year seems strange IMHO.

Older GLOG omits the year, perhaps to be more compact since the year is usually not ambiguous (rare to ever read logs that are > 1 year). But I am OK with putting the year to make the timestamps more readable.

  • This does tend to make the log lines that much wider, and some of them are already pretty wide; maybe there should be an env var opt-in (or opt-out)? Not to mention that some log parsers would need to be adapted to any log formatting changes.

This adds about 20 characters per line, so perhaps we can increase buffer size slightly if that's a concern. To avoid complexity (e.g. log parser like you mentioned) and less one trivial knob to worry about, my feeling is that having env var opt-in/opt-out seems a bit over-engineered, but I am OK with that if others have a strong opinion.

  • It's perhaps debatable whether the human-readable timestamp like here or a more compact machine-readable timestamp (compare e.g. dmesg to dmesg -T) is more helpful. But then again there's a zillion different ways someone might want a timestamp to be presented if we tried to add further configuration options, so maybe "keep it simple" is better? Not sure.

Besides being parsed by automated scripts, NCCL logs are also often read directly by human engineers during debugging/troubleshooting. For distributed jobs, they usually need to cross-reference lots of logs from different ranks & hosts. So in practice, non-human-readable timestamps can be a nuisance in these situations. IMO, there is already different log levels (WARNING, INFO etc) to handle logging verbosity.

@sjeaugey
Copy link
Member

What bothers me a bit here is that it may give users a false impression that those timestamps are synchronized and therefore if one rank prints something with a timestamp older than another line from another rank, then it happened after. Which would not necessarily be the case.

For that reason, I would prefer timestamps to be added by the process manager which gathers stdout of each rank and can add a prefix like the rank which printed the log, the timestamp in whatever format is preferred and which could also have some rank timesync feature.

That said, if that timestamp is an option and not enabled by default, I would be ok with adding it in NCCL. I can see why that could be useful for admins of production clusters, in particular to correlate rare network events with logs on the fabric management side.

@vtlam
Copy link
Author

vtlam commented Jun 20, 2023

What bothers me a bit here is that it may give users a false impression that those timestamps are synchronized and therefore if one rank prints something with a timestamp older than another line from another rank, then it happened after. Which would not necessarily be the case.

Generally logs are separated by physical hosts and operators are well aware of time drifting issue when matching events across different hosts. In practice, when they do have a need to cross-reference highly accurately across hosts, they would proactively enable a distributed time sync mechanism such as NTP, PTP.

For that reason, I would prefer timestamps to be added by the process manager which gathers stdout of each rank and can add a prefix like the rank which printed the log, the timestamp in whatever format is preferred and which could also have some rank timesync feature.

Depending on the implementation and runtime workload, one potential problem is that there can be a lag in the process manager ie. its timestamp can be different from the actual NCCL event.

That said, if that timestamp is an option and not enabled by default, I would be ok with adding it in NCCL. I can see why that could be useful for admins of production clusters, in particular to correlate rare network events with logs on the fabric management side.

SGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants