-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add timestamp to NCCL logging messages #887
base: master
Are you sure you want to change the base?
Conversation
Thanks for the feedback, @vtlam ! A few thoughts just off the top of my head:
|
Older GLOG omits the year, perhaps to be more compact since the year is usually not ambiguous (rare to ever read logs that are > 1 year). But I am OK with putting the year to make the timestamps more readable.
This adds about 20 characters per line, so perhaps we can increase buffer size slightly if that's a concern. To avoid complexity (e.g. log parser like you mentioned) and less one trivial knob to worry about, my feeling is that having env var opt-in/opt-out seems a bit over-engineered, but I am OK with that if others have a strong opinion.
Besides being parsed by automated scripts, NCCL logs are also often read directly by human engineers during debugging/troubleshooting. For distributed jobs, they usually need to cross-reference lots of logs from different ranks & hosts. So in practice, non-human-readable timestamps can be a nuisance in these situations. IMO, there is already different log levels (WARNING, INFO etc) to handle logging verbosity. |
What bothers me a bit here is that it may give users a false impression that those timestamps are synchronized and therefore if one rank prints something with a timestamp older than another line from another rank, then it happened after. Which would not necessarily be the case. For that reason, I would prefer timestamps to be added by the process manager which gathers stdout of each rank and can add a prefix like the rank which printed the log, the timestamp in whatever format is preferred and which could also have some rank timesync feature. That said, if that timestamp is an option and not enabled by default, I would be ok with adding it in NCCL. I can see why that could be useful for admins of production clusters, in particular to correlate rare network events with logs on the fabric management side. |
Generally logs are separated by physical hosts and operators are well aware of time drifting issue when matching events across different hosts. In practice, when they do have a need to cross-reference highly accurately across hosts, they would proactively enable a distributed time sync mechanism such as NTP, PTP.
Depending on the implementation and runtime workload, one potential problem is that there can be a lag in the process manager ie. its timestamp can be different from the actual NCCL event.
SGTM! |
In our experience, debugging NCCL issues is easier with timestamp information in NCCL log messages.
This pull request is an RFC prototype, largely inspired by GLOG implementation -- feel free to change the details if needed (e.g., timestamp formatting, timestamp location in the log messages etc).
Timestamp format (GLOG-like):
mmdd hh:mm:ss.uuuuuu
where
mm
: The month (zero padded; ie May is '05')dd
: The day (zero padded)hh:mm:ss.uuuuuu
: time in hours, minutes and fractional secondsExample:
hostname:822689:823163 [0] NCCL INFO Using network IB
0614 11:26:37.758169 hostname:822689:823163 [0] NCCL INFO Using network IB