Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix:Avoid deadlock issues caused by logging in a loop #1265

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wangfakang
Copy link

@wangfakang wangfakang commented Apr 25, 2024

To prevent deadlock issues caused by logging in a loop, so cache the value before the log operation.
Here are the related function call paths:

ncclDebugLog -> ncclLoadParam(lock) -> ncclDebugLog -> ncclLoadParam(deadlock)

nccl/src/misc/param.cc

Lines 62 to 76 in ab2b89c

void ncclLoadParam(char const* env, int64_t deftVal, int64_t uninitialized, int64_t* cache) {
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock(&mutex);
if (__atomic_load_n(cache, __ATOMIC_RELAXED) == uninitialized) {
const char* str = ncclGetEnv(env);
int64_t value = deftVal;
if (str && strlen(str) > 0) {
errno = 0;
value = strtoll(str, nullptr, 0);
if (errno) {
value = deftVal;
INFO(NCCL_ALL,"Invalid value %s for %s, using default %lld.", str, env, (long long)deftVal);
} else {
INFO(NCCL_ENV,"%s set by environment to %lld.", env, (long long)value);
}

The following stack is happend local development.
image

@wangfakang
Copy link
Author

friendly ping @AddyLaddy @sjeaugey .

@wangfakang wangfakang changed the title Bugfix:Avoid deadlock issues caused by logging in a loop. Bugfix:Avoid deadlock issues caused by logging in a loop Apr 25, 2024
@wangfakang wangfakang marked this pull request as draft April 25, 2024 11:29
Signed-off-by: wangfakang <fakangwang@gmail.com>
@wangfakang wangfakang marked this pull request as ready for review April 25, 2024 12:19
Copy link
Member

@kiskra-nvidia kiskra-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your submission!

I left one technical comment for you, but overall, I'm skeptical whether this is the best approach to solving your problem, although it's difficult to tell for sure since you didn't include the actual changes that motivate this PR (which I'm guessing is adding timestamps to the log lines?).

The problem is that adding a dependency in the debug code on the auto-generated ncclParamEnableLogTime function has created a circular lock dependency, leading to a deadlock. Your PR avoids the deadlock by moving the updating of the cache ahead of the debug output. It works, but it's fragile.

Instead, I recommend that you add the check for the presence of NCCL_ENABLE_LOG_TIME to the ncclDebugInit function, alongside the existing code over there that checks for NCCL_DEBUG and NCCL_DEBUG_SUBSYS. That should be more robust and thus more maintainable.

Also, I'm wondering whether NCCL_DEBUG_TIMESTAMPS wouldn't be a better name for this new config, but that's just bikeshedding...

Comment on lines 64 to 67
if (__builtin_expect(__atomic_load_n(cache, __ATOMIC_RELAXED) != uninitialized, true)) {
return;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary. You are basically duplicating what's already done by the caller of this function, which is:

nccl/src/include/param.h

Lines 24 to 26 in ab2b89c

if (__builtin_expect(__atomic_load_n(&cache, __ATOMIC_RELAXED) == uninitialized, false)) { \
ncclLoadParam("NCCL_" env, deftVal, uninitialized, &cache); \
} \

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary. You are basically duplicating what's already done by the caller of this function, which is:

nccl/src/include/param.h

Lines 24 to 26 in ab2b89c

if (__builtin_expect(__atomic_load_n(&cache, __ATOMIC_RELAXED) == uninitialized, false)) { \
ncclLoadParam("NCCL_" env, deftVal, uninitialized, &cache); \
} \

It seems so. However, as an basically API, wouldn't it be better to have this logic inside the API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintainability-wise, I agree with you that it would make sense to hide it inside the function. However, NCCL has these calls sprinkled all over the codebase, including in performance-critical places. Looking at the code history, I can see that the code was deliberately structured this way to minimize the runtime impact.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintainability-wise, I agree with you that it would make sense to hide it inside the function. However, NCCL has these calls sprinkled all over the codebase, including in performance-critical places. Looking at the code history, I can see that the code was deliberately structured this way to minimize the runtime impact.

Yes, I see what you mean, and addressed your comment.

@wangfakang
Copy link
Author

wangfakang commented Apr 26, 2024

Thank you for your submission!

I left one technical comment for you, but overall, I'm skeptical whether this is the best approach to solving your problem, although it's difficult to tell for sure since you didn't include the actual changes that motivate this PR (which I'm guessing is adding timestamps to the log lines?).

Thank you very much for your prompt reply. I added a switch through xcclParamEnableLogTime in ncclDebugLog to determine whether to enable the log timestamp function. This change triggered the ncclLoadParam deadlock issue.

The problem is that adding a dependency in the debug code on the auto-generated ncclParamEnableLogTime function has created a circular lock dependency, leading to a deadlock. Your PR avoids the deadlock by moving the updating of the cache ahead of the debug output. It works, but it's fragile.

Yes, I am in ncclDebugLog adds a timestamp of switch, similar to the usage of ncclParamWarnSetDebugInfo. So the fix mentioned by PR is to prevent the deadlock issue by caching the values before logging them.

nccl/src/debug.cc

Lines 178 to 183 in ab2b89c

if (level == NCCL_LOG_WARN) {
len = snprintf(buffer, sizeof(buffer), "\n%s:%d:%d [%d] %s:%d NCCL WARN ",
hostname, pid, tid, cudaDev, filefunc, line);
if (ncclParamWarnSetDebugInfo()) ncclDebugLevel = NCCL_LOG_INFO;
} else if (level == NCCL_LOG_INFO) {
len = snprintf(buffer, sizeof(buffer), "%s:%d:%d [%d] NCCL INFO ", hostname, pid, tid, cudaDev);

BTW, if we moving the location of ncclParamWarnSetDebugInfo to inside the NCCL_LOG_INFO branch or printing WARN logs inside ncclLoadParam would also trigger this deadlock issue.

Also, I'm wondering whether NCCL_DEBUG_TIMESTAMPS wouldn't be a better name for this new config, but that's just bikeshedding...

that's a good tips.

@kiskra-nvidia
Copy link
Member

I added a switch through xcclParamEnableLogTime in ncclDebugLog to determine whether to enable the log timestamp function. This change triggered the ncclLoadParam deadlock issue.

That's what I guessed. Are you planning to submit that functionality in a future PR?

BTW, if we moving the location of ncclParamWarnSetDebugInfo to inside the NCCL_LOG_INFO branch or printing WARN logs inside ncclLoadParam would also trigger this deadlock issue.

Yes, I see what you mean. It currently works, but it's fragile. I will propose internally a change to make it more robust, i.e., implementing custom querying code in ncclDebugInit. The same approach remains my recommendation for you as well (it should then be included as part of a future PR, should you plan to submit it).

Signed-off-by: wangfakang <fakangwang@gmail.com>
@wangfakang
Copy link
Author

I added a switch through xcclParamEnableLogTime in ncclDebugLog to determine whether to enable the log timestamp function. This change triggered the ncclLoadParam deadlock issue.

That's what I guessed. Are you planning to submit that functionality in a future PR?

Ok, I can submit this feature separately to another PR.

BTW, if we moving the location of ncclParamWarnSetDebugInfo to inside the NCCL_LOG_INFO branch or printing WARN logs inside ncclLoadParam would also trigger this deadlock issue.

Yes, I see what you mean. It currently works, but it's fragile. I will propose internally a change to make it more robust, i.e., implementing custom querying code in ncclDebugInit. The same approach remains my recommendation for you as well (it should then be included as part of a future PR, should you plan to submit it).

Yeah. IMO, this lock is to prevent only one thread from reading the environment variable when missing cache, which aims to improve performance rather than concurrent safety. In the scenario where there is already a cache, the overhead of the lock itself is questionable in terms of performance improvement. Therefore, I think the lock at this point can actually be removed.
Of course, I may not have considered the whole, please correct me, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants