Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlap gradient computation and NCCL AllReduce #361

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

PeterZhizhin
Copy link
Contributor

On my setup, I get the following:

Before:

step    2/37: train loss 4.720275 (acc 4.688650) (224.046844 ms, 36563.773438 tok/s)
step    3/37: train loss 3.802741 (acc 3.943135) (224.151611 ms, 36555.007812 tok/s)
step    4/37: train loss 3.698719 (acc 3.800745) (227.287033 ms, 36375.347656 tok/s)
step    5/37: train loss 3.444999 (acc 3.528596) (227.886978 ms, 36260.062500 tok/s)

After:

step    2/37: train loss 4.715888 (acc 4.686493) (199.011169 ms, 41163.503906 tok/s)
step    3/37: train loss 3.798963 (acc 3.942383) (193.084412 ms, 41811.468750 tok/s)
step    4/37: train loss 3.697987 (acc 3.800879) (193.079300 ms, 42027.660156 tok/s)
step    5/37: train loss 3.444056 (acc 3.526504) (193.470459 ms, 42112.496094 tok/s)

So, a 12% speedup.

NSight Systems profiles:

Before:
NSight Compute profile before: backward then NCCL happen on the same stream

After:
NSight Compute profile before: backward and NCCL is overlapped

@PeterZhizhin PeterZhizhin force-pushed the add_nvcc_parallel branch 2 times, most recently from 0ffa9a0 to 61a1f15 Compare May 5, 2024 15:51
@@ -2348,7 +2410,7 @@ void common_free(GPT2 &model) {
cudaCheck(cudaFree(cublaslt_workspace));
cublasCheck(cublasDestroy(cublas_handle));
cublasCheck(cublasLtDestroy(cublaslt_handle));
create_cudnn();
destroy_cudnn();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karpathy @PeterZhizhin cherry pick; this should be merged immediately

printf0("step %4d/%d: train loss %f (acc %f) (%f ms, %0f tok/s)\n",
step + 1, train_num_batches, model.mean_loss, accumulated_loss,
time_elapsed_ms, bias_corrected_ema_tokens_per_second);
logger_log_train(&logger, step, model.mean_loss);

// disable the profiler after 3 steps of optimization
if (step == 3) { cudaProfilerStop(); }
if (step == 3) { cudaCheck(cudaProfilerStop()); }
Copy link
Contributor

@ngc92 ngc92 May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an independent fix too

// Aggregate grads.lnfw and grads.lnfb in a background stream
floatX* layernorm_backward_pointers[] = {grads.lnfw, grads.lnfb};
size_t layernorm_backward_sizes[] = {C, C};
multi_gpu_async_all_reduce_pointers_group(2, layernorm_backward_pointers, layernorm_backward_sizes, multi_gpu_config, main_stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment says background stream, but call uses main_stream?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, in the version this code was based on, main_stream was the background stream?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants