Overlap gradient computation and NCCL AllReduce #361

PeterZhizhin · 2024-05-05T15:16:14Z

On my setup, I get the following:

Before:

step    2/37: train loss 4.720275 (acc 4.688650) (224.046844 ms, 36563.773438 tok/s)
step    3/37: train loss 3.802741 (acc 3.943135) (224.151611 ms, 36555.007812 tok/s)
step    4/37: train loss 3.698719 (acc 3.800745) (227.287033 ms, 36375.347656 tok/s)
step    5/37: train loss 3.444999 (acc 3.528596) (227.886978 ms, 36260.062500 tok/s)

After:

step    2/37: train loss 4.715888 (acc 4.686493) (199.011169 ms, 41163.503906 tok/s)
step    3/37: train loss 3.798963 (acc 3.942383) (193.084412 ms, 41811.468750 tok/s)
step    4/37: train loss 3.697987 (acc 3.800879) (193.079300 ms, 42027.660156 tok/s)
step    5/37: train loss 3.444056 (acc 3.526504) (193.470459 ms, 42112.496094 tok/s)

So, a 12% speedup.

NSight Systems profiles:

Before:

After:

ngc92 · 2024-05-17T23:02:31Z

train_gpt2.cu

@@ -2348,7 +2410,7 @@ void common_free(GPT2 &model) {
    cudaCheck(cudaFree(cublaslt_workspace));
    cublasCheck(cublasDestroy(cublas_handle));
    cublasCheck(cublasLtDestroy(cublaslt_handle));
-    create_cudnn();
+    destroy_cudnn();


@karpathy @PeterZhizhin cherry pick; this should be merged immediately

ngc92 · 2024-05-17T23:02:53Z

train_gpt2.cu

        printf0("step %4d/%d: train loss %f (acc %f) (%f ms, %0f tok/s)\n",
                step + 1, train_num_batches, model.mean_loss, accumulated_loss,
                time_elapsed_ms, bias_corrected_ema_tokens_per_second);
        logger_log_train(&logger, step, model.mean_loss);

        // disable the profiler after 3 steps of optimization
-        if (step == 3) { cudaProfilerStop(); }
+        if (step == 3) { cudaCheck(cudaProfilerStop()); }


this is an independent fix too

ngc92 · 2024-05-22T23:51:32Z

train_gpt2.cu

+    // Aggregate grads.lnfw and grads.lnfb in a background stream
+    floatX* layernorm_backward_pointers[] = {grads.lnfw, grads.lnfb};
+    size_t layernorm_backward_sizes[] = {C, C};
+    multi_gpu_async_all_reduce_pointers_group(2, layernorm_backward_pointers, layernorm_backward_sizes, multi_gpu_config, main_stream);


comment says background stream, but call uses main_stream?

oh wait, in the version this code was based on, main_stream was the background stream?

PeterZhizhin force-pushed the add_nvcc_parallel branch 2 times, most recently from 0ffa9a0 to 61a1f15 Compare May 5, 2024 15:51

Overlap gradient accumulation and gradient computation

47cfb4c

PeterZhizhin force-pushed the add_nvcc_parallel branch from 61a1f15 to 47cfb4c Compare May 11, 2024 18:59

ngc92 reviewed May 17, 2024

View reviewed changes

ngc92 mentioned this pull request May 22, 2024

move all kernels into a dedicated cuda stream #448

Draft

ngc92 reviewed May 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlap gradient computation and NCCL AllReduce #361

Overlap gradient computation and NCCL AllReduce #361

PeterZhizhin commented May 5, 2024

ngc92 May 17, 2024

ngc92 May 17, 2024 •

edited

ngc92 May 22, 2024

ngc92 May 22, 2024

Overlap gradient computation and NCCL AllReduce #361

Are you sure you want to change the base?

Overlap gradient computation and NCCL AllReduce #361

Conversation

PeterZhizhin commented May 5, 2024

ngc92 May 17, 2024

Choose a reason for hiding this comment

ngc92 May 17, 2024 • edited

Choose a reason for hiding this comment

ngc92 May 22, 2024

Choose a reason for hiding this comment

ngc92 May 22, 2024

Choose a reason for hiding this comment

ngc92 May 17, 2024 •

edited