New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CUDA Graph and AOT Autograd support #1271
base: main
Are you sure you want to change the base?
Conversation
I'm still working on extra benchmark and accuracy test for the new options at this moment. |
FYI I intend to review (can't set myself as a reviewer) |
Seems I can't add you as a formal reviewer either, might require reviewer to be added as collaborator. Hmm, I thought only read-only access was needed... |
|
||
if not args.distributed: | ||
losses_m.update(loss.item(), input.size(0)) | ||
|
||
torch.cuda.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to amortize CPU overhead by allowing it to run ahead during the previous step, I wanted to suggest not syncing on every step and instead recording the time at the beginning and end of the run and dividing by the batch size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this is typically a better approach and more representative of the user experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, some comments. Not 100% sure if we want the complexity of cuda graphs in training as it won't necessarily give benefits unless running inference sized batch sizes per GPU (<16 is probably where you'll start seeing benefit).
Have you run some correctness testing with AOTAutograd with and without CUDA Graphs?
parser.add_argument('--cuda-graph', default=False, action='store_true', | ||
help="Enable CUDA Graph support") | ||
parser.add_argument('--aot-autograd', default=False, action='store_true', | ||
help="Enable AOT Autograd support. (It's recommended to use this option with `--fuser nvfuser` but without `--torchscript`)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't they change so AOTAutograd defaults to nvFuser now? Can you assert in the script to make sure torchscript and aotautograd are not on at the same time just to give a cleaner error message stating not to use the options at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think torchdynamo with aot_autograd_speedup_strategy
has nvfuser enabled by default now, but not memory_efficient_fusion
? Not 100% sure.
Will add a mutually exclusive check on torchscript and aot_autograd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, no worries, checking they're mutually exclusive would be great. thanks.
|
||
if not args.distributed: | ||
losses_m.update(loss.item(), input.size(0)) | ||
|
||
torch.cuda.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this is typically a better approach and more representative of the user experience.
@@ -265,12 +279,28 @@ def _step(): | |||
for _ in range(self.num_warm_iter): | |||
_step() | |||
|
|||
if self.cuda_graph: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments simply mentioning we need to sync the stream, do warm up iterations, then capture the graph.
@@ -367,7 +397,21 @@ def _step(detail=False): | |||
for _ in range(self.num_warm_iter): | |||
_step() | |||
|
|||
t_run_start = self.time_fn() | |||
if self.cuda_graph: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA Graphs can be helpful in training, but since larger batches are often used there isn't typically a perf gain from using them. @rwightman do you think we should just leave cuda graphs for inference but not training? I don't suspect above something like batch size 8 you'd see much of a tangible benefit.
@@ -17,7 +17,7 @@ | |||
class ApexScaler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rwightman do you mind if we remove the Apex variant of AMP and just use native?
delta_fwd = _step() | ||
total_step += delta_fwd | ||
if self.cuda_graph: | ||
g.replay() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't 100% representative of inference with CUDA Graphs as we're not including copying the new inputs into the input buffer. I'm wondering if we want to confer with @mcarilli to see if we can come up with a reasonable async mechanism to copy the data in. I'm wondering if we want to just do something like have two CUDA Graphs going at the same time so we can async copy one input into one graph while running the other graph and just ping-pong back and forth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be important to do if we're actually doing inference and wanting the highest perf possible. For benchmarking we could just keep it simple with running the same inputs over and over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add inputs copying like static_inputs.copy_(self.example_inputs)
before every replay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'll really slow down the benchmarks as it will be 100% serialized. I'm hoping @mcarilli could suggest a mechanism (though maybe somewhat complex) to be able to overlap it. I'm thinking if we really want having 2 CUDA Graphs around to ping pong between would probably work fine if we throw them in different streams. Not an absolute must, but wanted to make sure we mentioned this added complexity.
@@ -682,7 +745,9 @@ def main(): | |||
def train_one_epoch( | |||
epoch, model, loader, optimizer, loss_fn, args, | |||
lr_scheduler=None, saver=None, output_dir=None, amp_autocast=suppress, | |||
loss_scaler=None, model_ema=None, mixup_fn=None): | |||
loss_scaler=None, model_ema=None, mixup_fn=None, | |||
cuda_graph=None, cg_stage=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments about the newly added arguments. Explicitly what they are, and why they're important. You can reference the cuda graphs docs if that makes it easier.
loss = _step(input, target) | ||
torch.cuda.current_stream().wait_stream(s) | ||
return (input, target, loss) | ||
elif cg_stage == 'capture': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again some comments in here would be helpful.
s = torch.cuda.Stream() | ||
s.wait_stream(torch.cuda.current_stream()) | ||
with torch.cuda.stream(s): | ||
for _ in range(11): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for 11
? In benchmarking you're doing 3
warmups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mcarilli mentioned that cuda graph warmup with DDP needs 11 iterations instead of 3 in the cuda graph doc. https://pytorch.org/docs/stable/notes/cuda.html#id5 🤔
The benchmark script is for single thread without DDP but training script may have DDP enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah, okay no problem. If Carilli says so, Carilli says so ;-)
I have some ResNet50 AMP+channels-last training results with cuda graph and nvfuser that verified the training accuracy (loss, val acc) here https://gist.github.com/xwang233/f3b5b4818762b08d716f969899b6d263. After 10 epochs, V100x8, BS = 128
A100x8, BS = 32
TL;DR: the training accuracies are the same for eager mode, cuda graph, and cuda graph + nvfuser. Cuda graph keeps the training throughput the same at large batch size, but can get out-of-the-box improvements on small batch size. For example, in the results shown above, ResNet50 on A100x8 with batch size = 32 got training throughput improvements from 5600 -> 6600 img/s. I'm also checking training accuracy with aot_autograd. |
ResNet50 FP32 training results with eager, cuda graph, TS+nvfuser, AOT_autograd+nvfuser https://gist.github.com/xwang233/d5136facb3361af54693081da346fd33 After 10 epochs, A100x8, BS = 128
A100x8, BS = 32
V100x8, BS = 64
|
@csarofeen @kevinstephano @xwang233 putting a few comments down here that relate to the whole PR It's nice to see it together but not sure it's worth it just yet, it really needs to be pushed into a model / task wrapper. I had a plan to work it into the I have to think if there's a way to have the graph code in this train script that better separates the extra code (even if it adds redundancy)... |
Add CUDA Graph support with
--cuda-graph
and AOT Autograd support with--aot-autograd
to benchmark.py and train.pyThe workflow for cuda graph in train.py might be a bit overcomplicated.
Related: #1244