Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] _validate() in the Vertical Federated Splitlearning CIFAR10 example with a ResNet50 causes torch.cuda.OutOfMemoryError #2443

Open
eshatkeinensinn opened this issue Mar 25, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@eshatkeinensinn
Copy link

Describe the bug
I adapted the Vertical Federated Splitlearning CIFAR10 example to a ResNet50 and aimed for a split of the network closer to the middle as possible. My code works when the second client only has the last flatten layer and the linear layer. The code also works when I set the split more in the middle and only execute the training. However, when during training _validate() is called, I get a "torch.cuda.OutOfMemoryError: CUDA out of memory" error. It says it tried to allocate 20.00 MiB, GPU 0 has a total capacity of 23.64 GiB, of which 21.81 MiB is free. Process 87262 has 1.17 GiB memory in use. Process 87261 has 5.39 GiB memory in use. Including non-PyTorch memory, this process has 13.91 GiB memory in use. Process 4034216 has 2.70 GiB memory in use. Of the allocated memory, 12.27 GiB is allocated by PyTorch, and 505.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large, try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF."

To Reproduce
Code :https://github.com/eshatkeinensinn/NVFlare/tree/main/examples/advanced/vertical_federated_learning/cifar10-split-res
Run the regular README steps and launch Jupyter Lab.

Additional context
My thought is that because the tensors passed between clients are larger than when I set the split later in the network, they are being utilized/stored differently in _validate() than train(), which leads to the out of memory error.

@eshatkeinensinn eshatkeinensinn added the bug Something isn't working label Mar 25, 2024
@YuanTingHsieh
Copy link
Collaborator

@eshatkeinensinn thanks for sharing.

I think this issue is because the GPU does not have enough memory to run both training and validation at the same time.
So one way is that we don't do them at the same time.

If you still want to do them both at the same time, then you can try:

  1. Reduce the network/model sizes
  2. Reduce the batch size of the data

@ZiyueXu77 @holgerroth feel free to comment more on prevent OOM, thanks

@eshatkeinensinn
Copy link
Author

with torch.no_grad() during the validation, i could fix the OOM.

@YuanTingHsieh
Copy link
Collaborator

@eshatkeinensinn thanks for sharing your tips!

@holgerroth
Copy link
Collaborator

@YuanTingHsieh , I think we should add torch.no_grad() to our example. Let's leave this open.

@holgerroth holgerroth reopened this Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants