[BUG] _validate() in the Vertical Federated Splitlearning CIFAR10 example with a ResNet50 causes torch.cuda.OutOfMemoryError #2443

eshatkeinensinn · 2024-03-25T13:13:20Z

Describe the bug
I adapted the Vertical Federated Splitlearning CIFAR10 example to a ResNet50 and aimed for a split of the network closer to the middle as possible. My code works when the second client only has the last flatten layer and the linear layer. The code also works when I set the split more in the middle and only execute the training. However, when during training _validate() is called, I get a "torch.cuda.OutOfMemoryError: CUDA out of memory" error. It says it tried to allocate 20.00 MiB, GPU 0 has a total capacity of 23.64 GiB, of which 21.81 MiB is free. Process 87262 has 1.17 GiB memory in use. Process 87261 has 5.39 GiB memory in use. Including non-PyTorch memory, this process has 13.91 GiB memory in use. Process 4034216 has 2.70 GiB memory in use. Of the allocated memory, 12.27 GiB is allocated by PyTorch, and 505.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large, try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF."

To Reproduce
Code :https://github.com/eshatkeinensinn/NVFlare/tree/main/examples/advanced/vertical_federated_learning/cifar10-split-res
Run the regular README steps and launch Jupyter Lab.

Additional context
My thought is that because the tensors passed between clients are larger than when I set the split later in the network, they are being utilized/stored differently in _validate() than train(), which leads to the out of memory error.

YuanTingHsieh · 2024-04-05T06:05:54Z

@eshatkeinensinn thanks for sharing.

I think this issue is because the GPU does not have enough memory to run both training and validation at the same time.
So one way is that we don't do them at the same time.

If you still want to do them both at the same time, then you can try:

Reduce the network/model sizes
Reduce the batch size of the data

@ZiyueXu77 @holgerroth feel free to comment more on prevent OOM, thanks

eshatkeinensinn · 2024-04-15T09:17:50Z

with torch.no_grad() during the validation, i could fix the OOM.

YuanTingHsieh · 2024-04-15T18:22:39Z

@eshatkeinensinn thanks for sharing your tips!

holgerroth · 2024-04-15T18:55:18Z

@YuanTingHsieh , I think we should add torch.no_grad() to our example. Let's leave this open.

eshatkeinensinn added the bug Something isn't working label Mar 25, 2024

YuanTingHsieh closed this as completed Apr 15, 2024

holgerroth reopened this Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] _validate() in the Vertical Federated Splitlearning CIFAR10 example with a ResNet50 causes torch.cuda.OutOfMemoryError #2443

[BUG] _validate() in the Vertical Federated Splitlearning CIFAR10 example with a ResNet50 causes torch.cuda.OutOfMemoryError #2443

eshatkeinensinn commented Mar 25, 2024

YuanTingHsieh commented Apr 5, 2024

eshatkeinensinn commented Apr 15, 2024

YuanTingHsieh commented Apr 15, 2024

holgerroth commented Apr 15, 2024

[BUG] _validate() in the Vertical Federated Splitlearning CIFAR10 example with a ResNet50 causes torch.cuda.OutOfMemoryError #2443

[BUG] _validate() in the Vertical Federated Splitlearning CIFAR10 example with a ResNet50 causes torch.cuda.OutOfMemoryError #2443

Comments

eshatkeinensinn commented Mar 25, 2024

YuanTingHsieh commented Apr 5, 2024

eshatkeinensinn commented Apr 15, 2024

YuanTingHsieh commented Apr 15, 2024

holgerroth commented Apr 15, 2024