|
| 1 | +# Efficient GPU Utilization |
| 2 | +- [1. CUDA out of memory solutions ](#1-cuda-out-of-memory-solutions) |
| 3 | + - [1.1. Use a smaller batch size](#11-use-a-smaller-batch-size) |
| 4 | + - [1.2. Check if there is any accumulated history across your training loop](#12-check-if-there-is-any-accumulated-history-across-your-training-loop) |
| 5 | + - [1.3 Delete intermediate variables you don't need](#13-delete-intermediate-variables-you-dont-need) |
| 6 | + - [1.4. Check if you GPU memory is freed properly](#14-check-if-you-gpu-memory-is-freed-properly) |
| 7 | + - [1.5. Turn off gradient calculation during validation](#15-turn-off-gradient-calculation-during-validation) |
| 8 | + - [1.6. COM in Google Colab](#16-com-in-google-colab) |
| 9 | +- [2. Multiple GPUs](#2-multiple-gpus) |
| 10 | + |
| 11 | +## 1. CUDA out of memory solutions |
| 12 | + |
| 13 | +<div align=center> |
| 14 | + <img src='images/COM.JPG' width=360 height=240> |
| 15 | +</div> |
| 16 | + |
| 17 | +- Anyone engaged in deep learning must have encountered the problem of cuda out of memory.Sometimes It's really frustrating when you've finished writing the code and debugged it for a week to make sure everything is correct. Just when you start training, the program throws a `CUDA out of memory` error. Here are some practical ways to hepl you solve this annoying problem |
| 18 | +### 1.1. Use a smaller batch size |
| 19 | +- The most frequent cause of this problem is that your batch size is set too large. Try to use a small one. |
| 20 | +- In some special scenarios, a smaller batch size may cause your network performance to drop, so a good way to balance this is to use gradient accumulation. Here is an example |
| 21 | + ```python |
| 22 | + accumulation_steps = 10 # Reset gradients tensors |
| 23 | + for i, (inputs, labels) in enumerate(training_set): |
| 24 | + predictions = model(inputs) # Forward pass |
| 25 | + loss = loss_function(predictions, labels) # Compute loss function |
| 26 | + loss = loss / accumulation_steps # Normalize our loss (if averaged) |
| 27 | + loss.backward() # Backward pass |
| 28 | + if (i+1) % accumulation_steps == 0: # Wait for several backward steps |
| 29 | + optimizer.step() # Now we can do an optimizer step |
| 30 | + model.zero_grad() # Reset gradients tensors |
| 31 | + if (i+1) % evaluation_steps == 0: # Evaluate the model when we... |
| 32 | + evaluate_model() # ...have no gradients accumulate |
| 33 | + ``` |
| 34 | +- As you can see from the code, `model.zero_grad()` is executed only after the forward count reaches `accmulation_step`, i.e. the gradient is accumulated 10 times before updating the parameters. This allows you to have a relatively large batch size while reducing the memory footprint. |
| 35 | +- This may also have some minor problems, such as the BN layer may not be calculated accurately, etc. |
| 36 | + |
| 37 | +### 1.2. Check if there is any accumulated history across your training loop |
| 38 | +- By default, computations involving variables that require gradients will keep history. This means that you should avoid using such variables in computations which will live beyond your training loops, e.g., when tracking statistics. Instead, you should detach the variable or access its underlying data. |
| 39 | +- Here is a bad example: |
| 40 | + ```python |
| 41 | + total_loss = 0 |
| 42 | + for i in range(10000): |
| 43 | + optimizer.zero_grad() |
| 44 | + output = model(input) |
| 45 | + loss = criterion(output) |
| 46 | + loss.backward() |
| 47 | + optimizer.step() |
| 48 | + total_loss += loss |
| 49 | + ``` |
| 50 | +- `total_loss` is defined outside the loop and will keep accumulating in each loop. This can lead to unnecessary memory usage and you can solve it in two ways: use `total_loss += loss.detach()` or `total_loss += loss.item()` instead. |
| 51 | + |
| 52 | +### 1.3 Delete intermediate variables you don't need |
| 53 | +- If you assign a Tensor or Variable to a local, Python will not deallocate until the local goes out of scope. You can free this reference by using del x. Similarly, if you assign a Tensor or Variable to a member variable of an object, it will not deallocate until the object goes out of scope. You will get the best memory usage if you don’t hold onto temporaries you don’t need. |
| 54 | +```python |
| 55 | +for i in range(5): |
| 56 | + intermediate = f(input[i]) |
| 57 | + result += g(intermediate) |
| 58 | +output = h(result) |
| 59 | +return output |
| 60 | +``` |
| 61 | +- Here, intermediate remains live even while h is executing, because its scope extrudes past the end of the loop. To free it earlier, you should `del intermediate` when you are done with it. |
| 62 | + |
| 63 | +### 1.4. Check if you GPU memory is freed properly |
| 64 | +- Sometimes even if your code stops running, the video memory may still be occupied by it. The best way is to find the process engaging gpu memory and kill it |
| 65 | +- find the PID of python process from: |
| 66 | + ```bash |
| 67 | + nvidia-smi |
| 68 | + ``` |
| 69 | +- copy the PID and kill it by: |
| 70 | + ```bash |
| 71 | + sudo kill -9 pid |
| 72 | + ``` |
| 73 | + |
| 74 | +### 1.5. Turn off gradient calculation during validation |
| 75 | +- You don't need to calculate gradients for forward and backward phase during validation. |
| 76 | + ```python |
| 77 | + with torch.no_grad(): |
| 78 | + for batch in loader: |
| 79 | + model.evaluate(batch) |
| 80 | + ``` |
| 81 | +
|
| 82 | +### 1.6. COM in Google Colab |
| 83 | +- If you are getting this error in Google Colab, then try this |
| 84 | + ```python |
| 85 | + import torch |
| 86 | + torch.cuda.empty_cache() |
| 87 | + ``` |
| 88 | +
|
| 89 | +## 2. Multiple GPUs |
0 commit comments