Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Code with Multiple GPUs #35

Open
faezeamin opened this issue Nov 22, 2023 · 4 comments
Open

Running Code with Multiple GPUs #35

faezeamin opened this issue Nov 22, 2023 · 4 comments

Comments

@faezeamin
Copy link

faezeamin commented Nov 22, 2023

Thank you for providing the code!

I'd like to run it using multiple GPUs with my own dataset, but I encountered the following error:

—---

DATA CONFIG:
lab: han
expt: jenelia-exp
animal: HH09
session: S07_20210611
n_input_channels: 2
y_pixels: 304
x_pixels: 288
use_output_mask: False
frame_rate: 30.0
neural_type: ca
neural_bin_size: 0.03333333333333333
approx_batch_size: 200

COMPUTE CONFIG:
device: cuda
n_parallel_gpus: 4
gpus_viz: 0;1;2;3
tt_n_gpu_trials: 128
tt_n_cpu_trials: 1000
tt_n_cpu_workers: 5
mem_limit_gb: 7

TRAINING CONFIG:
export_train_plots: True
export_latents: True
pretrained_weights_path: None
val_check_interval: 1
learning_rate: 0.0001
max_n_epochs: 1000
min_n_epochs: 10
enable_early_stop: False
early_stop_history: 10
rng_seed_train: None
as_numpy: False
batch_load: True
rng_seed_data: 0
train_frac: 1.0
trial_splits: 8;1;1;0

MODEL CONFIG:
experiment_name: dim_search
model_type: conv
n_ae_latents: 16
l2_reg: 0.0
rng_seed_model: 0
fit_sess_io_layers: False
ae_arch_json: None
model_class: ae
conditional_encoder: False
msp.alpha: None
vae.beta: 1
vae.beta_anneal_epochs: 100
beta_tcvae.beta: 1
beta_tcvae.beta_anneal_epochs: 100
ps_vae.alpha: 1
ps_vae.beta: 1
ps_vae.gamma: 1
ps_vae.delta: 1
ps_vae.anneal_epochs: 100
n_background: 3
n_sessions_per_batch: 1

using data from following sessions:
/root/capsule/scratch/results/han/jenelia-exp/HH09/S07_20210611
constructing data generator...done
Generator contains 1 SingleSessionDatasetBatchedLoad objects:
han_jenelia-exp_HH09_S07_20210611
signals: ['images']
transforms: OrderedDict([('images', None)])
paths: OrderedDict([('images', '/root/capsule/data/base-data-dir/han/jenelia-exp/HH09/S07_20210611/data.hdf5')])

constructing model...Initializing with random weights
done
CustomDataParallel(
(module): AE(
(encoding): ConvAEEncoder(
(encoder): ModuleList(
(zero_pad0): ZeroPad2d((1, 2, 1, 2))
(conv0): Conv2d(2, 32, kernel_size=(5, 5), stride=(2, 2))
(relu0): LeakyReLU(negative_slope=0.05)
(zero_pad1): ZeroPad2d((1, 2, 1, 2))
(conv1): Conv2d(32, 64, kernel_size=(5, 5), stride=(2, 2))
(relu1): LeakyReLU(negative_slope=0.05)
(zero_pad2): ZeroPad2d((1, 2, 1, 2))
(conv2): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2))
(relu2): LeakyReLU(negative_slope=0.05)
(zero_pad3): ZeroPad2d((1, 2, 1, 2))
(conv3): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2))
(relu3): LeakyReLU(negative_slope=0.05)
(zero_pad4): ZeroPad2d((1, 1, 0, 1))
(conv4): Conv2d(256, 512, kernel_size=(5, 5), stride=(5, 5))
(relu4): LeakyReLU(negative_slope=0.05)
)
(FF): Linear(in_features=8192, out_features=16, bias=True)
)
(decoding): ConvAEDecoder(
(FF): Linear(in_features=16, out_features=8192, bias=True)
(decoder): ModuleList(
(convtranspose0): ConvTranspose2d(512, 256, kernel_size=(5, 5), stride=(5, 5))
(relu0): LeakyReLU(negative_slope=0.05)
(convtranspose1): ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2))
(relu1): LeakyReLU(negative_slope=0.05)
(convtranspose2): ConvTranspose2d(128, 64, kernel_size=(5, 5), stride=(2, 2))
(relu2): LeakyReLU(negative_slope=0.05)
(convtranspose3): ConvTranspose2d(64, 32, kernel_size=(5, 5), stride=(2, 2))
(relu3): LeakyReLU(negative_slope=0.05)
(convtranspose4): ConvTranspose2d(32, 2, kernel_size=(5, 5), stride=(2, 2))
(sigmoid4): Sigmoid()
)
)
)
)
epoch 0000/1000
0%| | 0/256 [00:09<?, ?it/s]
Caught exception in worker thread CUDA out of memory. Tried to allocate 536.00 MiB (GPU 0; 7.43 GiB total capacity; 5.41 GiB already allocated; 505.19 MiB free; 6.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/test_tube/argparse_hopt.py", line 39, in optimize_parallel_gpu_private
results = train_function(trial_params, gpu_id_set)
File "/behavenet/behavenet/fitting/ae_grid_search.py", line 112, in main
fit(hparams, model, data_generator, exp, method='ae')
File "/root/capsule/behavenet/behavenet/fitting/training.py", line 347, in fit
loss_dict = model.loss(data, dataset=dataset, accumulate_grad=True)
File "/root/capsule/behavenet/behavenet/models/aes.py", line 766, in loss
loss.backward()
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 536.00 MiB (GPU 0; 7.43 GiB total capacity; 5.41 GiB already allocated; 505.19 MiB free; 6.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

—---

It seems that the code is not recognizing all four GPUs and is unable to utilize their capacity. In my troubleshooting efforts, I've explored the following steps:

  • configurations are set according to [user guide documentation]:
    “Training an AE can be slow: you can speed up the training by parallelizing over multiple gpus. To do this, just specify n_parallel_gpus to be the number of gpus you wish to use per model. The code will split up the gpus specified in gpus_viz into groups of size n_parallel_gpus (or less if there are leftover gpus) and run the models accordingly.”

  • The model is fitted on cloud computing - Code Ocean - using a four GPU machine which has the following properties:

GPU 0: Tesla M60, 7.982743552GB
GPU 1: Tesla M60, 7.982743552GB
GPU 2: Tesla M60, 7.982743552GB
GPU 3: Tesla M60, 7.982743552GB

  • PyTorch and Cuda Versions are as follows:

PyTorch Version: 1.12.1+cu116
CUDA Version: 11.6

  • Nvidia-smi

Wed Nov 22 11:47:07 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1B.0 Off | 0 |
| N/A 29C P8 16W / 150W | 0MiB / 7680MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 Off | 00000000:00:1C.0 Off | 0 |
| N/A 27C P0 38W / 150W | 0MiB / 7680MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M60 Off | 00000000:00:1D.0 Off | 0 |
| N/A 32C P0 38W / 150W | 0MiB / 7680MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
| N/A 25C P0 39W / 150W | 0MiB / 7680MiB | 75% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

  • Previously, I fitted this model architecture, which is the same one as the paper, on the same dataset as now, except being downsampled into 128 x 128, to use on a single GPU machine (14 GB). For that try, I used the same platform (Code Ocean), and the code worked successfully.

In the current run, I add the second camera view, and keep the frame size as in the original data (304 * 288). it seems that the code cannot identify other GPUs or doesn't implement their memories.

  • I tried running the integration test, and here is the final result:

================== Integration Test Results ==================

ae: passed
arhmm: passed
neural-ae: passed
neural-ae-me: passed
neural-labels: passed
neural-arhmm: passed
ae-multisession: passed
vae: passed
beta-tcvae: passed
cond-ae-msp: passed
cond-vae: passed
ps-vae: passed
msps-vae-multisession: passed
labels-images: passed

total time to perform integration test: 195.396645 sec


  • The code works properly in CPU mode on this data.

  • I tried "mem_limit_gb": 5,6,7, 8, 24.0. Also, reduced "tt_n_gpu_trials" to 128. None of them helped.

  • Dataset consists of trials of different length, with mean: 1772, std: 604 (frames per trial).

Despite these efforts, the issue persists. I would greatly appreciate any insights or suggestions you may have.
Thank you!

@themattinthehatt
Copy link
Owner

themattinthehatt commented Nov 22, 2023

Hi @faezeamin ,
I have not tried the multi-gpu training in several years - I can test this out on my end after the thanksgiving break and get back to you.
In the meantime, is requesting a GPU from code ocean with more memory possible for you?

@faezeamin
Copy link
Author

Thank you for your prompt response!
Yes - the model is functional on a single GPU with a size of 15.65 GB. But I'm interested in exploring the possibility of faster run-times using multiple GPUs, if feasible.

@themattinthehatt
Copy link
Owner

@faezeamin sorry for not getting to this yet, haven't forgotten about it though

@faezeamin
Copy link
Author

Hi @themattinthehatt - Just following up on this issue. Have you got a chance to take look on multiple GPU analysis? Thanks, -Faeze

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants