Running Code with Multiple GPUs #35

faezeamin · 2023-11-22T12:15:45Z

Thank you for providing the code!

I'd like to run it using multiple GPUs with my own dataset, but I encountered the following error:

—---

DATA CONFIG:
lab: han
expt: jenelia-exp
animal: HH09
session: S07_20210611
n_input_channels: 2
y_pixels: 304
x_pixels: 288
use_output_mask: False
frame_rate: 30.0
neural_type: ca
neural_bin_size: 0.03333333333333333
approx_batch_size: 200

COMPUTE CONFIG:
device: cuda
n_parallel_gpus: 4
gpus_viz: 0;1;2;3
tt_n_gpu_trials: 128
tt_n_cpu_trials: 1000
tt_n_cpu_workers: 5
mem_limit_gb: 7

TRAINING CONFIG:
export_train_plots: True
export_latents: True
pretrained_weights_path: None
val_check_interval: 1
learning_rate: 0.0001
max_n_epochs: 1000
min_n_epochs: 10
enable_early_stop: False
early_stop_history: 10
rng_seed_train: None
as_numpy: False
batch_load: True
rng_seed_data: 0
train_frac: 1.0
trial_splits: 8;1;1;0

MODEL CONFIG:
experiment_name: dim_search
model_type: conv
n_ae_latents: 16
l2_reg: 0.0
rng_seed_model: 0
fit_sess_io_layers: False
ae_arch_json: None
model_class: ae
conditional_encoder: False
msp.alpha: None
vae.beta: 1
vae.beta_anneal_epochs: 100
beta_tcvae.beta: 1
beta_tcvae.beta_anneal_epochs: 100
ps_vae.alpha: 1
ps_vae.beta: 1
ps_vae.gamma: 1
ps_vae.delta: 1
ps_vae.anneal_epochs: 100
n_background: 3
n_sessions_per_batch: 1

using data from following sessions:
/root/capsule/scratch/results/han/jenelia-exp/HH09/S07_20210611
constructing data generator...done
Generator contains 1 SingleSessionDatasetBatchedLoad objects:
han_jenelia-exp_HH09_S07_20210611
signals: ['images']
transforms: OrderedDict([('images', None)])
paths: OrderedDict([('images', '/root/capsule/data/base-data-dir/han/jenelia-exp/HH09/S07_20210611/data.hdf5')])

constructing model...Initializing with random weights
done
CustomDataParallel(
(module): AE(
(encoding): ConvAEEncoder(
(encoder): ModuleList(
(zero_pad0): ZeroPad2d((1, 2, 1, 2))
(conv0): Conv2d(2, 32, kernel_size=(5, 5), stride=(2, 2))
(relu0): LeakyReLU(negative_slope=0.05)
(zero_pad1): ZeroPad2d((1, 2, 1, 2))
(conv1): Conv2d(32, 64, kernel_size=(5, 5), stride=(2, 2))
(relu1): LeakyReLU(negative_slope=0.05)
(zero_pad2): ZeroPad2d((1, 2, 1, 2))
(conv2): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2))
(relu2): LeakyReLU(negative_slope=0.05)
(zero_pad3): ZeroPad2d((1, 2, 1, 2))
(conv3): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2))
(relu3): LeakyReLU(negative_slope=0.05)
(zero_pad4): ZeroPad2d((1, 1, 0, 1))
(conv4): Conv2d(256, 512, kernel_size=(5, 5), stride=(5, 5))
(relu4): LeakyReLU(negative_slope=0.05)
)
(FF): Linear(in_features=8192, out_features=16, bias=True)
)
(decoding): ConvAEDecoder(
(FF): Linear(in_features=16, out_features=8192, bias=True)
(decoder): ModuleList(
(convtranspose0): ConvTranspose2d(512, 256, kernel_size=(5, 5), stride=(5, 5))
(relu0): LeakyReLU(negative_slope=0.05)
(convtranspose1): ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2))
(relu1): LeakyReLU(negative_slope=0.05)
(convtranspose2): ConvTranspose2d(128, 64, kernel_size=(5, 5), stride=(2, 2))
(relu2): LeakyReLU(negative_slope=0.05)
(convtranspose3): ConvTranspose2d(64, 32, kernel_size=(5, 5), stride=(2, 2))
(relu3): LeakyReLU(negative_slope=0.05)
(convtranspose4): ConvTranspose2d(32, 2, kernel_size=(5, 5), stride=(2, 2))
(sigmoid4): Sigmoid()
)
)
)
)
epoch 0000/1000
0%| | 0/256 [00:09<?, ?it/s]
Caught exception in worker thread CUDA out of memory. Tried to allocate 536.00 MiB (GPU 0; 7.43 GiB total capacity; 5.41 GiB already allocated; 505.19 MiB free; 6.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/test_tube/argparse_hopt.py", line 39, in optimize_parallel_gpu_private
results = train_function(trial_params, gpu_id_set)
File "/behavenet/behavenet/fitting/ae_grid_search.py", line 112, in main
fit(hparams, model, data_generator, exp, method='ae')
File "/root/capsule/behavenet/behavenet/fitting/training.py", line 347, in fit
loss_dict = model.loss(data, dataset=dataset, accumulate_grad=True)
File "/root/capsule/behavenet/behavenet/models/aes.py", line 766, in loss
loss.backward()
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 536.00 MiB (GPU 0; 7.43 GiB total capacity; 5.41 GiB already allocated; 505.19 MiB free; 6.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

—---

It seems that the code is not recognizing all four GPUs and is unable to utilize their capacity. In my troubleshooting efforts, I've explored the following steps:

configurations are set according to [user guide documentation]:
“Training an AE can be slow: you can speed up the training by parallelizing over multiple gpus. To do this, just specify n_parallel_gpus to be the number of gpus you wish to use per model. The code will split up the gpus specified in gpus_viz into groups of size n_parallel_gpus (or less if there are leftover gpus) and run the models accordingly.”
The model is fitted on cloud computing - Code Ocean - using a four GPU machine which has the following properties:

GPU 0: Tesla M60, 7.982743552GB
GPU 1: Tesla M60, 7.982743552GB
GPU 2: Tesla M60, 7.982743552GB
GPU 3: Tesla M60, 7.982743552GB

PyTorch and Cuda Versions are as follows:

PyTorch Version: 1.12.1+cu116
CUDA Version: 11.6

Nvidia-smi

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Previously, I fitted this model architecture, which is the same one as the paper, on the same dataset as now, except being downsampled into 128 x 128, to use on a single GPU machine (14 GB). For that try, I used the same platform (Code Ocean), and the code worked successfully.

In the current run, I add the second camera view, and keep the frame size as in the original data (304 * 288). it seems that the code cannot identify other GPUs or doesn't implement their memories.

I tried running the integration test, and here is the final result:

================== Integration Test Results ==================

ae: passed
arhmm: passed
neural-ae: passed
neural-ae-me: passed
neural-labels: passed
neural-arhmm: passed
ae-multisession: passed
vae: passed
beta-tcvae: passed
cond-ae-msp: passed
cond-vae: passed
ps-vae: passed
msps-vae-multisession: passed
labels-images: passed

total time to perform integration test: 195.396645 sec

The code works properly in CPU mode on this data.
I tried "mem_limit_gb": 5,6,7, 8, 24.0. Also, reduced "tt_n_gpu_trials" to 128. None of them helped.
Dataset consists of trials of different length, with mean: 1772, std: 604 (frames per trial).

Despite these efforts, the issue persists. I would greatly appreciate any insights or suggestions you may have.
Thank you!

themattinthehatt · 2023-11-22T17:23:16Z

Hi @faezeamin ,
I have not tried the multi-gpu training in several years - I can test this out on my end after the thanksgiving break and get back to you.
In the meantime, is requesting a GPU from code ocean with more memory possible for you?

faezeamin · 2023-11-27T13:22:58Z

Thank you for your prompt response!
Yes - the model is functional on a single GPU with a size of 15.65 GB. But I'm interested in exploring the possibility of faster run-times using multiple GPUs, if feasible.

themattinthehatt · 2023-12-11T22:57:41Z

@faezeamin sorry for not getting to this yet, haven't forgotten about it though

faezeamin · 2024-02-26T23:07:21Z

Hi @themattinthehatt - Just following up on this issue. Have you got a chance to take look on multiple GPU analysis? Thanks, -Faeze

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Code with Multiple GPUs #35

Running Code with Multiple GPUs #35

faezeamin commented Nov 22, 2023 •

edited

themattinthehatt commented Nov 22, 2023 •

edited

faezeamin commented Nov 27, 2023

themattinthehatt commented Dec 11, 2023

faezeamin commented Feb 26, 2024

Running Code with Multiple GPUs #35

Running Code with Multiple GPUs #35

Comments

faezeamin commented Nov 22, 2023 • edited

themattinthehatt commented Nov 22, 2023 • edited

faezeamin commented Nov 27, 2023

themattinthehatt commented Dec 11, 2023

faezeamin commented Feb 26, 2024

faezeamin commented Nov 22, 2023 •

edited

themattinthehatt commented Nov 22, 2023 •

edited