You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using 2 GPUs on a single node, or multiple nodes on multiple nodes the training does not start while the job keeps running. I use a container to deploy the environment and SLURM. Is there specific cluster/slurm configuration required to make this work?
#!/bin/sh#SBATCH --job-name=cifar-lit-2GPU
...
#SBATCH --time=0:30:00 #SBATCH --nodes=1#SBATCH --ntasks-per-node=2#SBATCH --cpus-per-task=20 #SBATCH --gres=gpu:2 #SBATCH --mem=4GB #SBATCH --output=slurm-%x-%j.out#SBATCH --error=slurm-%x-%j.err # Start measuring execution time
start_time=$(date +%s)export APPTAINER_HOME=/tudelft.net/staff-umbrella/reit/apptainer
export APPTAINER_NAME=pytorch2.2.1-cuda12.1.sif
# Check that container file existsif [ !-f$APPTAINER_HOME/$APPTAINER_NAME ];then
ls $APPTAINER_HOME/$APPTAINER_NAMEexit 1
fi# Load CUDA that is compatible to container libraries
module use /opt/insy/modulefiles
module load cuda/12.1
# Run script
srun apptainer exec \
--nv \
--env-file ~/.env \
-B /home/:/home/ \
-B /tudelft.net/:/tudelft.net/ \
$APPTAINER_HOME/$APPTAINER_NAME \
python script.py
# End measuring execution time
end_time=$(date +%s)
elapsed_time=$((end_time - start_time))echo"Elapsed time: $elapsed_time seconds"
Error messages and logs
==> slurm-cifar-lit-2GPU-10065210.err <==
HPU available: False, using: 0 HPUs
/opt/conda/envs/__apptainer__/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA A40') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
==> slurm-cifar-lit-2GPU-10065210.out <==
SLURM_NNODES: 1
SLURM_NTASKS_PER_NODE: 2
SLURM_NNODES: 1
SLURM_NTASKS_PER_NODE: 2
Files already downloaded and verified
Files already downloaded and verified
Bug description
When using 2 GPUs on a single node, or multiple nodes on multiple nodes the training does not start while the job keeps running. I use a container to deploy the environment and SLURM. Is there specific cluster/slurm configuration required to make this work?
What version are you seeing the problem on?
v2.2
How to reproduce the bug
script.py
sbatch submission script
Error messages and logs
Environment
The text was updated successfully, but these errors were encountered: