Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'LOCAL_RANK' #24

Open
shoang22 opened this issue Sep 18, 2023 · 0 comments
Open

KeyError: 'LOCAL_RANK' #24

shoang22 opened this issue Sep 18, 2023 · 0 comments

Comments

@shoang22
Copy link

shoang22 commented Sep 18, 2023

Hello,

When attempting to fine-tune in a multi-gpu SLURM environment, I get the following error:

Mon Sep 18 09:20:06 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:1A:00.0 Off |                    0 |
| N/A   31C    P0              44W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:1C:00.0 Off |                    0 |
| N/A   27C    P0              41W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  | 00000000:1D:00.0 Off |                    0 |
| N/A   28C    P0              42W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  | 00000000:1E:00.0 Off |                    0 |
| N/A   30C    P0              42W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
MASTER_ADDR=mg089
GpuFreq=control_disabled
INFO:    underlay of /etc/localtime required more than 50 (80) bind mounts
INFO:    underlay of /etc/localtime required more than 50 (80) bind mounts
INFO:    underlay of /etc/localtime required more than 50 (80) bind mounts
INFO:    underlay of /etc/localtime required more than 50 (80) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
    run()
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
    trainer.fit(model, train_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
    self._configure_with_arguments(args, mpu)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
    args.local_rank = int(os.environ['LOCAL_RANK'])
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
    run()
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
    trainer.fit(model, train_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
    self._configure_with_arguments(args, mpu)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
    args.local_rank = int(os.environ['LOCAL_RANK'])
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
    run()
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
    trainer.fit(model, train_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
    self._configure_with_arguments(args, mpu)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
    args.local_rank = int(os.environ['LOCAL_RANK'])
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
    run()
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
    trainer.fit(model, train_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
    self._configure_with_arguments(args, mpu)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
    args.local_rank = int(os.environ['LOCAL_RANK'])
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

My SLURM job request is as follows:

#!/bin/bash
#SBATCH --job-name=chemformer
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH -p gpu
#SBATCH --gres=gpu:4
#SBATCH --mem=64gb
#SBATCH --time=0:15:00
#SBATCH --output=slurm_out/output.%A_%a.txt

module load miniconda3
module load cuda/11.8.0
nvidia-smi

export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "WORLD_SIZE="$WORLD_SIZE

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR

CONTAINER="singularity exec --nv $PROJ_DIR/singularity_images/chemformer.sif"

CMD="python -m molbart.fine_tune \
  --dataset uspto_50 \
  --data_path data/seq-to-seq_datasets/uspto_50.pickle \
  --model_path models/pre-trained/combined/step=1000000.ckpt \
  --task backward_prediction \
  --epochs 100 \
  --lr 0.001 \
  --schedule cycle \
  --batch_size 128 \
  --acc_batches 4 \
  --augment all \
  --aug_prob 0.5 \
  --gpus $SLURM_NTASKS_PER_NODE
  "

srun ${CONTAINER} ${CMD}

I am using the original library versions defined in the documentation. Is this behavior expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant