[Training Scripts] Distributed Training Script Python Argument Incorrect. #1608

tjtanaa · 2024-04-22T05:29:34Z

When running the command sh scripts/dist_train.sh 4 --cfg_file ...
I get the following error.

further instructions

  warnings.warn(
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] 
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] *****************************************
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] *****************************************
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
                [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
                [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
                [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
                [--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
                [--wo_gpu_stat] [--use_amp]
train.py: error: unrecognized arguments: --local-rank=0
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
                [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
                [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
                [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
                [--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
                [--wo_gpu_stat] [--use_amp]
train.py: error: unrecognized arguments: --local-rank=3
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
                [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
                [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
                [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
                [--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
                [--wo_gpu_stat] [--use_amp]
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
                [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
                [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
                [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
                [--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
                [--wo_gpu_stat] [--use_amp]
train.py: error: unrecognized arguments: --local-rank=2
train.py: error: unrecognized arguments: --local-rank=1
[2024-04-22 13:23:08,052] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid:

It seems that the error comes from the inconsistent parameters that torch.distributed is passing to the train.py.
The argument --local_rank should be renamed to --local-rank.

Suggested fix:
train.py line 36

    parser.add_argument('--local-rank', type=int, default=0, help='local rank for distributed training')

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-23T01:45:39Z

This issue is stale because it has been open for 30 days with no activity.

github-actions bot added the stale label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training Scripts] Distributed Training Script Python Argument Incorrect. #1608

[Training Scripts] Distributed Training Script Python Argument Incorrect. #1608

tjtanaa commented Apr 22, 2024

github-actions bot commented May 23, 2024

[Training Scripts] Distributed Training Script Python Argument Incorrect. #1608

[Training Scripts] Distributed Training Script Python Argument Incorrect. #1608

Comments

tjtanaa commented Apr 22, 2024

github-actions bot commented May 23, 2024