Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training Scripts] Distributed Training Script Python Argument Incorrect. #1608

Open
tjtanaa opened this issue Apr 22, 2024 · 1 comment
Open
Labels

Comments

@tjtanaa
Copy link

tjtanaa commented Apr 22, 2024

When running the command sh scripts/dist_train.sh 4 --cfg_file ...
I get the following error.

further instructions

  warnings.warn(
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] 
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] *****************************************
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] *****************************************
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
                [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
                [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
                [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
                [--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
                [--wo_gpu_stat] [--use_amp]
train.py: error: unrecognized arguments: --local-rank=0
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
                [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
                [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
                [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
                [--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
                [--wo_gpu_stat] [--use_amp]
train.py: error: unrecognized arguments: --local-rank=3
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
                [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
                [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
                [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
                [--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
                [--wo_gpu_stat] [--use_amp]
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
                [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
                [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
                [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
                [--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
                [--wo_gpu_stat] [--use_amp]
train.py: error: unrecognized arguments: --local-rank=2
train.py: error: unrecognized arguments: --local-rank=1
[2024-04-22 13:23:08,052] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 

It seems that the error comes from the inconsistent parameters that torch.distributed is passing to the train.py.
The argument --local_rank should be renamed to --local-rank.

Suggested fix:
train.py line 36

    parser.add_argument('--local-rank', type=int, default=0, help='local rank for distributed training')
Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant