Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume training causes output folder File exists error #18

Open
skymanaditya1 opened this issue Jun 1, 2022 · 1 comment
Open

Resume training causes output folder File exists error #18

skymanaditya1 opened this issue Jun 1, 2022 · 1 comment

Comments

@skymanaditya1
Copy link

Hi, I am trying to resume training a model from a pretrained experiment using the command python src/infra/launch.py hydra.run.dir=. exp_suffix=my_experiment_name env=local dataset=ffs dataset.resolution=256 num_gpus=4 training.resume=latest.

The model understands that it needs to resume training and prints the same - "We are going to resume the training and the experiment already exists. That's why the provided config/training_cmd are discarded and the project dir is not created". However, it attempts to recreate the output folder where all the intermediate checkpoints, inferred images, and videos are stored. And if I delete the output directory, it creates a new output directory but starts training from scratch which is weird.

Can you help me understand what I am doing wrong? Below is the full stack trace.

<=== TRAINING COMMAND START ===>
TORCH_EXTENSIONS_DIR=/tmp/torch_extensions cd /ssd_scratch/cvit/aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_50_how2sign-7870d59 && /ssd_scratch/cvit/aditya1/stylegan-v/env/bin/python src/train.py hydra.run.dir=. hydra.output_subdir=null hydra/job_logging=disabled hydra/hydra_logging=disabled
<=== TRAINING COMMAND END ===>
We are going to resume the training and the experiment already exists. That's why the provided config/training_cmd are discarded and the project dir is not created.

Training config is located in `experiment_config.yaml`

Output directory:   /ssd_scratch/cvit/aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_50_how2sign-7870d59/output
Training data:      data/how2sign_faces_styleganv_resized
Training duration:  25000 kimg
Number of GPUs:     2
Number of videos:   10000
Image resolution:   256
Conditional model:  False
Dataset x-flips:    True

Creating output directory...
Traceback (most recent call last):
  File "/ssd_scratch/cvit/aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_50_how2sign-7870d59/src/train.py", line 451, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/ssd_scratch/cvit/aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_50_how2sign-7870d59/src/train.py", line 437, in main
    os.makedirs(args.run_dir, exist_ok=args.resume_whole_state)
  File "/ssd_scratch/cvit/aditya1/stylegan-v/env/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/ssd_scratch/cvit/aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_50_how2sign-7870d59/output'
@skymanaditya1
Copy link
Author

skymanaditya1 commented Jun 1, 2022

I tried debugging the issue quickly, but the problem seems to be emanating from the value of the parameter c.resume in the train.py file. Even though the flag training.resume=latest is enabled, which should set the value of the field c.resume to "latest", it isn't happening that way. Even with the training.resume=latest flag set, the value of c.resume is False. I fixed it for now by explicitly setting the value of c.resume to True for now as I didn't want to go inside the configs too much myself for now.

I guess this is possible because when the file experiment_config.yaml is being read, the value of c.resume remains None. This is what the output of cfg.training looks like for me --
{'outdir': '${project_release_dir}', 'data': '${dataset.path}', 'gpus': '${num_gpus}', 'cfg': 'auto', 'snap': 50, 'kimg': 25000, 'metrics': ['fvd2048_16f', 'fvd2048_128f', 'fvd2048_128f_subsample8f', 'fid50k_full'], 'aug': 'ada', 'mirror': True, 'batch_size': 8, 'resume': None, 'seed': 0, 'dry_run': False, 'cond': False, 'subset': None, 'p': None, 'target': 0.6, 'augpipe': 'bgc', 'freezed': 0, 'fp32': False, 'nhwc': False, 'nobench': False, 'allow_tf32': False, 'num_workers': 3}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant