Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adafactor fails to run on a custom (rfs) resnet12 (with MAML) #405

Open
brando90 opened this issue Dec 3, 2021 · 3 comments
Open

Adafactor fails to run on a custom (rfs) resnet12 (with MAML) #405

brando90 opened this issue Dec 3, 2021 · 3 comments

Comments

@brando90
Copy link

brando90 commented Dec 3, 2021

I was trying adafactor but I get the following issues:

args.scheduler=None
--------------------- META-TRAIN ------------------------
Starting training!
Traceback (most recent call last):
  File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 441, in <module>
    main_resume_from_checkpoint(args)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 403, in main_resume_from_checkpoint
    run_training(args)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 413, in run_training
    meta_train_fixed_iterations(args)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/meta_learning/training/meta_training.py", line 233, in meta_train_fixed_iterations
    args.outer_opt.step()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch_optimizer/adafactor.py", line 191, in step
    self._approx_sq_grad(
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch_optimizer/adafactor.py", line 116, in _approx_sq_grad
    (exp_avg_sq_row / exp_avg_sq_row.mean(dim=-1))
RuntimeError: The size of tensor a (3) must match the size of tensor b (64) at non-singleton dimension 1

with the pytorch default adam training runs so why does this one fail?

related:

@ionutmodo
Copy link

are there any updates on this? The issue is still present

@ionutmodo
Copy link

ionutmodo commented Jul 10, 2023

I had a look at this error which I also faced when training a ResNet-50 model. I got a similar error as @brando90, except that the dimensions of my tensors were different. Please read further in order to understand how I managed to fix this issue.

First of all, the exception is raised from here, where the tensor exp_avg_sq_row is divided by the mean over the last dimension. In my case, exp_avg_sq_row has size [64, 3, 7]. When computing the mean over the last dimension, the result exp_avg_sq_row.mean(dim=-1) will have size [64, 3] and the dimension mismatch for this division operation raises the RuntimeError.

The solution is to unsqueeze the mean tensor such that instead of doing (exp_avg_sq_row / exp_avg_sq_row.mean(dim=-1)), we should do (exp_avg_sq_row / exp_avg_sq_row.mean(dim=-1).unsqueeze(-1)).

@Xynonners
Copy link

still happens, someone make a pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants