Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot explain Shape Mismatch #120

Open
MaxH1996 opened this issue Oct 4, 2021 · 7 comments
Open

Cannot explain Shape Mismatch #120

MaxH1996 opened this issue Oct 4, 2021 · 7 comments
Labels
question Further information is requested

Comments

@MaxH1996
Copy link

MaxH1996 commented Oct 4, 2021

Hi, I am currently working with the torchdyn package and I am getting an error that I cannot really explain:

File "/home/maxh/miniconda3/envs/deepqmc/lib/python3.8/site-packages/torch/autograd/function.py", line 87, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore[attr-defined]
  File "/home/maxh/miniconda3/envs/deepqmc/lib/python3.8/site-packages/torchdyn/numerics/sensitivity.py", line 152, in backward
    t_adj_sol, A = odeint(adjoint_dynamics, A, t_span[i - 1:i + 1].flip(0), solver, atol=atol, rtol=rtol)
  File "/home/maxh/miniconda3/envs/deepqmc/lib/python3.8/site-packages/torchdyn/numerics/odeint.py", line 87, in odeint
    dt = init_step(f, k1, x, t, solver.order, atol, rtol)
  File "/home/maxh/miniconda3/envs/deepqmc/lib/python3.8/site-packages/torchdyn/numerics/utils.py", line 39, in init_step
    d0, d1 = hairer_norm(x0 / scale), hairer_norm(f0 / scale)
RuntimeError: The size of tensor a (1203142) must match the size of tensor b (1206497) at non-singleton dimension 0

I know this error is specific to my particular code and usage of torchdyn, but mainly I am interested in why this mismatch occurs. The shape of x0 and f0 that I input are both [8000, 3], so I do not understand how I can get a tensor of size (1203142) or (1206497) . It appears to happen in the backpropagation step, because just simply passing in values is without any errors.

Do you maybe have any idea why this would occur?

@MaxH1996 MaxH1996 added the question Further information is requested label Oct 4, 2021
@Zymrael
Copy link
Member

Zymrael commented Oct 4, 2021

This error is happening while solving the adjoint dynamics for your net. The key lines are 47 onwards

xT, λT, μT = sol[-1], grad_output[-1][-1], torch.zeros_like(vf_params)

which are then concatenated and flattened, giving you the tensor of size (1203142). Does that match (1203142) or (1206497) for your specific network architecture? It also appears to be happening at your init step (see line 39).

Could you share (at a high level) what your f is?

@MaxH1996
Copy link
Author

MaxH1996 commented Oct 4, 2021

Thanks for your quick response! It is quite hard to share my f actually because there is a whole lot going on. But here is the actual class that I call:


class Func(nn.Module):
    
    def __init__(
        self,
        nuc,
        up,
        down,
        neural_net = Net
    ):
        super().__init__()
        self.net = Net(nuc, up, down)
        
    def forward(self, t, x, rn, batch_dim, n_elec):
        
      
        x = x.reshape(batch_dim, n_elec, 3)
        _,_, x = self.net(x, rn)
        x = x.reshape(batch_dim*n_elec,3)
       
        return x

Not sure if that helps at all, and then I call NeuralODE and Func is called using functools.partial for the extra arguments. What I did see is that the mismatch is in f0 , but x0 has the correct shape at line 39 init_step. Correct in the sense that it matches with the variable scale.

I'd have to check exactly if the flattening and concatenation would match for my architecture, but I think those numbers would make sense.

Btw, if I use the normal odeint without the adjoint I do not get this problem.

@Zymrael
Copy link
Member

Zymrael commented Oct 9, 2021

Identifying what the difference 1206497 - 1203142 = 3355 represents in terms of elements is key here. The shape 1206497 is determined during initialization of the adjoint as a concat of

xT, λT, μT = sol[-1], grad_output[-1][-1], torch.zeros_like(vf_params)

whereas 1203142 is produced as the output of f_ here. My guess is that this difference comes from a set of parameters that is registered with vf (and thus is counted here but is not counted in this line (self.vf_params):

  • What parameters do you pass to optimizable_parameters?
  • Are there parameters that get registered even if you pass your partial without optimizable_parameters?
  • If yes to the previous question, does including registered parameters not passed in as optimizable together in self.vf_params at init help?

@MaxH1996
Copy link
Author

This is basically the issue I have been trying to work out too (referring to the 3355 difference in parameters). To your points:

  • The optimizable parameters that I pass are from Func, so Func.parameters().
  • So if I do not register optimizable parameters in NeuralODE then I only get the message Your vector field does not have nn.Parameters to optimize..
  • I am not really sure what you mean by your last point.

Another thing I wanted to ask: I use second derivatives in my neural net. Specifically, my self.Net uses Laplacians. Does this pose a problem for the adjoint method?

@MaxH1996
Copy link
Author

Hey, I was wondering if you had any more thoughts on this issue. I didn't have time in the last couple of weeks to work on it, but I am coming back to it now and still experiencing this mismatch in shapes. I checked the areas where you suggested the differences might come from, but they are the same at these two locations.

@Zymrael
Copy link
Member

Zymrael commented Dec 13, 2021

I'd be happy to take a look at the model if you can share in private. To determine where the issue lies, I would only need access to the nn.Module that determines your input -> output map.

@data-hound
Copy link

data-hound commented Mar 27, 2024

Hi @Zymrael

I am encountering the same issue. Here is my network, along with the input shape, and how I am creating the NeuralODE:

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(32, 10, kernel_size=3, padding=1)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(2)
    
    def forward(self, x):
        x = self.maxpool(self.relu(self.conv1(x)))
        x = self.maxpool(self.relu(self.conv2(x)))
        x = self.relu(self.conv3(x))
        print('here')
        print(x.shape)
        return x

model = NeuralODE(SimpleCNN())
#Your vector field callable (nn.Module) should have both time `t` and state `x` as arguments, we've wrapped it for you.

t_span = torch.linspace(0,1,100)
t_eval, trajectory = model(next(iter(train_loader))[0], t_span)
trajectory = trajectory.detach()
next(iter(train_loader))[0].shape
#torch.Size([64, 1, 32, 32])

The error message :

RuntimeError: The size of tensor a (8) must match the size of tensor b (32) at non-singleton dimension 3

Semi-Complete stack trace:

here
torch.Size([64, 10, 8, 8])
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-41-5705b2264547>](https://localhost:8080/#) in <cell line: 2>()
      1 t_span = torch.linspace(0,1,100)
----> 2 t_eval, trajectory = model(next(iter(train_loader))[0], t_span)
      3 trajectory = trajectory.detach()

6 frames
[/usr/local/lib/python3.10/dist-packages/torchdyn/numerics/utils.py](https://localhost:8080/#) in init_step(f, f0, x0, t0, order, atol, rtol)
     37 def init_step(f, f0, x0, t0, order, atol, rtol):
     38     scale = atol + torch.abs(x0) * rtol
---> 39     d0, d1 = hairer_norm(x0 / scale), hairer_norm(f0 / scale)
     40 
     41     if d0 < 1e-5 or d1 < 1e-5:

RuntimeError: The size of tensor a (8) must match the size of tensor b (32) at non-singleton dimension 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants