Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiVI fails with invalid tensor #2581

Open
bio-la opened this issue Mar 1, 2024 · 9 comments · May be fixed by #2632
Open

multiVI fails with invalid tensor #2581

bio-la opened this issue Mar 1, 2024 · 9 comments · May be fixed by #2632
Labels

Comments

@bio-la
Copy link

bio-la commented Mar 1, 2024

Hi scvi team, thanks for the excellent tools! I'm reaching out with an issue running MultiVI with versions scvi-tools >=0.20.3
the error is the same reported here and here.

i set up two actions to reproduce the error. I'm running the same script with the same toy dataset, only changing the scvi-tools version in the environment. (you can see all the dependencies in the conda list step). the action is ran on ubuntu but i have the same issue on an intel chip.

this action, with scvi-tools=1.1.1 fails with the same error as described before
https://github.com/DendrouLab/panpipes/actions/runs/8114022415/job/22178710942?pr=201

this action, with scvi-tools=0.20.3, runs successfully. https://github.com/DendrouLab/panpipes/actions/runs/8114022411/job/22178710284?pr=201

thanks for your help!

@bio-la bio-la added the bug label Mar 1, 2024
@canergen
Copy link
Contributor

canergen commented Mar 1, 2024

Hi, the errors you are referring to are issues about MPS Apple hardware. We have seen the issue you are facing in multiVI in newer scvi-tools version. The main reason in our hands is that we have changed the default see in newer scvi-tools version. Do you fix the seed and still face the problem? Can you share the toy dataset (how large is it - ncells and file size)? It would be interesting to explore what updates in the TrainingPlan remove these errors.

@bio-la
Copy link
Author

bio-la commented Mar 1, 2024

thanks for your answer! here is the toy dataset (approx 2000 cells by 4000 features)

i didn't include a seed but i will try that and let you know.

@bio-la
Copy link
Author

bio-la commented Mar 1, 2024

Unfortunately, no success with explicit scvi.settings.seed
https://github.com/DendrouLab/panpipes/actions/runs/8114593332/job/22180588924?pr=201

@martinkim0
Copy link
Contributor

Hi @bio-la, sorry you're running into this issue. It looks like the issue might be slightly different than the Discourse threads you linked since the CI is running on Ubuntu, not macOS, so I'm guessing this is an issue unrelated to a PyTorch MPS build.

Could you try passing in a lower learning rate (maybe lr=1e-5 or lr=1e-6) and see if that helps? Also, is this error occurring in the first epoch of training or later on? I wasn't able to find that info in the logs.

@bio-la
Copy link
Author

bio-la commented Mar 4, 2024

thanks for your suggestions. here are my comments:

  • i get the same error message on Ubuntu and Macos Intel chip. I wouldn't know why the effect is the same on M3 chips, if you speculate that the cause is not the same.
  • the error appears at the first epoch of training
  • changing lr from 1e-3 to 1e-5 doesn't solve the issue, scvi 0.20.3 still works with the lower lr
  • i noticed that the conda install is now pulling scvi 1.1.2, still failing here.

I'm using this dataset
with the following parameters for Multivi:

MultiVI:
    batch_covariate: dataset
    model_args:
      n_hidden : None 
      n_latent :  None
      #(bool,default: True)
      region_factors : True 
       #{‘normal’, ‘ln’} (default: 'normal')
      latent_distribution : 'normal'
      #(bool,default: False)
      deeply_inject_covariates : False 
      #(bool, default: False)
      fully_paired : False 
    training_args:
      #(default: 500)
      max_epochs : 500 
      #float (default: 0.0001)
      lr : 1.0e-05
      #leave blanck for default str | int | bool | None (default: None)
      use_gpu :
      # float (default: 0.9)
      train_size : 0.9 
      # leave blanck for default, float | None (default: None)
      validation_size : 
      # int (default: 128)
      batch_size : 128
      #float (default: 0.001)
      weight_decay : 0.001 
      #float (default: 1.0e-08)
      eps : 1.0e-08 
      #bool (default: True)
      early_stopping : True 
      #bool (default: True)
      save_best : True
       #leave blanck for default int | None (default: None)
      check_val_every_n_epoch :
      #leave blank for default int | None (default: None)
      n_steps_kl_warmup : 
       # int | None (default: 50)
      n_epochs_kl_warmup : 50
      #bool (default: True)
      adversarial_mixing : True 
       #leave blank for default dict | None (default: None)
    training_plan : None

@bio-la
Copy link
Author

bio-la commented Mar 11, 2024

Hi, @martinkim0 any news on this? thanks!

@canergen
Copy link
Contributor

We are on it. I have one solution (gradient clipping) that solves similar problems in totalVI (AdversarialTrainingplan is not stable). We first need to make sure that it doesn't reduce quality in downstream tasks.
I would assume adversarial_mixing=False will solve it for now in your tests but will reduce integration.
Last question: Can you tell me the AnnData version used to save the object above? I had issues in my testing environment opening the file (likely outdated AnnData on my end).

@canergen
Copy link
Contributor

Hi @bio-la, currently we enable an adversarial classifier even if only a single batch is present in the dataset. This is a bug. We will have a fix soonish. To overcome the error for now, you can pass mvi.train(adversarial_mixing=False). Please let us know, if you are still facing the issue.

@canergen canergen added this to the scvi-tools 1.1.x milestone Mar 26, 2024
@canergen canergen linked a pull request Mar 26, 2024 that will close this issue
@bio-la
Copy link
Author

bio-la commented Mar 26, 2024

awesome! will wait for the changes to be merged and let you know. thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants