Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2 BUG]: LightningModule has parameters that were not used #853

Open
clue09 opened this issue May 2, 2024 · 9 comments · May be fixed by #883
Open

[v2 BUG]: LightningModule has parameters that were not used #853

clue09 opened this issue May 2, 2024 · 9 comments · May be fixed by #883
Labels
bug Something isn't working
Milestone

Comments

@clue09
Copy link

clue09 commented May 2, 2024

Describe the bug
When trying to train the example Chemprop model with:

chemprop train --data-path tests/data/regression.csv \
    --task-type regression \
    --output-dir train_example

I get the error message:

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.
Epoch 0: 50%|█████   | 1/2 [00:00<00:00, 2.58it/s, v_num=0, train_loss=0.827] 

Expected behavior
The trained model is not created in train_example/model_0/best.pt, instead in this path only train_example/model_0/trainer_logs/version_0/hparams.yml exists due to some issue in training.

Environment
Chemprop 2 installed from source using pip, Python 3.11.9, OS: Linux

Error Stack Trace

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/bin/chemprop", line 8, in <module>
[rank3]:     sys.exit(main())
[rank3]:              ^^^^^^
[rank3]:   File "/home/askuhn/chemprop/chemprop/cli/main.py", line 80, in main
[rank3]:     func(args)
[rank3]:   File "/home/askuhn/chemprop/chemprop/cli/train.py", line 72, in func
[rank3]:     main(args)
[rank3]:   File "/home/askuhn/chemprop/chemprop/cli/train.py", line 988, in main
[rank3]:     train_model(
[rank3]:   File "/home/askuhn/chemprop/chemprop/cli/train.py", line 864, in train_model
[rank3]:     trainer.fit(model, train_loader, val_loader)
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
[rank3]:     call._call_and_handle_interrupt(
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank3]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank3]:     return function(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
[rank3]:     self._run(model, ckpt_path=ckpt_path)
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
[rank3]:     results = self._run_stage()
[rank3]:               ^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
[rank3]:     self.fit_loop.run()
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
[rank3]:     self.advance()
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
[rank3]:     self.epoch_loop.run(self._data_fetcher)
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
[rank3]:     self.advance(data_fetcher)
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
[rank3]:     batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 190, in run
[rank3]:     self._optimizer_step(batch_idx, closure)
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 268, in _optimizer_step
[rank3]:     call._call_lightning_module_hook(
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
[rank3]:     output = fn(*args, **kwargs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/core/module.py", line 1303, in optimizer_step
[rank3]:     optimizer.step(closure=optimizer_closure)
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/core/optimizer.py", line 152, in step
[rank3]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank3]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
[rank3]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank3]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 239, in optimizer_step
[rank3]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
[rank3]:     return optimizer.step(closure=closure, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank3]:     return wrapped(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank3]:     out = func(*args, **kwargs)
[rank3]:           ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
[rank3]:     ret = func(self, *args, **kwargs)
[rank3]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/optim/adam.py", line 148, in step
[rank3]:     loss = closure()
[rank3]:            ^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/plugins/precision/precision.py", line 108, in _wrap_closure
[rank3]:     closure_result = closure()
[rank3]:                      ^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
[rank3]:     self._result = self.closure(*args, **kwargs)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
[rank3]:     step_output = self._step_fn()
[rank3]:                   ^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 318, in _training_step
[rank3]:     training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
[rank3]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
[rank3]:     output = fn(*args, **kwargs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 390, in training_step
[rank3]:     return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 642, in __call__
[rank3]:     wrapper_output = wrapper_module(*args, **kwargs)
[rank3]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1589, in forward
[rank3]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank3]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1480, in _pre_forward
[rank3]:     if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank3]:                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@clue09 clue09 added the bug Something isn't working label May 2, 2024
@kevingreenman kevingreenman added this to the v2.0.1 milestone May 2, 2024
@kevingreenman
Copy link
Member

I can reproduce this issue on two Linux machines, using python 3.11/3.12, torch 2.2/2.3, and lightning 2.2.4

@davidegraff
Copy link
Contributor

Are these on machines with multiple GPUs? If so, what happens when you use --devices=0?

@kevingreenman
Copy link
Member

Yes, the machines have multiple GPUs. But when I try that, I get this error:

Traceback (most recent call last):
  File "/home/kpg/miniconda3/envs/chemprop-v2/bin/chemprop", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/kpg/chemprop/chemprop/cli/main.py", line 80, in main
    func(args)
  File "/home/kpg/chemprop/chemprop/cli/train.py", line 72, in func
    main(args)
  File "/home/kpg/chemprop/chemprop/cli/train.py", line 988, in main
    train_model(
  File "/home/kpg/chemprop/chemprop/cli/train.py", line 854, in train_model
    trainer = pl.Trainer(
              ^^^^^^^^^^^
  File "/home/kpg/miniconda3/envs/chemprop-v2/lib/python3.11/site-packages/lightning/pytorch/utilities/argparse.py", line 70, in insert_env_defaults
    return fn(self, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/kpg/miniconda3/envs/chemprop-v2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 401, in __init__
    self._accelerator_connector = _AcceleratorConnector(
                                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kpg/miniconda3/envs/chemprop-v2/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py", line 149, in __init__
    self._check_device_config_and_set_final_flags(devices=devices, num_nodes=num_nodes)
  File "/home/kpg/miniconda3/envs/chemprop-v2/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py", line 336, in _check_device_config_and_set_final_flags
    raise MisconfigurationException(
lightning.fabric.utilities.exceptions.MisconfigurationException: `Trainer(devices='0')` value is not a valid input using cuda accelerator.

Perhaps we need to change how the devices argument is processed so it's input as an int instead of a str?

@JacksonBurns
Copy link
Member

related: Lightning-AI/pytorch-lightning#17212

@shihchengli
Copy link
Contributor

After adding the following code block to the MPNN class, I found that there are two parameters (message_passing.W_d.weight and message_passing.W_d.bias) not used during training. The matrix $\mathbf{W}_d$ is used when additional atomic descriptors are provided, based on the description here. Only initiating $\mathbf{W}_d$ when additional atomic descriptors are provided should fix this issue.

def on_after_backward(self):
    for name, param in self.named_parameters():
        if param.grad is None:
            print(name)

@davidegraff
Copy link
Contributor

I wonder if changing W_d = nn.Linear(...) to W_d = nn.LazyLinear(...) would fix this issue.

@shihchengli
Copy link
Contributor

shihchengli commented May 11, 2024

Using nn.LazyLinear would cause another error

RuntimeError: Modules with uninitialized parameters can't be used with DistributedDataParallel. Run a dummy forward pass to correctly initialize the modules

How about change W_d = nn.Linear(d_h + d_vd, d_h + d_vd) if d_vd is not None else None to W_d = nn.Linear(d_h + d_vd, d_h + d_vd) if d_vd else None? I have tested it, and after this change, the model can be trained using DDP.

However, another issue arises while running trainer.predict(). It cannot locate one of the saved checkpoints. I understand that DDP creates multiple independent processes, but does this result in different model weights by the end? Additionally, only one checkpoint is saved, which may also be related to the checkpointing settings we use. I think the issue with the unused parameters can be combined into a PR to fix issues with running on multiple GPUs. I can open another issue to describe this if it would be better.

FileNotFoundError: Checkpoint file not found: /home/gridsan/sli/packages/chemprop/train_example/model_0/checkpoints/best-epoch=33-val_loss=0.05.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Loaded model weights from the checkpoint at /home/gridsan/sli/packages/chemprop/train_example/model_0/checkpoints/best-epoch=33-val_loss=0.10.ckpt
[rank: 1] Child process with PID 3920163 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
Killed

@JacksonBurns
Copy link
Member

How about change W_d = nn.Linear(d_h + d_vd, d_h + d_vd) if d_vd is not None else None to W_d = nn.Linear(d_h + d_vd, d_h + d_vd) if d_vd else None?

@shihchengli I don't understand why the latter works but the former doesn't - is d_vd isn't None when there are no additional features, what is it? It's some implicitly false thing - an empty array?

I can open another issue to describe this if it would be better.

Yes please do break that out into another issue.

@shihchengli
Copy link
Contributor

@JacksonBurns It would be an integer 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants