[v2 BUG]: LightningModule has parameters that were not used #853

clue09 · 2024-05-02T17:01:57Z

Describe the bug
When trying to train the example Chemprop model with:

chemprop train --data-path tests/data/regression.csv \
    --task-type regression \
    --output-dir train_example

I get the error message:

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.
Epoch 0: 50%|█████   | 1/2 [00:00<00:00, 2.58it/s, v_num=0, train_loss=0.827]

Expected behavior
The trained model is not created in train_example/model_0/best.pt, instead in this path only train_example/model_0/trainer_logs/version_0/hparams.yml exists due to some issue in training.

Environment
Chemprop 2 installed from source using pip, Python 3.11.9, OS: Linux

Error Stack Trace

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/bin/chemprop", line 8, in <module>
[rank3]:     sys.exit(main())
[rank3]:              ^^^^^^
[rank3]:   File "/home/askuhn/chemprop/chemprop/cli/main.py", line 80, in main
[rank3]:     func(args)
[rank3]:   File "/home/askuhn/chemprop/chemprop/cli/train.py", line 72, in func
[rank3]:     main(args)
[rank3]:   File "/home/askuhn/chemprop/chemprop/cli/train.py", line 988, in main
[rank3]:     train_model(
[rank3]:   File "/home/askuhn/chemprop/chemprop/cli/train.py", line 864, in train_model
[rank3]:     trainer.fit(model, train_loader, val_loader)
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
[rank3]:     call._call_and_handle_interrupt(
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank3]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank3]:     return function(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
[rank3]:     self._run(model, ckpt_path=ckpt_path)
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
[rank3]:     results = self._run_stage()
[rank3]:               ^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
[rank3]:     self.fit_loop.run()
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
[rank3]:     self.advance()
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
[rank3]:     self.epoch_loop.run(self._data_fetcher)
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
[rank3]:     self.advance(data_fetcher)
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
[rank3]:     batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 190, in run
[rank3]:     self._optimizer_step(batch_idx, closure)
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 268, in _optimizer_step
[rank3]:     call._call_lightning_module_hook(
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
[rank3]:     output = fn(*args, **kwargs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/core/module.py", line 1303, in optimizer_step
[rank3]:     optimizer.step(closure=optimizer_closure)
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/core/optimizer.py", line 152, in step
[rank3]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank3]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
[rank3]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank3]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 239, in optimizer_step
[rank3]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
[rank3]:     return optimizer.step(closure=closure, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank3]:     return wrapped(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank3]:     out = func(*args, **kwargs)
[rank3]:           ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
[rank3]:     ret = func(self, *args, **kwargs)
[rank3]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/optim/adam.py", line 148, in step
[rank3]:     loss = closure()
[rank3]:            ^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/plugins/precision/precision.py", line 108, in _wrap_closure
[rank3]:     closure_result = closure()
[rank3]:                      ^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
[rank3]:     self._result = self.closure(*args, **kwargs)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
[rank3]:     step_output = self._step_fn()
[rank3]:                   ^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 318, in _training_step
[rank3]:     training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
[rank3]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
[rank3]:     output = fn(*args, **kwargs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 390, in training_step
[rank3]:     return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 642, in __call__
[rank3]:     wrapper_output = wrapper_module(*args, **kwargs)
[rank3]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1589, in forward
[rank3]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank3]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/askuhn/miniforge3/envs/chemprop/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1480, in _pre_forward
[rank3]:     if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank3]:                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The text was updated successfully, but these errors were encountered:

kevingreenman · 2024-05-02T17:14:41Z

I can reproduce this issue on two Linux machines, using python 3.11/3.12, torch 2.2/2.3, and lightning 2.2.4

davidegraff · 2024-05-02T17:23:32Z

Are these on machines with multiple GPUs? If so, what happens when you use --devices=0?

kevingreenman · 2024-05-02T17:31:55Z

Yes, the machines have multiple GPUs. But when I try that, I get this error:

Traceback (most recent call last):
  File "/home/kpg/miniconda3/envs/chemprop-v2/bin/chemprop", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/kpg/chemprop/chemprop/cli/main.py", line 80, in main
    func(args)
  File "/home/kpg/chemprop/chemprop/cli/train.py", line 72, in func
    main(args)
  File "/home/kpg/chemprop/chemprop/cli/train.py", line 988, in main
    train_model(
  File "/home/kpg/chemprop/chemprop/cli/train.py", line 854, in train_model
    trainer = pl.Trainer(
              ^^^^^^^^^^^
  File "/home/kpg/miniconda3/envs/chemprop-v2/lib/python3.11/site-packages/lightning/pytorch/utilities/argparse.py", line 70, in insert_env_defaults
    return fn(self, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/kpg/miniconda3/envs/chemprop-v2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 401, in __init__
    self._accelerator_connector = _AcceleratorConnector(
                                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kpg/miniconda3/envs/chemprop-v2/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py", line 149, in __init__
    self._check_device_config_and_set_final_flags(devices=devices, num_nodes=num_nodes)
  File "/home/kpg/miniconda3/envs/chemprop-v2/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py", line 336, in _check_device_config_and_set_final_flags
    raise MisconfigurationException(
lightning.fabric.utilities.exceptions.MisconfigurationException: `Trainer(devices='0')` value is not a valid input using cuda accelerator.

Perhaps we need to change how the devices argument is processed so it's input as an int instead of a str?

JacksonBurns · 2024-05-02T18:45:13Z

related: Lightning-AI/pytorch-lightning#17212

shihchengli · 2024-05-06T03:18:12Z

After adding the following code block to the MPNN class, I found that there are two parameters (message_passing.W_d.weight and message_passing.W_d.bias) not used during training. The matrix $\mathbf{W}_d$ is used when additional atomic descriptors are provided, based on the description here. Only initiating $\mathbf{W}_d$ when additional atomic descriptors are provided should fix this issue.

def on_after_backward(self):
    for name, param in self.named_parameters():
        if param.grad is None:
            print(name)

davidegraff · 2024-05-06T14:14:40Z

I wonder if changing W_d = nn.Linear(...) to W_d = nn.LazyLinear(...) would fix this issue.

shihchengli · 2024-05-11T16:05:09Z

Using nn.LazyLinear would cause another error

RuntimeError: Modules with uninitialized parameters can't be used with DistributedDataParallel. Run a dummy forward pass to correctly initialize the modules

How about change W_d = nn.Linear(d_h + d_vd, d_h + d_vd) if d_vd is not None else None to W_d = nn.Linear(d_h + d_vd, d_h + d_vd) if d_vd else None? I have tested it, and after this change, the model can be trained using DDP.

However, another issue arises while running trainer.predict(). It cannot locate one of the saved checkpoints. I understand that DDP creates multiple independent processes, but does this result in different model weights by the end? Additionally, only one checkpoint is saved, which may also be related to the checkpointing settings we use. I think the issue with the unused parameters can be combined into a PR to fix issues with running on multiple GPUs. I can open another issue to describe this if it would be better.

FileNotFoundError: Checkpoint file not found: /home/gridsan/sli/packages/chemprop/train_example/model_0/checkpoints/best-epoch=33-val_loss=0.05.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Loaded model weights from the checkpoint at /home/gridsan/sli/packages/chemprop/train_example/model_0/checkpoints/best-epoch=33-val_loss=0.10.ckpt
[rank: 1] Child process with PID 3920163 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
Killed

JacksonBurns · 2024-05-21T19:48:41Z

How about change W_d = nn.Linear(d_h + d_vd, d_h + d_vd) if d_vd is not None else None to W_d = nn.Linear(d_h + d_vd, d_h + d_vd) if d_vd else None?

@shihchengli I don't understand why the latter works but the former doesn't - is d_vd isn't None when there are no additional features, what is it? It's some implicitly false thing - an empty array?

I can open another issue to describe this if it would be better.

Yes please do break that out into another issue.

shihchengli · 2024-05-22T01:03:24Z

@JacksonBurns It would be an integer 0.

clue09 added the bug Something isn't working label May 2, 2024

kevingreenman added this to the v2.0.1 milestone May 2, 2024

This was referenced May 22, 2024

[v2 BUG]: LightningModule's DDP doesn't work #874

Open

Fix unused parameters issue in DDP #883

Open

shihchengli linked a pull request May 24, 2024 that will close this issue

Fix unused parameters issue in DDP #883

Open

2 tasks

shihchengli mentioned this issue May 24, 2024

Fix DDP prediction and checkpoint Issues #884

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2 BUG]: LightningModule has parameters that were not used #853

[v2 BUG]: LightningModule has parameters that were not used #853

clue09 commented May 2, 2024 •

edited by kevingreenman

kevingreenman commented May 2, 2024

davidegraff commented May 2, 2024

kevingreenman commented May 2, 2024

JacksonBurns commented May 2, 2024

shihchengli commented May 6, 2024

davidegraff commented May 6, 2024

shihchengli commented May 11, 2024 •

edited

JacksonBurns commented May 21, 2024

shihchengli commented May 22, 2024

[v2 BUG]: LightningModule has parameters that were not used #853

[v2 BUG]: LightningModule has parameters that were not used #853

Comments

clue09 commented May 2, 2024 • edited by kevingreenman

kevingreenman commented May 2, 2024

davidegraff commented May 2, 2024

kevingreenman commented May 2, 2024

JacksonBurns commented May 2, 2024

shihchengli commented May 6, 2024

davidegraff commented May 6, 2024

shihchengli commented May 11, 2024 • edited

JacksonBurns commented May 21, 2024

shihchengli commented May 22, 2024

clue09 commented May 2, 2024 •

edited by kevingreenman

shihchengli commented May 11, 2024 •

edited