Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tape-embed fails under PyTorch 1.5.0 #48

Open
wbogud opened this issue Apr 22, 2020 · 2 comments
Open

tape-embed fails under PyTorch 1.5.0 #48

wbogud opened this issue Apr 22, 2020 · 2 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@wbogud
Copy link

wbogud commented Apr 22, 2020

I was trying to run tape-embed, but received the following error message (everything went fine when I ran it with the --no_cuda flag):

(protein) wbogud@cuda:~/projects/protein$ time tape-embed transformer ../data/test.fasta embeddings.npz models/tape/bert-base/
20/04/22 16:12:11 - INFO - tape.training -   device: cuda n_gpu: 4
20/04/22 16:12:11 - INFO - tape.models.modeling_utils -   loading configuration file models/tape/bert-base/config.json
20/04/22 16:12:11 - INFO - tape.models.modeling_utils -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "base_model": "transformer",
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "input_size": 768,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 8192,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": -1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_size": 768,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 1,
  "vocab_size": 30
}

20/04/22 16:12:11 - INFO - tape.models.modeling_utils -   loading weights file models/tape/bert-base/pytorch_model.bin
  0%|                                                                                                               | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/home/wbogud/anaconda3/envs/protein/bin/tape-embed", line 8, in <module>
    sys.exit(run_embed())
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/main.py", line 234, in run_embed
    training.run_embed(**embed_args)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/training.py", line 642, in run_embed
    outputs = runner.forward(batch, no_loss=True)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/training.py", line 86, in forward
    outputs = self.model(**batch)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/models/modeling_bert.py", line 443, in forward
    dtype=next(self.parameters()).dtype)  # fp16 compatibility
StopIteration

Downgrading to PyTorch 1.4.0 solved the issue.

Could the error be related to a known issue of PyTorch 1.5.0 described at https://github.com/pytorch/pytorch/releases/tag/v1.5.0? (torch.nn.parallel.DistributedDataParallel does not work in Single-Process Multi-GPU mode)

@rmrao
Copy link
Collaborator

rmrao commented Apr 22, 2020

That seems plausible. We've moved to pytorch lightning in an internal version of this code, which sidesteps some of the version issues. We are looking into cleaning that up and making it public.

@rmrao rmrao added the bug Something isn't working label Apr 22, 2020
@rmrao
Copy link
Collaborator

rmrao commented May 1, 2020

For now, the new version of TAPE has torch>=1.0,<1.5 added to the requirements.

@rmrao rmrao added the wontfix This will not be worked on label Sep 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants