tape-embed fails under PyTorch 1.5.0 #48

wbogud · 2020-04-22T16:42:29Z

I was trying to run tape-embed, but received the following error message (everything went fine when I ran it with the --no_cuda flag):

(protein) wbogud@cuda:~/projects/protein$ time tape-embed transformer ../data/test.fasta embeddings.npz models/tape/bert-base/
20/04/22 16:12:11 - INFO - tape.training -   device: cuda n_gpu: 4
20/04/22 16:12:11 - INFO - tape.models.modeling_utils -   loading configuration file models/tape/bert-base/config.json
20/04/22 16:12:11 - INFO - tape.models.modeling_utils -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "base_model": "transformer",
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "input_size": 768,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 8192,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": -1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_size": 768,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 1,
  "vocab_size": 30
}

20/04/22 16:12:11 - INFO - tape.models.modeling_utils -   loading weights file models/tape/bert-base/pytorch_model.bin
  0%|                                                                                                               | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/home/wbogud/anaconda3/envs/protein/bin/tape-embed", line 8, in <module>
    sys.exit(run_embed())
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/main.py", line 234, in run_embed
    training.run_embed(**embed_args)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/training.py", line 642, in run_embed
    outputs = runner.forward(batch, no_loss=True)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/training.py", line 86, in forward
    outputs = self.model(**batch)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/models/modeling_bert.py", line 443, in forward
    dtype=next(self.parameters()).dtype)  # fp16 compatibility
StopIteration

Downgrading to PyTorch 1.4.0 solved the issue.

Could the error be related to a known issue of PyTorch 1.5.0 described at https://github.com/pytorch/pytorch/releases/tag/v1.5.0? (torch.nn.parallel.DistributedDataParallel does not work in Single-Process Multi-GPU mode)

The text was updated successfully, but these errors were encountered:

rmrao · 2020-04-22T19:29:14Z

That seems plausible. We've moved to pytorch lightning in an internal version of this code, which sidesteps some of the version issues. We are looking into cleaning that up and making it public.

rmrao · 2020-05-01T16:55:02Z

For now, the new version of TAPE has torch>=1.0,<1.5 added to the requirements.

rmrao added the bug Something isn't working label Apr 22, 2020

rmrao added the wontfix This will not be worked on label Sep 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tape-embed fails under PyTorch 1.5.0 #48

tape-embed fails under PyTorch 1.5.0 #48

wbogud commented Apr 22, 2020

rmrao commented Apr 22, 2020

rmrao commented May 1, 2020

tape-embed fails under PyTorch 1.5.0 #48

tape-embed fails under PyTorch 1.5.0 #48

Comments

wbogud commented Apr 22, 2020

rmrao commented Apr 22, 2020

rmrao commented May 1, 2020