cuDNN error: CUDNN_STATUS_INTERNAL_ERROR error #6

unwritten · 2022-04-15T08:41:07Z

code segment below will report error as titled, under multi gpu training

    # rotary embeddings
    positions = self.get_rotary_embedding(n, device)
    q, k = map(lambda t: apply_rotary_pos_emb(positions, t), (q, k))

The text was updated successfully, but these errors were encountered:

lucidrains · 2022-04-19T20:47:53Z

hmm, are you sure you aren't OOM?

conceptofmind · 2022-04-21T15:00:49Z

code segment below will report error as titled, under multi gpu training
    # rotary embeddings
    positions = self.get_rotary_embedding(n, device)
    q, k = map(lambda t: apply_rotary_pos_emb(positions, t), (q, k))

Are you using a specific library for parallel computing? Horovod, PyTorch Lightning, Fairscale, Deepspeed, or PyTorch distributed with model = nn.DataParallel(model)? I have tested parallel GPU use with both Deepspeed and model = nn.DataParallel(model) so far. cuDNN errors can be quite difficult to debug. Have you tried on CPU or using .detach()?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN error: CUDNN_STATUS_INTERNAL_ERROR error #6

cuDNN error: CUDNN_STATUS_INTERNAL_ERROR error #6

unwritten commented Apr 15, 2022

lucidrains commented Apr 19, 2022

conceptofmind commented Apr 21, 2022 •

edited

cuDNN error: CUDNN_STATUS_INTERNAL_ERROR error #6

cuDNN error: CUDNN_STATUS_INTERNAL_ERROR error #6

Comments

unwritten commented Apr 15, 2022

lucidrains commented Apr 19, 2022

conceptofmind commented Apr 21, 2022 • edited

conceptofmind commented Apr 21, 2022 •

edited