XlaRuntimeError: INTERNAL: RET_CHECK failure #55

JakobThumm · 2023-05-02T12:16:52Z

Hi,
I try to run the code in docker.
Unfortunately, I get a JAX-related error:

UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: RET_CHECK failure 
(external/xla/xla/service/gpu/gemm_algorithm_picker.cc:380) 
stream->parent()->GetBlasGemmAlgorithms(stream, &algorithms)

Steps to reproduce:
Install NVidia Container Toolkit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
I changed line 33 in the Dockerfile to

COPY dreamerv3/embodied/scripts scripts

Create docker image and run container

docker build -f  dreamerv3/Dockerfile -t dreamer-v3:$USER . && \
 docker run -it --rm --gpus all -v ~/logdir:/logdir dreamer-v3:$USER \
   sh ../scripts/xvfb_run.sh python3 dreamerv3/train.py \
   --logdir "/logdir/$(date +%Y%m%d-%H%M%S)" \
   --configs atari small --task atari_pong

My local nvida-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     Off  | 00000000:01:00.0  On |                  N/A |
| 30%   30C    P8    10W / 125W |    995MiB /  8192MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

The output of the provided nvidia docker test:

docker run -it --rm --gpus all nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 11.4.2

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

Tue May  2 12:15:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     Off  | 00000000:01:00.0  On |                  N/A |
| 30%   30C    P8    11W / 125W |    995MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Do you think I have to change my local Cuda Version in order to get Dreamer-v3 inside the container running correctly?

The text was updated successfully, but these errors were encountered:

danijar · 2023-05-04T03:23:36Z

Hi, you can also use a Docker base image with newer CUDA version. The algorithm supports the newest JAX/CUDA versions. Hope that helps!

masonhargrave mentioned this issue Aug 25, 2023

Dockerfile CUDA Image Version, COPY Path, and Execution Instructions Need Updates #84

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XlaRuntimeError: INTERNAL: RET_CHECK failure #55

XlaRuntimeError: INTERNAL: RET_CHECK failure #55

JakobThumm commented May 2, 2023

danijar commented May 4, 2023

XlaRuntimeError: INTERNAL: RET_CHECK failure #55

XlaRuntimeError: INTERNAL: RET_CHECK failure #55

Comments

JakobThumm commented May 2, 2023

danijar commented May 4, 2023