Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XlaRuntimeError: INTERNAL: RET_CHECK failure #55

Open
JakobThumm opened this issue May 2, 2023 · 1 comment
Open

XlaRuntimeError: INTERNAL: RET_CHECK failure #55

JakobThumm opened this issue May 2, 2023 · 1 comment

Comments

@JakobThumm
Copy link

Hi,
I try to run the code in docker.
Unfortunately, I get a JAX-related error:

UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: RET_CHECK failure 
(external/xla/xla/service/gpu/gemm_algorithm_picker.cc:380) 
stream->parent()->GetBlasGemmAlgorithms(stream, &algorithms)

Steps to reproduce:
Install NVidia Container Toolkit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
I changed line 33 in the Dockerfile to

COPY dreamerv3/embodied/scripts scripts

Create docker image and run container

docker build -f  dreamerv3/Dockerfile -t dreamer-v3:$USER . && \
 docker run -it --rm --gpus all -v ~/logdir:/logdir dreamer-v3:$USER \
   sh ../scripts/xvfb_run.sh python3 dreamerv3/train.py \
   --logdir "/logdir/$(date +%Y%m%d-%H%M%S)" \
   --configs atari small --task atari_pong

My local nvida-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     Off  | 00000000:01:00.0  On |                  N/A |
| 30%   30C    P8    10W / 125W |    995MiB /  8192MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

The output of the provided nvidia docker test:

docker run -it --rm --gpus all nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 11.4.2

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

Tue May  2 12:15:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     Off  | 00000000:01:00.0  On |                  N/A |
| 30%   30C    P8    11W / 125W |    995MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Do you think I have to change my local Cuda Version in order to get Dreamer-v3 inside the container running correctly?

@danijar
Copy link
Owner

danijar commented May 4, 2023

Hi, you can also use a Docker base image with newer CUDA version. The algorithm supports the newest JAX/CUDA versions. Hope that helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants