Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow ML notebook error on GPU #511

Open
racheetmatai opened this issue Jan 25, 2024 · 3 comments
Open

Tensorflow ML notebook error on GPU #511

racheetmatai opened this issue Jan 25, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@racheetmatai
Copy link

racheetmatai commented Jan 25, 2024

Describe the bug
Tensorflow ML noteboook fails to run on GPU. The suggestions in the thread don't work.

To Reproduce
Steps to reproduce the behavior:

  1. In the jupyter hub instance terminal mamba install -c nvidia cuda-nvcc
  2. shutdown kernel of the notebooks if they were active
  3. Run the example in the thread or any example which uses TF.
  4. You will see either a libdevice not found or StatefulPartitionedCall_2

Expected behavior
Images prior to April (when the thread was posted) gave the StatefulPartitionedCall_2 error where as the latest images were giving the libdevice not found error

Docker Image Version
following images were tried:
libdevice not found error
2023.11.14
2023.10.24

StatefulPartitionedCall_2 error
2023.05.18
2023.04.15
2023.01.04

Infrastructure (Where you are running this image):

Additional context
The same notebook runs fine on the CPU Tensorflow ML notebooks

@weiji14 weiji14 added the bug Something isn't working label Jan 25, 2024
@weiji14
Copy link
Member

weiji14 commented Jan 25, 2024

Hi @racheetmatai, thanks for opening this bug report. It seems like you've tested docker images up to tag 2023.11.14. I'm wondering if any of the newer ones, e.g. 2024.01.03 which includes the CUDA 11.2 to 11.8 update (#505) might help with this issue?

There are a few things we can try, there are some major changes on conda-forge related to CUDA 12, and I'm wondering if the cuda-nvcc issue could be handled differently now if we update from CUDA 11.8 to 12, cc @ngam. There are also some tensorflow updates we need to do related to flax (#489), but I'm not sure if it would help here.

@racheetmatai
Copy link
Author

racheetmatai commented Jan 25, 2024

Hi @weiji14, I should have mentioned, i tried the example today (the default TF ML notebook on pangeo) before submitting the issue and got the StatefulPartitionedCall_2 error. I know that the cuda version, driver version and cudnn version have to match exactly (or atleast thats how it used to be) and this is particularly painful on ubuntu. Just updating one of these used to sometimes leave the build broken.

@racheetmatai
Copy link
Author

@weiji14 pip install 'flax==0.7.2' 'jax<=0.4.13' 'ml_dtypes==0.2.0' mamba install cuda-nvcc==11.6.* -c nvidia and adding os.environ['XLA_FLAGS'] = '--xla_gpu_cuda_data_dir=/srv/conda/envs/notebook' in the beginning of my notebook works. Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants