Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel crash when using TensorFlow/PyTorch? #8

Closed
benz0li opened this issue Apr 25, 2024 · 8 comments
Closed

Kernel crash when using TensorFlow/PyTorch? #8

benz0li opened this issue Apr 25, 2024 · 8 comments

Comments

@benz0li
Copy link
Member

benz0li commented Apr 25, 2024

@mthiboust Could you test with my/b-data's images?


glcr.b-data.ch/jupyterlab/cuda/python/scipy:3.11.9

  • based on nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

Install TensorFlow: pip install tensorflow==2.14.1

Install PyTorch: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118


glcr.b-data.ch/jupyterlab/cuda/python/scipy:3.12.3

  • based on nvidia/cuda:12.4.1-devel-ubuntu22.04
    • including a custom cuDNN 8 installation

Install TensorFlow: pip install tensorflow

Install PyTorch: pip install torch torchvision torchaudio


Cross reference:

@benz0li
Copy link
Member Author

benz0li commented Apr 25, 2024

@mthiboust See the CUDA Version Matrix for PyTorch/TensorFlow compatibility.

@mthiboust
Copy link

Thanks @benz0li for your help. Your images haven't solved my problem. In fact, the root cause may be on Keras side because I have the same problem with tensorflow official image (cf keras-team/keras#19601).

@benz0li
Copy link
Member Author

benz0li commented May 7, 2024

Thanks @benz0li for your help. Your images haven't solved my problem. In fact, the root cause may be on Keras side because I have the same problem with tensorflow official image (cf keras-team/keras#19601).

@mthiboust I cannot reproduce with image glcr.b-data.ch/jupyterlab/cuda/python/scipy:3.12.3 (Container: CUDA 12.4.1 + Python 3.12.3) on Debian 12 (bookworm) using NVIDIA driver version 550.54.15 and Docker 26.1.0:

docker run --rm -ti glcr.b-data.ch/jupyterlab/cuda/python/scipy:3.12.3 bash

==========
== CUDA ==
==========

CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

=============
== JUPYTER ==
=============

Entered start.sh with args: bash
Running hooks in: /usr/local/bin/start-notebook.d as uid: 1000 gid: 100
Sourcing shell script: /usr/local/bin/start-notebook.d/10-populate.sh
Done running hooks in: /usr/local/bin/start-notebook.d
Running hooks in: /usr/local/bin/before-notebook.d as uid: 1000 gid: 100
Sourcing shell script: /usr/local/bin/before-notebook.d/10-env.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/11-home.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/30-code-server.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/90-limits.sh
Done running hooks in: /usr/local/bin/before-notebook.d
Executing the command: bash

In the Container:

pip install keras pandas tensorflow
Defaulting to user installation because normal site-packages is not writeable
Collecting keras
  Downloading keras-3.3.3-py3-none-any.whl.metadata (5.7 kB)
Requirement already satisfied: pandas in /usr/local/lib/python3.12/site-packages (2.2.2)
Collecting tensorflow
  Downloading tensorflow-2.16.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Collecting absl-py (from keras)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.12/site-packages (from keras) (1.26.4)
Collecting rich (from keras)
  Downloading rich-13.7.1-py3-none-any.whl.metadata (18 kB)
Collecting namex (from keras)
  Downloading namex-0.0.8-py3-none-any.whl.metadata (246 bytes)
Requirement already satisfied: h5py in /usr/local/lib/python3.12/site-packages (from keras) (3.11.0)
Collecting optree (from keras)
  Downloading optree-0.11.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (45 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.4/45.4 kB 2.4 MB/s eta 0:00:00
Collecting ml-dtypes (from keras)
  Downloading ml_dtypes-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/site-packages (from pandas) (2024.1)
Collecting astunparse>=1.6.0 (from tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow)
  Downloading flatbuffers-24.3.25-py2.py3-none-any.whl.metadata (850 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow)
  Downloading gast-0.5.4-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow)
  Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting libclang>=13.0.0 (from tensorflow)
  Downloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB)
Collecting ml-dtypes (from keras)
  Downloading ml_dtypes-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting opt-einsum>=2.3.2 (from tensorflow)
  Downloading opt_einsum-3.3.0-py3-none-any.whl.metadata (6.5 kB)
Requirement already satisfied: packaging in /usr/local/lib/python3.12/site-packages (from tensorflow) (24.0)
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow)
  Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.12/site-packages (from tensorflow) (2.31.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/site-packages (from tensorflow) (69.5.1)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.12/site-packages (from tensorflow) (1.16.0)
Collecting termcolor>=1.1.0 (from tensorflow)
  Downloading termcolor-2.4.0-py3-none-any.whl.metadata (6.1 kB)
Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.12/site-packages (from tensorflow) (4.11.0)
Collecting wrapt>=1.11.0 (from tensorflow)
  Downloading wrapt-1.16.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting grpcio<2.0,>=1.24.3 (from tensorflow)
  Downloading grpcio-1.63.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
Collecting tensorboard<2.17,>=2.16 (from tensorflow)
  Downloading tensorboard-2.16.2-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.12/site-packages (from astunparse>=1.6.0->tensorflow) (0.43.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (2.2.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (2024.2.2)
Collecting markdown>=2.6.8 (from tensorboard<2.17,>=2.16->tensorflow)
  Downloading Markdown-3.6-py3-none-any.whl.metadata (7.0 kB)
Collecting tensorboard-data-server<0.8.0,>=0.7.0 (from tensorboard<2.17,>=2.16->tensorflow)
  Downloading tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl.metadata (1.1 kB)
Collecting werkzeug>=1.0.1 (from tensorboard<2.17,>=2.16->tensorflow)
  Downloading werkzeug-3.0.3-py3-none-any.whl.metadata (3.7 kB)
Collecting markdown-it-py>=2.2.0 (from rich->keras)
  Downloading markdown_it_py-3.0.0-py3-none-any.whl.metadata (6.9 kB)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/site-packages (from rich->keras) (2.17.2)
Collecting mdurl~=0.1 (from markdown-it-py>=2.2.0->rich->keras)
  Downloading mdurl-0.1.2-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.12/site-packages (from werkzeug>=1.0.1->tensorboard<2.17,>=2.16->tensorflow) (2.1.5)
Downloading keras-3.3.3-py3-none-any.whl (1.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 9.8 MB/s eta 0:00:00
Downloading tensorflow-2.16.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (589.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 589.9/589.9 MB 2.6 MB/s eta 0:00:00
Downloading absl_py-2.1.0-py3-none-any.whl (133 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.7/133.7 kB 8.0 MB/s eta 0:00:00
Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Downloading flatbuffers-24.3.25-py2.py3-none-any.whl (26 kB)
Downloading gast-0.5.4-py3-none-any.whl (19 kB)
Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.5/57.5 kB 5.3 MB/s eta 0:00:00
Downloading grpcio-1.63.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 56.5 MB/s eta 0:00:00
Downloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl (24.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.5/24.5 MB 37.1 MB/s eta 0:00:00
Downloading ml_dtypes-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 26.2 MB/s eta 0:00:00
Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.5/65.5 kB 6.4 MB/s eta 0:00:00
Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.6/294.6 kB 13.0 MB/s eta 0:00:00
Downloading tensorboard-2.16.2-py3-none-any.whl (5.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 44.4 MB/s eta 0:00:00
Downloading termcolor-2.4.0-py3-none-any.whl (7.7 kB)
Downloading wrapt-1.16.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (87 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.3/87.3 kB 12.4 MB/s eta 0:00:00
Downloading namex-0.0.8-py3-none-any.whl (5.8 kB)
Downloading optree-0.11.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (308 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.3/308.3 kB 7.9 MB/s eta 0:00:00
Downloading rich-13.7.1-py3-none-any.whl (240 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 240.7/240.7 kB 4.3 MB/s eta 0:00:00
Downloading Markdown-3.6-py3-none-any.whl (105 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 105.4/105.4 kB 6.1 MB/s eta 0:00:00
Downloading markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.5/87.5 kB 10.2 MB/s eta 0:00:00
Downloading tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl (6.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 56.9 MB/s eta 0:00:00
Downloading werkzeug-3.0.3-py3-none-any.whl (227 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.3/227.3 kB 4.5 MB/s eta 0:00:00
Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Installing collected packages: namex, libclang, flatbuffers, wrapt, werkzeug, termcolor, tensorboard-data-server, protobuf, optree, opt-einsum, ml-dtypes, mdurl, markdown, grpcio, google-pasta, gast, astunparse, absl-py, tensorboard, markdown-it-py, rich, keras, tensorflow
Successfully installed absl-py-2.1.0 astunparse-1.6.3 flatbuffers-24.3.25 gast-0.5.4 google-pasta-0.2.0 grpcio-1.63.0 keras-3.3.3 libclang-18.1.1 markdown-3.6 markdown-it-py-3.0.0 mdurl-0.1.2 ml-dtypes-0.3.2 namex-0.0.8 opt-einsum-3.3.0 optree-0.11.0 protobuf-4.25.3 rich-13.7.1 tensorboard-2.16.2 tensorboard-data-server-0.7.2 tensorflow-2.16.1 termcolor-2.4.0 werkzeug-3.0.3 wrapt-1.16.0
nano test.py

👉 Python code from keras-team/keras#19601 (comment)

python test.py
2024-05-07 15:42:24.309257: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-07 15:42:24.369721: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-07 15:42:26.612762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6819 MB memory:  -> device: 0, name: Quadro RTX 4000, pci bus id: 0000:af:00.0, compute capability: 7.5
Epoch 1/50
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1715096547.486490     255 service.cc:145] XLA service 0x7f3f1c006770 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1715096547.486601     255 service.cc:153]   StreamExecutor device (0): Quadro RTX 4000, Compute Capability 7.5
2024-05-07 15:42:27.514121: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-05-07 15:42:27.594475: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8907
I0000 00:00:1715096548.025991     255 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
10/10 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - loss: 0.2421
Epoch 2/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1632 
Epoch 3/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1439 
Epoch 4/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1400 
Epoch 5/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1355 
Epoch 6/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1336 
Epoch 7/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1316 
Epoch 8/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1303 
Epoch 9/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1294 
Epoch 10/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1284 
Epoch 11/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1284 
Epoch 12/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1278 
Epoch 13/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1273 
Epoch 14/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1271 
Epoch 15/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1273 
Epoch 16/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1273 
Epoch 17/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1273 
Epoch 18/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1270 
Epoch 19/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1271 
Epoch 20/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1267 
Epoch 21/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1268 
Epoch 22/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1265 
Epoch 23/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1270 
Epoch 24/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1267 
Epoch 25/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1265 
Epoch 26/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1266 
Epoch 27/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1267 
Epoch 28/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1266 
Epoch 29/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1267 
Epoch 30/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1265 
Epoch 31/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1265 
Epoch 32/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1268 
Epoch 33/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1265 
Epoch 34/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1262 
Epoch 35/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1263 
Epoch 36/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1264 
Epoch 37/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1264 
Epoch 38/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1264 
Epoch 39/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1264 
Epoch 40/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1263 
Epoch 41/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1263 
Epoch 42/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1262 
Epoch 43/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1260 
Epoch 44/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1258 
Epoch 45/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1258 
Epoch 46/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1262 
Epoch 47/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1265 
Epoch 48/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1264 
Epoch 49/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 0.1266 
Epoch 50/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.1261

Code run on CPU (Intel(R) Xeon(R) Silver 4210R) / GPU (Quadro RTX 4000, Compute Capability 7.5) / Ubuntu 22.04 (Container) with Keras 3.3.3, Numpy 1.26.4 and Tensorflow 2.16.1.

@mthiboust
Copy link

Thanks for testing it! Like you, I do not have the issue when using the GPU. The bug only happens on CPU. I am curious to know if you run into the same bug on your CPU without CUDA

@benz0li
Copy link
Member Author

benz0li commented May 7, 2024

Thanks for testing it! Like you, I do not have the issue when using the GPU. The bug only happens on CPU. I am curious to know if you run into the same bug on your CPU without CUDA

Checking right now...

@benz0li
Copy link
Member Author

benz0li commented May 7, 2024

I am curious to know if you run into the same bug on your CPU without CUDA

@mthiboust No. I cannot reproduce with image glcr.b-data.ch/jupyterlab/python/scipy:3.12.3 (Container: Python 3.12.3) on Debian 12 (bookworm) using Docker 26.1.0 either:

docker run --rm -ti glcr.b-data.ch/jupyterlab/python/scipy:3.12.3 bash
Entered start.sh with args: bash
Running hooks in: /usr/local/bin/start-notebook.d as uid: 1000 gid: 100
Sourcing shell script: /usr/local/bin/start-notebook.d/10-populate.sh
Done running hooks in: /usr/local/bin/start-notebook.d
Running hooks in: /usr/local/bin/before-notebook.d as uid: 1000 gid: 100
Sourcing shell script: /usr/local/bin/before-notebook.d/10-env.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/11-home.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/30-code-server.sh
Sourcing shell script: /usr/local/bin/before-notebook.d/90-limits.sh
Done running hooks in: /usr/local/bin/before-notebook.d
Executing the command: bash

In the Container:

pip install keras pandas tensorflow
Defaulting to user installation because normal site-packages is not writeable
Collecting keras
  Downloading keras-3.3.3-py3-none-any.whl.metadata (5.7 kB)
Requirement already satisfied: pandas in /usr/local/lib/python3.12/site-packages (2.2.2)
Collecting tensorflow
  Downloading tensorflow-2.16.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Collecting absl-py (from keras)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.12/site-packages (from keras) (1.26.4)
Collecting rich (from keras)
  Downloading rich-13.7.1-py3-none-any.whl.metadata (18 kB)
Collecting namex (from keras)
  Downloading namex-0.0.8-py3-none-any.whl.metadata (246 bytes)
Requirement already satisfied: h5py in /usr/local/lib/python3.12/site-packages (from keras) (3.11.0)
Collecting optree (from keras)
  Downloading optree-0.11.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (45 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.4/45.4 kB 2.9 MB/s eta 0:00:00
Collecting ml-dtypes (from keras)
  Downloading ml_dtypes-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/site-packages (from pandas) (2024.1)
Collecting astunparse>=1.6.0 (from tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow)
  Downloading flatbuffers-24.3.25-py2.py3-none-any.whl.metadata (850 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow)
  Downloading gast-0.5.4-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow)
  Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting libclang>=13.0.0 (from tensorflow)
  Downloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB)
Collecting ml-dtypes (from keras)
  Downloading ml_dtypes-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting opt-einsum>=2.3.2 (from tensorflow)
  Downloading opt_einsum-3.3.0-py3-none-any.whl.metadata (6.5 kB)
Requirement already satisfied: packaging in /usr/local/lib/python3.12/site-packages (from tensorflow) (24.0)
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow)
  Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.12/site-packages (from tensorflow) (2.31.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/site-packages (from tensorflow) (69.5.1)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.12/site-packages (from tensorflow) (1.16.0)
Collecting termcolor>=1.1.0 (from tensorflow)
  Downloading termcolor-2.4.0-py3-none-any.whl.metadata (6.1 kB)
Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.12/site-packages (from tensorflow) (4.11.0)
Collecting wrapt>=1.11.0 (from tensorflow)
  Downloading wrapt-1.16.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting grpcio<2.0,>=1.24.3 (from tensorflow)
  Downloading grpcio-1.63.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
Collecting tensorboard<2.17,>=2.16 (from tensorflow)
  Downloading tensorboard-2.16.2-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.12/site-packages (from astunparse>=1.6.0->tensorflow) (0.43.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (2.2.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/site-packages (from requests<3,>=2.21.0->tensorflow) (2024.2.2)
Collecting markdown>=2.6.8 (from tensorboard<2.17,>=2.16->tensorflow)
  Downloading Markdown-3.6-py3-none-any.whl.metadata (7.0 kB)
Collecting tensorboard-data-server<0.8.0,>=0.7.0 (from tensorboard<2.17,>=2.16->tensorflow)
  Downloading tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl.metadata (1.1 kB)
Collecting werkzeug>=1.0.1 (from tensorboard<2.17,>=2.16->tensorflow)
  Downloading werkzeug-3.0.3-py3-none-any.whl.metadata (3.7 kB)
Collecting markdown-it-py>=2.2.0 (from rich->keras)
  Downloading markdown_it_py-3.0.0-py3-none-any.whl.metadata (6.9 kB)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/site-packages (from rich->keras) (2.17.2)
Collecting mdurl~=0.1 (from markdown-it-py>=2.2.0->rich->keras)
  Downloading mdurl-0.1.2-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.12/site-packages (from werkzeug>=1.0.1->tensorboard<2.17,>=2.16->tensorflow) (2.1.5)
Downloading keras-3.3.3-py3-none-any.whl (1.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 10.9 MB/s eta 0:00:00
Downloading tensorflow-2.16.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (589.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 589.9/589.9 MB 2.9 MB/s eta 0:00:00
Downloading absl_py-2.1.0-py3-none-any.whl (133 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.7/133.7 kB 11.8 MB/s eta 0:00:00
Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Downloading flatbuffers-24.3.25-py2.py3-none-any.whl (26 kB)
Downloading gast-0.5.4-py3-none-any.whl (19 kB)
Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.5/57.5 kB 7.6 MB/s eta 0:00:00
Downloading grpcio-1.63.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 21.7 MB/s eta 0:00:00
Downloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl (24.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.5/24.5 MB 5.9 MB/s eta 0:00:00
Downloading ml_dtypes-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 31.9 MB/s eta 0:00:00
Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.5/65.5 kB 7.4 MB/s eta 0:00:00
Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.6/294.6 kB 7.7 MB/s eta 0:00:00
Downloading tensorboard-2.16.2-py3-none-any.whl (5.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 38.9 MB/s eta 0:00:00
Downloading termcolor-2.4.0-py3-none-any.whl (7.7 kB)
Downloading wrapt-1.16.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (87 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.3/87.3 kB 2.5 MB/s eta 0:00:00
Downloading namex-0.0.8-py3-none-any.whl (5.8 kB)
Downloading optree-0.11.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (308 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.3/308.3 kB 7.9 MB/s eta 0:00:00
Downloading rich-13.7.1-py3-none-any.whl (240 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 240.7/240.7 kB 20.8 MB/s eta 0:00:00
Downloading Markdown-3.6-py3-none-any.whl (105 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 105.4/105.4 kB 2.3 MB/s eta 0:00:00
Downloading markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.5/87.5 kB 7.2 MB/s eta 0:00:00
Downloading tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl (6.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 47.8 MB/s eta 0:00:00
Downloading werkzeug-3.0.3-py3-none-any.whl (227 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.3/227.3 kB 11.5 MB/s eta 0:00:00
Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Installing collected packages: namex, libclang, flatbuffers, wrapt, werkzeug, termcolor, tensorboard-data-server, protobuf, optree, opt-einsum, ml-dtypes, mdurl, markdown, grpcio, google-pasta, gast, astunparse, absl-py, tensorboard, markdown-it-py, rich, keras, tensorflow
Successfully installed absl-py-2.1.0 astunparse-1.6.3 flatbuffers-24.3.25 gast-0.5.4 google-pasta-0.2.0 grpcio-1.63.0 keras-3.3.3 libclang-18.1.1 markdown-3.6 markdown-it-py-3.0.0 mdurl-0.1.2 ml-dtypes-0.3.2 namex-0.0.8 opt-einsum-3.3.0 optree-0.11.0 protobuf-4.25.3 rich-13.7.1 tensorboard-2.16.2 tensorboard-data-server-0.7.2 tensorflow-2.16.1 termcolor-2.4.0 werkzeug-3.0.3 wrapt-1.16.0
nano test.py

👉 Python code from keras-team/keras#19601 (comment)

python test.py
2024-05-07 19:48:22.210161: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-07 19:48:22.219200: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-07 19:48:22.225813: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-07 19:48:22.279096: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-07 19:48:23.508076: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Epoch 1/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 6s 492ms/step - loss: 0.1715
Epoch 2/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 5s 479ms/step - loss: 0.1474
Epoch 3/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 424ms/step - loss: 0.1392
Epoch 4/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 326ms/step - loss: 0.1341
Epoch 5/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 328ms/step - loss: 0.1307
Epoch 6/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 323ms/step - loss: 0.1289
Epoch 7/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 325ms/step - loss: 0.1276
Epoch 8/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 336ms/step - loss: 0.1272
Epoch 9/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 356ms/step - loss: 0.1272
Epoch 10/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 359ms/step - loss: 0.1271
Epoch 11/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 355ms/step - loss: 0.1272
Epoch 12/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 356ms/step - loss: 0.1267
Epoch 13/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 353ms/step - loss: 0.1271
Epoch 14/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 356ms/step - loss: 0.1262
Epoch 15/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 352ms/step - loss: 0.1265
Epoch 16/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 346ms/step - loss: 0.1265
Epoch 17/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 349ms/step - loss: 0.1263
Epoch 18/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 353ms/step - loss: 0.1261
Epoch 19/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 352ms/step - loss: 0.1262
Epoch 20/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 351ms/step - loss: 0.1263
Epoch 21/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 347ms/step - loss: 0.1260
Epoch 22/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 341ms/step - loss: 0.1262
Epoch 23/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 344ms/step - loss: 0.1259
Epoch 24/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 343ms/step - loss: 0.1260
Epoch 25/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 340ms/step - loss: 0.1259
Epoch 26/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 336ms/step - loss: 0.1256
Epoch 27/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 337ms/step - loss: 0.1259
Epoch 28/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 341ms/step - loss: 0.1256
Epoch 29/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 337ms/step - loss: 0.1257
Epoch 30/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 338ms/step - loss: 0.1262
Epoch 31/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 339ms/step - loss: 0.1262
Epoch 32/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 337ms/step - loss: 0.1257
Epoch 33/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 336ms/step - loss: 0.1261
Epoch 34/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 335ms/step - loss: 0.1259
Epoch 35/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 338ms/step - loss: 0.1258
Epoch 36/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 336ms/step - loss: 0.1258
Epoch 37/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 338ms/step - loss: 0.1255
Epoch 38/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 336ms/step - loss: 0.1255
Epoch 39/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 337ms/step - loss: 0.1255
Epoch 40/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 338ms/step - loss: 0.1256
Epoch 41/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 337ms/step - loss: 0.1258
Epoch 42/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 334ms/step - loss: 0.1256
Epoch 43/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 338ms/step - loss: 0.1257
Epoch 44/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 4s 361ms/step - loss: 0.1254
Epoch 45/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 339ms/step - loss: 0.1256
Epoch 46/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 336ms/step - loss: 0.1255
Epoch 47/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 335ms/step - loss: 0.1253
Epoch 48/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 338ms/step - loss: 0.1255
Epoch 49/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 337ms/step - loss: 0.1255
Epoch 50/50
10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 336ms/step - loss: 0.1251

Code run on CPU (Intel(R) Xeon(R) Silver 4210R) / Ubuntu 22.04 (Container) with Keras 3.3.3, Numpy 1.26.4 and Tensorflow 2.16.1.

@mthiboust
Copy link

Thanks again @benz0li! I'll have a new look at it next week. Hopefully I'll have new ideas by then!

@benz0li
Copy link
Member Author

benz0li commented May 7, 2024

Closing because I cannot reproduce [on my machine].

@benz0li benz0li closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants