S3 filesystem pure virtual method called; terminate called without an active exception #1912

rivershah · 2024-01-01T18:13:56Z

I am getting a core dump during interpreter teardown, when using the s3 filesystem. Can I please be given guidance how to handle this issue. Please see script to reproduce inside docker:

FROM tensorflow/tensorflow:2.14.0-gpu

The following environment variables are set

"AWS_ACCESS_KEY_ID": xxx,
"AWS_SECRET_ACCESS_KEY": xxx,
"AWS_ENDPOINT_URL_S3": xxx,
"AWS_REGION": "us-east-1",
"S3_USE_HTTPS": "1",
"S3_VERIFY_SSL": "1",
"S3_DISABLE_MULTI_PART_DOWNLOAD": "0",
"S3_ENDPOINT": xxx,

import os

import tensorflow as tf
import tensorflow_io as tfio

def illustrate_core_dump():
    print(f"tf version: {tf.__version__}")
    print(f"tfio version: {tfio.__version__}")
    filename = f"{os.environ['CLOUD_MOUNT']}/tmp/test_tfrecord.tfrecord"
    assert filename.startswith("s3://"), "problem appears to be be for s3 filesystem only"
    ds = tf.data.TFRecordDataset(filename, "GZIP")

    for i in ds:
        print(f"i.shape: {i.shape}")


if __name__ == "__main__":
    illustrate_core_dump()
    print("reaches here successfully")
    print("something broken during destruction and tf")

    # during interpreter teardown if s3 filesystem used we will get
    # pure virtual method called
    # terminate called without an active exception
    # Aborted (core dumped)

    # gs:// and file:// do not exhibit this issue which don't rely on tfio

TF_CPP_MIN_LOG_LEVEL=0 python notebooks/illustrate_core_dump.py 
2024-01-01 18:07:11.253238: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-01 18:07:11.253287: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-01 18:07:11.253323: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-01 18:07:11.262384: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
tf version: 2.14.0
tfio version: 0.35.0
2024-01-01 18:07:14.402239: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.413303: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.416545: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.421598: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.423868: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.426098: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.494277: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.496519: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.498484: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.500342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13589 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
i.shape: ()
reaches here successfully
something broken during destruction and tf
pure virtual method called
terminate called without an active exception
Aborted (core dumped)

The text was updated successfully, but these errors were encountered:

rivershah · 2024-01-04T15:26:09Z

tensorflow-io==0.34.0 # works
tensorflow-io==0.35.0 # crashing

Can we please verify why the latest so exhibiting this issue. Thank you

jpambrun · 2024-01-29T21:59:47Z

I had the same issue and it was driving me insane. I have some unrelated custom c++ ops and wasted a day digging into those. I am using s3 and going back to 0.34.0 fixed it.

saimidu · 2024-02-06T00:21:56Z

Facing the same issue but for tensorflow==2.13, with tensorflow-io==0.34.0 (and with tensorflow-io==0.35.0). There is no straightforward root-cause, and reverting to tensorflow-io==0.33.0 fixes it.

I've also faced the same error with tensorflow==2.14, with tensorflow-io==0.35.0, which is the only version that supports TF 2.14 as per the compatibility chart on the README.md. But reverting to tensorflow-io==0.33.0 seems to fix it.

saimidu · 2024-02-12T23:32:05Z

As an update, I followed the build instructions for tensorflow-io (Ubuntu 22.04 and then Python Wheels), and discovered that this particular pure virtual method called error does not occur when I use a locally built wheel for tensorflow-io.

Note: The link in the docker build instructions is broken - https://github.com/tensorflow/io/blob/master/docs/development.md#docker - and the latest image in tfsigio/tfio is about 2 years old.

rivershah · 2024-02-13T06:45:09Z

@saimi Is there any chance you can please post the steps you took to build? I tried to build but was thwarted by the issues you mentioned.

saimidu · 2024-02-13T23:35:50Z

@rivershah I pulled the ubuntu:22.04 image from dockerhub

docker run --name tfio_builder -itd ubuntu:22.04 bash
docker exec -it tfio_builder bash

and installed all the packages and bazel as instructed in https://github.com/tensorflow/io/blob/master/docs/development.md#ubuntu-2204 (without the sudo)

apt-get -y -qq update
apt-get -y -qq install gcc g++ git unzip curl python3-pip python-is-python3 libntirpc-dev
curl -sSOL https://github.com/bazelbuild/bazelisk/releases/download/v1.11.0/bazelisk-linux-amd64
mv bazelisk-linux-amd64 /usr/local/bin/bazel
chmod +x /usr/local/bin/bazel

python3 --version  # made sure I had python version>=3.9
python3 -m pip install -U pip
git clone https://github.com/tensorflow/io
cd io/
git checkout v0.35.0
pip install "tensorflow==2.14.1"
./configure.sh
export TF_PYTHON_VERSION=3.10
bazel build -s --verbose_failures --copt="-Wno-error=array-parameter=" --copt="-I/usr/include/tirpc" //tensorflow_io/... //tensorflow_io_gcs_filesystem/...

I then followed the instructions at https://github.com/tensorflow/io/blob/master/docs/development.md#python-wheels:

python3 setup.py bdist_wheel --data bazel-bin

Then, within the same container, I was able to validate tf-io's S3 filesystem functionality by trying to checkpoint a model to S3.

I'll need to do some additional work to reproduce the failure I got when copying the generated tf-io wheel out into a different container, since I've terminated all of that setup now.

rivershah · 2024-04-05T04:47:57Z

Bumping this issue. Needs looking at to ensure build process handling correctly

rivershah · 2024-05-14T11:46:57Z

This problem persists in tensorflow-io==0.37.0 Please fix, this is rendering s3 based io unusable without resorting to old versions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 filesystem pure virtual method called; terminate called without an active exception #1912

S3 filesystem pure virtual method called; terminate called without an active exception #1912

rivershah commented Jan 1, 2024 •

edited

rivershah commented Jan 4, 2024

jpambrun commented Jan 29, 2024

saimidu commented Feb 6, 2024 •

edited

saimidu commented Feb 12, 2024

rivershah commented Feb 13, 2024

saimidu commented Feb 13, 2024 •

edited

rivershah commented Apr 5, 2024

rivershah commented May 14, 2024

S3 filesystem pure virtual method called; terminate called without an active exception #1912

S3 filesystem pure virtual method called; terminate called without an active exception #1912

Comments

rivershah commented Jan 1, 2024 • edited

rivershah commented Jan 4, 2024

jpambrun commented Jan 29, 2024

saimidu commented Feb 6, 2024 • edited

saimidu commented Feb 12, 2024

rivershah commented Feb 13, 2024

saimidu commented Feb 13, 2024 • edited

rivershah commented Apr 5, 2024

rivershah commented May 14, 2024

rivershah commented Jan 1, 2024 •

edited

saimidu commented Feb 6, 2024 •

edited

saimidu commented Feb 13, 2024 •

edited