Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 filesystem pure virtual method called; terminate called without an active exception #1912

Open
rivershah opened this issue Jan 1, 2024 · 8 comments

Comments

@rivershah
Copy link

rivershah commented Jan 1, 2024

I am getting a core dump during interpreter teardown, when using the s3 filesystem. Can I please be given guidance how to handle this issue. Please see script to reproduce inside docker:

FROM tensorflow/tensorflow:2.14.0-gpu

The following environment variables are set

"AWS_ACCESS_KEY_ID": xxx,
"AWS_SECRET_ACCESS_KEY": xxx,
"AWS_ENDPOINT_URL_S3": xxx,
"AWS_REGION": "us-east-1",
"S3_USE_HTTPS": "1",
"S3_VERIFY_SSL": "1",
"S3_DISABLE_MULTI_PART_DOWNLOAD": "0",
"S3_ENDPOINT": xxx,
import os

import tensorflow as tf
import tensorflow_io as tfio

def illustrate_core_dump():
    print(f"tf version: {tf.__version__}")
    print(f"tfio version: {tfio.__version__}")
    filename = f"{os.environ['CLOUD_MOUNT']}/tmp/test_tfrecord.tfrecord"
    assert filename.startswith("s3://"), "problem appears to be be for s3 filesystem only"
    ds = tf.data.TFRecordDataset(filename, "GZIP")

    for i in ds:
        print(f"i.shape: {i.shape}")


if __name__ == "__main__":
    illustrate_core_dump()
    print("reaches here successfully")
    print("something broken during destruction and tf")

    # during interpreter teardown if s3 filesystem used we will get
    # pure virtual method called
    # terminate called without an active exception
    # Aborted (core dumped)

    # gs:// and file:// do not exhibit this issue which don't rely on tfio
TF_CPP_MIN_LOG_LEVEL=0 python notebooks/illustrate_core_dump.py 
2024-01-01 18:07:11.253238: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-01 18:07:11.253287: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-01 18:07:11.253323: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-01 18:07:11.262384: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
tf version: 2.14.0
tfio version: 0.35.0
2024-01-01 18:07:14.402239: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.413303: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.416545: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.421598: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.423868: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.426098: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.494277: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.496519: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.498484: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.500342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13589 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
i.shape: ()
reaches here successfully
something broken during destruction and tf
pure virtual method called
terminate called without an active exception
Aborted (core dumped)
@rivershah
Copy link
Author

tensorflow-io==0.34.0 # works
tensorflow-io==0.35.0 # crashing

Can we please verify why the latest so exhibiting this issue. Thank you

@jpambrun
Copy link

I had the same issue and it was driving me insane. I have some unrelated custom c++ ops and wasted a day digging into those. I am using s3 and going back to 0.34.0 fixed it.

@saimidu
Copy link

saimidu commented Feb 6, 2024

Facing the same issue but for tensorflow==2.13, with tensorflow-io==0.34.0 (and with tensorflow-io==0.35.0). There is no straightforward root-cause, and reverting to tensorflow-io==0.33.0 fixes it.

I've also faced the same error with tensorflow==2.14, with tensorflow-io==0.35.0, which is the only version that supports TF 2.14 as per the compatibility chart on the README.md. But reverting to tensorflow-io==0.33.0 seems to fix it.

@saimidu
Copy link

saimidu commented Feb 12, 2024

As an update, I followed the build instructions for tensorflow-io (Ubuntu 22.04 and then Python Wheels), and discovered that this particular pure virtual method called error does not occur when I use a locally built wheel for tensorflow-io.

Note: The link in the docker build instructions is broken - https://github.com/tensorflow/io/blob/master/docs/development.md#docker - and the latest image in tfsigio/tfio is about 2 years old.

@rivershah
Copy link
Author

@saimi Is there any chance you can please post the steps you took to build? I tried to build but was thwarted by the issues you mentioned.

@saimidu
Copy link

saimidu commented Feb 13, 2024

@rivershah I pulled the ubuntu:22.04 image from dockerhub

docker run --name tfio_builder -itd ubuntu:22.04 bash
docker exec -it tfio_builder bash

and installed all the packages and bazel as instructed in https://github.com/tensorflow/io/blob/master/docs/development.md#ubuntu-2204 (without the sudo)

apt-get -y -qq update
apt-get -y -qq install gcc g++ git unzip curl python3-pip python-is-python3 libntirpc-dev
curl -sSOL https://github.com/bazelbuild/bazelisk/releases/download/v1.11.0/bazelisk-linux-amd64
mv bazelisk-linux-amd64 /usr/local/bin/bazel
chmod +x /usr/local/bin/bazel

python3 --version  # made sure I had python version>=3.9
python3 -m pip install -U pip
git clone https://github.com/tensorflow/io
cd io/
git checkout v0.35.0
pip install "tensorflow==2.14.1"
./configure.sh
export TF_PYTHON_VERSION=3.10
bazel build -s --verbose_failures --copt="-Wno-error=array-parameter=" --copt="-I/usr/include/tirpc" //tensorflow_io/... //tensorflow_io_gcs_filesystem/...

I then followed the instructions at https://github.com/tensorflow/io/blob/master/docs/development.md#python-wheels:

python3 setup.py bdist_wheel --data bazel-bin

Then, within the same container, I was able to validate tf-io's S3 filesystem functionality by trying to checkpoint a model to S3.

I'll need to do some additional work to reproduce the failure I got when copying the generated tf-io wheel out into a different container, since I've terminated all of that setup now.

@rivershah
Copy link
Author

Bumping this issue. Needs looking at to ensure build process handling correctly

@rivershah
Copy link
Author

This problem persists in tensorflow-io==0.37.0 Please fix, this is rendering s3 based io unusable without resorting to old versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants