Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Problems when running batchensemble on TPU #487

Open
pyun-ram opened this issue Aug 26, 2021 · 1 comment
Open

[question] Problems when running batchensemble on TPU #487

pyun-ram opened this issue Aug 26, 2021 · 1 comment

Comments

@pyun-ram
Copy link

Hi,

Thanks for sharing the awesome codebase!
I am trying to run batchensemble on TPU with CoLab, but did not make it.
When I run the following commands in CoLab,

! cd uncertainty-baselines/ && python baselines/cifar/batchensemble.py \
    --data_dir=gs://uncertainty-baselines/tensorflow_datasets \
    --output_dir=gs://uncertainty-baselines/model \
    --download_data=True

The error message is like this:

2021-08-26 08:38:19.132914: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-08-26 08:38:19.132988: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (898749e7e475): /proc/driver/nvidia/version does not exist
I0826 08:38:20.279230 139626063222656 batchensemble.py:46] Saving checkpoints at gs://uncertainty-baselines/model
I0826 08:38:20.279786 139626063222656 batchensemble.py:58] Use TPU at local
2021-08-26 08:38:20.280855: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-26 08:38:20.287134: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.84.65.66:8470}
2021-08-26 08:38:20.287199: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33975}
2021-08-26 08:38:20.304265: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.84.65.66:8470}
2021-08-26 08:38:20.304330: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33975}
2021-08-26 08:38:20.304983: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:33975
I0826 08:38:20.305609 139626063222656 remote.py:237] Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
I0826 08:38:20.306017 139626063222656 tpu_strategy_util.py:61] Deallocate tpu buffers before initializing tpu system.
INFO:tensorflow:Initializing the TPU system: grpc://10.84.65.66:8470
I0826 08:38:20.478643 139626063222656 tpu_strategy_util.py:85] Initializing the TPU system: grpc://10.84.65.66:8470
INFO:tensorflow:Finished initializing TPU system.
I0826 08:38:34.033338 139626063222656 tpu_strategy_util.py:143] Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
I0826 08:38:34.035219 139626063222656 tpu_system_metadata.py:159] Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
I0826 08:38:34.035438 139626063222656 tpu_system_metadata.py:160] *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
I0826 08:38:34.035542 139626063222656 tpu_system_metadata.py:161] *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I0826 08:38:34.035627 139626063222656 tpu_system_metadata.py:163] *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0826 08:38:34.035708 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0826 08:38:34.036015 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
I0826 08:38:34.036103 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
I0826 08:38:34.036188 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
I0826 08:38:34.036268 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
I0826 08:38:34.036354 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
I0826 08:38:34.036433 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
I0826 08:38:34.036512 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
I0826 08:38:34.036591 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
I0826 08:38:34.036669 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
I0826 08:38:34.036748 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I0826 08:38:34.036829 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
W0826 08:38:34.390309 139626063222656 datasets.py:59] Skipped due to ImportError. Try installing uncertainty baselines with the `datasets` extras.
Traceback (most recent call last):
  File "/content/uncertainty-baselines/uncertainty_baselines/datasets/datasets.py", line 54, in <module>
    from uncertainty_baselines.datasets.smcalflow import MultiWoZDataset  # pylint: disable=g-import-not-at-top
  File "/content/uncertainty-baselines/uncertainty_baselines/datasets/smcalflow.py", line 40, in <module>
    import seqio
ModuleNotFoundError: No module named 'seqio'
W0826 08:38:34.392557 139626063222656 __init__.py:70] Skipped dataset due to ImportError. Try installing uncertainty baselines with the `datasets` extras.
Traceback (most recent call last):
  File "/content/uncertainty-baselines/uncertainty_baselines/datasets/__init__.py", line 64, in <module>
    from uncertainty_baselines.datasets.smcalflow import MultiWoZDataset  # pylint: disable=g-import-not-at-top
  File "/content/uncertainty-baselines/uncertainty_baselines/datasets/smcalflow.py", line 40, in <module>
    import seqio
ModuleNotFoundError: No module named 'seqio'
I0826 08:38:34.393076 139626063222656 datasets.py:134] Building dataset cifar10 with additional kwargs:
{
  "data_dir": "gs://uncertainty-baselines/tensorflow_datasets",
  "download_data": true,
  "validation_percent": 0.0
}
I0826 08:38:34.765078 139626063222656 dataset_info.py:443] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: cifar10/3.0.2
I0826 08:38:35.081398 139626063222656 dataset_info.py:358] Load dataset info from /tmp/tmpavayi48ztfds
I0826 08:38:35.084364 139626063222656 dataset_info.py:413] Field info.citation from disk and from code do not match. Keeping the one from code.
I0826 08:38:35.087480 139626063222656 dataset_info.py:413] Field info.splits from disk and from code do not match. Keeping the one from code.
I0826 08:38:35.087698 139626063222656 dataset_info.py:413] Field info.module_name from disk and from code do not match. Keeping the one from code.
Traceback (most recent call last):
  File "baselines/cifar/batchensemble.py", line 369, in <module>
    app.run(main)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "baselines/cifar/batchensemble.py", line 70, in main
    train_dataset = train_builder.load(batch_size=batch_size)
  File "/content/uncertainty-baselines/uncertainty_baselines/datasets/base.py", line 400, in load
    return self._load(preprocess_fn=preprocess_fn, batch_size=batch_size)
  File "/content/uncertainty-baselines/uncertainty_baselines/datasets/base.py", line 272, in _load
    self._seed, num=2)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 524, in __iter__
    shape = self._shape_tuple()
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsupported algorithm id: 3
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 2531, in async_wait
    context().sync_executors()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 681, in sync_executors
2021-08-26 08:38:35.096108: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: Unsupported algorithm id: 3
    pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsupported algorithm id: 3
2021-08-26 08:38:35.461105: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: INVALID_ARGUMENT: Unable to find the relevant tensor remote_handle: Op ID: 10, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1629967115.460834602","description":"Error received from peer ipv4:10.84.65.66:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 10, Output num: 0","grpc_status":3}

I am new to TPU and CoLab, so I guess there might be something wrong with my running steps.
The following are all the cells of my CoLab notebook:

# cell 1
# set up uncertainty-baseline running environments
! git clone https://github.com/google/uncertainty-baselines
! cd uncertainty-baselines && pip install -e .[models,datasets,jax,tests]
# add the following line to upgrade tensorflow_datasets to 4.4.0, 
# otherwise it will raise ArgumentError saying do not have try_gcs argument.
! pip install tensorflow_datasets --upgrade
# cell 2
from google.colab import auth
auth.authenticate_user()
# cell 3
# I hide my project-id here, since I am not sure whether the exposure of my project-id will cause risks or not. XD
!gcloud config set project <project-id>
!gsutil mb -p <project-id> -c standard -l us-central1 -b on gs://uncertainty-baselines
# cell 4
! cd uncertainty-baselines/ && python baselines/cifar/batchensemble.py \
    --data_dir=gs://uncertainty-baselines/tensorflow_datasets \
    --output_dir=gs://uncertainty-baselines/model \
    --download_data=True

I have the following two guesses:

  • there might be something wrong with my running steps;
  • there might be something wrong with the TensorFlow version.

The tensorflow version is: 2.7.0-dev20210824
The tpu version is: TPU v2
The uncertainty-baseline commit is: 865d49d

@pyun-ram
Copy link
Author

pyun-ram commented Aug 26, 2021

Since the README mentioned that batchensemble should be run on TPU v3-8,
I tried to run it on a TPU Compute Engine but failed with the same problem.
Following is my running steps:

In google shell:

# I hide my project-id here, 
# since I am not sure whether the exposure of my project-id will cause risks or not. XD
export PROJECT_ID=project-id
gcloud config set project $PROJECT_ID
gsutil mb -p ${PROJECT_ID} -c standard -l us-central1 -b on gs://uncertainty-baselines
gcloud compute tpus execution-groups create \
 --name=uncertainty-baselines \
 --zone=us-central1-b \
 --tf-version=2.6.0 \
 --machine-type=n1-standard-1 \
 --accelerator-type=v3-8

In the VM:

git clone https://github.com/google/uncertainty-baselines
cd uncertainty-baselines && pip install -e .[models,datasets,jax,tests]
export BUCKET=gs://uncertainty-baselines
export TPU_NAME=uncertainty-baselines
export DATA_DIR=$BUCKET/tensorflow_datasets
export OUTPUT_DIR=$BUCKET/model
python3 baselines/cifar/batchensemble.py \
    --tpu=$TPU_NAME \
    --data_dir=$DATA_DIR \
    --output_dir=$OUTPUT_DIR \
    --download_data=True

The error seems the same as my CoLab experience. It is :


2021-08-26 09:30:53.745303: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-08-26 09:30:53.745528: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-08-26 09:30:58.652979: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-08-26 09:30:58.653248: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2021-08-26 09:30:58.653342: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (uncertainty-baselines): /proc/driver/nvidia/version does not exist
I0826 09:30:58.794567 139878914160448 batchensemble.py:46] Saving checkpoints at gs://uncertainty-baselines/model
I0826 09:30:58.796198 139878914160448 batchensemble.py:58] Use TPU at uncertainty-baselines
I0826 09:30:58.807567 139878914160448 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0826 09:30:58.847581 139878914160448 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/nomadic-archway-323412/locations/us-central1-b/nodes/uncertainty-baselines?alt=json
I0826 09:30:58.847998 139878914160448 transport.py:157] Attempting refresh to obtain initial access_token
I0826 09:30:58.910928 139878914160448 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0826 09:30:58.942784 139878914160448 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/nomadic-archway-323412/locations/us-central1-b/nodes/uncertainty-baselines?alt=json
I0826 09:30:58.943188 139878914160448 transport.py:157] Attempting refresh to obtain initial access_token
I0826 09:30:59.007248 139878914160448 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0826 09:30:59.042939 139878914160448 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/nomadic-archway-323412/locations/us-central1-b/nodes/uncertainty-baselines?alt=json
I0826 09:30:59.043304 139878914160448 transport.py:157] Attempting refresh to obtain initial access_token
I0826 09:30:59.090867 139878914160448 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0826 09:30:59.119607 139878914160448 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/nomadic-archway-323412/locations/us-central1-b/nodes/uncertainty-baselines?alt=json
I0826 09:30:59.119935 139878914160448 transport.py:157] Attempting refresh to obtain initial access_token
2021-08-26 09:30:59.184921: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-26 09:30:59.188754: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.100.189.242:8470}
2021-08-26 09:30:59.190415: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:35077}
2021-08-26 09:30:59.208210: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.100.189.242:8470}
2021-08-26 09:30:59.208396: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:35077}
2021-08-26 09:30:59.208920: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:35077
I0826 09:30:59.209608 139878914160448 remote.py:237] Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
I0826 09:30:59.210030 139878914160448 tpu_strategy_util.py:61] Deallocate tpu buffers before initializing tpu system.
INFO:tensorflow:Initializing the TPU system: uncertainty-baselines
I0826 09:30:59.381586 139878914160448 tpu_strategy_util.py:85] Initializing the TPU system: uncertainty-baselines
INFO:tensorflow:Finished initializing TPU system.
I0826 09:31:05.071858 139878914160448 tpu_strategy_util.py:143] Finished initializing TPU system.
I0826 09:31:05.074776 139878914160448 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0826 09:31:05.107393 139878914160448 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/nomadic-archway-323412/locations/us-central1-b/nodes/uncertainty-baselines?alt=json
I0826 09:31:05.107806 139878914160448 transport.py:157] Attempting refresh to obtain initial access_token
I0826 09:31:05.174555 139878914160448 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0826 09:31:05.219382 139878914160448 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/nomadic-archway-323412/locations/us-central1-b/nodes/uncertainty-baselines?alt=json
I0826 09:31:05.219723 139878914160448 transport.py:157] Attempting refresh to obtain initial access_token
INFO:tensorflow:Found TPU system:
I0826 09:31:05.276664 139878914160448 tpu_system_metadata.py:159] Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
I0826 09:31:05.277065 139878914160448 tpu_system_metadata.py:160] *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
I0826 09:31:05.277360 139878914160448 tpu_system_metadata.py:161] *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I0826 09:31:05.277621 139878914160448 tpu_system_metadata.py:163] *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0826 09:31:05.277831 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0826 09:31:05.278252 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
I0826 09:31:05.278487 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
I0826 09:31:05.278706 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
I0826 09:31:05.278957 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
I0826 09:31:05.279162 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
I0826 09:31:05.279375 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
I0826 09:31:05.279607 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
I0826 09:31:05.279962 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
I0826 09:31:05.280166 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
I0826 09:31:05.280377 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I0826 09:31:05.280594 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
W0826 09:31:05.713980 139878914160448 datasets.py:59] Skipped due to ImportError. Try installing uncertainty baselines with the `datasets` extras.
Traceback (most recent call last):
  File "/home/XXXXXX/uncertainty-baselines/uncertainty_baselines/datasets/datasets.py", line 54, in <module>
    from uncertainty_baselines.datasets.smcalflow import MultiWoZDataset  # pylint: disable=g-import-not-at-top
  File "/home/XXXXXX/uncertainty-baselines/uncertainty_baselines/datasets/smcalflow.py", line 40, in <module>
    import seqio
ModuleNotFoundError: No module named 'seqio'
W0826 09:31:05.721032 139878914160448 __init__.py:70] Skipped dataset due to ImportError. Try installing uncertainty baselines with the `datasets` extras.
Traceback (most recent call last):
  File "/home/XXXXXX/uncertainty-baselines/uncertainty_baselines/datasets/__init__.py", line 64, in <module>
    from uncertainty_baselines.datasets.smcalflow import MultiWoZDataset  # pylint: disable=g-import-not-at-top
  File "/home/XXXXXX/uncertainty-baselines/uncertainty_baselines/datasets/smcalflow.py", line 40, in <module>
    import seqio
ModuleNotFoundError: No module named 'seqio'
I0826 09:31:05.722037 139878914160448 datasets.py:134] Building dataset cifar10 with additional kwargs:
{
  "data_dir": "gs://uncertainty-baselines/tensorflow_datasets",
  "download_data": true,
  "validation_percent": 0.0
}
I0826 09:31:06.054608 139878914160448 dataset_info.py:443] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: cifar10/3.0.2
I0826 09:31:06.387699 139878914160448 dataset_info.py:358] Load dataset info from /tmp/tmp9iinh0mftfds
I0826 09:31:06.389907 139878914160448 dataset_info.py:413] Field info.citation from disk and from code do not match. Keeping the one from code.
I0826 09:31:06.390269 139878914160448 dataset_info.py:413] Field info.splits from disk and from code do not match. Keeping the one from code.
I0826 09:31:06.390483 139878914160448 dataset_info.py:413] Field info.module_name from disk and from code do not match. Keeping the one from code.
Traceback (most recent call last):
  File "baselines/cifar/batchensemble.py", line 369, in <module>
    app.run(main)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "baselines/cifar/batchensemble.py", line 70, in main
    train_dataset = train_builder.load(batch_size=batch_size)
  File "/home/XXXXXX/uncertainty-baselines/uncertainty_baselines/datasets/base.py", line 400, in load
    return self._load(preprocess_fn=preprocess_fn, batch_size=batch_size)
  File "/home/XXXXXX/uncertainty-baselines/uncertainty_baselines/datasets/base.py", line 272, in _load
    self._seed, num=2)
  File "/home/XXXXXX/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 524, in __iter__
2021-08-26 09:31:06.397275: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: Unsupported algorithm id: 3
    shape = self._shape_tuple()
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsupported algorithm id: 3
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/XXXXXX/.local/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 2531, in async_wait
    context().sync_executors()
  File "/home/XXXXXX/.local/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 681, in sync_executors
    pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsupported algorithm id: 3
2021-08-26 09:31:06.698395: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: INVALID_ARGUMENT: Unable to find the relevant tensor remote_handle: Op ID: 10, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1629970266.698308853","description":"Error received from peer ipv4:10.100.189.242:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 10, Output num: 0","grpc_status":3}

@pyun-ram pyun-ram changed the title Problems when running batchensemble on TPU [question] Problems when running batchensemble on TPU Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant