[question] Problems when running batchensemble on TPU #487

pyun-ram · 2021-08-26T08:53:38Z

Hi,

Thanks for sharing the awesome codebase!
I am trying to run batchensemble on TPU with CoLab, but did not make it.
When I run the following commands in CoLab,

! cd uncertainty-baselines/ && python baselines/cifar/batchensemble.py \
    --data_dir=gs://uncertainty-baselines/tensorflow_datasets \
    --output_dir=gs://uncertainty-baselines/model \
    --download_data=True

The error message is like this:

2021-08-26 08:38:19.132914: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-08-26 08:38:19.132988: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (898749e7e475): /proc/driver/nvidia/version does not exist
I0826 08:38:20.279230 139626063222656 batchensemble.py:46] Saving checkpoints at gs://uncertainty-baselines/model
I0826 08:38:20.279786 139626063222656 batchensemble.py:58] Use TPU at local
2021-08-26 08:38:20.280855: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-26 08:38:20.287134: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.84.65.66:8470}
2021-08-26 08:38:20.287199: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33975}
2021-08-26 08:38:20.304265: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.84.65.66:8470}
2021-08-26 08:38:20.304330: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33975}
2021-08-26 08:38:20.304983: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:33975
I0826 08:38:20.305609 139626063222656 remote.py:237] Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
I0826 08:38:20.306017 139626063222656 tpu_strategy_util.py:61] Deallocate tpu buffers before initializing tpu system.
INFO:tensorflow:Initializing the TPU system: grpc://10.84.65.66:8470
I0826 08:38:20.478643 139626063222656 tpu_strategy_util.py:85] Initializing the TPU system: grpc://10.84.65.66:8470
INFO:tensorflow:Finished initializing TPU system.
I0826 08:38:34.033338 139626063222656 tpu_strategy_util.py:143] Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
I0826 08:38:34.035219 139626063222656 tpu_system_metadata.py:159] Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
I0826 08:38:34.035438 139626063222656 tpu_system_metadata.py:160] *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
I0826 08:38:34.035542 139626063222656 tpu_system_metadata.py:161] *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I0826 08:38:34.035627 139626063222656 tpu_system_metadata.py:163] *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0826 08:38:34.035708 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0826 08:38:34.036015 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
I0826 08:38:34.036103 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
I0826 08:38:34.036188 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
I0826 08:38:34.036268 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
I0826 08:38:34.036354 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
I0826 08:38:34.036433 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
I0826 08:38:34.036512 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
I0826 08:38:34.036591 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
I0826 08:38:34.036669 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
I0826 08:38:34.036748 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I0826 08:38:34.036829 139626063222656 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
W0826 08:38:34.390309 139626063222656 datasets.py:59] Skipped due to ImportError. Try installing uncertainty baselines with the `datasets` extras.
Traceback (most recent call last):
  File "/content/uncertainty-baselines/uncertainty_baselines/datasets/datasets.py", line 54, in <module>
    from uncertainty_baselines.datasets.smcalflow import MultiWoZDataset  # pylint: disable=g-import-not-at-top
  File "/content/uncertainty-baselines/uncertainty_baselines/datasets/smcalflow.py", line 40, in <module>
    import seqio
ModuleNotFoundError: No module named 'seqio'
W0826 08:38:34.392557 139626063222656 __init__.py:70] Skipped dataset due to ImportError. Try installing uncertainty baselines with the `datasets` extras.
Traceback (most recent call last):
  File "/content/uncertainty-baselines/uncertainty_baselines/datasets/__init__.py", line 64, in <module>
    from uncertainty_baselines.datasets.smcalflow import MultiWoZDataset  # pylint: disable=g-import-not-at-top
  File "/content/uncertainty-baselines/uncertainty_baselines/datasets/smcalflow.py", line 40, in <module>
    import seqio
ModuleNotFoundError: No module named 'seqio'
I0826 08:38:34.393076 139626063222656 datasets.py:134] Building dataset cifar10 with additional kwargs:
{
  "data_dir": "gs://uncertainty-baselines/tensorflow_datasets",
  "download_data": true,
  "validation_percent": 0.0
}
I0826 08:38:34.765078 139626063222656 dataset_info.py:443] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: cifar10/3.0.2
I0826 08:38:35.081398 139626063222656 dataset_info.py:358] Load dataset info from /tmp/tmpavayi48ztfds
I0826 08:38:35.084364 139626063222656 dataset_info.py:413] Field info.citation from disk and from code do not match. Keeping the one from code.
I0826 08:38:35.087480 139626063222656 dataset_info.py:413] Field info.splits from disk and from code do not match. Keeping the one from code.
I0826 08:38:35.087698 139626063222656 dataset_info.py:413] Field info.module_name from disk and from code do not match. Keeping the one from code.
Traceback (most recent call last):
  File "baselines/cifar/batchensemble.py", line 369, in <module>
    app.run(main)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "baselines/cifar/batchensemble.py", line 70, in main
    train_dataset = train_builder.load(batch_size=batch_size)
  File "/content/uncertainty-baselines/uncertainty_baselines/datasets/base.py", line 400, in load
    return self._load(preprocess_fn=preprocess_fn, batch_size=batch_size)
  File "/content/uncertainty-baselines/uncertainty_baselines/datasets/base.py", line 272, in _load
    self._seed, num=2)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 524, in __iter__
    shape = self._shape_tuple()
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsupported algorithm id: 3
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 2531, in async_wait
    context().sync_executors()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/context.py", line 681, in sync_executors
2021-08-26 08:38:35.096108: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: Unsupported algorithm id: 3
    pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsupported algorithm id: 3
2021-08-26 08:38:35.461105: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: INVALID_ARGUMENT: Unable to find the relevant tensor remote_handle: Op ID: 10, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1629967115.460834602","description":"Error received from peer ipv4:10.84.65.66:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 10, Output num: 0","grpc_status":3}

I am new to TPU and CoLab, so I guess there might be something wrong with my running steps.
The following are all the cells of my CoLab notebook:

# cell 1
# set up uncertainty-baseline running environments
! git clone https://github.com/google/uncertainty-baselines
! cd uncertainty-baselines && pip install -e .[models,datasets,jax,tests]
# add the following line to upgrade tensorflow_datasets to 4.4.0, 
# otherwise it will raise ArgumentError saying do not have try_gcs argument.
! pip install tensorflow_datasets --upgrade

# cell 2
from google.colab import auth
auth.authenticate_user()

# cell 3
# I hide my project-id here, since I am not sure whether the exposure of my project-id will cause risks or not. XD
!gcloud config set project <project-id>
!gsutil mb -p <project-id> -c standard -l us-central1 -b on gs://uncertainty-baselines

# cell 4
! cd uncertainty-baselines/ && python baselines/cifar/batchensemble.py \
    --data_dir=gs://uncertainty-baselines/tensorflow_datasets \
    --output_dir=gs://uncertainty-baselines/model \
    --download_data=True

I have the following two guesses:

there might be something wrong with my running steps;
there might be something wrong with the TensorFlow version.

The tensorflow version is: 2.7.0-dev20210824
The tpu version is: TPU v2
The uncertainty-baseline commit is: 865d49d

The text was updated successfully, but these errors were encountered:

pyun-ram · 2021-08-26T09:17:49Z

Since the README mentioned that batchensemble should be run on TPU v3-8,
I tried to run it on a TPU Compute Engine but failed with the same problem.
Following is my running steps:

In google shell:

# I hide my project-id here, 
# since I am not sure whether the exposure of my project-id will cause risks or not. XD
export PROJECT_ID=project-id
gcloud config set project $PROJECT_ID
gsutil mb -p ${PROJECT_ID} -c standard -l us-central1 -b on gs://uncertainty-baselines
gcloud compute tpus execution-groups create \
 --name=uncertainty-baselines \
 --zone=us-central1-b \
 --tf-version=2.6.0 \
 --machine-type=n1-standard-1 \
 --accelerator-type=v3-8

In the VM:

git clone https://github.com/google/uncertainty-baselines
cd uncertainty-baselines && pip install -e .[models,datasets,jax,tests]
export BUCKET=gs://uncertainty-baselines
export TPU_NAME=uncertainty-baselines
export DATA_DIR=$BUCKET/tensorflow_datasets
export OUTPUT_DIR=$BUCKET/model
python3 baselines/cifar/batchensemble.py \
    --tpu=$TPU_NAME \
    --data_dir=$DATA_DIR \
    --output_dir=$OUTPUT_DIR \
    --download_data=True

The error seems the same as my CoLab experience. It is :


2021-08-26 09:30:53.745303: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-08-26 09:30:53.745528: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-08-26 09:30:58.652979: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-08-26 09:30:58.653248: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2021-08-26 09:30:58.653342: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (uncertainty-baselines): /proc/driver/nvidia/version does not exist
I0826 09:30:58.794567 139878914160448 batchensemble.py:46] Saving checkpoints at gs://uncertainty-baselines/model
I0826 09:30:58.796198 139878914160448 batchensemble.py:58] Use TPU at uncertainty-baselines
I0826 09:30:58.807567 139878914160448 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0826 09:30:58.847581 139878914160448 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/nomadic-archway-323412/locations/us-central1-b/nodes/uncertainty-baselines?alt=json
I0826 09:30:58.847998 139878914160448 transport.py:157] Attempting refresh to obtain initial access_token
I0826 09:30:58.910928 139878914160448 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0826 09:30:58.942784 139878914160448 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/nomadic-archway-323412/locations/us-central1-b/nodes/uncertainty-baselines?alt=json
I0826 09:30:58.943188 139878914160448 transport.py:157] Attempting refresh to obtain initial access_token
I0826 09:30:59.007248 139878914160448 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0826 09:30:59.042939 139878914160448 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/nomadic-archway-323412/locations/us-central1-b/nodes/uncertainty-baselines?alt=json
I0826 09:30:59.043304 139878914160448 transport.py:157] Attempting refresh to obtain initial access_token
I0826 09:30:59.090867 139878914160448 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0826 09:30:59.119607 139878914160448 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/nomadic-archway-323412/locations/us-central1-b/nodes/uncertainty-baselines?alt=json
I0826 09:30:59.119935 139878914160448 transport.py:157] Attempting refresh to obtain initial access_token
2021-08-26 09:30:59.184921: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-26 09:30:59.188754: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.100.189.242:8470}
2021-08-26 09:30:59.190415: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:35077}
2021-08-26 09:30:59.208210: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.100.189.242:8470}
2021-08-26 09:30:59.208396: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:35077}
2021-08-26 09:30:59.208920: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:35077
I0826 09:30:59.209608 139878914160448 remote.py:237] Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
I0826 09:30:59.210030 139878914160448 tpu_strategy_util.py:61] Deallocate tpu buffers before initializing tpu system.
INFO:tensorflow:Initializing the TPU system: uncertainty-baselines
I0826 09:30:59.381586 139878914160448 tpu_strategy_util.py:85] Initializing the TPU system: uncertainty-baselines
INFO:tensorflow:Finished initializing TPU system.
I0826 09:31:05.071858 139878914160448 tpu_strategy_util.py:143] Finished initializing TPU system.
I0826 09:31:05.074776 139878914160448 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0826 09:31:05.107393 139878914160448 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/nomadic-archway-323412/locations/us-central1-b/nodes/uncertainty-baselines?alt=json
I0826 09:31:05.107806 139878914160448 transport.py:157] Attempting refresh to obtain initial access_token
I0826 09:31:05.174555 139878914160448 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0826 09:31:05.219382 139878914160448 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/nomadic-archway-323412/locations/us-central1-b/nodes/uncertainty-baselines?alt=json
I0826 09:31:05.219723 139878914160448 transport.py:157] Attempting refresh to obtain initial access_token
INFO:tensorflow:Found TPU system:
I0826 09:31:05.276664 139878914160448 tpu_system_metadata.py:159] Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
I0826 09:31:05.277065 139878914160448 tpu_system_metadata.py:160] *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
I0826 09:31:05.277360 139878914160448 tpu_system_metadata.py:161] *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I0826 09:31:05.277621 139878914160448 tpu_system_metadata.py:163] *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0826 09:31:05.277831 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0826 09:31:05.278252 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
I0826 09:31:05.278487 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
I0826 09:31:05.278706 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
I0826 09:31:05.278957 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
I0826 09:31:05.279162 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
I0826 09:31:05.279375 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
I0826 09:31:05.279607 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
I0826 09:31:05.279962 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
I0826 09:31:05.280166 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
I0826 09:31:05.280377 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I0826 09:31:05.280594 139878914160448 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
W0826 09:31:05.713980 139878914160448 datasets.py:59] Skipped due to ImportError. Try installing uncertainty baselines with the `datasets` extras.
Traceback (most recent call last):
  File "/home/XXXXXX/uncertainty-baselines/uncertainty_baselines/datasets/datasets.py", line 54, in <module>
    from uncertainty_baselines.datasets.smcalflow import MultiWoZDataset  # pylint: disable=g-import-not-at-top
  File "/home/XXXXXX/uncertainty-baselines/uncertainty_baselines/datasets/smcalflow.py", line 40, in <module>
    import seqio
ModuleNotFoundError: No module named 'seqio'
W0826 09:31:05.721032 139878914160448 __init__.py:70] Skipped dataset due to ImportError. Try installing uncertainty baselines with the `datasets` extras.
Traceback (most recent call last):
  File "/home/XXXXXX/uncertainty-baselines/uncertainty_baselines/datasets/__init__.py", line 64, in <module>
    from uncertainty_baselines.datasets.smcalflow import MultiWoZDataset  # pylint: disable=g-import-not-at-top
  File "/home/XXXXXX/uncertainty-baselines/uncertainty_baselines/datasets/smcalflow.py", line 40, in <module>
    import seqio
ModuleNotFoundError: No module named 'seqio'
I0826 09:31:05.722037 139878914160448 datasets.py:134] Building dataset cifar10 with additional kwargs:
{
  "data_dir": "gs://uncertainty-baselines/tensorflow_datasets",
  "download_data": true,
  "validation_percent": 0.0
}
I0826 09:31:06.054608 139878914160448 dataset_info.py:443] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: cifar10/3.0.2
I0826 09:31:06.387699 139878914160448 dataset_info.py:358] Load dataset info from /tmp/tmp9iinh0mftfds
I0826 09:31:06.389907 139878914160448 dataset_info.py:413] Field info.citation from disk and from code do not match. Keeping the one from code.
I0826 09:31:06.390269 139878914160448 dataset_info.py:413] Field info.splits from disk and from code do not match. Keeping the one from code.
I0826 09:31:06.390483 139878914160448 dataset_info.py:413] Field info.module_name from disk and from code do not match. Keeping the one from code.
Traceback (most recent call last):
  File "baselines/cifar/batchensemble.py", line 369, in <module>
    app.run(main)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "baselines/cifar/batchensemble.py", line 70, in main
    train_dataset = train_builder.load(batch_size=batch_size)
  File "/home/XXXXXX/uncertainty-baselines/uncertainty_baselines/datasets/base.py", line 400, in load
    return self._load(preprocess_fn=preprocess_fn, batch_size=batch_size)
  File "/home/XXXXXX/uncertainty-baselines/uncertainty_baselines/datasets/base.py", line 272, in _load
    self._seed, num=2)
  File "/home/XXXXXX/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 524, in __iter__
2021-08-26 09:31:06.397275: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: Unsupported algorithm id: 3
    shape = self._shape_tuple()
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsupported algorithm id: 3
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/XXXXXX/.local/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 2531, in async_wait
    context().sync_executors()
  File "/home/XXXXXX/.local/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 681, in sync_executors
    pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsupported algorithm id: 3
2021-08-26 09:31:06.698395: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: INVALID_ARGUMENT: Unable to find the relevant tensor remote_handle: Op ID: 10, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1629970266.698308853","description":"Error received from peer ipv4:10.100.189.242:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 10, Output num: 0","grpc_status":3}

pyun-ram changed the title ~~Problems when running batchensemble on TPU~~ [question] Problems when running batchensemble on TPU Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] Problems when running batchensemble on TPU #487

[question] Problems when running batchensemble on TPU #487

pyun-ram commented Aug 26, 2021

pyun-ram commented Aug 26, 2021 •

edited

[question] Problems when running batchensemble on TPU #487

[question] Problems when running batchensemble on TPU #487

Comments

pyun-ram commented Aug 26, 2021

pyun-ram commented Aug 26, 2021 • edited

pyun-ram commented Aug 26, 2021 •

edited