Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help building LVIS locally #5113

Open
JKelle opened this issue Oct 19, 2023 · 11 comments
Open

Need help building LVIS locally #5113

JKelle opened this issue Oct 19, 2023 · 11 comments
Labels

Comments

@JKelle
Copy link

JKelle commented Oct 19, 2023

What I need help with / What I was wondering
I need help downloading the LVIS dataset to my EC2 instance.

What I've tried so far
First, I copied the changes from #5094.
Then, I tried using the SDK to download_and_prepare the dataset as follows

import tensorflow_datasets as tfds

builder = tfds.builder("lvis")
builder.download_and_prepare()

I also tried adding more parameters for the DirectRunner

import apache_beam as beam
import tensorflow_datasets as tfds

builder = tfds.builder("lvis")
flags = ["--direct_num_workers=4", "--direct_running_mode=multi_processing"]
builder.download_and_prepare(
    download_config=tfds.download.DownloadConfig(
        beam_runner="DirectRunner",
        beam_options=beam.options.pipeline_options.PipelineOptions(flags=flags),
    )
)

After around 10ish minutes I can see 4 CPUs at near 100% utilization, so I think the builder is working. It runs for a while, 30 minutes to a couple hours depending on how many workers I specify, then either hits an error or runs out of memory and gets killed. If I remember correctly, this dataset is about ~25 GB in size. My machine has 64 GB of RAM.

It would be nice if...
It would be most convenient for me if I could just download an already built version of the dataset so I could avoid needing to build it myself. I don't really understand what goes on during the build. I just need this dataset locally in TFDS format so I can train a model that's been written to consume this dataset in this format. I'd rather not have to learn about Apache Beam and set up Google Cloud infrastructure just to get a 25 GB dataset.

If that's not possible, then it would be nice if I could build the LVIS dataset locally more easily.

Environment information
(if applicable)

  • Operating System: Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1103-aws aarch64)
  • Python version: 3.10.13
  • tensorflow version: 2.14.0
  • tensorflow-cpu-aws version: 2.14.0
  • tensorflow-datasets version: 4.9.3
  • tensorflow-io-gcs-filesystem version: 0.34.0
  • apache-beam version: 2.51.0
  • EC2 instance type: r6g.2xlarge
@marcenacp
Copy link
Collaborator

marcenacp commented Oct 24, 2023

Thanks for reaching out. We can unfortunately not host prepared datasets ourselves because some datasets have specific licensing terms.

As far as your issue is concerned, Beam usually allows to build much bigger datasets. So the issue seems to concern that one dataset only and the code in the dataset builder. We'd like to understand why it runs out-of-memory. Would it be possible for you to run a heap inspection [1] before it crashes, and report your findings?

[1] Python has a few internals or libraries for this:

@Rahulraj0308
Copy link

@marcenacp, is the dataset hosted on cloud like aws or google cloud storage?? if it is we can directly download it form there and Since you're running into memory limitations we might consider using an EC2 instance with more memory, such as r6g.4xlarge or larger, to accommodate the dataset processing.

@phamnhuvu-dev
Copy link

I have the same problem
My Machine: WSL2 64GB RAM, GPU RTX 4090

(owl_vit) phamnhuvu@PhamNhuVu:~$ python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow_datasets as tfds
>>> ds = tfds.load('lvis')
2024-02-28 19:21:30.443798: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-28 19:21:30.464700: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-28 19:21:30.464747: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-28 19:21:30.465379: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-28 19:21:30.468686: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-28 19:21:30.832186: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-02-28 19:21:31.503426: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".
Downloading and preparing dataset 25.35 GiB (download: 25.35 GiB, generated: 23.04 GiB, total: 48.39 GiB) to /home/phamnhuvu/tensorflow_datasets/lvis/1.3.0...
Extraction completed...: 0 file [00:00, ? file/s]██████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 1404.12 url/s]
Dl Size...: 100%|█████████████████████████████████████████████████████████████| 27215681797/27215681797 [00:00<00:00, 5140540530662.18 MiB/s]
Dl Completed...: 100%|█████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 1285.92 url/s]
Generating splits...:   0%|                                                                                       | 0/4 [00:00<?, ? splits/s]WARNING:absl:**************************** WARNING *********************************
Warning: The dataset you're trying to generate is using Apache Beam,
yet no `beam_runner` nor `beam_options` was explicitly provided.

Some Beam datasets take weeks to generate, so are usually not suited
for single machine generation. Please have a look at the instructions
to setup distributed generation:

https://www.tensorflow.org/datasets/beam_datasets#generating_a_beam_dataset
**********************************************************************
2024-02-28 19:29:01.039584: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.184540: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.184598: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.186879: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.186919: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.186943: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.354102: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.354154: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.354169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2022] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2024-02-28 19:29:01.354211: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 19:29:01.354253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21458 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9
Killed

@rishabh-akridata
Copy link

Hello,
Has anyone found a solution for this? I am also facing the same issue, the process gets killed after sticking in the Apache beam for some time.

@phamnhuvu-dev
Copy link

@phamnhuvu-dev
Copy link

I don't have any problems with the coco dataset.

2024-02-28 18:20:46.938048: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-28 18:20:46.960985: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:absl:You use TensorFlow DType <dtype: 'int64'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to int64.
2024-02-28 18:20:48.154348: W tensorflow/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".
Downloading and preparing dataset 19.57 GiB (download: 19.57 GiB, generated: Unknown size, total: 19.57 GiB) to /root/tensorflow_datasets/coco/2017_panoptic/1.1.0...
Extraction completed...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [06:41<00:00, 100.48s/ file]
Dl Size...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20152447948/20152447948 [06:41<00:00, 50141956.99 MiB/s]
Dl Completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [06:41<00:00, 133.97s/ url]
Extraction completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118287/118287 [00:50<00:00, 2351.85 file/s]
Extraction completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 2260.43 file/s]
Dataset coco downloaded and prepared to /root/tensorflow_datasets/coco/2017_panoptic/1.1.0. Subsequent calls will reuse this data.                                                           
2024-02-28 18:31:43.882797: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node      
Your kernel may have been built without NUMA support.
2024-02-28 18:31:43.887190: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:43.887240: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:43.889959: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:43.889994: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:43.890015: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:44.073422: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:44.073472: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:44.073487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1725] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2024-02-28 18:31:44.073511: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:999] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-02-28 18:31:44.073544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1638] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21194 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9

@rishabh-akridata
Copy link

@phamnhuvu-dev Okay, but I want to replicate the results on the LVIS dataset only. Any other workaround for this to bypass this issue?

@phamnhuvu-dev
Copy link

@rishabh-akridata I use the COCO dataset instead of LVIS dataset

@rohit901
Copy link

rohit901 commented Mar 6, 2024

It would be super helpful if we can download pre-built LVIS val dataset in TFDS format, does anyone have the links for it?

@rohit901
Copy link

rohit901 commented Mar 6, 2024

I tried increasing num workers in my machine as it has around 256 cores and 200GB memory, but still not able to build tfds dataset of LVIS val split.

Can you please guide me, i require this tfds dataset of lvis val split

2024-03-06 11:24:53.330928: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".
Downloading and preparing dataset 25.35 GiB (download: 25.35 GiB, generated: 23.04 GiB, total: 48.39 GiB) to /l/users/rohit.bharadwaj/RNCDL_extras/owl_vit/data/lvis/1.3.0...
Extraction completed...: 0 file [00:00, ? file/s]
Dl Size...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27215681797/27215681797 [00:00<00:00, 441532968804.31 MiB/s]
Dl Completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 112.66 url/s]
WARNING:apache_beam.runners.portability.local_job_service:Worker: severity: WARN timestamp {   seconds: 1709710635   nanos: 954519271 } message: "No semi_persistent_directory found: Functions defined in __main__ (interactive session) may fail." log_location: "/home/rohit.bharadwaj/.conda/envs/scenic/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py:361" thread: "MainThread"
WARNING:apache_beam.runners.portability.local_job_service:Worker: severity: WARN timestamp {   seconds: 1709710635   nanos: 956524848 } message: "Discarding unparseable args: ['--direct_runner_use_stacked_bundle']" log_location: "/home/rohit.bharadwaj/.conda/envs/scenic/lib/python3.11/site-packages/apache_beam/options/pipeline_options.py:372" thread: "MainThread"

>  File "/home/rohit.bharadwaj/.conda/envs/scenic/lib/python3.11/site-packages/grpc/_channel.py", line 968, in _next
    return self._next()
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {created_time:"2024-03-06T11:45:33.860667744+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
           ^^^^^^^^^^^^
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {grpc_message:"Socket closed", grpc_status:14, created_time:"2024-03-06T11:45:33.860667995+04:00"}"
>
  File "/home/rohit.bharadwaj/.conda/envs/scenic/lib/python3.11/site-packages/grpc/_channel.py", line 542, in __next__
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:37189 {created_time:"2024-03-06T11:45:33.866557836+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>  File "<frozen runpy>", line 88, in _run_code
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {created_time:"2024-03-06T11:45:33.879721749+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {created_time:"2024-03-06T11:45:33.877178203+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {created_time:"2024-03-06T11:45:33.860551262+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:37189 {created_time:"2024-03-06T11:45:33.876286223+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {created_time:"2024-03-06T11:45:33.860943561+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
Traceback (most recent call last):
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {grpc_message:"Socket closed", grpc_status:14, created_time:"2024-03-06T11:45:33.879805441+04:00"}"
>                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rohit.bharadwaj/.conda/envs/scenic/lib/python3.11/site-packages/apache_beam/runners/worker/data_plane.py", line 669, in _read_inputs
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {created_time:"2024-03-06T11:45:33.865689797+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
           ^^^^^^^^^^^^
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:37189 {created_time:"2024-03-06T11:45:33.877735885+04:00", grpc_status:14, grpc_message:"Socket closed"}"
>
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:34189 {grpc_message:"Socket closed", grpc_status:14, created_time:"2024-03-06T11:45:33.878120773+04:00"}"
>
           ^^^^^^^^^^^^

@phamnhuvu-dev
Copy link

Screen.Recording.2024-03-17.at.13.20.37.mov

@marcenacp There is a problem at the extraction step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants