Cannot load huggingface:imagenet-1k dataset due to parse error #5105

christian-steinmeyer · 2023-10-06T13:48:48Z

Short description
When following the instructions here, I cannot download the imagenet-1k dataset from huggingface.

Environment information

Operating System: macOS Sonoma 14.0 (23A344)
Python version: 3.10.10
tensorflow-datasets/tfds-nightly version: 4.9.3.dev202310060044
tensorflow/tf-nightly version: 2.13.0
Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ? Yes

Reproduction instructions

import tensorflow_datasets as tfds

if __name__ == '__main__':
    ds = tfds.load('huggingface:imagenet-1k', split='train')

Stacktrace

Traceback (most recent call last):
  File "<python_file>.py", line 4, in <module>
    ds = tfds.load('huggingface: imagenet-1k', split='train')
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/logging/__init__.py", line 168, in __call__
    return function(*args, **kwargs)
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/load.py", line 633, in load
    dbuilder = _fetch_builder(
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/load.py", line 488, in _fetch_builder
    return builder(name, data_dir=data_dir, try_gcs=try_gcs, **builder_kwargs)
  File "<...>/.pyenv/versions/3.10.10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/logging/__init__.py", line 168, in __call__
    return function(*args, **kwargs)
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/load.py", line 171, in builder
    name, builder_kwargs = naming.parse_builder_name_kwargs(
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/naming.py", line 141, in parse_builder_name_kwargs
    name, parsed_builder_kwargs = _dataset_name_and_kwargs_from_name_str(name)
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/naming.py", line 173, in _dataset_name_and_kwargs_from_name_str
    raise ValueError(err_msg)
ValueError: Parsing builder name string huggingface: imagenet-1k failed.
The builder name string must be of the following format:
  dataset_name[/config_name][:version][/kwargs]

  Where:

    * dataset_name and config_name are string following python variable naming.
    * version is of the form x.y.z where {x,y,z} can be any digit or *.
    * kwargs is a comma list separated of arguments and values to pass to
      builder.

  Examples:
    my_dataset
    my_dataset:1.2.*
    my_dataset/config1
    my_dataset/config1:1.*.*
    my_dataset/config1/arg1=val1,arg2=val2
    my_dataset/config1:1.2.3/right=True,foo=bar,rate=1.2

Expected behavior
The parser properly parses the string given in the documentation and downloading the dataset succeeds.

Additional context

The text was updated successfully, but these errors were encountered:

ccl-core · 2023-10-12T12:40:22Z

Hello @christian-steinmeyer !

As HF and TFDS have different naming rules, you will have to adapt the dataset name to follow TFDS' naming: in this case, the correct name would be huggingface:imagenet_1k

As a pointer, you can refer to the from_hf_to_tfds function under:

datasets/tensorflow_datasets/core/dataset_builders/huggingface_dataset_builder.py

Line 124 in 3e6bec0

def from_hf_to_tfds(hf_name: str) -> str:

We will update our documentation so that this is clearer for users!

christian-steinmeyer · 2023-10-16T05:58:11Z

That worked, thanks! And yes, an update in the documentation would be very helpful!

christian-steinmeyer · 2023-10-17T06:44:04Z

@ccl-core Quick follow-up question: Downloading the dataset worked - however, after generating splits, the load function also includes the step of generating tfrecords (Output "Generating training examples..."), which is pretty slow for me (~20 examples/s). Is there any way to speed this up? I couldn't find anything in the builder config or the download and prepare config. The number of available CPUs doesn't seem to be a factor. For Imagenet-1k, this is taking many hours.

christian-steinmeyer · 2023-10-23T14:56:46Z

Hi again! I found the tfds_num_proc argument of the hugginface dataset builder. However, it doesn't seem to be what I'm looking for. Using a number equal to my cpu count or half / quarter times that, there is no progress printed in the generating training examples... step, only my ram fills up and then at some point it crashes.

tfds.load(
    'huggingface:imagenet_1k',
    data_dir=IMAGE_DIR,
    shuffle_files=True,
    builder_kwargs={"tfds_num_proc": N_JOBS}
)

In the meantime, my original try ran through (without builder_kwargs). However, when I use this in a training run, I get tons of warnings like W tensorflow/core/lib/png/png_io.cc:88] PNG warning: 1CCP: known incorrect profile or profile 'ICC PRofile': 'RGB ': RGB color space not permitted on grayscale PNG. Both of which to me seem like a misconfiguration of the dataset somehow. Or is this expected?

maziarzamani · 2024-02-16T20:37:33Z

@ccl-core Quick follow-up question: Downloading the dataset worked - however, after generating splits, the load function also includes the step of generating tfrecords (Output "Generating training examples..."), which is pretty slow for me (~20 examples/s). Is there any way to speed this up? I couldn't find anything in the builder config or the download and prepare config. The number of available CPUs doesn't seem to be a factor. For Imagenet-1k, this is taking many hours.

Same problem here. It runs ~20 examples/s and eventually after a day or so it crashes.

christian-steinmeyer added the bug Something isn't working label Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot load huggingface:imagenet-1k dataset due to parse error #5105

Cannot load huggingface:imagenet-1k dataset due to parse error #5105

christian-steinmeyer commented Oct 6, 2023

ccl-core commented Oct 12, 2023

christian-steinmeyer commented Oct 16, 2023

christian-steinmeyer commented Oct 17, 2023

christian-steinmeyer commented Oct 23, 2023

maziarzamani commented Feb 16, 2024

Cannot load huggingface:imagenet-1k dataset due to parse error #5105

Cannot load huggingface:imagenet-1k dataset due to parse error #5105

Comments

christian-steinmeyer commented Oct 6, 2023

ccl-core commented Oct 12, 2023

christian-steinmeyer commented Oct 16, 2023

christian-steinmeyer commented Oct 17, 2023

christian-steinmeyer commented Oct 23, 2023

maziarzamani commented Feb 16, 2024