Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot load huggingface:imagenet-1k dataset due to parse error #5105

Open
christian-steinmeyer opened this issue Oct 6, 2023 · 5 comments
Open
Labels
bug Something isn't working

Comments

@christian-steinmeyer
Copy link

Short description
When following the instructions here, I cannot download the imagenet-1k dataset from huggingface.

Environment information

  • Operating System: macOS Sonoma 14.0 (23A344)

  • Python version: 3.10.10

  • tensorflow-datasets/tfds-nightly version: 4.9.3.dev202310060044

  • tensorflow/tf-nightly version: 2.13.0

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ? Yes

Reproduction instructions

import tensorflow_datasets as tfds

if __name__ == '__main__':
    ds = tfds.load('huggingface:imagenet-1k', split='train')

Stacktrace

Traceback (most recent call last):
  File "<python_file>.py", line 4, in <module>
    ds = tfds.load('huggingface: imagenet-1k', split='train')
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/logging/__init__.py", line 168, in __call__
    return function(*args, **kwargs)
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/load.py", line 633, in load
    dbuilder = _fetch_builder(
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/load.py", line 488, in _fetch_builder
    return builder(name, data_dir=data_dir, try_gcs=try_gcs, **builder_kwargs)
  File "<...>/.pyenv/versions/3.10.10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/logging/__init__.py", line 168, in __call__
    return function(*args, **kwargs)
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/load.py", line 171, in builder
    name, builder_kwargs = naming.parse_builder_name_kwargs(
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/naming.py", line 141, in parse_builder_name_kwargs
    name, parsed_builder_kwargs = _dataset_name_and_kwargs_from_name_str(name)
  File "<venv>/lib/python3.10/site-packages/tensorflow_datasets/core/naming.py", line 173, in _dataset_name_and_kwargs_from_name_str
    raise ValueError(err_msg)
ValueError: Parsing builder name string huggingface: imagenet-1k failed.
The builder name string must be of the following format:
  dataset_name[/config_name][:version][/kwargs]

  Where:

    * dataset_name and config_name are string following python variable naming.
    * version is of the form x.y.z where {x,y,z} can be any digit or *.
    * kwargs is a comma list separated of arguments and values to pass to
      builder.

  Examples:
    my_dataset
    my_dataset:1.2.*
    my_dataset/config1
    my_dataset/config1:1.*.*
    my_dataset/config1/arg1=val1,arg2=val2
    my_dataset/config1:1.2.3/right=True,foo=bar,rate=1.2

Expected behavior
The parser properly parses the string given in the documentation and downloading the dataset succeeds.

Additional context

@christian-steinmeyer christian-steinmeyer added the bug Something isn't working label Oct 6, 2023
@ccl-core
Copy link
Collaborator

Hello @christian-steinmeyer !

As HF and TFDS have different naming rules, you will have to adapt the dataset name to follow TFDS' naming: in this case, the correct name would be huggingface:imagenet_1k

As a pointer, you can refer to the from_hf_to_tfds function under:

We will update our documentation so that this is clearer for users!

@christian-steinmeyer
Copy link
Author

That worked, thanks! And yes, an update in the documentation would be very helpful!

@christian-steinmeyer
Copy link
Author

@ccl-core Quick follow-up question: Downloading the dataset worked - however, after generating splits, the load function also includes the step of generating tfrecords (Output "Generating training examples..."), which is pretty slow for me (~20 examples/s). Is there any way to speed this up? I couldn't find anything in the builder config or the download and prepare config. The number of available CPUs doesn't seem to be a factor. For Imagenet-1k, this is taking many hours.

@christian-steinmeyer
Copy link
Author

Hi again! I found the tfds_num_proc argument of the hugginface dataset builder. However, it doesn't seem to be what I'm looking for. Using a number equal to my cpu count or half / quarter times that, there is no progress printed in the generating training examples... step, only my ram fills up and then at some point it crashes.

tfds.load(
    'huggingface:imagenet_1k',
    data_dir=IMAGE_DIR,
    shuffle_files=True,
    builder_kwargs={"tfds_num_proc": N_JOBS}
)

In the meantime, my original try ran through (without builder_kwargs). However, when I use this in a training run, I get tons of warnings like W tensorflow/core/lib/png/png_io.cc:88] PNG warning: 1CCP: known incorrect profile or profile 'ICC PRofile': 'RGB ': RGB color space not permitted on grayscale PNG. Both of which to me seem like a misconfiguration of the dataset somehow. Or is this expected?

@maziarzamani
Copy link

@ccl-core Quick follow-up question: Downloading the dataset worked - however, after generating splits, the load function also includes the step of generating tfrecords (Output "Generating training examples..."), which is pretty slow for me (~20 examples/s). Is there any way to speed this up? I couldn't find anything in the builder config or the download and prepare config. The number of available CPUs doesn't seem to be a factor. For Imagenet-1k, this is taking many hours.

Same problem here. It runs ~20 examples/s and eventually after a day or so it crashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants