Skip to content

Releases: tensorflow/datasets

v4.8.0

21 Dec 11:09
Compare
Choose a tag to compare

Added

  • [API] DatasetBuilder's description and citations can be specified in
    dedicated README.md and CITATIONS.bib files, within the dataset package
    (see https://www.tensorflow.org/datasets/add_dataset).
  • Tags can be associated to Datasets, in the TAGS.txt file. For
    now, they are only used in the generated documentation.
  • [API][Experimental] New ViewBuilder to define datasets as transformations
    of existing datasets. Also adds tfds.transform with functionality to apply
    transformations.
  • Loggers are also called on tfds.as_numpy(...), base Logger class has a
    new corresponding method.
  • tfds.core.DatasetBuilder can have a default limit for the number of
    simultaneous downloads. tfds.download.DownloadConfig can override it.
  • tfds.features.Audio supports storing raw audio data for lazy decoding.
  • The number of shards can be overridden when preparing a dataset:
    builder.download_and_prepare(download_config=tfds.download.DownloadConfig(num_shards=42)).
    Alternatively, you can configure the min and max shard size if you want TFDS
    to compute the number of shards for you, but want to have control over the
    shard sizes.

Changed

Deprecated

Removed

Fixed

Security

v4.7.0

05 Oct 10:23
f00f1e3
Compare
Choose a tag to compare

Added

  • [API] Added TfDataBuilder that is handy for storing experimental ad hoc TFDS datasets in notebook-like environments such that they can be versioned, described, and easily shared with teammates.
  • [API] Added options to create format-specific dataset builders. The new API now includes a number of NLP-specific builders, such as:
  • [API] Added tfds.beam.inc_counter to reduce beam.metrics.Metrics.counter boilerplate
  • [API] Added options to group together existing TFDS datasets into dataset collections and to perform simple operations over them.
  • [Documentation] update, specifically:
    • New guide on format-specific dataset builders;
    • New guide on adding new dataset collections to TFDS;
    • Updated TFDS CLI documentation.
  • [TFDS CLI] Supports custom config through Json (e.g. tfds build my_dataset --config='{"name": "my_custom_config", "description": "Abc"}')
  • New datasets:
  • Updated datasets:
    • C4 was updated to version 3.1.
    • common_voice was updated to a more recent snapshot.
    • wikipedia was updated with the 20220620 snapshot.
  • New dataset collections, such as xtreme and LongT5

Changed

  • The base Logger class expects more information to be passed to the as_dataset method. This should only be relevant to people who have implemented and registered custom Logger class(es).
  • You can set DEFAULT_BUILDER_CONFIG_NAME in a DatasetBuilder to change the default config if it shouldn't be the first builder config defined in BUILDER_CONFIGS.

Deprecated

Removed

Fixed

  • Various datasets
  • In Linux, when loading a dataset from a directory that is not your home (~) directory, a new ~ directory is not created in the current directory (fixes #4117).

Security

v4.6.0

02 Jun 09:21
Compare
Choose a tag to compare

Added

  • Support for community datasets on GCS.
  • [API] tfds.builder_from_directory and tfds.builder_from_directories, see
    https://www.tensorflow.org/datasets/external_tfrecord#directly_from_folder.
  • [API] Dash ("-") support in split names.
  • [API] file_format argument to download_and_prepare method, allowing user
    to specify an alternative file format to store prepared data (e.g. "riegeli").
  • [API] file_format to DatasetInfo string representation.
  • [API] Expose the return value of Beam pipelines. This allows for users to
    read the Beam metrics.
  • [API] Expose Feature tf_example_spec to public.
  • [API] doc kwarg on Features, to describe a feature.
  • [Documentation] Features description is shown on TFDS Catalog.
  • [Documentation] More metadata about HuggingFace datasets in TFDS catalog.
  • [Performance] Parallel load of metadata files.
  • [Testing] TFDS tests are now run using GitHub actions - misc improvements such
    as caching and sharding.
  • [Testing] Improvements to MockFs.
  • New datasets.

Changed

  • [API] num_shards is now optional in the shard name.

Removed

Fixed

  • Various datasets.
  • Dataset builders that are defined adhoc (e.g. in Colab).
  • Better DatasetNotFoundError messages.
  • Don't set deterministic on a global level but locally in interleave, so it
    only apply to interleave and not all transformations.
  • Google drive downloader.

As always, thank you to all contributors!

v4.5.2

31 Jan 15:45
Compare
Choose a tag to compare

Release notes:

  • Fix import bug on Windows (#3709)
  • Updated documentation

v4.5.1

31 Jan 12:10
Compare
Choose a tag to compare

Release notes:

  • Fix import bug on Windows (#3709)
  • Add split=tfds.split_for_jax_process('train') (alias of tfds.even_splits('train', n=jax.process_count())[jax.process_index()])

v4.5.0

26 Jan 09:44
Compare
Choose a tag to compare

This is the last version of TFDS supporting 3.6. Future version will use 3.7

  • Better split API:

    • Splits can be selected using shards: split='train[3shard]'
    • Underscore supported in numbers for better readability: split='train[:500_000]'
    • Select the union of all splits with split='all'
    • tfds.even_splits is more precise and flexible:
      • Return splits exactly of the same size when passed tfds.even_splits('train', n=3, drop_remainder=True)
      • Works on subsplits tfds.even_splits('train[:75%]', n=3) or even nested
      • Can be composed with other splits: tfds.even_splits('train', n=3)[0] + 'test'
  • FeatureConnectors:

    • Faster dataset generation (using tfrecords)
    • Features now have serialize_example / deserialize_example methods to encode/decode example to proto: example_bytes = features.serialize_example(example_data)
    • Audio now supports encoding='zlib' for better compression
    • Features specs exposed in proto for better compatibility with other languages
  • Better testing:

    • Mock dataset now supports nested datasets
    • Customize the number of sub examples
  • Documentation update:

  • RLDS:

    • Nested datasets features are supported
    • New datasets: Robomimic, D4RL Ant Maze, RLU Real World RL, and RLU Atari with ordered episodes
  • Misc:

    • Create beam pipeline using TFDS as input with tfds.beam.ReadFromTFDS
    • Support setting the file formats in tfds build --file_format=tfrecord
    • Typing annotations exposed in tfds.typing
    • tfds.ReadConfig has a new assert_cardinality=False to disable cardinality
    • Add a tfds.display_progress_bar(True) for functional control
    • Support for huge number of shards (>99999)
    • DatasetInfo exposes .release_notes

And of course, new datasets, bug fixes,...

Thank you to all our contributors for improving TFDS!

v4.4.0

28 Jul 12:29
Compare
Choose a tag to compare

API:

  • Add PartialDecoding support, to decode only a subset of the features (for performances)
  • Catalog now expose links to KnowYourData visualisations
  • tfds.as_numpy supports datasets with None
  • Dataset generated with disable_shuffling=True are now read in generation order.
  • Loading datasets from files now supports custom tfds.features.FeatureConnector
  • tfds.testing.mock_data now supports
    • non-scalar tensors with dtype tf.string
    • builder_from_files and path-based community datasets
  • File format automatically restored (for datasets generated with tfds.builder(..., file_format=)).
  • Many new reinforcement learning datasets
  • Various bug fixes and internal improvements like:
    • Dynamically set number of worker thread during extraction
    • Update progression bar during download even if downloads are cached

Dataset creation:

  • Add tfds.features.LabeledImage for semantic segmentation (like image but with additional info.features['image_label'].name label metadata)
  • Add float32 support for tfds.features.Image (e.g. for depth map)
  • All FeatureConnector can now have a None dimension anywhere (previously restricted to the first position).
  • tfds.features.Tensor() can have arbitrary number of dynamic dimension (Tensor(..., shape=(None, None, 3, None)))
  • tfds.features.Tensor can now be serialised as bytes, instead of float/int values (to allow better compression): Tensor(..., encoding='zlib')
  • Add script to add TFDS metadata files to existing TF-record (see doc).
  • New guide on common implementation gotchas

Thank you all for your support and contribution!

v4.3.0

07 May 13:09
Compare
Choose a tag to compare

API:
• Add dataset.info.splits['train'].num_shards to expose the number of shards to the user
• Add tfds.features.Dataset to have a field containing sub-datasets (e.g. used in RL datasets)
• Add dtype and tf.uint16 supports for tfds.features.Video
• Add DatasetInfo.license field to add redistributing information
• Better tfds.benchmark(ds) (compatible with any iterator, not just tf.data, better colab representation)

Other
• Faster tfds.as_numpy() (avoid extra tf.Tensor <> np.array copy)
• Better tfds.as_dataframe visualisation (Sequence, ragged tensor, semantic masks with use_colormap)
• (experimental) community datasets support. To allow dynamically import datasets defined outside the TFDS repository.
• (experimental) Add a hugging-face compatibility wrapper to use Hugging-face datasets directly in TFDS.
• (experimental) Riegelli format support
• (experimental) Add DatasetInfo.disable_shuffling to force examples to be read in generation order.
• Add .copy, .format methods to GPath objects
• Many bug fixes

Testing:
• Supports custom BuilderConfig in DatasetBuilderTest
DatasetBuilderTest now has a dummy_data class property which can be used in setUpClass
• Add add_tfds_id and cardinality support to tfds.testing.mock_data

And of course, many new datasets and datasets updates.

We would like to thank all the TFDS contributors!

v4.2.0

06 Jan 15:41
Compare
Choose a tag to compare

API:

  • Add tfds build to the CLI. See documentation.
  • DownloadManager now returns Pathlib-like objects
  • Datasets returned by tfds.as_numpy are compatible with len(ds)
  • New tfds.features.Dataset to represent nested datasets
  • Add tfds.ReadConfig(add_tfds_id=True) to add a unique id to the example ex['tfds_id'] (e.g. b'train.tfrecord-00012-of-01024__123')
  • Add num_parallel_calls option to tfds.ReadConfig to overwrite to default AUTOTUNE option
  • tfds.ImageFolder now support tfds.decode.SkipDecoder
  • Add multichannel audio support to tfds.features.Audio
  • Better tfds.as_dataframe visualization (ffmpeg video if installed, bounding boxes,...)
  • Add try_gcs to tfds.builder(..., try_gcs=True)
  • Simpler BuilderConfig definition: class VERSION and RELEASE_NOTES are applied to all BuilderConfig. Config description is now optional.

Breaking compatibility changes:

  • Removed configs for all text datasets. Only plain text version is kept. For example: multi_nli/plain_text -> multi_nli.
  • To guarantee better deterministic, new validations are performed on the keys when creating a dataset (to avoid filenames as keys (non-deterministic) and restrict key to str, bytes and int). New errors likely indicates an issue in the dataset implementation.
  • tfds.core.benchmark now returns a pd.DataFrame (instead of a dict)
  • tfds.units is not visible anymore from the public API

Bug fixes:

  • Support 0-len sequence with images of dynamic shape (Fix #2616)
  • Progression bar correctly updated when copying files.
  • Many bug fixes (GPath consistency with pathlib, s3 compatibility, TQDM visual artifacts, GCS crash on windows, re-download when checksums updated,...)
  • Better debugging and error message (e.g. human readable size,...)
  • Allow max_examples_per_splits=0 in tfds build --max_examples_per_splits=0 to test _split_generators only (without _generate_examples).

And of course, many new datasets and datasets updates.

Thank you the community for their many valuable contributions and to supporting us in this project!!!

v4.1.0

04 Nov 12:02
Compare
Choose a tag to compare
  • When generating a dataset, if download fails for any reason, it is now possible to manually download the data. See doc.

  • Simplification of the dataset creation API.

    • We've made it is easier to create datasets outside TFDS repository (see our updated dataset creation guide).
    • _split_generators should now returns {'split_name': self._generate_examples(), ...} (but current datasets are backward compatible).
    • All dataset inherit from tfds.core.GeneratorBasedBuilder. Converting a dataset to beam now only require changing _generate_examples (see example and doc).
    • tfds.core.SplitGenerator, tfds.core.BeamBasedBuilder are deprecated and will be removed in future version.
  • Better pathlib.Path, os.PathLike compatibility:

    • dl_manager.manual_dir now returns a pathlib-Like object. Example:
    text = (dl_manager.manual_dir / 'downloaded-text.txt').read_text()
    • Note: Other dl_manager.download, .extract,... will return pathlib-like objects in future versions
    • FeatureConnector,... and most functions should accept PathLike objects. Let us know if some functions you need are missing.
    • Add a tfds.core.as_path to create pathlib.Path-like objects compatible with GCS (e.g. tfds.core.as_path('gs://my-bucket/labels.csv').read_text()).
  • Other bug fixes and improvement. E.g.

    • Add verify_ssl= option to tfds.download.DownloadConfig to disable SSH certificate during download.
    • BuilderConfig are now compatible with Beam datasets #2348
    • --record_checksums now assume the new dataset-as-folder model
    • tfds.features.Images can accept encoded bytes images directly (useful when used with img_name, img_bytes = dl_manager.iter_archive('images.zip')).
    • Doc API now show deprecated methods, abstract methods to overwrite are now documented.
    • You can generate imagenet2012 with only a single split (e.g. only the validation data). Other split will be skipped if not present.
  • And of course new datasets

Thank you to all our contributors for improving TFDS!