Skip to content

Releases: tensorflow/datasets

v4.0.1

09 Oct 17:45
Compare
Choose a tag to compare
  • Fix tfds.load when generation code isn't present
  • Fix improve GCS compatibility.

Thanks @carlthome for reporting and fixing the issue.

v4.0.0

06 Oct 19:15
Compare
Choose a tag to compare

API changes, new features:

  • Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
  • tfds.load can now load dataset without using the generation class. So tfds.load('my_dataset:1.0.0') can work even if MyDataset.VERSION == '2.0.0' (See #2493).
  • Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail)
  • tfds.testing.mock_data does not require metadata files anymore!
  • Add tfds.as_dataframe(ds, ds_info) with custom visualisation (example)
  • Add tfds.even_splits to generate subsplits (e.g. tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]
  • Add new DatasetBuilder.RELEASE_NOTES property
  • tfds.features.Image now supports PNG with 4-channels
  • tfds.ImageFolder now supports custom shape, dtype
  • Downloaded URLs are available through MyDataset.url_infos
  • Add skip_prefetch option to tfds.ReadConfig
  • as_supervised=True support for tfds.show_examples, tfds.as_dataframe

Breaking compatible changes:

  • tfds.as_numpy() now returns an iterable which can be iterated multiple times. To migrate next(ds) -> next(iter(ds))
  • Rename tfds.features.text.Xyz -> tfds.deprecated.text.Xyz
  • Remove DatasetBuilder.IN_DEVELOPMENT property
  • Remove tfds.core.disallow_positional_args (should use Py3 *, instead)
  • tfds.features can now be saved/loaded, you may have to overwrite FeatureConnector.from_json_content and FeatureConnector.to_json_content to support this feature.
  • Stop testing against TF 1.15. Requires Python 3.6.8+.

Other bug fixes:

  • Better archive extension detection for dl_manager.download_and_extract
  • Fix tfds.__version__ in TFDS nightly to be PEP440 compliant
  • Fix crash when GCS not available
  • Script to detect dead-urls
  • Improved open-source workflow, contributor guide, documentation
  • Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...

And of course, new datasets, datasets updates.

A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix@ for being a major contributor.

v3.2.1

12 Aug 10:05
Compare
Choose a tag to compare
  • Fix an issue with GCS on Windows.

v3.2.0

10 Jul 21:39
Compare
Choose a tag to compare

Future breaking change:

  • The tfds.features.text encoding API is deprecated. Please use tensorflow_text instead.

New features

API:

  • Add a tfds.ImageFolder and tfds.TranslateFolder to easily create custom datasets with your custom data.
  • Add a tfds.ReadConfig(input_context=) to shard dataset, for better multi-worker compatibility (#1426).
  • The default data_dir can be controlled by the TFDS_DATA_DIR environment variable.
  • Better usability when developing datasets outside TFDS
    • Downloads are always cached
    • Checksum are optional
  • Added a tfds.show_statistics(ds_info) to display FACETS OVERVIEW. Note: This require the dataset to have been generated with the statistics.
  • Open source various scripts to help deployment/documentation (Generate catalog documentation, export all metadata files,...)

Documentation:

  • Catalog display images (example)
  • Catalog shows which dataset have been recently added and are only available in tfds-nightly nights_stay

Breaking compatibility change:

  • Fix deterministic example order on Windows when path was used as key (this only impact a few datasets). Now example order should be the same on all platforms.
  • Remove tfds.load('image_label_folder') in favor of the more user-friendly tfds.ImageFolder

Other:

  • Various performances improvements for both generation and reading (e.g. use __slot__, fix parallelisation bug in tf.data.TFRecordReader,...)
  • Various fixes (typo, types annotations, better error messages, fixing dead links, better windows compatibility,...)

Thanks to all our contributors who help improving the state of dataset for the entire research community!

v3.1.0

30 Apr 00:18
Compare
Choose a tag to compare

Beaking compatibility change:

  • Rename tfds.core.NamedSplit, tfds.core.SplitBase -> tfds.Split. Now tfds.Split.TRAIN,... are instance of tfds.Split
  • Remove deprecated num_shards argument from tfds.core.SplitGenerator. This argument was ignored as shards are automatically computed.

Future breaking compatibility changes:

  • Rename interleave_parallel_reads -> interleave_cycle_length for tfds.ReadConfig.
  • Invert ds, ds_info argument orders for tfds.show_examplesFuture breaking change:
  • The tfds.features.text encoding API is deprecated. Please use tensorflow_text instead.

Other changes:

  • Testing: Add support for custom decoders in tfds.testing.mock_data
  • Documentation: shows which datasets are only present in tfds-nightly
  • Documentation: display images for supported datasets
  • API: Add tfds.builder_cls(name) to access a DatasetBuilder class by name
  • API: Add info.split['train'].filenames for access to the tf-record files.
  • API: Add tfds.core.add_data_dir to register an additional data dir
  • Remove most ds.with_options which where applied by TFDS. Now use tf.data default.
  • Other bug fixes and improvement (Better error messages, windows compatibility,...)

Thank you all for your contributions, and helping us make TFDS better for everyone!

v3.0.0

16 Apr 03:03
Compare
Choose a tag to compare

Breaking changes:

  • Legacy mode tfds.experiment.S3 has been removed
  • New image_classification section. Some datasets have been move there from images.
  • in_memory argument has been removed from as_dataset/tfds.load (small datasets are now auto-cached).
  • DownloadConfig do not append the dataset name anymore (manual data should be in <manual_dir>/ instead of <manual_dir>/<dataset_name>/)
  • Tests now check that all dl_manager.download urls has registered checksums. To opt-out, add SKIP_CHECKSUMS = True to your DatasetBuilderTestCase.
  • tfds.load now always returns tf.compat.v2.Dataset. If you're using still using tf.compat.v1:
    • Use tf.compat.v1.data.make_one_shot_iterator(ds) rather than ds.make_one_shot_iterator()
    • Use isinstance(ds, tf.compat.v2.Dataset) instead of isinstance(ds, tf.data.Dataset)
  • tfds.Split.ALL has been removed from the API.

Future breaking change:

  • The tfds.features.text encoding API is deprecated. Please use tensorflow_text instead.
  • num_shards argument of tfds.core.SplitGenerator is currently ignored and will be removed in the next version.

Features:

  • DownloadManager is now pickable (can be used inside Beam pipelines)
  • tfds.features.Audio:
    • Support float as returned value
    • Expose sample_rate through info.features['audio'].sample_rate
    • Support for encoding audio features from file objects
  • Various bug fixes, better error messages, documentation improvements
  • More datasets

Thank you to all our contributors for helping us make TFDS better for everyone!

v2.1.0

25 Feb 21:51
Compare
Choose a tag to compare

New features:

  • Datasets expose info.dataset_size and info.download_size. All datasets generated with 2.1.0 cannot be loaded with previous version (previous datasets can be read with 2.1.0 however).
  • Auto-caching small datasets. in_memory argument is deprecated and will be removed in a future version.
  • Datasets expose their cardinality num_examples = tf.data.experimental.cardinality(ds) (Requires tf-nightly or TF >= 2.2.0)
  • Get the number of example in a sub-splits with: info.splits['train[70%:]'].num_examples

v2.0.0

24 Jan 20:02
Compare
Choose a tag to compare
  • This is the last version of TFDS that will support Python 2. Going forward, we'll only support and test against Python 3.
  • The default versions of all datasets are now using the S3 slicing API. See the guide for details.
  • The previous split API is still available, but is deprecated. If you wrote DatasetBuilders outside the TFDS repository, please make sure they do not use experiments={tfds.core.Experiment.S3: False}. This will be removed in the next version, as well as the num_shards kwargs from SplitGenerator.
  • Several new datasets. Thanks to all the contributors!
  • API changes and new features:
    • shuffle_files defaults to False so that dataset iteration is deterministic by default. You can customize the reading pipeline, including shuffling and interleaving, through the new read_config parameter in tfds.load.
    • urls kwargs renamed homepage in DatasetInfo
    • Support for nested tfds.features.Sequence and tf.RaggedTensor
    • Custom FeatureConnectors can override the decode_batch_example method for efficient decoding when wrapped inside a tfds.features.Sequence(my_connector)
    • Declaring a dataset in Colab won't register it, which allow to re-run the cell without having to change the name
    • Beam datasets can use a tfds.core.BeamMetadataDict to store additional metadata computed as part of the Beam pipeline.
    • Beam datasets' _split_generators accepts an additional pipeline kwargs to define a pipeline shared between all splits.
  • Various other bug fixes and performance improvements. Thank you for all the reports and fixes!

v1.3.0

24 Oct 16:12
Compare
Choose a tag to compare

Bug fixes and performance improvements.

v1.2.0

20 Aug 08:26
Compare
Choose a tag to compare

Features

  • Add shuffle_files argument to tfds.load function. The semantic is the same as in builder.as_dataset function, which for now means that by default, files will be shuffled for TRAIN split, and not for other splits. Default behaviour will change to always be False at next release.
  • Most datasets now support the new S3 API (documentation)
  • Support for uint16 PNG images

Misc

  • Crash while shuffling on Windows
  • Various documentation improvements

New datasets

  • AFLW2000-3D
  • Amazon_US_Reviews
  • binarized_mnist
  • BinaryAlphaDigits
  • Caltech Birds 2010
  • Coil100
  • DeepWeeds
  • Food101
  • MIT Scene Parse 150
  • RockYou leaked password
  • Stanford Dogs
  • Stanford Online Products
  • Visual Domain Decathlon