Skip to content

Releases: huggingface/datasets

2.19.0

19 Apr 08:46
0d3c746
Compare
Choose a tag to compare

Dataset Features

  • Add Polars compatibility by @psmyth94 in #6531
    • convert to a Polars dataframe using .to_polars();
      import polars as pl
      from datasets import load_dataset
      ds = load_dataset("DIBT/10k_prompts_ranked", split="train")
      ds.to_polars() \
          .groupby("topic") \
          .agg(pl.len(), pl.first()) \
          .sort("len", descending=True)
    • Use Polars formatting to return Polars objects when accessing a dataset:
      ds = ds.with_format("polars")
      ds[:10].group_by("kind").len()
  • Add fsspec support for to_json, to_csv, and to_parquet by @alvarobartt in #6096
    • Save on HF in any file format:
      ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl")
      ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv")
      ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
  • Add mode parameter to Image feature by @mariosasko in #6735
    • Set images to be read in a certain mode like "RGB"
      dataset = dataset.cast_column("image", Image(mode="RGB"))
  • Add CLI function to convert script-dataset to Parquet by @albertvillanova in #6795
    • run command to open a PR in script-based dataset to convert it to Parquet:
      datasets-cli convert_to_parquet <dataset_id>
      
  • Add Dataset.take and Dataset.skip by @lhoestq in #6813
    • same as IterableDataset.take and IterableDataset.skip
      ds = ds.take(10)  # take only the first 10 examples

General improvements and bug fixes

New Contributors

Full Changelog: 2.18.0...2.19.0

2.18.0

01 Mar 21:00
ca8409a
Compare
Choose a tag to compare

Dataset features

  • Make JSON builder support an array of strings by @albertvillanova in #6696
  • Base parquet batch_size on parquet row group size by @lhoestq in #6701
    • Faster cold start for streaming
  • Change default compression argument for JsonDatasetWriter by @Rexhaif in #6659
  • Automatic Conversion for uint16/uint32 to Compatible PyTorch Dtypes by @mohalisad in #6660
  • fsspec: support fsspec>=2023.12.0 glob changes by @pmrowla in #6687
    • Support latest fsspec up to 2024.2.0

General improvements and bug fixes

New Contributors

Full Changelog: 2.17.1...2.18.0

2.17.1

19 Feb 09:58
5d22682
Compare
Choose a tag to compare

Bug Fixes

Full Changelog: 2.17.0...2.17.1

2.17.0

09 Feb 10:09
7063357
Compare
Choose a tag to compare

Dataset Features

General improvements and bug fixes

New Contributors

Full Changelog: 2.16.1...2.17.0

2.16.1

30 Dec 16:46
7b2bcd7
Compare
Choose a tag to compare

Bug fixes

  • Fix dl_manager.extract returning FileNotFoundError by @lhoestq in #6543
    • Fix bug causing FileNotFoundError when passing a relative directory as cache_dir to load_dataset
  • Fix custom configs from script by @lhoestq in #6544
    • Fix bug when loading a dataset with a loading script using custom arguments would fail
    • e.g. load_dataset("ted_talks_iwslt", language_pair=("ja", "en"), year="2015")

Full Changelog: 2.16.0...2.16.1

2.16.0

22 Dec 14:21
a85fb52
Compare
Choose a tag to compare

Security features

  • Add trust_remote_code argument by @lhoestq in #6429
    • Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at https://hf.co/datasets/<repo_id>. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argument trust_remote_code=True.
    • Passing trust_remote_code=True will be mandatory to load these datasets from the next major release of datasets.
    • Using the environment variable HF_DATASETS_TRUST_REMOTE_CODE=0 you can already disable custom code by default without waiting for the next release of datasets
  • Use parquet export if possible by @lhoestq in #6448
    • This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
    • You can see a dataset's Parquet export at https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet

Features

  • Webdataset dataset builder by @lhoestq in #6391
  • Implement get dataset default config name by @albertvillanova in #6511
  • Lazy data files resolution and offline cache reload by @lhoestq in #6493
    • This speeds up the load_dataset step that lists the data files of big repositories (up to x100) but requires huggingface_hub 0.20 or newer
    • Fix load_dataset that used to reload data from cache even if the dataset was updated on Hugging Face
    • Reload a dataset from your cache even if you don't have internet connection
    • New cache directory scheme for no-script datasets: ~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
    • Backward comaptibility: cached datasets from datasets 2.15 (using the old scheme) are still reloaded from cache

General improvements and bug fixes

New Contributors

Full Changelog: 2.15.0...2.16.0

2.15.0

16 Nov 08:06
0caf912
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 2.14.7...2.15.0

2.14.7

15 Nov 08:19
bf02cff
Compare
Choose a tag to compare

Bug Fixes

New Contributors

Full Changelog: 2.14.6...2.14.7

2.14.6

24 Oct 08:15
06c3ffb
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 2.14.5...2.14.6

2.14.5

24 Oct 08:15
1a598a0
Compare
Choose a tag to compare

Bug fixes

Other improvements

New Contributors

Full Changelog: 2.14.4...2.14.5