Releases · huggingface/datasets

19 Apr 08:46

albertvillanova

2.19.0

0d3c746

2.19.0 Latest

Latest

Dataset Features

Add Polars compatibility by @psmyth94 in #6531

convert to a Polars dataframe using .to_polars();

import polars as pl
from datasets import load_dataset
ds = load_dataset("DIBT/10k_prompts_ranked", split="train")
ds.to_polars() \
    .groupby("topic") \
    .agg(pl.len(), pl.first()) \
    .sort("len", descending=True)

Use Polars formatting to return Polars objects when accessing a dataset:
```
ds = ds.with_format("polars")
ds[:10].group_by("kind").len()
```

Add fsspec support for to_json, to_csv, and to_parquet by @alvarobartt in #6096

Save on HF in any file format:

ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl")
ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv")
ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")

Add mode parameter to Image feature by @mariosasko in #6735
- Set images to be read in a certain mode like "RGB"
```
dataset = dataset.cast_column("image", Image(mode="RGB"))
```
Add CLI function to convert script-dataset to Parquet by @albertvillanova in #6795
- run command to open a PR in script-based dataset to convert it to Parquet:
```
datasets-cli convert_to_parquet <dataset_id>
```
Add Dataset.take and Dataset.skip by @lhoestq in #6813
- same as IterableDataset.take and IterableDataset.skip
```
ds = ds.take(10)  # take only the first 10 examples
```

General improvements and bug fixes

Bump huggingface-hub lower version to 0.21.2 by @albertvillanova in #6713
fix CastError pickling by @lhoestq in #6712
Expand no-code dataset info with datasets-server info by @mariosasko in #6714
Fix sliced ConcatenationTable pickling with mixed schemas vertically by @lhoestq in #6715
Fix concurrent script loading with force_redownload by @lhoestq in #6718
get_dataset_default_config_name docstring by @lhoestq in #6723
Deprecate Beam API and download from HF GCS bucket by @mariosasko in #6474
Deprecate Pandas builder by @mariosasko in #6730
Using a registry instead of calling globals for fetching feature types by @psmyth94 in #6727
Update torch_formatter.py by @VarunNSrivastava in #6402
Improve default patterns resolution by @mariosasko in #6704
Transpose images with EXIF Orientation tag by @mariosasko in #6739
Fix missing download_config in get_data_patterns by @lhoestq in #6742
Allow null values in dict columns by @mariosasko in #6743
Fix fsspec tqdm callback by @lhoestq in #6749
chore(deps): bump fsspec by @shcheklein in #6747
Fix offline mode with single config by @lhoestq in #6741
Remove deprecated code by @Wauplin in #6761
fixing the issue 6755(small typo) by @JINO-ROHIT in #6767
remove_columns/rename_columns doc fixes by @mariosasko in #6772
Fix CI by @mariosasko in #6780
rename datasets-server to dataset-viewer by @severo in #6785
Install dependencies with uv in CI by @mariosasko in #6779
Fix cache conflict in _check_legacy_cache2 by @lhoestq in #6792
Fix typo in docs (upload CLI) by @Wauplin in #6802
fix DatasetBuilder._split_generators incomplete type annotation by @JonasLoos in #6799
#6791 Improve type checking around FAISS by @Dref360 in #6803
Fix --repo-type order in cli upload docs by @lhoestq in #6804
Fix hf-internal-testing/dataset_with_script commit SHA in CI test by @albertvillanova in #6806
Fix cache path to snakecase for CachedDatasetModuleFactory and Cache by @izhx in #6754
Multithreaded downloads by @lhoestq in #6794
Remove os.path.relpath in resolve_patterns by @mariosasko in #6815
Extract data on the fly in packaged builders by @mariosasko in #6784
add allow_primitive_to_str and allow_decimal_to_str instead of allow_number_to_str by @Modexus in #6811
Support indexable objects in Dataset.__getitem__ by @mariosasko in #6817
Make convert_to_parquet CLI command create script branch by @albertvillanova in #6809
Fix parquet export infos by @lhoestq in #6822

New Contributors

@VarunNSrivastava made their first contribution in #6402
@shcheklein made their first contribution in #6747
@JINO-ROHIT made their first contribution in #6767
@JonasLoos made their first contribution in #6799
@izhx made their first contribution in #6754
@Modexus made their first contribution in #6811

Full Changelog: 2.18.0...2.19.0

Contributors

severo, shcheklein, and 12 other contributors

Assets 2

01 Mar 21:00

lhoestq

2.18.0

ca8409a

2.18.0

Dataset features

Make JSON builder support an array of strings by @albertvillanova in #6696
Base parquet batch_size on parquet row group size by @lhoestq in #6701
- Faster cold start for streaming
Change default compression argument for JsonDatasetWriter by @Rexhaif in #6659
Automatic Conversion for uint16/uint32 to Compatible PyTorch Dtypes by @mohalisad in #6660
fsspec: support fsspec>=2023.12.0 glob changes by @pmrowla in #6687
- Support latest fsspec up to 2024.2.0

General improvements and bug fixes

Fix for Incorrect ex_iterable used with multi num_worker by @kq-chen in #6582
- Previously using PyTorch DDP and num_workers could lead to incorrect shards assignments to workers and cause errors
Fix imagefolder dataset url by @mariosasko in #6683
Improve error message for gated datasets on load by @lewtun in #6684
Updated Quickstart Notebook link by @Codeblockz in #6685
Update the print message for chunked_dataset in process.mdx by @gzbfgjf2 in #6693
Faster xlistdir by @mariosasko in #6698
Update GitHub Actions to Node 20 by @albertvillanova in #6682
Update release instructions by @albertvillanova in #6681
Pass through information about location of cache directory. by @stridge-cruxml in #6677
Allow SplitDict setitem to replace existing SplitInfo by @lhoestq in #6665
Update ruff by @lhoestq in #6706
Silence ruff deprecation messages by @mariosasko in #6707
fix: show correct package name to install biopython by @BioGeek in #6662
Fix data_files when passing data_dir by @lhoestq in #6705
Release: 2.18.0 by @lhoestq in #6708

New Contributors

@Codeblockz made their first contribution in #6685
@gzbfgjf2 made their first contribution in #6693
@stridge-cruxml made their first contribution in #6677
@pmrowla made their first contribution in #6687
@BioGeek made their first contribution in #6662
@Rexhaif made their first contribution in #6659
@mohalisad made their first contribution in #6660
@kq-chen made their first contribution in #6582

Full Changelog: 2.17.1...2.18.0

Contributors

BioGeek, pmrowla, and 10 other contributors

Assets 2

19 Feb 09:58

albertvillanova

2.17.1

5d22682

2.17.1

Bug Fixes

Revert the changes in arrow_writer.py from #6636 by @bryant1410 in #6664
Remove deprecated verbose parameter from CSV builder by @albertvillanova in #6672

Full Changelog: 2.17.0...2.17.1

Contributors

bryant1410 and albertvillanova

Assets 2

09 Feb 10:09

albertvillanova

2.17.0

7063357

2.17.0

Dataset Features

[WebDataset] Audio support and bug fixes by @lhoestq in #6573
Add concurrent loading of shards to datasets.load_from_disk by @kkoutini in #6464
Support data_dir parameter in push_to_hub by @albertvillanova in #6634
Support push_to_hub without org/user to default to logged-in user by @albertvillanova in #6629
Allow concatenation of datasets with mixed structs by @Dref360 in #6587

General improvements and bug fixes

Fix parallel downloads for datasets without scripts by @lhoestq in #6551
Fix imagefolder with one image by @lhoestq in #6556
Fix tests based on datasets that used to have scripts by @lhoestq in #6574
remove eli5 test by @lhoestq in #6583
[IterableDataset] Fix drop_last_batchin map after shuffling or sharding by @lhoestq in #6575
Support standalone yaml by @lhoestq in #6557
Drop redundant None guard. by @xkszltl in #6596
fix os.listdir return name is empty string by @d710055071 in #6581
Fix CI: pyarrow 15, pandas 2.2 and sqlachemy by @lhoestq in #6617
Dedicated RNG object for fingerprinting by @mariosasko in #6606
Migrate from setup.cfg to pyproject.toml by @mariosasko in #6619
keep more info in DatasetInfo.from_merge #6585 by @JochenSiegWork in #6586
Read GeoParquet files using parquet reader by @weiji14 in #6508
Use schema metadata only if it matches features by @lhoestq in #6616
Raise error on bad split name by @lhoestq in #6626
Disable tqdm bars in non-interactive environments by @mariosasko in #6627
Add with_rank param to Dataset.filter by @mariosasko in #6608
Bump max range of dill to 0.3.8 by @ringohoffman in #6630
Fix filelock: use current umask for filelock >= 3.10 by @lhoestq in #6631
Faster webdataset streaming by @lhoestq in #6578
Multi gpu docs by @lhoestq in #6550
dataset viewer requires no-script by @severo in #6633
Make split slicing consistent with list slicing by @mariosasko in #5891
Do not use Parquet exports if revision is passed by @albertvillanova in #6555
Make CLI test support multi-processing by @albertvillanova in #6628
Fix reload cache with data dir by @lhoestq in #6632
Fix array cast/embed with null values by @mariosasko in #6283
Faster column validation and reordering by @psmyth94 in #6636
Better multi-gpu example by @lhoestq in #6646
Fix missing info when loading some datasets from Parquet export by @lhoestq in #6635
Minor multi gpu doc improvement by @lhoestq in #6649
Document usage of hfh cli instead of git by @lhoestq in #6648

New Contributors

@xkszltl made their first contribution in #6596
@kkoutini made their first contribution in #6464
@JochenSiegWork made their first contribution in #6586
@weiji14 made their first contribution in #6508
@ringohoffman made their first contribution in #6630
@psmyth94 made their first contribution in #6636

Full Changelog: 2.16.1...2.17.0

Contributors

severo, xkszltl, and 10 other contributors

Assets 2

30 Dec 16:46

lhoestq

2.16.1

7b2bcd7

2.16.1

Bug fixes

Fix dl_manager.extract returning FileNotFoundError by @lhoestq in #6543
- Fix bug causing FileNotFoundError when passing a relative directory as cache_dir to load_dataset
Fix custom configs from script by @lhoestq in #6544
- Fix bug when loading a dataset with a loading script using custom arguments would fail
- e.g. load_dataset("ted_talks_iwslt", language_pair=("ja", "en"), year="2015")

Full Changelog: 2.16.0...2.16.1

Contributors

lhoestq

Assets 2

22 Dec 14:21

lhoestq

2.16.0

a85fb52

2.16.0

Security features

Add trust_remote_code argument by @lhoestq in #6429
- Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at https://hf.co/datasets/<repo_id>. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argument trust_remote_code=True.
- Passing trust_remote_code=True will be mandatory to load these datasets from the next major release of datasets.
- Using the environment variable HF_DATASETS_TRUST_REMOTE_CODE=0 you can already disable custom code by default without waiting for the next release of datasets
Use parquet export if possible by @lhoestq in #6448
- This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
- You can see a dataset's Parquet export at https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet

Features

Webdataset dataset builder by @lhoestq in #6391
Implement get dataset default config name by @albertvillanova in #6511
Lazy data files resolution and offline cache reload by @lhoestq in #6493
- This speeds up the load_dataset step that lists the data files of big repositories (up to x100) but requires huggingface_hub 0.20 or newer
- Fix load_dataset that used to reload data from cache even if the dataset was updated on Hugging Face
- Reload a dataset from your cache even if you don't have internet connection
- New cache directory scheme for no-script datasets: ~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
- Backward comaptibility: cached datasets from datasets 2.15 (using the old scheme) are still reloaded from cache

General improvements and bug fixes

Remove unused argument in _get_data_files_patterns by @lhoestq in #6343
Set usedforsecurity=False in hashlib methods (FIPS compliance) by @Wauplin in #6414
Use ruff for formatting by @mariosasko in #6434
Create DatasetNotFoundError and DataFilesNotFoundError by @albertvillanova in #6431
Fix multi gpu map example by @lhoestq in #6415
Better tqdm wrapper by @mariosasko in #6433
Remove Table.__getstate__ and Table.__setstate__ by @LZHgrla in #6444
Use filelock package for file locking by @mariosasko in #6445
Fix metadata file resolution when inferred pattern is ** by @mariosasko in #6449
Update hub-docs reference by @mishig25 in #6453
Refactor dill logic by @mariosasko in #6454
Don't require trust_remote_code in inspect_dataset by @lhoestq in #6456
[docs] troubleshooting guide by @MKhalusova in #6424
Missing DatasetNotFoundError by @lhoestq in #6462
Disable benchmarks in PRs by @lhoestq in #6463
More robust temporary directory deletion by @mariosasko in #6426
Fix shard retry mechanism in push_to_hub by @mariosasko in #6461
Use auth to get parquet export by @lhoestq in #6468
Remove delete doc CI by @lhoestq in #6471
Fix CI quality by @albertvillanova in #6473
Fix PermissionError on Windows CI by @albertvillanova in #6477
More robust preupload retry mechanism by @mariosasko in #6479
Add IterableDataset __repr__ by @lhoestq in #6480
Fix max lock length on unix by @lhoestq in #6482
Fix ArrayXD YAML conversion by @mariosasko in #6168
Fix docs phrasing about supported formats when sharing a dataset by @albertvillanova in #6486
Fix deprecation warning when building conda package by @albertvillanova in #6425
Make push_to_hub return CommitInfo by @albertvillanova in #6492
docs: add reference Git over SSH by @severo in #6499
Fallback on dataset script if user wants to load default config by @lhoestq in #6498
Don't expand_info in HF glob by @lhoestq in #6469
Fix streaming xnli by @lhoestq in #6503
Pickle support for torch.Generator objects by @mariosasko in #6502
Enable setting config as default when push_to_hub by @albertvillanova in #6500
Better cast error when generating dataset by @lhoestq in #6509
Replace list_files_info with list_repo_tree in push_to_hub by @mariosasko in #6510
Remove deprecated HfFolder by @lhoestq in #6512
Support huggingface-hub pre-releases by @albertvillanova in #6516
Support push_to_hub canonical datasets by @albertvillanova in #6519
Support commit_description parameter in push_to_hub by @albertvillanova in #6520
fix get_metadata_patterns function args error by @d710055071 in #6518
Fix metrics dead link by @qgallouedec in #6491
fix tests by @lhoestq in #6523
Cache backward compatibility with 2.15.0 by @lhoestq in #6514
Preserve order of configs and splits when using Parquet exports by @albertvillanova in #6526

New Contributors

@LZHgrla made their first contribution in #6444
@d710055071 made their first contribution in #6518

Full Changelog: 2.15.0...2.16.0

Contributors

MKhalusova, severo, and 8 other contributors

Assets 2

16 Nov 08:06

albertvillanova

2.15.0

0caf912

2.15.0

What's Changed

Fix typo in Audio dataset documentation by @prassanna-ravishankar in #6222
Add push_to_hub with multiple configs docs by @lhoestq in #6226
Remove RGB -> BGR image conversion in Object Detection tutorial by @mariosasko in #6228
Update README.md by @NinoRisteski in #6233
Don't skip hidden files in dl_manager.iter_files when they are given as input by @mariosasko in #6230
Update README.md by @NinoRisteski in #6223
Remove unused global variables in audio.py by @mariosasko in #6241
Improve error message for missing function parameters by @suavemint in #6232
Fix cast from fixed size list to variable size list by @mariosasko in #6243
Update create_dataset.mdx by @EswarDivi in #6247
[DOCS] Fix typo: Elasticsearch by @leemthompo in #6258
Support streaming datasets with pyarrow.parquet.read_table by @albertvillanova in #6251
Temporarily pin tensorflow < 2.14.0 by @albertvillanova in #6264
Fix CI 404 errors by @albertvillanova in #6262
Remove apache_beam import in BeamBasedBuilder._save_info by @mariosasko in #6265
Improve documentation of dataset.from_generator by @hartmans in #6281
Fix parquet columns argument in streaming mode by @lhoestq in #6295
Doc readme improvements by @mariosasko in #6298
Unpin tensorflow maximum version by @mariosasko in #6301
Unpin jax maximum version by @mariosasko in #6300
Fix ArrayXD cast by @mariosasko in #6297
Reduce the number of commits in push_to_hub by @mariosasko in #6269
Fix typo in code example in docs by @bryant1410 in #6307
Update README.md by @smty2018 in #6304
Deterministic set hash by @lhoestq in #6318
docs: resolving namespace conflict, refactored variable by @smty2018 in #6312
Fix typos by @python273 in #6321
Fix commit message formatting in multi-commit uploads by @qgallouedec in #6313
Temporarily pin fsspec < 2023.10.0 by @albertvillanova in #6331
Unpin fsspec by @lhoestq in #6336
Fix use_dataset.mdx by @angel-luis in #6351
Add fsspec version to the datasets-cli env command output by @mariosasko in #6356
Expanduser in save_to_disk() by @Unknown3141592 in #6098
Fix time measuring snippet in docs by @mariosasko in #6367
Temporarily pin pyarrow < 14.0.0 by @albertvillanova in #6375
Fix typo in Dataset.map docstring by @bryant1410 in #6373
Avoid redundant warning when encoding NumPy array as Image by @mariosasko in #6379
Replace deprecated license_file in setup.cfg by @albertvillanova in #6332
Minor release step improvement by @lhoestq in #6339
Fix dependency conflict within CI build documentation by @albertvillanova in #6411
Remove redundant condition in builders by @albertvillanova in #6398
Handle future deprecation argument by @winglian in #6390
Remove token value from warnings by @mariosasko in #6418
Rename audio_classificiation.py to audio_classification.py by @carlthome in #6416
Add pyarrow-hotfix to release docs by @albertvillanova in #6421
Simplify filesystem logic by @mariosasko in #6362
Fix conda release by adding pyarrow-hotfix dependency by @albertvillanova in #6423

New Contributors

@prassanna-ravishankar made their first contribution in #6222
@NinoRisteski made their first contribution in #6233
@suavemint made their first contribution in #6232
@EswarDivi made their first contribution in #6247
@leemthompo made their first contribution in #6258
@hartmans made their first contribution in #6281
@smty2018 made their first contribution in #6304
@python273 made their first contribution in #6321
@angel-luis made their first contribution in #6351
@Unknown3141592 made their first contribution in #6098
@winglian made their first contribution in #6390
@carlthome made their first contribution in #6416

Full Changelog: 2.14.7...2.15.0

Contributors

hartmans, winglian, and 15 other contributors

Assets 2

15 Nov 08:19

albertvillanova

2.14.7

bf02cff

2.14.7

Bug Fixes

Fix UnboundLocalError if preprocessing returns an empty list by @cwallenwein in #6346
Fix python formatting for complex types in format_table by @mariosasko in #6368
Support pyarrow 14.0.0 by @albertvillanova in #6378
Do not try to download from HF GCS for generator by @yundai424 in #6372
Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 by @albertvillanova in #6404

New Contributors

@cwallenwein made their first contribution in #6346
@yundai424 made their first contribution in #6372

Full Changelog: 2.14.6...2.14.7

Contributors

albertvillanova, cwallenwein, and 2 other contributors

Assets 2

24 Oct 08:15

lhoestq

2.14.6

06c3ffb

2.14.6

What's Changed

Ignore dataset_info.json in data files resolution by @mariosasko in #6224
Check builder cls default config name in inspect by @lhoestq in #6253
Add support for fsspec>=2023.9.0 by @mariosasko in #6244
Create DefunctDatasetError by @albertvillanova in #6286
Fix get_data_patterns for directories with the word data twice by @albertvillanova in #6309
Fix loading Hub datasets with CSV metadata file by @albertvillanova in #6316
datasets.filesystems: fix is_remote_filesystems by @ap-- in #6334
Pin upper version of fsspec by @albertvillanova in #6337
Fix regex get_data_files formatting for base paths by @ZachNagengast in #6322

New Contributors

@ap-- made their first contribution in #6334
@ZachNagengast made their first contribution in #6322

Full Changelog: 2.14.5...2.14.6

Contributors

ap--, ZachNagengast, and 3 other contributors

Assets 2

24 Oct 08:15

albertvillanova

2.14.5

1a598a0

2.14.5

Bug fixes

Bump fsspec from 2021.11.1 to 2022.3.0 by @mariosasko in #6091
Minor fix in iter_files for hidden files by @mariosasko in #6092
Use yaml instead of get data patterns when possible by @lhoestq in #6154
Fix Parquet loading with columns by @mariosasko in #6160
Fix: Missing a MetadataConfigs init when the repo has a datasets_info.json but no README by @clefourrier in #6164
PyArrow 13 CI fixes by @mariosasko in #6175
Don't alter input in Features.from_dict by @lhoestq in #6189
Fix multiprocessing with spawn in iterable datasets by @Hubert-Bonisseur in #6165
Set minimal fsspec version requirement to 2023.1.0 by @mariosasko in #6192
Temporarily pin pandas < 2.1.0 by @albertvillanova in #6200
Preserve split order in DataFilesDict by @albertvillanova in #6198
Add missing revision argument by @qgallouedec in #6191
Temporarily pin fsspec < 2023.9.0 by @albertvillanova in #6210
Do not filter out .zip extensions from no-script datasets by @albertvillanova in #6208
Fix empty splitinfo json by @lhoestq in #6211
Fix to_json ValueError and remove pandas pin by @albertvillanova in #6201
Fix checking patterns to infer packaged builder by @polinaeterna in #6215
Rename old push_to_hub configs to "default" in dataset_infos by @lhoestq in #6218

Other improvements

Deprecate Dataset.export by @mariosasko in #6081
Deprecate download_custom by @mariosasko in #6093
Ignore CI lint rule violation in Pickler.memoize by @albertvillanova in #6138
Remove unused allowed_extensions param by @albertvillanova in #6135
Export to_iterable_dataset to document by @npuichigo in #6145
[Docs] Add description of select_columns to guide by @unifyh in #6119
Ignore parallel warning in map_nested by @lhoestq in #6148
[docs] Complete to_iterable_dataset by @stevhliu in #6158
Raise FileNotFoundError when passing data_files that don't exist by @lhoestq in #6155
Fix typo in about_mapstyle_vs_iterable.mdx by @lhoestq in #6171
Document BUILDER_CONFIG_CLASS by @lhoestq in #6166
Fix import in image_load doc by @mariosasko in #6181
Use object detection images from huggingface/documentation-images by @mariosasko in #6177
Use hf-internal-testing repos for hosting test dataset repos by @mariosasko in #6180

New Contributors

@npuichigo made their first contribution in #6145
@unifyh made their first contribution in #6119

Full Changelog: 2.14.4...2.14.5

Contributors

albertvillanova, npuichigo, and 8 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Features

General improvements and bug fixes

New Contributors

Contributors

Dataset features

General improvements and bug fixes

New Contributors

Contributors

Bug Fixes

Contributors

Dataset Features

General improvements and bug fixes

New Contributors

Contributors

Bug fixes

Contributors

Security features

Features

General improvements and bug fixes

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Bug Fixes

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Bug fixes

Other improvements

New Contributors

Contributors

Releases: huggingface/datasets

2.19.0

Dataset Features

General improvements and bug fixes

New Contributors

Contributors

2.18.0

Dataset features

General improvements and bug fixes

New Contributors

Contributors

2.17.1

Bug Fixes

Contributors

2.17.0

Dataset Features

General improvements and bug fixes

New Contributors

Contributors

2.16.1

Bug fixes

Contributors

2.16.0

Security features

Features

General improvements and bug fixes

New Contributors

Contributors

2.15.0

What's Changed

New Contributors

Contributors

2.14.7

Bug Fixes

New Contributors

Contributors

2.14.6

What's Changed

New Contributors

Contributors

2.14.5

Bug fixes

Other improvements

New Contributors

Contributors