mimic-iv-demo on HuggingFace raises DatasetGenerationCastError #2190

tompollard · 2024-02-09T19:58:17Z

We have talked a little about trying to integrate the HuggingFace platform with PhysioNet (in particular, making is easier for the HuggingFace community to work with PhysioNet datasets).

A while back, Alistair uploaded a copy of the MIMIC-IV demo to: https://huggingface.co/datasets/physionet/mimic-iv-demo. I thought I'd have a quick play around with this.

When attempting to load the dataset using HuggingFace's load_dataset(), I receive a DatasetGenerationCastError:

# Running in collab
!pip install datasets

from datasets import load_dataset
mimic = load_dataset('physionet/mimic-iv-demo')

Traceback:

---------------------------------------------------------------------------
CastError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1988                     try:
-> 1989                         writer.write_table(table)
   1990                     except CastError as cast_error:

8 frames
[/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py](https://localhost:8080/#) in write_table(self, pa_table, writer_batch_size)
    589         pa_table = pa_table.combine_chunks()
--> 590         pa_table = table_cast(pa_table, self._schema)
    591         if self.embed_local_files:

[/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in table_cast(table, schema)
   2239     if table.schema != schema:
-> 2240         return cast_table_to_schema(table, schema)
   2241     elif table.schema.metadata != schema.metadata:

[/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in cast_table_to_schema(table, schema)
   2193     if sorted(table.column_names) != sorted(features):
-> 2194         raise CastError(
   2195             f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match",

CastError: Couldn't cast
subject_id: int64
hadm_id: int64
admittime: string
dischtime: string
deathtime: string
admission_type: string
admit_provider_id: string
admission_location: string
discharge_location: string
insurance: string
language: string
marital_status: string
race: string
edregtime: string
edouttime: string
hospital_expire_flag: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 2220
to
{'subject_id': Value(dtype='int64', id=None)}
because column names don't match

During handling of the above exception, another exception occurred:

DatasetGenerationCastError                Traceback (most recent call last)
[<ipython-input-15-0345be2aa2fc>](https://localhost:8080/#) in <cell line: 1>()
----> 1 mimic = load_dataset('physionet/mimic-iv-demo')

[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2572 
   2573     # Download and prepare data
-> 2574     builder_instance.download_and_prepare(
   2575         download_config=download_config,
   2576         download_mode=download_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
   1003                         if num_proc is not None:
   1004                             prepare_split_kwargs["num_proc"] = num_proc
-> 1005                         self._download_and_prepare(
   1006                             dl_manager=dl_manager,
   1007                             verification_mode=verification_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1098             try:
   1099                 # Prepare split will record examples associated to the split
-> 1100                 self._prepare_split(split_generator, **prepare_split_kwargs)
   1101             except OSError as e:
   1102                 raise OSError(

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   1858             job_id = 0
   1859             with pbar:
-> 1860                 for job_id, done, content in self._prepare_split_single(
   1861                     gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1862                 ):

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1989                         writer.write_table(table)
   1990                     except CastError as cast_error:
-> 1991                         raise DatasetGenerationCastError.from_cast_error(
   1992                             cast_error=cast_error,
   1993                             builder_name=self.info.builder_name,

DatasetGenerationCastError: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 15 new columns ({'admission_location', 'race', 'admittime', 'dischtime', 'hadm_id', 'language', 'discharge_location', 'admission_type', 'edregtime', 'edouttime', 'admit_provider_id', 'marital_status', 'insurance', 'hospital_expire_flag', 'deathtime'})

This happened while the csv dataset builder was generating data using

/root/.cache/huggingface/datasets/downloads/5a3898fd1af7dd22d0359508d82978ba6c36a780c8aba0b1b15a9437a90adedc

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)

The text was updated successfully, but these errors were encountered:

bemoody · 2024-02-29T18:58:09Z

A while back, Alistair uploaded a copy of the MIMIC-IV demo to: https://huggingface.co/datasets/physionet/mimic-iv-demo

Please, please, please don't use unversioned URLs :(

tompollard · 2024-02-29T19:03:49Z

Please, please, please don't use unversioned URLs :(

Yeah, that's a good point. I think this was really intended as a trial run. Clearly lots more thought needed about how to integrate the PhysioNet and HuggingFace platforms in a sensible way.

Without spending too much time thinking about this, as a start I like the idea of:

Adding data/model loader scripts to a new PhysioNet Python package
Providing guidance on how to use these scripts on HuggingFace (or even better, incorporate into HuggingFace tools).

tompollard · 2024-02-29T19:23:12Z

Side note, but I have just switched the https://huggingface.co/datasets/physionet/mimic-iv-demo dataset to "Private", which I think means that anyone who isn't part of the project will get a 404. @bemoody if you have an account on HuggingFace then let me know and I'll add you to the PhysioNet project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mimic-iv-demo on HuggingFace raises DatasetGenerationCastError #2190

mimic-iv-demo on HuggingFace raises DatasetGenerationCastError #2190

tompollard commented Feb 9, 2024

bemoody commented Feb 29, 2024

tompollard commented Feb 29, 2024

tompollard commented Feb 29, 2024

mimic-iv-demo on HuggingFace raises DatasetGenerationCastError #2190

mimic-iv-demo on HuggingFace raises DatasetGenerationCastError #2190

Comments

tompollard commented Feb 9, 2024

bemoody commented Feb 29, 2024

tompollard commented Feb 29, 2024

tompollard commented Feb 29, 2024