Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mimic-iv-demo on HuggingFace raises DatasetGenerationCastError #2190

Open
tompollard opened this issue Feb 9, 2024 · 3 comments
Open

mimic-iv-demo on HuggingFace raises DatasetGenerationCastError #2190

tompollard opened this issue Feb 9, 2024 · 3 comments

Comments

@tompollard
Copy link
Member

We have talked a little about trying to integrate the HuggingFace platform with PhysioNet (in particular, making is easier for the HuggingFace community to work with PhysioNet datasets).

A while back, Alistair uploaded a copy of the MIMIC-IV demo to: https://huggingface.co/datasets/physionet/mimic-iv-demo. I thought I'd have a quick play around with this.

When attempting to load the dataset using HuggingFace's load_dataset(), I receive a DatasetGenerationCastError:

# Running in collab
!pip install datasets

from datasets import load_dataset
mimic = load_dataset('physionet/mimic-iv-demo')

Traceback:

---------------------------------------------------------------------------
CastError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1988                     try:
-> 1989                         writer.write_table(table)
   1990                     except CastError as cast_error:

8 frames
[/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py](https://localhost:8080/#) in write_table(self, pa_table, writer_batch_size)
    589         pa_table = pa_table.combine_chunks()
--> 590         pa_table = table_cast(pa_table, self._schema)
    591         if self.embed_local_files:

[/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in table_cast(table, schema)
   2239     if table.schema != schema:
-> 2240         return cast_table_to_schema(table, schema)
   2241     elif table.schema.metadata != schema.metadata:

[/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in cast_table_to_schema(table, schema)
   2193     if sorted(table.column_names) != sorted(features):
-> 2194         raise CastError(
   2195             f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match",

CastError: Couldn't cast
subject_id: int64
hadm_id: int64
admittime: string
dischtime: string
deathtime: string
admission_type: string
admit_provider_id: string
admission_location: string
discharge_location: string
insurance: string
language: string
marital_status: string
race: string
edregtime: string
edouttime: string
hospital_expire_flag: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 2220
to
{'subject_id': Value(dtype='int64', id=None)}
because column names don't match

During handling of the above exception, another exception occurred:

DatasetGenerationCastError                Traceback (most recent call last)
[<ipython-input-15-0345be2aa2fc>](https://localhost:8080/#) in <cell line: 1>()
----> 1 mimic = load_dataset('physionet/mimic-iv-demo')

[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2572 
   2573     # Download and prepare data
-> 2574     builder_instance.download_and_prepare(
   2575         download_config=download_config,
   2576         download_mode=download_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
   1003                         if num_proc is not None:
   1004                             prepare_split_kwargs["num_proc"] = num_proc
-> 1005                         self._download_and_prepare(
   1006                             dl_manager=dl_manager,
   1007                             verification_mode=verification_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1098             try:
   1099                 # Prepare split will record examples associated to the split
-> 1100                 self._prepare_split(split_generator, **prepare_split_kwargs)
   1101             except OSError as e:
   1102                 raise OSError(

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   1858             job_id = 0
   1859             with pbar:
-> 1860                 for job_id, done, content in self._prepare_split_single(
   1861                     gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1862                 ):

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1989                         writer.write_table(table)
   1990                     except CastError as cast_error:
-> 1991                         raise DatasetGenerationCastError.from_cast_error(
   1992                             cast_error=cast_error,
   1993                             builder_name=self.info.builder_name,

DatasetGenerationCastError: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 15 new columns ({'admission_location', 'race', 'admittime', 'dischtime', 'hadm_id', 'language', 'discharge_location', 'admission_type', 'edregtime', 'edouttime', 'admit_provider_id', 'marital_status', 'insurance', 'hospital_expire_flag', 'deathtime'})

This happened while the csv dataset builder was generating data using

/root/.cache/huggingface/datasets/downloads/5a3898fd1af7dd22d0359508d82978ba6c36a780c8aba0b1b15a9437a90adedc

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
@bemoody
Copy link
Collaborator

bemoody commented Feb 29, 2024

A while back, Alistair uploaded a copy of the MIMIC-IV demo to: https://huggingface.co/datasets/physionet/mimic-iv-demo

Please, please, please don't use unversioned URLs :(

@tompollard
Copy link
Member Author

Please, please, please don't use unversioned URLs :(

Yeah, that's a good point. I think this was really intended as a trial run. Clearly lots more thought needed about how to integrate the PhysioNet and HuggingFace platforms in a sensible way.

Without spending too much time thinking about this, as a start I like the idea of:

  1. Adding data/model loader scripts to a new PhysioNet Python package
  2. Providing guidance on how to use these scripts on HuggingFace (or even better, incorporate into HuggingFace tools).

@tompollard
Copy link
Member Author

Side note, but I have just switched the https://huggingface.co/datasets/physionet/mimic-iv-demo dataset to "Private", which I think means that anyone who isn't part of the project will get a 404. @bemoody if you have an account on HuggingFace then let me know and I'll add you to the PhysioNet project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants