Closes #261 #518

MFreidank · 2022-04-25T13:26:17Z

This PR implements a dataloader for BioASQ Task A (text task) in an attempt to close issue #261 . I've attempted to stay close to examples/bioasq_task_b.py whereever possible. Please let me know if any changes are required.

Tagging @jason-fries as we discussed previously on the issue thread and I noticed he made some recent changes to bioasq_task_b that I also tried to match with my PR.

I have been able to confirm that my data loader works across years and also got unit test runs for individual years to pass (I tried 2022 and 2013). However, it's hard for me to do a single clean unittest run across all configurations as dataset sizes are very large (>>10 GB for some of the files) and individual tests take a very long time to run on the machine I have access to.
Could someone help with testing?

If the following information is NOT present in the issue, please populate:

Name: Bioasq Task A
Description: Labeling abstracts with MESH Keywords
Paper: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0564-6
Data: http://participants-area.bioasq.org/accounts/register/

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

hakunanatasha · 2022-04-27T05:40:20Z

@MFreidank happy to help you run it; do you know if the full dataset is an aggregate of the individual years? If the individual years pass via --subset_id in the unit tests, then this is fine.

I noticed this also requires a package that is not default ijson; is this a hard requirement? If so, i'll need to update the requirements file too

MFreidank · 2022-04-27T06:40:26Z

Hi @hakunanatasha

Thank you for offering your help.

Yes, essentially the full dataset would be the aggregate over all individual years.
My challenge is that due to dataset size running tests for any single --subset_id takes time on my machine (>>8h for the subsets I tried). This makes a "full" iteration over all implemented subsets difficult/slow and I was only able to unittest individual subsets (for years 2013 and 2022) and verify that dataset loading works across all datasets.

Regarding ijson: I believe it would at least make our code a lot simpler.
My reason for using it was that loading data via json.load unfortunately attempts to load 20+ GB data into memory in one shot for most of the subsets. This is unlikely to work well on most end user machines (my own included).
ijson parses the JSON files one object at a time and therefore never reads the whole file into memory at any given point.
This allows users on (nearly) any machine to load this data.
The alternative would be to essentially implement an on-the-fly object by object JSON parser inline (more or less replicating ijson functionality within our repository) as unfortunately the python standard library does not currently have direct means to do this.
I felt that would add enough complexity that having this dependency is warranted.

Please let me know if any of the above is unclear.

hakunanatasha · 2022-04-27T15:27:22Z

@MFreidank good answers - 8h is a bit tough. Let me see if my machine can handle it.

an on-the-fly json parser is overkill; your rationale is plenty enough to warrant a new package in the requirements!

MFreidank · 2022-05-01T19:13:42Z

Hi @hakunanatasha - any updates?
I read some things on discord around major changes being under way.
Please let me know when I should update this PR to ensure it stays up to date.

MFreidank added 3 commits April 25, 2022 15:05

chg: added data loaders for 2014 - 2022

33320d9

add description; add year 2013

7df3809

Adding dataloader for Bioasq Task A 2013-2022

4351cb8

MFreidank requested review from hakunanatasha, jason-fries, sunnnymskang, ruisi-su, galtay, leonweber, sg-wbi and debajyotidatta as code owners April 25, 2022 13:26

minor fix, documentation

679e9d5

MFreidank mentioned this pull request Apr 25, 2022

Create dataset loader for BioASQ Task A (2013-2021) #261

Open

hakunanatasha self-assigned this Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #261 #518

Closes #261 #518

MFreidank commented Apr 25, 2022 •

edited

hakunanatasha commented Apr 27, 2022

MFreidank commented Apr 27, 2022

hakunanatasha commented Apr 27, 2022

MFreidank commented May 1, 2022

Closes #261 #518

Are you sure you want to change the base?

Closes #261 #518

Conversation

MFreidank commented Apr 25, 2022 • edited

Checkbox

hakunanatasha commented Apr 27, 2022

MFreidank commented Apr 27, 2022

hakunanatasha commented Apr 27, 2022

MFreidank commented May 1, 2022

MFreidank commented Apr 25, 2022 •

edited