Closes #67 - Add Monero #516

napsternxg · 2022-04-25T02:00:42Z

Fixes #67 - Add Monero

If the following information is NOT present in the issue, please populate:

Name: MoNERo
Description: MoNERo: a Biomedical Gold Standard Corpus for the Romanian Language for part of speech tagging and named entity recognition.
Paper: https://www.racai.ro/en/tools/text/
Data: https://github.com/bigscience-workshop/biomedical/files/8550757/MoNERo.tar.gz

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Tested via data loading.

hakunanatasha · 2022-04-27T05:06:20Z

@napsternxg passes all the unit tests and loads fine, but I noticed if I do the following:

from datasets import load_dataset
x = load_dataset("biodatasets/monero/monero.py", name="monero_bigbio_kb")["train"]["entities"][-1]

I find these all empty. Is this intended?

hakunanatasha · 2022-04-27T05:07:31Z

@napsternxg also I made a small change at the end of the file (near the main call)

napsternxg · 2022-04-30T04:18:14Z

Hi @hakunanatasha thanks. Let me have a look at this. I will address this by early next week.

napsternxg · 2022-05-05T05:57:30Z

Hi @hakunanatasha I checked the entities. They are present. When no entity is present in a doc we see an empty list.
This is a better way to check:

from datasets import load_dataset
data = load_dataset("biodatasets/monero/monero.py", name="monero_bigbio_kb")
data["train"]["entities"][-5:]

Will output

[[],
 [],
 [],
 [{'id': 'docid-4982-E0',
   'type': 'DISO',
   'text': ['hemipareză spastică'],
   'offsets': [[109, 128]],
   'normalized': []}],
 []]

This means only the second last doc among the last 5 docs has any entity.

I also added a fix about entity offsets.
I think this PR is ready for Merge.

napsternxg and others added 2 commits April 11, 2022 16:25

Fixes bigscience-workshop#67 - Add monero

ee234c5

Working setup for Monero.

26f7986

Tested via data loading.

napsternxg requested review from hakunanatasha, jason-fries, sunnnymskang, ruisi-su, galtay, leonweber, sg-wbi and debajyotidatta as code owners April 25, 2022 02:00

napsternxg mentioned this pull request Apr 25, 2022

Create a dataset loader for MoNERo #67

Open

hakunanatasha self-assigned this Apr 27, 2022

fix: remove main call

9cee97b

Fixed entity offset

4591ef0

sg-wbi changed the title ~~Fixes #67 - Add Monero~~ Closes #67 - Add Monero May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #67 - Add Monero #516

Closes #67 - Add Monero #516

napsternxg commented Apr 25, 2022

hakunanatasha commented Apr 27, 2022

hakunanatasha commented Apr 27, 2022

napsternxg commented Apr 30, 2022

napsternxg commented May 5, 2022

Closes #67 - Add Monero #516

Are you sure you want to change the base?

Closes #67 - Add Monero #516

Conversation

napsternxg commented Apr 25, 2022

Checkbox

hakunanatasha commented Apr 27, 2022

hakunanatasha commented Apr 27, 2022

napsternxg commented Apr 30, 2022

napsternxg commented May 5, 2022