Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #24 #591

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

clancyoftheoverflow
Copy link
Member

@clancyoftheoverflow clancyoftheoverflow commented May 18, 2022

Hi. I am sorry for the very ong delay. This took me much longer than I had planned. I hope it can still be useful.

If the following information is NOT present in the issue, please populate:

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py. The tests returned 2 failures. 1) ID globally unique: coreferences (tlinks) in the original dataset use two formats for IDs. 2) Check passage offset: sometimes offsets seem to be incorrect in the original XML files. The tests returned also one error with "Check-multi-label type", but I am not sure how to interpret it.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

======================================================================
ERROR: runTest (main.TestDataLoader) [Check multi-label type]
Run all tests that check:

Traceback (most recent call last):
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 145, in runTest
self.test_multilabel_type(dataset_bigbio)
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 636, in test_multilabel_type
match = re.search(_CONNECTORS, feature_type)
File "C:\Users\franc\miniconda3\envs\BigScience\lib\re.py", line 201, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object

======================================================================
FAIL: runTest (main.TestDataLoader) [IDs globally unique]
Run all tests that check:

Traceback (most recent call last):
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 117, in runTest
self.test_are_ids_globally_unique(dataset_bigbio)
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 277, in test_are_ids_globally_unique
self._assert_ids_globally_unique(example, ids_seen=ids_seen)
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 258, in _assert_ids_globally_unique
self._assert_ids_globally_unique(elem, ids_seen)
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 262, in _assert_ids_globally_unique
self.assertNotIn(v, ids_seen)
AssertionError: 'Sectime0' unexpectedly found in {'TL74', 'T2', 'TL57', 'E60', 'E21', 'TL93', 'TL55', 'E54', 'E25', 'Sectime28', 'T14', 'T11', 'E7', 'E26', 'TL9', 'TL73', 'E57', 'T3', 'Sectime12', 'Sectime18', 'TL39', 'Sectime0', 'T13', 'Sectime25', 'TL17', 'TL5', 'TL11', 'E43', 'TL1', 'Sectime8', 'TL69', 'TL40', 'E15', 'E8', 'TL42', 'E24', 'TL58', 'TL88', 'S0', 'E59', 'TL33', 'TL36', 'Sectime11', 'Sectime22', 'E68', 'TL67', 'T12', 'E71', 'E76', 'TL51', 'S1', 'E73', 'E17', 'E34', '1', 'TL64', 'TL7', 'T1', 'Sectime1', 'Sectime3', 'E40', 'E39', 'TL24', 'E3', 'Sectime23', 'Sectime26', 'TL63', 'E62', 'TL78', 'E30', 'E41', 'Sectime2', 'E28', 'TL75', 'Sectime15', 'Sectime29', 'E58', 'E11', '1-full-passage', 'Sectime4', 'Sectime20', 'TL62', 'TL3', 'TL90', 'E36', 'TL53', 'TL31', 'Sectime7', 'T7', 'E2', 'T6', 'E29', 'TL70', 'Sectime14', 'TL29', 'TL23', 'TL14', 'E48', 'TL56', 'TL10', 'TL68', 'E49', 'TL34', 'TL43', 'Sectime6', 'E14', 'Sectime13', 'E45', 'T0', 'E0', 'E53', 'TL71', 'TL91', 'Sectime16', 'TL2', 'TL44', 'E75', 'TL18', 'TL60', 'Sectime17', 'E23', 'E67', 'E55', 'E31', 'TL22', 'TL95', 'TL13', 'Sectime21', 'TL46', 'TL12', 'E69', 'TL27', 'E51', 'E32', 'TL48', 'E44', 'TL35', 'TL89', 'T9', 'T8', 'TL79', 'E72', 'E66', 'TL47', 'TL59', 'E16', 'E22', 'TL0', 'TL15', 'E4', 'E12', 'TL50', 'Sectime19', 'E37', 'E74', 'TL41', 'TL30', 'Sectime27', 'E27', 'TL20', 'TL49', 'TL83', 'E13', 'TL52', 'T4', 'TL86', 'TL37', 'Sectime5', 'E5', 'E20', 'TL19', 'E46', 'TL45', 'TL8', 'TL54', 'E33', 'TL66', 'TL77', 'E52', 'TL72', 'E61', 'TL85', 'TL82', 'E63', 'TL21', 'E35', 'TL4', 'TL38', 'E6', 'TL76', 'TL92', 'E42', 'Sectime24', 'T5', 'E9', 'E47', 'E56', 'Sectime9', 'T10', 'E1', 'TL26', 'E65', 'E38', 'E18', 'TL28', 'E64', 'TL6', 'TL61', 'TL80', 'TL87', 'Sectime10', 'E19', 'TL84', 'E70', 'E10', 'E50'}

======================================================================
FAIL: runTest (main.TestDataLoader) [Check passage offsets]
Run all tests that check:

Traceback (most recent call last):
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 136, in runTest
self.test_passages_offsets(dataset_bigbio)
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 382, in test_passages_offsets
self.assertEqual(example_text[start:end], text[idx], msg)
AssertionError: '9/29/1993\n' != '09/29/1993'

  • 9/29/1993
    ? -
  • 09/29/1993? +
    : Split:train - Example:1 - text:9/29/1993 != text_by_offset:09/29/1993

Ran 1 test in 55.097s

FAILED (failures=2, errors=1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant