Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GUM version #11

Open
amir-zeldes opened this issue Jul 13, 2020 · 4 comments
Open

GUM version #11

amir-zeldes opened this issue Jul 13, 2020 · 4 comments

Comments

@amir-zeldes
Copy link

I just ran into this list - thanks for putting it up. I curate the GUM corpus included in the data folder, but it seems to be a rather old version. We now have much more data, including four more genres and bringing up the total word count to about 130,000 tokens annotated for nested, (non-)named entities. Would you like to update the data to include the latest version?

@juand-r
Copy link
Owner

juand-r commented Jul 13, 2020

Thanks for the suggestion.

I can add the newest version of GUM to the table in the README, as well as a copy in the data folder (but in a different format than the older GUM, since I used the CoNLL 2003 format before).

Also, I was thinking I could leave the old version up... in case people need to compare results with work done using that version.

@amir-zeldes
Copy link
Author

OK - I had a quick look at the data to see the format you're using, and I noticed a few issues with the data that might cause problems:

  • The CoNLL 2003 format has just one level of 'flat' BIO encoding, but GUM has nested (N)NER, meaning the nested entities are missing. For example, 'video gamers' should be labeled as person within 'teams of video gamers' (which are organization):
teams	B-organization
of	I-organization
video	I-organization
gamers	I-organization
  • GUM's native formats do encode the nesting, so you could just use the original files, but if you want to represent this using BIO encoding and just one set of tags (i.e. no B-lv1-organization, B-lv2-...), you could consider using the format used in LitBank, with multiple BIO columns: https://github.com/dbamman/litbank/blob/master/entities/tsv/105_persuasion_brat.tsv
  • A separate problem is the splits and sentence orders:
    • Sentences seem to be shuffled, so systems couldn't use information from the previous/next sentence, which may be desirable (e.g. document level Bert models). This is especially important since GUM includes entity types for pronouns too, which often can't be resolved with just the current sentence.
    • Sentences from the same documents are in train and test - this means that a model can appear to work really well since it knows "Vava'u" is a place in test. But this relatively rare place name is only recognized correctly because train happens to contain "Vava'u" too, which is probably unrealistically good if applied to unseen data.

GUM has established file splits, which you can find here: https://github.com/UniversalDependencies/UD_English-GUM/tree/master/not-to-release/file-lists

These are the same splits used in the conll shared task on UD parsing, so I'd recommend using the same splits for NER too.

@juand-r
Copy link
Owner

juand-r commented Jul 21, 2020

Thanks for the LitBank reference. I agree on the benefits of both nested NER annotation, and on using the surrounding context of sentences (I was only training at the individual sentence level and using BIO annotations when I started this, but am glad people are moving beyond that).

I was not aware that GUM had trail/test/dev splits -- thanks for pointing that out. I'll use the established file splits.

I was also thinking of structuring this a bit better, indicating which datasets have nested entity encoding, as well as other relevant details.

@amir-zeldes
Copy link
Author

OK, thanks - let me know if you need any input or help figuring out the GUM documentation. The Coptic dataset also has nested (N)NER, in the same conllu tabs+brackets format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants