-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GUM version #11
Comments
Thanks for the suggestion. I can add the newest version of GUM to the table in the README, as well as a copy in the data folder (but in a different format than the older GUM, since I used the CoNLL 2003 format before). Also, I was thinking I could leave the old version up... in case people need to compare results with work done using that version. |
OK - I had a quick look at the data to see the format you're using, and I noticed a few issues with the data that might cause problems:
GUM has established file splits, which you can find here: https://github.com/UniversalDependencies/UD_English-GUM/tree/master/not-to-release/file-lists These are the same splits used in the conll shared task on UD parsing, so I'd recommend using the same splits for NER too. |
Thanks for the LitBank reference. I agree on the benefits of both nested NER annotation, and on using the surrounding context of sentences (I was only training at the individual sentence level and using BIO annotations when I started this, but am glad people are moving beyond that). I was not aware that GUM had trail/test/dev splits -- thanks for pointing that out. I'll use the established file splits. I was also thinking of structuring this a bit better, indicating which datasets have nested entity encoding, as well as other relevant details. |
OK, thanks - let me know if you need any input or help figuring out the GUM documentation. The Coptic dataset also has nested (N)NER, in the same conllu tabs+brackets format. |
I just ran into this list - thanks for putting it up. I curate the GUM corpus included in the data folder, but it seems to be a rather old version. We now have much more data, including four more genres and bringing up the total word count to about 130,000 tokens annotated for nested, (non-)named entities. Would you like to update the data to include the latest version?
The text was updated successfully, but these errors were encountered: