Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Named Entity Recognition and Classification for languages other than EN/FR #136

Open
dddpt opened this issue Oct 28, 2021 · 4 comments
Open
Labels

Comments

@dddpt
Copy link

dddpt commented Oct 28, 2021

I am using entity-fishing on a corpus of ~35k documents with a french, an italian and a german version.

In the entity-fishing documentation, there is this paragraph:

The tool currently supports English, German, French, Spanish and Italian languages (more to come!). For English and French, a Name Entity Recognition based on CRF grobid-ner is used in combination with the disambiguation. For each recognized entity in one language, it is possible to complement the result with crosslingual information in the other languages. A nbest mode is available. Domain information are produced for a large amount of entities in the technical and scientific fields, together with Wikipedia categories and confidence scores.

What does it mean for non english/french texts?
Is another named entity recognition system used?
Should I expect worse results for entity recognition on german and italian?

The german wikipedia has the best coverage of the topics in my corpus so I was thinking of focusing on the german version of the corpus. Now I'm wondering if I should instead focus on the french version hoping for better performance on recognition. Any hints?

Thanks for this great tool! :-)

@kermitt2
Copy link
Owner

kermitt2 commented Feb 8, 2022

Hello @dddpt !

Sorry for the slow response :(

For non-English/French texts, no NER is used, which means "terms" are selected only via Wikipedia anchors of this language.
So if the German Wikipedia has the best coverage for a given domain, there is no NER problem because the anchors will be very rich and the disambiguation more frequent.

NER is nice for general text (like journalism, history. ...), because the named entity classes are very general. NER does not help for more specialized domains, like scientific domains. Wikipedia vocabulary is bringing reliable terms, and NER actually is often noisy.

@dddpt
Copy link
Author

dddpt commented Feb 8, 2022

Hi @kermitt2,

Thanks for the answer ;-)

A Wikipedia anchor is the text of a link from a wikipedia article to another right?

So it means that each time a term (or a sequence of terms?) corresponds to any anchor in the Wikipedia of the corresponding language, it is recognized as an entity?
Doesn't it label almost every word as an entity?

(and while I'm at it, is there a technical report/article detailing entity-fishing in addition to the readthedocs?)

@kermitt2
Copy link
Owner

kermitt2 commented Feb 8, 2022

So it means that each time a term (or a sequence of terms?) corresponds to any anchor in the Wikipedia of the corresponding language, it is recognized as an entity?

It is recognized as entity candidate, this is how more or less all entity linking tools work (although often not at full scale). In English for instance, there are 206 million "terms" (so anchors, plus article titles and synomyms - single or multiple word terms) considered by entity-fishing for every input. Each term of these 206 million terms is associated with one or several Wikidata entities.

Doesn't it label almost every word as an entity?

Well indeed plenty of words/multi-word terms might be considered (what I call "mention"), leading a massive amount of entity candidates. The challenge is to 1) select the most likely correct entity candidates 2) decide if the most-likely one is acceptable (so reject some "linking", because the term is used as common word, not as a reference to a particular entity). Only a few candidates are finally selected as final label entities.

The "best" mentions and entities are selected by learning the disambiguation made by the wikipedia contributors when adding anchors in Wikipedia.

(and while I'm at it, is there a technical report/article detailing entity-fishing in addition to the readthedocs?)

This presentation at WikiDataCon https://grobid.s3.amazonaws.com/presentations/29-10-2017.pdf

@dddpt
Copy link
Author

dddpt commented Feb 8, 2022

Great, thanks for the detailed reply 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants