Clarify NER-related background material in Analyze_Model_Outputs.ipynb #114

frreiss · 2020-09-08T22:58:10Z

In the notebook notebooks/Analyze_Model_Outputs.ipynb (see here), some of the terminology used may be unfamiliar to a newcomer to NLP. In particular, this paragraph could use a gentler introduction to the concepts of named entity recognition and token-level error rate:

IOB2 format is a convenient way to represent a corpus, but it is a less useful representation for analyzing the result quality of named entity recognition models. Most tokens in a typical NER corpus will be tagged O, any measure of error rate in terms of tokens will over-emphasizing the tokens that are part of entities. Token-level error rate implicitly assigns higher weight to named entity mentions that consist of multiple tokens, further unbalancing error metrics. And most crucially, a naive comparison of IOB tags can result in marking an incorrect answer as correct. Consider a case where the correct sequence of labels is B, B, I but the model has output B, I, I; in this case, last two tokens of model output are both incorrect (the model has assigned them to the same entity as the first token), but a naive token-level comparison will consider the last token to be correct.

We should add more Markdown text to this notebook in two places:

At the beginning, there should be a more detailed explanation of named entity recognition models, ideally with a visual illustration of NER model outputs (perhaps drawn by some Python code using displaCy).
The above paragraph should be expanded out with a more detailed explanation of what happens when you use token classification (instead of entity extraction) as the basis for computing model quality.

The text was updated successfully, but these errors were encountered:

frreiss added documentation Improvements or additions to documentation help wanted Extra attention is needed good first issue Good for newcomers labels Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify NER-related background material in Analyze_Model_Outputs.ipynb #114

Clarify NER-related background material in Analyze_Model_Outputs.ipynb #114

frreiss commented Sep 8, 2020 •

edited

Clarify NER-related background material in Analyze_Model_Outputs.ipynb #114

Clarify NER-related background material in Analyze_Model_Outputs.ipynb #114

Comments

frreiss commented Sep 8, 2020 • edited

frreiss commented Sep 8, 2020 •

edited