Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case-sensitive handling of stopwords and overrides #680

Open
bgyori opened this issue Apr 24, 2020 · 6 comments
Open

Case-sensitive handling of stopwords and overrides #680

bgyori opened this issue Apr 24, 2020 · 6 comments

Comments

@bgyori
Copy link
Contributor

bgyori commented Apr 24, 2020

This is related to clulab/bioresources#32, and also clulab/bioresources#30. Currently, stopwords and overrides are used in a case-insensitive way with the following behavior:

  • If there is a grounding entry that matches a stopword in a case-insensitive way, the entity is extracted and the grounding is always produced. Example where this is a problem: M2 is a chemical name, whereas m2 ought to be a stopword (mostly used to mean meter squared). Another example is IMPACT which is a perfectly valid gene name if capitalized like this (https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:20387) but impact (very common and never referring to this gene) ought to be a stopword.
  • If there is a grounding entry that matches an override in a case-insensitive way, the override is always applied. Example where this is a problem: pLK or rSK1, which we ought to apply case-sensitive overrides to, to correct for the fact that they get grounded as if they were PLK and RSK1, respectively.

(All this is true for grounding in Reach in general where the priority order between grounding files defines which grounding is chosen rather than which entry's case-sensitive match is closer to the string. But I am not necessarily sure that that should be changed.)

So the question is, would it be straightforward to change this behavior?

@MihaiSurdeanu
Copy link
Contributor

This would take some engineering to fix...
We made this decision early on, because people write gene/protein names inconsistently. Sometimes they are capitalized, sometimes they are not. So, in addition of the engineering work, I am afraid that this change may cause us to lose some valid entities...

@bgyori
Copy link
Contributor Author

bgyori commented Apr 24, 2020

What I am suggesting is a bit more specific than making everything case sensitive. Namely, this would only apply in cases where there is ambiguity between two choices, i.e., a grounding entry and either a stopword or an explicit override with different capitalization. If there is no ambiguity with either of these entry types then case-insensitive matching would continue to apply.

@MihaiSurdeanu
Copy link
Contributor

I see. Let me think about this.

@MihaiSurdeanu
Copy link
Contributor

@bgyori: for the stop words matching, maybe we can improve the logic here?
https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/BioNERPostProcessor.scala#L85

Note that this already handles "impact" vs. "IMPACT" correctly. That is, the former is not extracted, while the latter is marked as a GGP and grounded to Q9P2X3.

@bgyori
Copy link
Contributor Author

bgyori commented Apr 27, 2020

@bgyori: for the stop words matching, maybe we can improve the logic here?
https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/BioNERPostProcessor.scala#L85

Note that this already handles "impact" vs. "IMPACT" correctly. That is, the former is not extracted, while the latter is marked as a GGP and grounded to Q9P2X3.

I see, you're right about impact, so that wasn't a valid example in my original comment. Maybe a better example for ignores is II which, when all caps should be ignored but Ii is a valid protein synonym. As for overrides, there is no special logic like for stopwords, and the override is applied irrespective of capitalization, right?

@MihaiSurdeanu
Copy link
Contributor

Let's discuss stop words first, since they may be simpler to address.
However, I am struggling to find a general solution for handling stop words. We found in the past that capitalization is a strong indicator that we are looking at valid protein names... I can think of maybe two solutions:

  1. A simple fix: we now remove stop words that have upper initial. We can refine this, and change it to removing them only they follow punctuation. If not, they are probably valid names.
  2. We could come up with two stop word lists: (a) one that is case sensitive, and (b) one that is not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants