Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure more creative results are shown in our results #2157

Open
kvnthomas98 opened this issue Oct 5, 2023 · 12 comments
Open

Ensure more creative results are shown in our results #2157

kvnthomas98 opened this issue Oct 5, 2023 · 12 comments
Assignees

Comments

@kvnthomas98
Copy link
Collaborator

Currently Lookup Results may dominate the results.
If we have a creative query, we need to ensure that creative results don't get filtered out.

Suggestions proposed by @dkoslicki:
i) Manually place creative results on top.
ii) Interleave creative results between lookup results.

@saramsey
Copy link
Member

Slotting this on the agenda for tomorrow's AHM. Good topic for group discussion.

@saramsey
Copy link
Member

saramsey commented Oct 24, 2023

I am intrigued with the "interleave" idea. While xDTD is awesome, I have concerns about a modification such that creative mode results are always (and only) at the top; I can expound on that tomorrow at the meeting.

@dkoslicki
Copy link
Member

Thoughts from the AHM:

Different possible approaches include:

  1. (easy; but xDTD could be on later pages of the UI) Rank the lookups, take the xDTD results, take n from the former and m from the latter where n+m < 500 (or whatever the specified cutoff is)
  2. (harder, but proper) Get rid of the noise/cruft in the lookup results. There is no way there are >500 drugs that treat a given disease. I suspect this is due to SemMedDB. A naive approach would be to impose "we never return more than N lookups where N is small (eg. 25). A nuanced approach would be to see what's causing the explosion of lookup results, and downrank them in the ranker.
  3. (not ideal, but fast) Just interleave the results and check with the UI team if this will make the UI show the creative results

@dkoslicki
Copy link
Member

Eg. of a bazillion lookups: https://arax.ncats.io/?r=174764
Any common(ish) disease will do

@saramsey
Copy link
Member

saramsey commented Oct 30, 2023

@dkoslicki As a test (and since there was a TRAPI query for it in #2187) I ran "what drugs treat multiple sclerosis" through ARAX (the arax.ncats.io/beta endpoint) using knowledge_type="inferred". I got 500 results. The first 50 results look pretty reasonable, with a minority of experimental/investigational treatments in there (vitamin D, epigallocatechin, cannabidiol, estriol, ibudilast, melatonin, biotin, etc.). Below the first 50, we start to get some really broad categories like "Antibodies" or "Interferons" or "Vaccines", or "Vitamins" or "immunomodulators". We also start to get some puzzling results like "ethylene glycol" (which may reflect text-mining getting confused by text about PEGylation of some other therapeutic agent). Below the first 150 results, we do start to see increased frequency of crazy stuff like "caffeine", "fish oils", "ketamine", "nicotine", "tadalafil", and so forth.

I think there are four things driving such a large number of lookup results:

  1. Insufficient canonicalization. I see a lot of essentially the same results that are repeated with slightly different names. Like "glatiramer" and "glatiramer acetate", that kind of thing.
  2. We have general terms like "Vaccines", "Antibodies", "Cannabinoids", "Immunoglobulins", etc. that are cluttering up the results
  3. We have a lot of drugs that are on the list because they are being intensively studied for efficacy in MS. Theoretically, if we were to filter to get only the drugs that are marketed (i.e., indicated) for MS (and I'm not suggesting we do that in practice), we'd see the result list length drop by probably 8X to 10X.
  4. Drugs that are used to treat other comorbidities of MS, but that are really not MS (e.g., tadalafil or what have you).

I think our scores are, overall, a bit too high for the drugs that are not indicated for MS (e.g., the investigational treatments). For the drugs that are indicated for MS, the scores are fine.

Our scores are way too high for the overly general stuff like "Vaccines" and stuff like that. Ideally, those should be either filtered out or have their score reduced due to the concepts' generality. I know we've talked about this a lot, I guess I'm just echoing the feeling here that it would be good if we weren't seeing "antibodies" and "vaccines" and "vitamins" in the results.

@saramsey
Copy link
Member

saramsey commented Oct 30, 2023

So in conclusion, I concur, there really aren't 500 different treatments for M.S. But there are probably at least 60-70 that are used to manage M.S. (remember it's a complex multi-faceted disease for which there is AFAIK no cure), plus another 100 to 150 being actively investigated.

@dkoslicki
Copy link
Member

@saramsey do we have a KP or edge property that we can use explicitly for "indicated for"? IIRC, when we ask for treats edges, KP's don't distinguish between investigational and indicated for. Perhaps there's something in KG2 we could use to cross check?

@saramsey
Copy link
Member

saramsey commented Oct 30, 2023

@dkoslicki I am not sure. It is a problem that the biolink:treats is being used for investigational/experimental therapies like vitamin D. I think the Biolink people and the Predicates WG people are working on "refactoring" the biolink:treats predicate to allow more precise statements for such cases.

In the meantime, I like the idea of trying to pull in that information from somewhere. I am not sure about where we could get it, though. I guess if someone were to go through all 500 results and label them as "indicated", "investigational", and "neither" (this would take an afternoon though!), we could try to find which sources are contributing to the "indicated" vs. "investigational". I suspect there will be a bias towards certain sources.

@dkoslicki
Copy link
Member

https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files perhaps? Doesn't cover biologics and like though

@amykglen
Copy link
Member

while I'm not aware of something like indicated_for edges that capture which drugs are FDA approved to treat which conditions (that seems very useful), we do have the ability to constrain queries on FDA approval status (#1599, which makes use of KG2 data (#1497))... it wouldn't let us filter down the result set to drugs approved specifically for MS, but maybe it would at least get rid of general terms and drugs not yet approved for anything?

@dkoslicki
Copy link
Member

Ah, I wasn't aware of that. Should be a good first pass, so @kvnthomas98 please do make note of Amy's comment once you start working on this.

@saramsey
Copy link
Member

saramsey commented Nov 1, 2023

Thank you @amykglen, good suggestion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants