Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epoprostenol used to treat rats #2243

Open
edeutsch opened this issue Mar 1, 2024 · 13 comments
Open

Epoprostenol used to treat rats #2243

edeutsch opened this issue Mar 1, 2024 · 13 comments

Comments

@edeutsch
Copy link
Collaborator

edeutsch commented Mar 1, 2024

I was assigned this issue by TAQA:
NCATSTranslator/Feedback#707

Apparently xDTD was trained with KG2-SemMedDB that asserts that Epoprostenol is used to treat rats. And there are lots of papers describing treatment of rats with Epoprostenol. But apparently this is not an appreciated answer.

It is unclear to me whether we just want to remove such SemMedDB edges in KG2

Or whether the xDTD training data can be refined to exclude Drug-treats-X edges where X is a species.

Or whether this problem goes away on its own with the upcoming KG2 "treats" refactor. (where I assume we should make an effort to ensure that ideas like:
Drug X was used to attempt to treat disease Y in species Z
are NOT excoded as:
Drug X treats species Z

Anyone have ideas on how to handle the TAQA issue?

@edeutsch
Copy link
Collaborator Author

edeutsch commented Mar 1, 2024

Bill stated it more elegantly than I did. Do we/can we employ domain and range constraints to avoid this kind of thing:
NCATSTranslator/Feedback#707 (comment)

@amykglen
Copy link
Member

amykglen commented Mar 4, 2024

The KG2 API does actually filter out edges that violate such domain/range specifications, but they're still in the underlying KG2c graph, which xDTD is trained on (I think). Maybe those edges should be excluded from the graph used for training? They're easily identifiable by the domain_range_exclusion property. (There are 3.8 million such edges in KG2c - about 8% of the total edges.)

@saramsey
Copy link
Member

Do we need a fix for this in the Lobster release? Hoping the answer is no, and that we can instead aim to fix this in the Octopus release?

@saramsey
Copy link
Member

I'm not sure that I am informed enough to have an opinion about whether or not we should include edges with domain_range_exclusion set to True (i.e., excluded edges) in the graph used for training xDTD. But it seems like we should (somehow) ensure that ARAX isn't returning results for which the key edge basis is an excluded edge. I'm fine with the idea of adding a filter for this, if that is what people feel is best. @dkoslicki @chunyuma @amykglen what do you think?

@chunyuma
Copy link
Collaborator

Hi @edeutsch and @saramsey, I think both solutions (1. use filtered KG to train xDTD; 2. add a filter to the xDTD outputs) work for this issue. However, I will say option 2 will be easier and more flexible considering the long training time of xDTD. For option 1, are we sure that the edges with domain_range_exclusion=True include all edges that we would like to be excluded for training? Or are they just a subset of them? If the domain_range_exclusion=True includes all, then we can exclude those edges in training.

@amykglen
Copy link
Member

amykglen commented Mar 12, 2024

@amykglen what do you think?

Adding a filter seems fine to me - and I take back my statement that those edges should be removed from the training dataset specifically, ha - I don't know enough about xDTD to know whether that would make sense. But I agree with Steve that at least the results that ARAXInfer returns shouldn't include domain_range_exclusion=True edges, however it makes sense to achieve that.

For option 1, are we sure that the edges with domain_range_exclusion=True include all edges that we would like to be excluded for training? Or are they just a subset of them?

I think @saramsey or @sundareswarpullela or @acevedol know more about this than me, but from what I can tell, I think it's only SemmedDB edges that are marked as domain_range_exclusion=True (where appropriate). However, I'm guessing that SemmedDB is the main 'problem' source for edges with invalid domain/range anyway, so maybe that is sufficient?

@dkoslicki
Copy link
Member

@chunyuma since it takes so long to re-train xDTD, what about the following path forward:

  1. Add the filter to the xDTD output
  2. As time permits, update the xDTD training code to exclude such edges. No need to do a full re-build until a new version of KG2 warrants it.

@chunyuma
Copy link
Collaborator

Sure, I can add a filter to the xDTD output. Can I know where I can find the edge attribute domain_range_exclusion? I can't find it in the edges_c.tsv file of KG v2.8.4.

@amykglen
Copy link
Member

Huh, that's weird. I see it in my copy of KG2.8.4c:

ubuntu@ip-172-31-48-160:~/plater-plover$ cat edges_c_header.tsv 
subject	object	predicate	primary_knowledge_source	publications:string[]	publications_info	kg2_ids:string[]	qualified_predicate	qualified_object_aspect	qualified_object_direction	domain_range_exclusion	id	:TYPE	:START_ID	:END_ID

Also note that currently the values for domain_range_exclusion are strings ("True" or "False"), though eventually they will be switched to actual booleans (see #2185). So you might want to set up your code to handle either strings or booleans

@chunyuma
Copy link
Collaborator

Thanks @amykglen! I will check it again.

@chunyuma
Copy link
Collaborator

Hi team,

I have already updated the xDTD database for KG2.8.4 to exclude all edges with domain_range_exclusion==True. It should now solve this issue. I tested test_ARAX_infer.py but got an error reported in issue #2252.

@amykglen
Copy link
Member

hey @chunyuma - I just responded in #2252 about the error you're seeing

@chunyuma
Copy link
Collaborator

Thanks @amykglen. Now the updated xDTD database has passed the Infer tests. I think we can verify this solution for this issue after deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants