Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add link to PharmKG8k about alternate version of data #1276

Open
skingi20 opened this issue May 26, 2023 · 1 comment
Open

Add link to PharmKG8k about alternate version of data #1276

skingi20 opened this issue May 26, 2023 · 1 comment
Labels
blocked:submitter Waiting on clarification or input from the submitter documentation Improvements or additions to documentation

Comments

@skingi20
Copy link

The PharmKG8k dataset that you use, and the one which the people behind the PharmKG8k paper point to have some significant issues.

We think that we have solved these significant issues. The issues consist of data leakage and the split being inductive.
More can be read about it on this GitHub:
https://github.com/skingi20/improvement_of_PharmKG8k_split
The new split that we propose, which has no data leakage and is transductive, can be found on there as well as proof of the original PharmKG8k split's problems.

@cthoyt
Copy link
Member

cthoyt commented Jun 12, 2023

I'm a domain expert in biology and chemistry. I'm not very happy with any of the "benchmark" biomedical knowledge graphs we have added in PyKEEN. Despite that, they're included for the convenience of people who aren't so skilled with data acquisition.

I would suggest not investing any time in PharmKG nor its variants since it is one of the worse ones, both in terms of the content included, the questionable semantics, and the fact that its corresponding paper doesn't explain much about how it's made (plus no code available).

If you want to send a PR that adds a warning in the docs for PharmKG that there are issues and people should consider your alternative, then we could consider that, but making more variants of a questionable resource probably won't help pykeen users in the long run.

@cthoyt cthoyt added the documentation Improvements or additions to documentation label Jun 26, 2023
@cthoyt cthoyt changed the title The PharmKG8k dataset split contains data leakage and is an inductive split, we propose a better split. Add link to PharmKG8k about alternate version of data Jun 26, 2023
@cthoyt cthoyt added the blocked:submitter Waiting on clarification or input from the submitter label Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked:submitter Waiting on clarification or input from the submitter documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants