Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics and data splitting scheme on PPI datasets #234

Open
sduzxj opened this issue Nov 9, 2023 · 2 comments
Open

Statistics and data splitting scheme on PPI datasets #234

sduzxj opened this issue Nov 9, 2023 · 2 comments

Comments

@sduzxj
Copy link

sduzxj commented Nov 9, 2023

Using the following code, I will get different statistics than the documentation[1, 2], how do I get the preprocessed dataset or the code about the dataset splitting scheme [2]?
code:
import torch
import torchdrug
from torchdrug import datasets
from torchdrug import core, datasets, tasks, models, layers

from torchdrug.datasets import HumanPPI,YeastPPI,PPIAffinity,Fold
from torchdrug import data, utils
from torchdrug import transforms as T

dataset = YeastPPI('./dataset/PPI', lazy =True)#, #transform=transforms)
train_set,valid_set,test_set =dataset.split(['train', 'valid', 'test'])#.split()

print(len(train_set))
print(len(valid_set))
print(len(test_set))
output statistics : 2421, 203, 326

[1] https://torchdrug.ai/docs/api/datasets.html
[2] https://torchprotein.ai/benchmark#leaderboard-for-yeast-ppi-prediction

@mahdip72
Copy link

I have the same issue with subcellular localization dataset. The number of samples in the training, validation and test sets are different compared to the PEER paper. What is the problem? Is there any additional post-prosessing step we need to do on the dataset?

@mahdip72
Copy link

@sduzxj

I ran your code and got different numbers:
4945, 95, 394

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants