Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SemiSupervisedDataSplitter not properly separating unlabelled cells into train/val/test #2729

Open
ethanweinberger opened this issue Apr 15, 2024 · 1 comment
Labels

Comments

@ethanweinberger
Copy link
Contributor

ethanweinberger commented Apr 15, 2024

When splitting datasets into train/validation/test sets, I would expect that the union of train/validation/test samples should be equal to the full dataset. Moreover, data points should only be present once in one of train/validation/test. However, for splits produced by the SemiSupervisedDataSplitter class with shuffling turned on this is not the case, and some unlabeled points are excluded from all splits (i.e., they belong to none of train/val/test) while others are repeated multiple times.

I believe the issue is that sampling is performed with replacement here when shuffling the unlabelled points, causing some unlabelled points to be repeated while others are never sampled. On the other hand, for labelled points sampling is properly done without replacement.

I've included a colab notebook here reproducing the issue. Happy to take this on myself since the fix should be straightforward.

Versions:

scvi-tools 1.1.2

@martinkim0
Copy link
Contributor

Hmm thanks for pointing this out, this is an unfortunate issue. It's been on our roadmap to refactor the data splitter classes, so I can take a look at this + fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants