You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When splitting datasets into train/validation/test sets, I would expect that the union of train/validation/test samples should be equal to the full dataset. Moreover, data points should only be present once in one of train/validation/test. However, for splits produced by the SemiSupervisedDataSplitter class with shuffling turned on this is not the case, and some unlabeled points are excluded from all splits (i.e., they belong to none of train/val/test) while others are repeated multiple times.
I believe the issue is that sampling is performed with replacement here when shuffling the unlabelled points, causing some unlabelled points to be repeated while others are never sampled. On the other hand, for labelled points sampling is properly done without replacement.
I've included a colab notebook here reproducing the issue. Happy to take this on myself since the fix should be straightforward.
Versions:
scvi-tools 1.1.2
The text was updated successfully, but these errors were encountered:
Hmm thanks for pointing this out, this is an unfortunate issue. It's been on our roadmap to refactor the data splitter classes, so I can take a look at this + fix it.
When splitting datasets into train/validation/test sets, I would expect that the union of train/validation/test samples should be equal to the full dataset. Moreover, data points should only be present once in one of train/validation/test. However, for splits produced by the
SemiSupervisedDataSplitter
class with shuffling turned on this is not the case, and some unlabeled points are excluded from all splits (i.e., they belong to none of train/val/test) while others are repeated multiple times.I believe the issue is that sampling is performed with replacement here when shuffling the unlabelled points, causing some unlabelled points to be repeated while others are never sampled. On the other hand, for labelled points sampling is properly done without replacement.
I've included a colab notebook here reproducing the issue. Happy to take this on myself since the fix should be straightforward.
Versions:
scvi-tools 1.1.2
The text was updated successfully, but these errors were encountered: