Skip to content

What is the best way to split? #1107

Answered by noamzbr
set92 asked this question in Q&A
Discussion options

You must be logged in to vote

In most use-cases where there are time stamps, the best way to do a train-test split is to ensure that all test samples have timestamps larger (later) than that largest (latest) timestamps in the training samples. In a similar manner, when the dataset contains user IDs, the best way to split would be to ensure that no user IDs from the test data appear also in the training data. That way we can be sure that our predictive performance on the test data really represents the models ability to generalize, rather than memorization of some user-specific or time-specific pattern in the training data. Further down the line, when we'll test our model's inference on some "real world" data, it won't…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by shir22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants