What is the best way to split? #1107

set92 · 2022-03-24T16:35:35Z

set92
Mar 24, 2022

I'm doing a churn prediction project, in which I have the data of all users from August 2021 - January 2022 and the same dataset from Sept 2021 to February 2022. I'm training a model with the first dataset, the data until January, and then the test is what I'm trying predict and get close to, which is the data up to february. Is this thinking right? Or should I use the typical train_test split function with the data until January? But in this way I can't check data drifting, right?

Also, doing it this way means that the data of some users (churned users) is in both datasets, and the full suite tells me that's wrong, but I suppose it depends on each case, and in mine is not wrong, right? Or I should remove those users from the test dataset?

Answered by noamzbr

Mar 27, 2022

In most use-cases where there are time stamps, the best way to do a train-test split is to ensure that all test samples have timestamps larger (later) than that largest (latest) timestamps in the training samples. In a similar manner, when the dataset contains user IDs, the best way to split would be to ensure that no user IDs from the test data appear also in the training data. That way we can be sure that our predictive performance on the test data really represents the models ability to generalize, rather than memorization of some user-specific or time-specific pattern in the training data. Further down the line, when we'll test our model's inference on some "real world" data, it won't…

View full answer

noamzbr · 2022-03-27T14:03:20Z

noamzbr
Mar 27, 2022
Maintainer

In most use-cases where there are time stamps, the best way to do a train-test split is to ensure that all test samples have timestamps larger (later) than that largest (latest) timestamps in the training samples. In a similar manner, when the dataset contains user IDs, the best way to split would be to ensure that no user IDs from the test data appear also in the training data. That way we can be sure that our predictive performance on the test data really represents the models ability to generalize, rather than memorization of some user-specific or time-specific pattern in the training data. Further down the line, when we'll test our model's inference on some "real world" data, it won't be able to use "information from the future", so when testing it now we'll want to make sure that such leakage does not exist.

For that reason, our checks (IndexTrainTestLeakage and DateTrainTestLeakageOverlap) will return a failed condition in such cases. We recommend fixing these issues, but it could be that in your particular use-case this is not a problem, and in such a case you may remove the checks or their condition from your suite.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the best way to split? #1107

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What is the best way to split? #1107

set92 Mar 24, 2022

Replies: 1 comment

noamzbr Mar 27, 2022 Maintainer

set92
Mar 24, 2022

noamzbr
Mar 27, 2022
Maintainer