Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

contamination / information leakage between training and testing data #2

Open
makesourcenotcode opened this issue Dec 12, 2023 · 0 comments

Comments

@makesourcenotcode
Copy link

makesourcenotcode commented Dec 12, 2023

I notice that both in the README.md and train.php the same mistake is made:

Namely that ZScaleStandardizer is used BEFORE the train test split and not AFTER. This results in information leakage right from the start and puts into question all the various metrics at the end.

The correct approach would be using ZScaleStandardizer on the training set only and capturing it's parameters to repeat on the testing set before using the trained model to make predictions.

This can potentially mislead newer users studying machine learning into bad habits that will later need to be unlearned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant