contamination / information leakage between training and testing data #2

makesourcenotcode · 2023-12-12T03:09:35Z

I notice that both in the README.md and train.php the same mistake is made:

Namely that ZScaleStandardizer is used BEFORE the train test split and not AFTER. This results in information leakage right from the start and puts into question all the various metrics at the end.

The correct approach would be using ZScaleStandardizer on the training set only and capturing it's parameters to repeat on the testing set before using the trained model to make predictions.

This can potentially mislead newer users studying machine learning into bad habits that will later need to be unlearned.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contamination / information leakage between training and testing data #2

contamination / information leakage between training and testing data #2

makesourcenotcode commented Dec 12, 2023 •

edited

contamination / information leakage between training and testing data #2

contamination / information leakage between training and testing data #2

Comments

makesourcenotcode commented Dec 12, 2023 • edited

makesourcenotcode commented Dec 12, 2023 •

edited