Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Chapter 2, CV splits are not random as opposed to what is written #105

Open
eranr opened this issue Nov 8, 2023 · 0 comments
Open

Comments

@eranr
Copy link

eranr commented Nov 8, 2023

Notebook name: 02_end_to_end_machine_learning_project
Section 4.2 “Better Evaluation Using Cross-Validation”, cell 140
Book Chapter 2, subsection "Better Evaluation Using Cross-Validation".
According to the book the following code randomly splits the training set:

from sklearn.model_selection import cross_val_score

tree_rmses = -cross_val_score(tree_reg, housing, housing_labels,
                              scoring="neg_root_mean_squared_error", cv=10)

According to the documentation of cross_val_score (version 1.3.2) specifying an integer for the “cv” variable implies an internal use of (Stratified)KFold class with shuffle=False. Perhaps stating the obvious - to get randomization, one could pass a CV splitter instance as e.g. below:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold

forest_reg = make_pipeline(preprocessing,
                       	RandomForestRegressor(random_state=42))
forest_rmses = -cross_val_score(forest_reg, housing, housing_labels,
        	scoring="neg_root_mean_squared_error", cv=KFold(n_splits=10,  shuffle=True, random_state=42))
@eranr eranr changed the title [BUG] [BUG] Chapter 2, CV splits are not random as opposed to what is written Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant