Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IDEA] Chapter 2, Add code demonstrating HalvingRandomSearchCV #106

Open
eranr opened this issue Nov 8, 2023 · 0 comments
Open

[IDEA] Chapter 2, Add code demonstrating HalvingRandomSearchCV #106

eranr opened this issue Nov 8, 2023 · 0 comments

Comments

@eranr
Copy link

eranr commented Nov 8, 2023

Notebook name: 02_end_to_end_machine_learning_project
Section 5.2 “Randomized Search”
Cell 152

This is the first cell in the section, and it contains only the HalvingRandomSearchCV import. It seems like the cell is out of order and should contain actual code using the HalvingRandomSearchCV class.
How about adding the following two cells after the RandomizedSearchCV cells:

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
param_distribs = {'preprocessing__geo__n_clusters': randint(low=3, high=50),
                  'random_forest__max_features': randint(low=2, high=20)}

h_rnd_search = HalvingRandomSearchCV(
    full_pipeline, param_distributions=param_distribs, cv=3,
    scoring='neg_root_mean_squared_error', random_state=42)

h_rnd_search.fit(housing, housing_labels)
cv_res = pd.DataFrame(h_rnd_search.cv_results_).dropna()
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_res = cv_res[["param_preprocessing__geo__n_clusters",
                 "param_random_forest__max_features", "split0_test_score",
                 "split1_test_score", "split2_test_score", "mean_test_score"]]
cv_res.columns = ["n_clusters", "max_features"] + score_cols
cv_res[score_cols] = -cv_res[score_cols].round().astype(np.int64)
cv_res.head()

A couple of notes:

  • The execution of the first cell would generate a lot of warnings as e.g. the resource savings in the form of reducing the training set may not be adequate for the candidate being tested. One example I ran into was inside the KMeans fit function, where the number of clusters exceeded the number of training set data points.

  • The second cell is identical to previous cells that display the search results, only with the “dropna()” at the end. Whenever there is an error trying to fit a candidate as described above, the associated score appearing in the results are “nan”, leading the attempt to round the numerical results to fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant