Skip to content

Commit

Permalink
FIX Notebooks not updated by make notebooks (INRIA#743)
Browse files Browse the repository at this point in the history
  • Loading branch information
ArturoAmorQ committed Oct 27, 2023
1 parent 008cff4 commit 5cc989e
Show file tree
Hide file tree
Showing 10 changed files with 2,191 additions and 808 deletions.
194 changes: 151 additions & 43 deletions notebooks/linear_models_ex_03.ipynb
Expand Up @@ -6,25 +6,36 @@
"source": [
"# \ud83d\udcdd Exercise M4.03\n",
"\n",
"The parameter `penalty` can control the **type** of regularization to use,\n",
"whereas the regularization **strength** is set using the parameter `C`.\n",
"Setting`penalty=\"none\"` is equivalent to an infinitely large value of `C`. In\n",
"this exercise, we ask you to train a logistic regression classifier using the\n",
"`penalty=\"l2\"` regularization (which happens to be the default in\n",
"scikit-learn) to find by yourself the effect of the parameter `C`.\n",
"Now, we tackle a more realistic classification problem instead of making a\n",
"synthetic dataset. We start by loading the Adult Census dataset with the\n",
"following snippet. For the moment we retain only the **numerical features**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"We start by loading the dataset."
"adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
"target = adult_census[\"class\"]\n",
"data = adult_census.select_dtypes([\"integer\", \"floating\"])\n",
"data = data.drop(columns=[\"education-num\"])\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"admonition note alert alert-info\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
"<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
"Appendix - Datasets description section at the end of this MOOC.</p>\n",
"</div>"
"We confirm that all the selected features are numerical.\n",
"\n",
"Compute the generalization performance in terms of accuracy of a linear model\n",
"composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n",
"cross-validation with `return_estimator=True` to be able to inspect the\n",
"trained estimators."
]
},
{
Expand All @@ -33,16 +44,17 @@
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n",
"# only keep the Adelie and Chinstrap classes\n",
"penguins = (\n",
" penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n",
")\n",
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is the most important feature seen by the logistic regression?\n",
"\n",
"culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n",
"target_column = \"Species\""
"You can use a boxplot to compare the absolute values of the coefficients while\n",
"also visualizing the variability induced by the cross-validation resampling."
]
},
{
Expand All @@ -51,22 +63,15 @@
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"penguins_train, penguins_test = train_test_split(penguins, random_state=0)\n",
"\n",
"data_train = penguins_train[culmen_columns]\n",
"data_test = penguins_test[culmen_columns]\n",
"\n",
"target_train = penguins_train[target_column]\n",
"target_test = penguins_test[target_column]"
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's create our predictive model."
"Let's now work with **both numerical and categorical features**. You can\n",
"reload the Adult Census dataset with the following snippet:"
]
},
{
Expand All @@ -75,23 +80,42 @@
"metadata": {},
"outputs": [],
"source": [
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.linear_model import LogisticRegression\n",
"adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
"target = adult_census[\"class\"]\n",
"data = adult_census.drop(columns=[\"class\", \"education-num\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a predictive model where:\n",
"- The numerical data must be scaled.\n",
"- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n",
" group categories concerning less than 1% of the total samples.\n",
"- The predictor is a `LogisticRegression`. You may need to increase the number\n",
" of `max_iter`, which is 100 by default.\n",
"\n",
"logistic_regression = make_pipeline(\n",
" StandardScaler(), LogisticRegression(penalty=\"l2\")\n",
")"
"Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n",
"above to evaluate this complex pipeline."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given the following candidates for the `C` parameter, find out the impact of\n",
"`C` on the classifier decision boundary. You can use\n",
"`sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the\n",
"decision function boundary."
"By comparing the cross-validation test scores of both models fold-to-fold,\n",
"count the number of times the model using both numerical and categorical\n",
"features has a better test score than the model using only numerical features."
]
},
{
Expand All @@ -100,16 +124,100 @@
"metadata": {},
"outputs": [],
"source": [
"Cs = [0.01, 0.1, 1, 10]\n",
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the following questions, you can copy adn paste the following snippet to\n",
"get the feature names from the column transformer here named `preprocessor`.\n",
"\n",
"```python\n",
"preprocessor.fit(data)\n",
"feature_names = (\n",
" preprocessor.named_transformers_[\"onehotencoder\"].get_feature_names_out(\n",
" categorical_columns\n",
" )\n",
").tolist()\n",
"feature_names += numerical_columns\n",
"feature_names\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that there are as many feature names as coefficients in the last step\n",
"of your predictive pipeline."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which of the following pairs of features is most impacting the predictions of\n",
"the logistic regression classifier based on the absolute magnitude of its\n",
"coefficients?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at the impact of the `C` hyperparameter on the magnitude of the weights."
"Now create a similar pipeline consisting of the same preprocessor as above,\n",
"followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n",
"Set `degree=2` and `interaction_only=True` to the feature engineering step.\n",
"Remember not to include a \"bias\" feature to avoid introducing a redundancy\n",
"with the intercept of the subsequent logistic regression."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By comparing the cross-validation test scores of both models fold-to-fold,\n",
"count the number of times the model using multiplicative interactions and both\n",
"numerical and categorical features has a better test score than the model\n",
"without interactions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
Expand Down

0 comments on commit 5cc989e

Please sign in to comment.