From 5cc989e98fb8251b2f3e7df08c783ff9f2e592e4 Mon Sep 17 00:00:00 2001 From: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com> Date: Fri, 27 Oct 2023 11:56:47 +0200 Subject: [PATCH] FIX Notebooks not updated by `make notebooks` (#743) --- notebooks/linear_models_ex_03.ipynb | 194 +++-- notebooks/linear_models_ex_04.ipynb | 244 +++++++ ...s_feature_engineering_classification.ipynb | 682 ++++++++++++++++++ notebooks/linear_models_sol_03.ipynb | 501 +++++++------ notebooks/linear_models_sol_04.ipynb | 395 ++++++++++ .../linear_regression_non_linear_link.ipynb | 312 ++++---- notebooks/logistic_regression.ipynb | 259 ++++++- .../logistic_regression_non_linear.ipynb | 327 --------- notebooks/trees_ex_01.ipynb | 48 +- python_scripts/trees_ex_01.py | 37 +- 10 files changed, 2191 insertions(+), 808 deletions(-) create mode 100644 notebooks/linear_models_ex_04.ipynb create mode 100644 notebooks/linear_models_feature_engineering_classification.ipynb create mode 100644 notebooks/linear_models_sol_04.ipynb delete mode 100644 notebooks/logistic_regression_non_linear.ipynb diff --git a/notebooks/linear_models_ex_03.ipynb b/notebooks/linear_models_ex_03.ipynb index 36b516f3c..7ada01f07 100644 --- a/notebooks/linear_models_ex_03.ipynb +++ b/notebooks/linear_models_ex_03.ipynb @@ -6,25 +6,36 @@ "source": [ "# \ud83d\udcdd Exercise M4.03\n", "\n", - "The parameter `penalty` can control the **type** of regularization to use,\n", - "whereas the regularization **strength** is set using the parameter `C`.\n", - "Setting`penalty=\"none\"` is equivalent to an infinitely large value of `C`. In\n", - "this exercise, we ask you to train a logistic regression classifier using the\n", - "`penalty=\"l2\"` regularization (which happens to be the default in\n", - "scikit-learn) to find by yourself the effect of the parameter `C`.\n", + "Now, we tackle a more realistic classification problem instead of making a\n", + "synthetic dataset. We start by loading the Adult Census dataset with the\n", + "following snippet. For the moment we retain only the **numerical features**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", "\n", - "We start by loading the dataset." + "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n", + "target = adult_census[\"class\"]\n", + "data = adult_census.select_dtypes([\"integer\", \"floating\"])\n", + "data = data.drop(columns=[\"education-num\"])\n", + "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "
\n", - "

Note

\n", - "

If you want a deeper overview regarding this dataset, you can refer to the\n", - "Appendix - Datasets description section at the end of this MOOC.

\n", - "
" + "We confirm that all the selected features are numerical.\n", + "\n", + "Compute the generalization performance in terms of accuracy of a linear model\n", + "composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n", + "cross-validation with `return_estimator=True` to be able to inspect the\n", + "trained estimators." ] }, { @@ -33,16 +44,17 @@ "metadata": {}, "outputs": [], "source": [ - "import pandas as pd\n", - "\n", - "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n", - "# only keep the Adelie and Chinstrap classes\n", - "penguins = (\n", - " penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n", - ")\n", + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What is the most important feature seen by the logistic regression?\n", "\n", - "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n", - "target_column = \"Species\"" + "You can use a boxplot to compare the absolute values of the coefficients while\n", + "also visualizing the variability induced by the cross-validation resampling." ] }, { @@ -51,22 +63,15 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.model_selection import train_test_split\n", - "\n", - "penguins_train, penguins_test = train_test_split(penguins, random_state=0)\n", - "\n", - "data_train = penguins_train[culmen_columns]\n", - "data_test = penguins_test[culmen_columns]\n", - "\n", - "target_train = penguins_train[target_column]\n", - "target_test = penguins_test[target_column]" + "# Write your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "First, let's create our predictive model." + "Let's now work with **both numerical and categorical features**. You can\n", + "reload the Adult Census dataset with the following snippet:" ] }, { @@ -75,23 +80,42 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.pipeline import make_pipeline\n", - "from sklearn.preprocessing import StandardScaler\n", - "from sklearn.linear_model import LogisticRegression\n", + "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n", + "target = adult_census[\"class\"]\n", + "data = adult_census.drop(columns=[\"class\", \"education-num\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a predictive model where:\n", + "- The numerical data must be scaled.\n", + "- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n", + " group categories concerning less than 1% of the total samples.\n", + "- The predictor is a `LogisticRegression`. You may need to increase the number\n", + " of `max_iter`, which is 100 by default.\n", "\n", - "logistic_regression = make_pipeline(\n", - " StandardScaler(), LogisticRegression(penalty=\"l2\")\n", - ")" + "Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n", + "above to evaluate this complex pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Given the following candidates for the `C` parameter, find out the impact of\n", - "`C` on the classifier decision boundary. You can use\n", - "`sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the\n", - "decision function boundary." + "By comparing the cross-validation test scores of both models fold-to-fold,\n", + "count the number of times the model using both numerical and categorical\n", + "features has a better test score than the model using only numerical features." ] }, { @@ -100,8 +124,60 @@ "metadata": {}, "outputs": [], "source": [ - "Cs = [0.01, 0.1, 1, 10]\n", + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the following questions, you can copy adn paste the following snippet to\n", + "get the feature names from the column transformer here named `preprocessor`.\n", "\n", + "```python\n", + "preprocessor.fit(data)\n", + "feature_names = (\n", + " preprocessor.named_transformers_[\"onehotencoder\"].get_feature_names_out(\n", + " categorical_columns\n", + " )\n", + ").tolist()\n", + "feature_names += numerical_columns\n", + "feature_names\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that there are as many feature names as coefficients in the last step\n", + "of your predictive pipeline." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Which of the following pairs of features is most impacting the predictions of\n", + "the logistic regression classifier based on the absolute magnitude of its\n", + "coefficients?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# Write your code here." ] }, @@ -109,7 +185,39 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Look at the impact of the `C` hyperparameter on the magnitude of the weights." + "Now create a similar pipeline consisting of the same preprocessor as above,\n", + "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n", + "Set `degree=2` and `interaction_only=True` to the feature engineering step.\n", + "Remember not to include a \"bias\" feature to avoid introducing a redundancy\n", + "with the intercept of the subsequent logistic regression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By comparing the cross-validation test scores of both models fold-to-fold,\n", + "count the number of times the model using multiplicative interactions and both\n", + "numerical and categorical features has a better test score than the model\n", + "without interactions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." ] }, { diff --git a/notebooks/linear_models_ex_04.ipynb b/notebooks/linear_models_ex_04.ipynb new file mode 100644 index 000000000..5d40693d7 --- /dev/null +++ b/notebooks/linear_models_ex_04.ipynb @@ -0,0 +1,244 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 2 + }, + "source": [ + "# \ud83d\udcdd Exercise M4.04\n", + "\n", + "In the previous Module we tuned the hyperparameter `C` of the logistic\n", + "regression without mentioning that it controls the regularization strength.\n", + "Later, on the slides on \ud83c\udfa5 **Intuitions on regularized linear models** we\n", + "metioned that a small `C` provides a more regularized model, whereas a\n", + "non-regularized model is obtained with an infinitely large value of `C`.\n", + "Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`\n", + "model.\n", + "\n", + "In this exercise, we ask you to train a logistic regression classifier using\n", + "different values of the parameter `C` to find its effects by yourself.\n", + "\n", + "We start by loading the dataset. We only keep the Adelie and Chinstrap classes\n", + "to keep the discussion simple." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "

Note

\n", + "

If you want a deeper overview regarding this dataset, you can refer to the\n", + "Appendix - Datasets description section at the end of this MOOC.

\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n", + "penguins = (\n", + " penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n", + ")\n", + "\n", + "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n", + "target_column = \"Species\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "penguins_train, penguins_test = train_test_split(\n", + " penguins, random_state=0, test_size=0.4\n", + ")\n", + "\n", + "data_train = penguins_train[culmen_columns]\n", + "data_test = penguins_test[culmen_columns]\n", + "\n", + "target_train = penguins_train[target_column]\n", + "target_test = penguins_test[target_column]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We define a function to help us fit a given `model` and plot its decision\n", + "boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging\n", + "colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped\n", + "to the white color. Equivalently, the darker the color, the closer the\n", + "predicted probability is to 0 or 1 and the more confident the classifier is in\n", + "its predictions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.inspection import DecisionBoundaryDisplay\n", + "\n", + "\n", + "def plot_decision_boundary(model):\n", + " model.fit(data_train, target_train)\n", + " accuracy = model.score(data_test, target_test)\n", + " C = model.get_params()[\"logisticregression__C\"]\n", + "\n", + " disp = DecisionBoundaryDisplay.from_estimator(\n", + " model,\n", + " data_train,\n", + " response_method=\"predict_proba\",\n", + " plot_method=\"pcolormesh\",\n", + " cmap=\"RdBu_r\",\n", + " alpha=0.8,\n", + " vmin=0.0,\n", + " vmax=1.0,\n", + " )\n", + " DecisionBoundaryDisplay.from_estimator(\n", + " model,\n", + " data_train,\n", + " response_method=\"predict_proba\",\n", + " plot_method=\"contour\",\n", + " linestyles=\"--\",\n", + " linewidths=1,\n", + " alpha=0.8,\n", + " levels=[0.5],\n", + " ax=disp.ax_,\n", + " )\n", + " sns.scatterplot(\n", + " data=penguins_train,\n", + " x=culmen_columns[0],\n", + " y=culmen_columns[1],\n", + " hue=target_column,\n", + " palette=[\"tab:blue\", \"tab:red\"],\n", + " ax=disp.ax_,\n", + " )\n", + " plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", + " plt.title(f\"C: {C} \\n Accuracy on the test set: {accuracy:.2f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's now create our predictive model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.pipeline import make_pipeline\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Influence of the parameter `C` on the decision boundary\n", + "\n", + "Given the following candidates for the `C` parameter and the\n", + "`plot_decision_boundary` function, find out the impact of `C` on the\n", + "classifier's decision boundary.\n", + "\n", + "- How does the value of `C` impact the confidence on the predictions?\n", + "- How does it impact the underfit/overfit trade-off?\n", + "- How does it impact the position and orientation of the decision boundary?\n", + "\n", + "Try to give an interpretation on the reason for such behavior." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]\n", + "\n", + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Impact of the regularization on the weights\n", + "\n", + "Look at the impact of the `C` hyperparameter on the magnitude of the weights.\n", + "**Hint**: You can [access pipeline\n", + "steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)\n", + "by name or position. Then you can query the attributes of that step such as\n", + "`coef_`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Impact of the regularization on with non-linear feature engineering\n", + "\n", + "Use the `plot_decision_boundary` function to repeat the experiment using a\n", + "non-linear feature engineering pipeline. For such purpose, insert\n", + "`Nystroem(kernel=\"rbf\", gamma=1, n_components=100)` between the\n", + "`StandardScaler` and the `LogisticRegression` steps.\n", + "\n", + "- Does the value of `C` still impact the position of the decision boundary and\n", + " the confidence of the model?\n", + "- What can you say about the impact of `C` on the underfitting vs overfitting\n", + " trade-off?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.kernel_approximation import Nystroem\n", + "\n", + "# Write your code here." + ] + } + ], + "metadata": { + "jupytext": { + "main_language": "python" + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/notebooks/linear_models_feature_engineering_classification.ipynb b/notebooks/linear_models_feature_engineering_classification.ipynb new file mode 100644 index 000000000..87544be19 --- /dev/null +++ b/notebooks/linear_models_feature_engineering_classification.ipynb @@ -0,0 +1,682 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "# Non-linear feature engineering for Logistic Regression\n", + "\n", + "In the slides at the beginning of the module we mentioned that linear\n", + "classification models are not suited to non-linearly separable data.\n", + "Nevertheless, one can still use feature engineering as previously done for\n", + "regression models to overcome this issue. To do so, we use non-linear\n", + "transformations that typically map the original feature space into a higher\n", + "dimension space, where the linear model can separate the data more easily.\n", + "\n", + "Let us illustrate this on three synthetic datasets. Each dataset has two\n", + "original features and two classes to make it easy to visualize. The first\n", + "dataset is called the \"moons\" dataset as the data points from each class are\n", + "shaped as a crescent moon:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn.datasets import make_moons\n", + "\n", + "feature_names = [\"Feature #0\", \"Feature #1\"]\n", + "target_name = \"class\"\n", + "\n", + "X, y = make_moons(n_samples=100, noise=0.13, random_state=42)\n", + "\n", + "# We store both the data and target in a dataframe to ease plotting\n", + "moons = pd.DataFrame(\n", + " np.concatenate([X, y[:, np.newaxis]], axis=1),\n", + " columns=feature_names + [target_name],\n", + ")\n", + "data_moons, target_moons = moons[feature_names], moons[target_name]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The second dataset is called the \"Gaussian quantiles\" dataset as all data\n", + "points are sampled from a 2D Gaussian distribution regardless of the class.\n", + "The points closest to the center are assigned to the class 1 while the points\n", + "in the outer edges are assigned to the class 0, resulting in concentric\n", + "circles." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.datasets import make_gaussian_quantiles\n", + "\n", + "X, y = make_gaussian_quantiles(\n", + " n_samples=100, n_features=2, n_classes=2, random_state=42\n", + ")\n", + "gauss = pd.DataFrame(\n", + " np.concatenate([X, y[:, np.newaxis]], axis=1),\n", + " columns=feature_names + [target_name],\n", + ")\n", + "data_gauss, target_gauss = gauss[feature_names], gauss[target_name]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The third dataset is called the \"XOR\" dataset as the data points are sampled\n", + "from a uniform distribution in a 2D space and the class is defined by the\n", + "Exclusive OR (XOR) operation on the two features: the target class is 1 if\n", + "only one of the two features is greater than 0. The target class is 0\n", + "otherwise." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "xor = pd.DataFrame(\n", + " np.random.RandomState(0).uniform(low=-1, high=1, size=(200, 2)),\n", + " columns=feature_names,\n", + ")\n", + "target_xor = np.logical_xor(xor[\"Feature #0\"] > 0, xor[\"Feature #1\"] > 0)\n", + "target_xor = target_xor.astype(np.int32)\n", + "xor[\"class\"] = target_xor\n", + "data_xor = xor[feature_names]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We use matplotlib to visualize all the datasets at a glance:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 2 + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "from matplotlib.colors import ListedColormap\n", + "\n", + "\n", + "_, axs = plt.subplots(ncols=3, figsize=(14, 4), constrained_layout=True)\n", + "\n", + "common_scatter_plot_params = dict(\n", + " cmap=ListedColormap([\"tab:red\", \"tab:blue\"]),\n", + " edgecolor=\"white\",\n", + " linewidth=1,\n", + ")\n", + "\n", + "axs[0].scatter(\n", + " data_moons[feature_names[0]],\n", + " data_moons[feature_names[1]],\n", + " c=target_moons,\n", + " **common_scatter_plot_params,\n", + ")\n", + "axs[1].scatter(\n", + " data_gauss[feature_names[0]],\n", + " data_gauss[feature_names[1]],\n", + " c=target_gauss,\n", + " **common_scatter_plot_params,\n", + ")\n", + "axs[2].scatter(\n", + " data_xor[feature_names[0]],\n", + " data_xor[feature_names[1]],\n", + " c=target_xor,\n", + " **common_scatter_plot_params,\n", + ")\n", + "axs[0].set(\n", + " title=\"The moons dataset\",\n", + " xlabel=feature_names[0],\n", + " ylabel=feature_names[1],\n", + ")\n", + "axs[1].set(\n", + " title=\"The Gaussian quantiles dataset\",\n", + " xlabel=feature_names[0],\n", + ")\n", + "axs[2].set(\n", + " title=\"The XOR dataset\",\n", + " xlabel=feature_names[0],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We intuitively observe that there is no (single) straight line that can\n", + "separate the two classes in any of the datasets. We can confirm this by\n", + "fitting a linear model, such as a logistic regression, to each dataset and\n", + "plot the decision boundary of the model.\n", + "\n", + "Let's first define a function to help us fit a given model and plot its\n", + "decision boundary on the previous datasets at a glance:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.inspection import DecisionBoundaryDisplay\n", + "\n", + "\n", + "def plot_decision_boundary(model, title=None):\n", + " datasets = [\n", + " (data_moons, target_moons),\n", + " (data_gauss, target_gauss),\n", + " (data_xor, target_xor),\n", + " ]\n", + " fig, axs = plt.subplots(\n", + " ncols=3,\n", + " figsize=(14, 4),\n", + " constrained_layout=True,\n", + " )\n", + "\n", + " for i, ax, (data, target) in zip(\n", + " range(len(datasets)),\n", + " axs,\n", + " datasets,\n", + " ):\n", + " model.fit(data, target)\n", + " DecisionBoundaryDisplay.from_estimator(\n", + " model,\n", + " data,\n", + " response_method=\"predict_proba\",\n", + " plot_method=\"pcolormesh\",\n", + " cmap=\"RdBu\",\n", + " alpha=0.8,\n", + " # Setting vmin and vmax to the extreme values of the probability to\n", + " # ensure that 0.5 is mapped to white (the middle) of the blue-red\n", + " # colormap.\n", + " vmin=0,\n", + " vmax=1,\n", + " ax=ax,\n", + " )\n", + " DecisionBoundaryDisplay.from_estimator(\n", + " model,\n", + " data,\n", + " response_method=\"predict_proba\",\n", + " plot_method=\"contour\",\n", + " alpha=0.8,\n", + " levels=[0.5], # 0.5 probability contour line\n", + " linestyles=\"--\",\n", + " linewidths=2,\n", + " ax=ax,\n", + " )\n", + " ax.scatter(\n", + " data[feature_names[0]],\n", + " data[feature_names[1]],\n", + " c=target,\n", + " **common_scatter_plot_params,\n", + " )\n", + " if i > 0:\n", + " ax.set_ylabel(None)\n", + " if title is not None:\n", + " fig.suptitle(title)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Now let's define our logistic regression model and plot its decision boundary\n", + "on the three datasets:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.pipeline import make_pipeline\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())\n", + "logistic_regression" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_decision_boundary(logistic_regression, title=\"Linear classifier\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This confirms that it is not possible to separate the two classes with a\n", + "linear model. On each plot we see a **significant number of misclassified\n", + "samples on the training set**! The three plots show typical cases of\n", + "**underfitting** for linear models.\n", + "\n", + "Also, the last two plots show soft colors, meaning that the model is highly\n", + "unsure about which class to choose." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Engineering non-linear features\n", + "\n", + "As we did for the linear regression models, we now attempt to build a more\n", + "expressive machine learning pipeline by leveraging non-linear feature\n", + "engineering, with techniques such as binning, splines, polynomial features,\n", + "and kernel approximation.\n", + "\n", + "Let's start with the binning transformation of the features:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.preprocessing import KBinsDiscretizer\n", + "\n", + "classifier = make_pipeline(\n", + " KBinsDiscretizer(n_bins=5, encode=\"onehot\"), # already the default params\n", + " LogisticRegression(),\n", + ")\n", + "classifier" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_decision_boundary(classifier, title=\"Binning classifier\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We can see that the resulting decision boundary is constrained to follow\n", + "**axis-aligned segments**, which is very similar to what a decision tree would\n", + "do as we will see in the next Module. Furthermore, as for decision trees, the\n", + "model makes piecewise constant predictions within each rectangular region.\n", + "\n", + "This axis-aligned decision boundary is not necessarily the natural decision\n", + "boundary a human would have intuitively drawn for the moons dataset and the\n", + "Gaussian quantiles datasets. It still makes it possible for the model to\n", + "successfully separate the data. However, binning alone does not help the\n", + "classifier separate the data for the XOR dataset. This is because **the\n", + "binning transformation is a feature-wise transformation** and thus **cannot\n", + "capture interactions** between features that are necessary to separate the\n", + "XOR dataset.\n", + "\n", + "Let's now consider a **spline** transformation of the original features. This\n", + "transformation can be considered a **smooth version of the binning\n", + "transformation**. You can find more details in the [scikit-learn user guide](\n", + "https://scikit-learn.org/stable/modules/preprocessing.html#spline-transformer)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.preprocessing import SplineTransformer\n", + "\n", + "classifier = make_pipeline(\n", + " SplineTransformer(degree=3, n_knots=5),\n", + " LogisticRegression(),\n", + ")\n", + "classifier" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_decision_boundary(classifier, title=\"Spline classifier\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We can see that the decision boundary is now smooth, and while it favors\n", + "axis-aligned decision rules when extrapolating in low density regions, it can\n", + "adopt a more curvy decision boundary in the high density regions.\n", + "However, as for the binning transformation, the model still fails to separate\n", + "the data for the XOR dataset, irrespective of the number of knots, for the\n", + "same reasons: **the spline transformation is a feature-wise transformation**\n", + "and thus **cannot capture interactions** between features.\n", + "\n", + "Take into account that the number of knots is a hyperparameter that needs to be\n", + "tuned. If we use too few knots, the model would underfit the data, as shown on\n", + "the moons dataset. If we use too many knots, the model would overfit the data.\n", + "\n", + "
\n", + "

Note

\n", + "

Notice that KBinsDiscretizer(encode=\"onehot\") and SplineTransformer do not\n", + "require additional scaling. Indeed, they can replace the scaling step for\n", + "numerical features: they both create features with values in the [0, 1] range.

\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Modeling non-additive feature interactions\n", + "\n", + "We now consider feature engineering techniques that non-linearly combine the\n", + "original features in the hope of capturing interactions between them. We will\n", + "consider polynomial features and kernel approximation.\n", + "\n", + "Let's start with the polynomial features:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.preprocessing import PolynomialFeatures\n", + "\n", + "classifier = make_pipeline(\n", + " StandardScaler(),\n", + " PolynomialFeatures(degree=3, include_bias=False),\n", + " LogisticRegression(C=10),\n", + ")\n", + "classifier" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_decision_boundary(classifier, title=\"Polynomial classifier\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We can see that the decision boundary of this polynomial classifier is\n", + "**smooth** and can successfully separate the data on all three datasets\n", + "(depending on how we set the values of the `degree` and `C`\n", + "hyperparameters).\n", + "\n", + "It is interesting to observe that this models extrapolates very differently\n", + "from the previous models: its decision boundary can take a diagonal\n", + "direction. Furthermore, we can observe that predictions are very confident in\n", + "the low density regions of the feature space, even very close to the decision\n", + "boundary\n", + "\n", + "We can obtain very similar results by using a kernel approximation technique\n", + "such as the Nystr\u00f6m method with a polynomial kernel:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 0 + }, + "outputs": [], + "source": [ + "from sklearn.kernel_approximation import Nystroem\n", + "\n", + "classifier = make_pipeline(\n", + " StandardScaler(),\n", + " Nystroem(kernel=\"poly\", degree=3, coef0=1, n_components=100),\n", + " LogisticRegression(C=10),\n", + ")\n", + "classifier" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_decision_boundary(classifier, title=\"Polynomial Nystroem classifier\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The polynomial kernel approach would be interesting in cases were the\n", + "original feature space is already of high dimension: in these cases,\n", + "**computing the complete polynomial expansion** with `PolynomialFeatures`\n", + "could be **intractable**, while Nystr\u00f6m method can control the output\n", + "dimensionality with the `n_components` parameter.\n", + "\n", + "Let's now explore the use of a radial basis function (RBF) kernel:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 0 + }, + "outputs": [], + "source": [ + "from sklearn.kernel_approximation import Nystroem\n", + "\n", + "classifier = make_pipeline(\n", + " StandardScaler(),\n", + " Nystroem(kernel=\"rbf\", gamma=1, n_components=100),\n", + " LogisticRegression(C=5),\n", + ")\n", + "classifier" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_decision_boundary(classifier, title=\"RBF Nystroem classifier\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The resulting decision boundary is **smooth** and can successfully separate\n", + "the classes for all three datasets. Furthemore, the model extrapolates very\n", + "differently: in particular, it tends to be **much less confident in its\n", + "predictions in the low density regions** of the feature space.\n", + "\n", + "As for the previous polynomial pipelines, this pipeline **does not favor\n", + "axis-aligned decision rules**. It can be shown mathematically that the\n", + "[inductive bias](https://en.wikipedia.org/wiki/Inductive_bias) of our RBF\n", + "pipeline is actually rotationally invariant." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Multi-step feature engineering\n", + "\n", + "It is possible to combine several feature engineering transformers in a\n", + "single pipeline to blend their respective inductive biases. For instance, we\n", + "can combine the binning transformation with a kernel approximation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "lines_to_next_cell": 0 + }, + "outputs": [], + "source": [ + "classifier = make_pipeline(\n", + " KBinsDiscretizer(n_bins=5),\n", + " Nystroem(kernel=\"rbf\", gamma=1.0, n_components=100),\n", + " LogisticRegression(),\n", + ")\n", + "classifier" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_decision_boundary(classifier, title=\"Binning + Nystroem classifier\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "It is interesting to observe that this model is still piecewise constant with\n", + "axis-aligned decision boundaries everywhere, but it can now successfully deal\n", + "with the XOR problem thanks to the second step of the pipeline that can\n", + "model the interactions between the features transformed by the first step.\n", + "\n", + "We can also combine the spline transformation with a kernel approximation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.kernel_approximation import Nystroem\n", + "\n", + "classifier = make_pipeline(\n", + " SplineTransformer(n_knots=5),\n", + " Nystroem(kernel=\"rbf\", gamma=1.0, n_components=100),\n", + " LogisticRegression(),\n", + ")\n", + "classifier" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plot_decision_boundary(classifier, title=\"Spline + RBF Nystroem classifier\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The decision boundary of this pipeline is smooth, but with axis-aligned\n", + "extrapolation.\n", + "\n", + "Depending on the task, this can be considered an advantage or a drawback." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Summary and take-away messages\n", + "\n", + "- Linear models such as logistic regression can be used for classification on\n", + " non-linearly separable datasets by leveraging non-linear feature\n", + " engineering.\n", + "- Transformers such as `KBinsDiscretizer` and `SplineTransformer` can be used\n", + " to engineer non-linear features independently for each original feature.\n", + "- As a result, these transformers cannot capture interactions between the\n", + " orignal features (and then would fail on the XOR classification task).\n", + "- Despite this limitation they already augment the expressivity of the\n", + " pipeline, which can be sufficient for some datasets.\n", + "- They also favor axis-aligned decision boundaries, in particular in the low\n", + " density regions of the feature space (axis-aligned extrapolation).\n", + "- Transformers such as `PolynomialFeatures` and `Nystroem` can be used to\n", + " engineer non-linear features that capture interactions between the original\n", + " features.\n", + "- It can be useful to combine several feature engineering transformers in a\n", + " single pipeline to build a more expressive model, for instance to favor\n", + " axis-aligned extrapolation while also capturing interactions.\n", + "- In particular, if the original dataset has both numerical and categorical\n", + " features, it can be useful to apply binning or a spline transformation to the\n", + " numerical features and one-hot encoding to the categorical features. Then,\n", + " the resulting features can be combined with a kernel approximation to model\n", + " interactions between numerical and categorical features. This can be\n", + " achieved with the help of `ColumnTransformer`.\n", + "\n", + "In subsequent notebooks and exercises, we will further explore the interplay\n", + "between regularization, feature engineering, and the under-fitting /\n", + "overfitting trade-off.\n", + "\n", + "But first we will do an exercise to illustrate the relationship between the\n", + "Nystr\u00f6m kernel approximation and support vector machines." + ] + } + ], + "metadata": { + "jupytext": { + "main_language": "python" + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/notebooks/linear_models_sol_03.ipynb b/notebooks/linear_models_sol_03.ipynb index 178514087..20256e76b 100644 --- a/notebooks/linear_models_sol_03.ipynb +++ b/notebooks/linear_models_sol_03.ipynb @@ -2,36 +2,40 @@ "cells": [ { "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 2 - }, + "metadata": {}, "source": [ "# \ud83d\udcc3 Solution for Exercise M4.03\n", "\n", - "In the previous Module we tuned the hyperparameter `C` of the logistic\n", - "regression without mentioning that it controls the regularization strength.\n", - "Later, on the slides on \ud83c\udfa5 **Intuitions on regularized linear models** we\n", - "metioned that a small `C` provides a more regularized model, whereas a\n", - "non-regularized model is obtained with an infinitely large value of `C`.\n", - "Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`\n", - "model.\n", - "\n", - "In this exercise, we ask you to train a logistic regression classifier using\n", - "different values of the parameter `C` to find its effects by yourself.\n", + "Now, we tackle a more realistic classification problem instead of making a\n", + "synthetic dataset. We start by loading the Adult Census dataset with the\n", + "following snippet. For the moment we retain only the **numerical features**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", "\n", - "We start by loading the dataset. We only keep the Adelie and Chinstrap classes\n", - "to keep the discussion simple." + "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n", + "target = adult_census[\"class\"]\n", + "data = adult_census.select_dtypes([\"integer\", \"floating\"])\n", + "data = data.drop(columns=[\"education-num\"])\n", + "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "
\n", - "

Note

\n", - "

If you want a deeper overview regarding this dataset, you can refer to the\n", - "Appendix - Datasets description section at the end of this MOOC.

\n", - "
" + "We confirm that all the selected features are numerical.\n", + "\n", + "Compute the generalization performance in terms of accuracy of a linear model\n", + "composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n", + "cross-validation with `return_estimator=True` to be able to inspect the\n", + "trained estimators." ] }, { @@ -40,15 +44,28 @@ "metadata": {}, "outputs": [], "source": [ - "import pandas as pd\n", + "# solution\n", + "from sklearn.pipeline import make_pipeline\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.model_selection import cross_validate\n", "\n", - "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n", - "penguins = (\n", - " penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n", + "model = make_pipeline(StandardScaler(), LogisticRegression())\n", + "cv_results_lr = cross_validate(\n", + " model, data, target, cv=10, return_estimator=True\n", ")\n", + "test_score_lr = cv_results_lr[\"test_score\"]\n", + "test_score_lr" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What is the most important feature seen by the logistic regression?\n", "\n", - "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n", - "target_column = \"Species\"" + "You can use a boxplot to compare the absolute values of the coefficients while\n", + "also visualizing the variability induced by the cross-validation resampling." ] }, { @@ -57,29 +74,41 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.model_selection import train_test_split\n", - "\n", - "penguins_train, penguins_test = train_test_split(\n", - " penguins, random_state=0, test_size=0.4\n", - ")\n", + "# solution\n", + "import matplotlib.pyplot as plt\n", "\n", - "data_train = penguins_train[culmen_columns]\n", - "data_test = penguins_test[culmen_columns]\n", + "coefs = [pipeline[-1].coef_[0] for pipeline in cv_results_lr[\"estimator\"]]\n", + "coefs = pd.DataFrame(coefs, columns=data.columns)\n", "\n", - "target_train = penguins_train[target_column]\n", - "target_test = penguins_test[target_column]" + "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n", + "_, ax = plt.subplots()\n", + "_ = coefs.abs().plot.box(color=color, vert=False, ax=ax)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "solution" + ] + }, + "source": [ + "Since we scaled the features, the coefficients of the linear model can be\n", + "meaningful compared directly. `\"capital-gain\"` is the most impacting feature.\n", + "Just be aware not to draw conclusions on the causal effect provided the impact\n", + "of a feature. Interested readers are refered to the [example on Common\n", + "pitfalls in the interpretation of coefficients of linear\n", + "models](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html)\n", + "or the [example on Failure of Machine Learning to infer causal\n", + "effects](https://scikit-learn.org/stable/auto_examples/inspection/plot_causal_interpretation.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We define a function to help us fit a given `model` and plot its decision\n", - "boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging\n", - "colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped\n", - "to the white color. Equivalently, the darker the color, the closer the\n", - "predicted probability is to 0 or 1 and the more confident the classifier is in\n", - "its predictions." + "Let's now work with **both numerical and categorical features**. You can\n", + "reload the Adult Census dataset with the following snippet:" ] }, { @@ -88,53 +117,24 @@ "metadata": {}, "outputs": [], "source": [ - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "from sklearn.inspection import DecisionBoundaryDisplay\n", - "\n", - "\n", - "def plot_decision_boundary(model):\n", - " model.fit(data_train, target_train)\n", - " accuracy = model.score(data_test, target_test)\n", - "\n", - " disp = DecisionBoundaryDisplay.from_estimator(\n", - " model,\n", - " data_train,\n", - " response_method=\"predict_proba\",\n", - " plot_method=\"pcolormesh\",\n", - " cmap=\"RdBu_r\",\n", - " alpha=0.8,\n", - " vmin=0.0,\n", - " vmax=1.0,\n", - " )\n", - " DecisionBoundaryDisplay.from_estimator(\n", - " model,\n", - " data_train,\n", - " response_method=\"predict_proba\",\n", - " plot_method=\"contour\",\n", - " linestyles=\"--\",\n", - " linewidths=1,\n", - " alpha=0.8,\n", - " levels=[0.5],\n", - " ax=disp.ax_,\n", - " )\n", - " sns.scatterplot(\n", - " data=penguins_train,\n", - " x=culmen_columns[0],\n", - " y=culmen_columns[1],\n", - " hue=target_column,\n", - " palette=[\"tab:blue\", \"tab:red\"],\n", - " ax=disp.ax_,\n", - " )\n", - " plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", - " plt.title(f\"C: {C} \\n Accuracy on the test set: {accuracy:.2f}\")" + "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n", + "target = adult_census[\"class\"]\n", + "data = adult_census.drop(columns=[\"class\", \"education-num\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Let's now create our predictive model." + "Create a predictive model where:\n", + "- The numerical data must be scaled.\n", + "- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n", + " group categories concerning less than 1% of the total samples.\n", + "- The predictor is a `LogisticRegression`. You may need to increase the number\n", + " of `max_iter`, which is 100 by default.\n", + "\n", + "Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n", + "above to evaluate this complex pipeline." ] }, { @@ -143,28 +143,36 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.pipeline import make_pipeline\n", - "from sklearn.preprocessing import StandardScaler\n", - "from sklearn.linear_model import LogisticRegression\n", - "\n", - "logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())" + "# solution\n", + "from sklearn.compose import make_column_selector as selector\n", + "from sklearn.compose import make_column_transformer\n", + "from sklearn.preprocessing import OneHotEncoder\n", + "\n", + "categorical_columns = selector(dtype_include=object)(data)\n", + "numerical_columns = selector(dtype_exclude=object)(data)\n", + "\n", + "preprocessor = make_column_transformer(\n", + " (\n", + " OneHotEncoder(handle_unknown=\"ignore\", min_frequency=0.01),\n", + " categorical_columns,\n", + " ),\n", + " (StandardScaler(), numerical_columns),\n", + ")\n", + "model = make_pipeline(preprocessor, LogisticRegression(max_iter=5_000))\n", + "cv_results_complex_lr = cross_validate(\n", + " model, data, target, cv=10, return_estimator=True, n_jobs=2\n", + ")\n", + "test_score_complex_lr = cv_results_complex_lr[\"test_score\"]\n", + "test_score_complex_lr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Influence of the parameter `C` on the decision boundary\n", - "\n", - "Given the following candidates for the `C` parameter and the\n", - "`plot_decision_boundary` function, find out the impact of `C` on the\n", - "classifier's decision boundary.\n", - "\n", - "- How does the value of `C` impact the confidence on the predictions?\n", - "- How does it impact the underfit/overfit trade-off?\n", - "- How does it impact the position and orientation of the decision boundary?\n", - "\n", - "Try to give an interpretation on the reason for such behavior." + "By comparing the cross-validation test scores of both models fold-to-fold,\n", + "count the number of times the model using both numerical and categorical\n", + "features has a better test score than the model using only numerical features." ] }, { @@ -173,75 +181,83 @@ "metadata": {}, "outputs": [], "source": [ - "Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]\n", + "# solution\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", "\n", + "indices = np.arange(len(test_score_lr))\n", + "plt.scatter(\n", + " indices, test_score_lr, color=\"tab:blue\", label=\"numerical features only\"\n", + ")\n", + "plt.scatter(\n", + " indices,\n", + " test_score_complex_lr,\n", + " color=\"tab:red\",\n", + " label=\"all features\",\n", + ")\n", + "plt.ylim((0, 1))\n", + "plt.xlabel(\"Cross-validation iteration\")\n", + "plt.ylabel(\"Accuracy\")\n", + "_ = plt.legend(bbox_to_anchor=(1.05, 1), loc=\"upper left\")\n", + "\n", + "print(\n", + " \"A model using both all features is better than a\"\n", + " \" model using only numerical features for\"\n", + " f\" {sum(test_score_complex_lr > test_score_lr)} CV iterations out of 10.\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the following questions, you can copy adn paste the following snippet to\n", + "get the feature names from the column transformer here named `preprocessor`.\n", + "\n", + "```python\n", + "preprocessor.fit(data)\n", + "feature_names = (\n", + " preprocessor.named_transformers_[\"onehotencoder\"].get_feature_names_out(\n", + " categorical_columns\n", + " )\n", + ").tolist()\n", + "feature_names += numerical_columns\n", + "feature_names\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# solution\n", - "for C in Cs:\n", - " logistic_regression.set_params(logisticregression__C=C)\n", - " plot_decision_boundary(logistic_regression)" + "preprocessor.fit(data)\n", + "feature_names = (\n", + " preprocessor.named_transformers_[\"onehotencoder\"].get_feature_names_out(\n", + " categorical_columns\n", + " )\n", + ").tolist()\n", + "feature_names += numerical_columns\n", + "feature_names" ] }, { "cell_type": "markdown", - "metadata": { - "tags": [ - "solution" - ] - }, + "metadata": {}, "source": [ - "\n", - "On this series of plots we can observe several important points. Regarding the\n", - "confidence on the predictions:\n", - "\n", - "- For low values of `C` (strong regularization), the classifier is less\n", - " confident in its predictions. We are enforcing a **spread sigmoid**.\n", - "- For high values of `C` (weak regularization), the classifier is more\n", - " confident: the areas with dark blue (very confident in predicting \"Adelie\")\n", - " and dark red (very confident in predicting \"Chinstrap\") nearly cover the\n", - " entire feature space. We are enforcing a **steep sigmoid**.\n", - "\n", - "To answer the next question, think that misclassified data points are more\n", - "costly when the classifier is more confident on the decision. Decision rules\n", - "are mostly driven by avoiding such cost. From the previous observations we can\n", - "then deduce that:\n", - "\n", - "- The smaller the `C` (the stronger the regularization), the lower the cost\n", - " of a misclassification. As more data points lay in the low-confidence\n", - " zone, the more the decision rules are influenced almost uniformly by all\n", - " the data points. This leads to a less expressive model, which may underfit.\n", - "- The higher the value of `C` (the weaker the regularization), the more the\n", - " decision is influenced by a few training points very close to the boundary,\n", - " where decisions are costly. Remember that models may overfit if the number\n", - " of samples in the training set is too small, as at least a minimum of\n", - " samples is needed to average the noise out.\n", - "\n", - "The orientation is the result of two factors: minimizing the number of\n", - "misclassified training points with high confidence and their distance to the\n", - "decision boundary (notice how the contour line tries to align with the most\n", - "misclassified data points in the dark-colored zone). This is closely related\n", - "to the value of the weights of the model, which is explained in the next part\n", - "of the exercise.\n", - "\n", - "Finally, for small values of `C` the position of the decision boundary is\n", - "affected by the class imbalance: when `C` is near zero, the model predicts the\n", - "majority class (as seen in the training set) everywhere in the feature space.\n", - "In our case, there are approximately two times more \"Adelie\" than \"Chinstrap\"\n", - "penguins. This explains why the decision boundary is shifted to the right when\n", - "`C` gets smaller. Indeed, the most regularized model predicts light blue\n", - "almost everywhere in the feature space." + "Notice that there are as many feature names as coefficients in the last step\n", + "of your predictive pipeline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Impact of the regularization on the weights\n", - "\n", - "Look at the impact of the `C` hyperparameter on the magnitude of the weights.\n", - "**Hint**: You can [access pipeline\n", - "steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)\n", - "by name or position. Then you can query the attributes of that step such as\n", - "`coef_`." + "Which of the following pairs of features is most impacting the predictions of\n", + "the logistic regression classifier based on the absolute magnitude of its\n", + "coefficients?" ] }, { @@ -251,67 +267,63 @@ "outputs": [], "source": [ "# solution\n", - "lr_weights = []\n", - "for C in Cs:\n", - " logistic_regression.set_params(logisticregression__C=C)\n", - " logistic_regression.fit(data_train, target_train)\n", - " coefs = logistic_regression[-1].coef_[0]\n", - " lr_weights.append(pd.Series(coefs, index=culmen_columns))" + "coefs = [\n", + " pipeline[-1].coef_[0] for pipeline in cv_results_complex_lr[\"estimator\"]\n", + "]\n", + "coefs = pd.DataFrame(coefs, columns=feature_names)\n", + "\n", + "_, ax = plt.subplots(figsize=(10, 35))\n", + "_ = coefs.abs().plot.box(color=color, vert=False, ax=ax)" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, - "outputs": [], "source": [ - "lr_weights = pd.concat(lr_weights, axis=1, keys=[f\"C: {C}\" for C in Cs])\n", - "lr_weights.plot.barh()\n", - "_ = plt.title(\"LogisticRegression weights depending of C\")" + "We can visually inspect the coefficients and observe that `\"capital-gain\"` and\n", + "`\"education_Doctorate\"` are impacting the predictions the most." ] }, { "cell_type": "markdown", - "metadata": { - "tags": [ - "solution" - ] - }, + "metadata": {}, "source": [ + "Now create a similar pipeline consisting of the same preprocessor as above,\n", + "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n", + "Set `degree=2` and `interaction_only=True` to the feature engineering step.\n", + "Remember not to include a \"bias\" feature to avoid introducing a redundancy\n", + "with the intercept of the subsequent logistic regression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# solution\n", + "from sklearn.preprocessing import PolynomialFeatures\n", "\n", - "As small `C` provides a more regularized model, it shrinks the weights values\n", - "toward zero, as in the `Ridge` model.\n", - "\n", - "In particular, with a strong penalty (e.g. `C = 0.01`), the weight of the feature\n", - "named \"Culmen Depth (mm)\" is almost zero. It explains why the decision\n", - "separation in the plot is almost perpendicular to the \"Culmen Length (mm)\"\n", - "feature.\n", - "\n", - "For even stronger penalty strengths (e.g. `C = 1e-6`), the weights of both\n", - "features are almost zero. It explains why the decision separation in the plot\n", - "is almost constant in the feature space: the predicted probability is only\n", - "based on the intercept parameter of the model (which is never regularized)." + "model_with_interaction = make_pipeline(\n", + " preprocessor,\n", + " PolynomialFeatures(degree=2, include_bias=False, interaction_only=True),\n", + " LogisticRegression(C=0.01, max_iter=5_000),\n", + ")\n", + "model_with_interaction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Impact of the regularization on with non-linear feature engineering\n", - "\n", - "Use the `plot_decision_boundary` function to repeat the experiment using a\n", - "non-linear feature engineering pipeline. For such purpose, insert\n", - "`Nystroem(kernel=\"rbf\", gamma=1, n_components=100)` between the\n", - "`StandardScaler` and the `LogisticRegression` steps.\n", - "\n", - "- Does the value of `C` still impact the position of the decision boundary and\n", - " the confidence of the model?\n", - "- What can you say about the impact of `C` on the underfitting vs overfitting\n", - " trade-off?" + "By comparing the cross-validation test scores of both models fold-to-fold,\n", + "count the number of times the model using multiplicative interactions and both\n", + "numerical and categorical features has a better test score than the model\n", + "without interactions." ] }, { @@ -320,18 +332,51 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.kernel_approximation import Nystroem\n", - "\n", "# solution\n", - "classifier = make_pipeline(\n", - " StandardScaler(),\n", - " Nystroem(kernel=\"rbf\", gamma=1.0, n_components=100, random_state=0),\n", - " LogisticRegression(penalty=\"l2\", max_iter=1000),\n", + "cv_results_interactions = cross_validate(\n", + " model_with_interaction,\n", + " data,\n", + " target,\n", + " cv=10,\n", + " return_estimator=True,\n", + " n_jobs=2,\n", ")\n", - "\n", - "for C in Cs:\n", - " classifier.set_params(logisticregression__C=C)\n", - " plot_decision_boundary(classifier)" + "test_score_interactions = cv_results_interactions[\"test_score\"]\n", + "test_score_interactions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# solution\n", + "plt.scatter(\n", + " indices, test_score_lr, color=\"tab:blue\", label=\"numerical features only\"\n", + ")\n", + "plt.scatter(\n", + " indices,\n", + " test_score_complex_lr,\n", + " color=\"tab:red\",\n", + " label=\"all features\",\n", + ")\n", + "plt.scatter(\n", + " indices,\n", + " test_score_interactions,\n", + " color=\"black\",\n", + " label=\"all features and interactions\",\n", + ")\n", + "plt.xlabel(\"Cross-validation iteration\")\n", + "plt.ylabel(\"Accuracy\")\n", + "_ = plt.legend(bbox_to_anchor=(1.05, 1), loc=\"upper left\")\n", + "\n", + "print(\n", + " \"A model using all features and interactions is better than a model\"\n", + " \" without interactions for\"\n", + " f\" {sum(test_score_interactions > test_score_complex_lr)} CV iterations\"\n", + " \" out of 10.\"\n", + ")" ] }, { @@ -342,41 +387,19 @@ ] }, "source": [ - "\n", - "- For the lowest values of `C`, the overall pipeline underfits: it predicts\n", - " the majority class everywhere, as previously.\n", - "- When `C` increases, the models starts to predict some datapoints from the\n", - " \"Chinstrap\" class but the model is not very confident anywhere in the\n", - " feature space.\n", - "- The decision boundary is no longer a straight line: the linear model is now\n", - " classifying in the 100-dimensional feature space created by the `Nystroem`\n", - " transformer. As are result, the decision boundary induced by the overall\n", - " pipeline is now expressive enough to wrap around the minority class.\n", - "- For `C = 1` in particular, it finds a smooth red blob around most of the\n", - " \"Chinstrap\" data points. When moving away from the data points, the model is\n", - " less confident in its predictions and again tends to predict the majority\n", - " class according to the proportion in the training set.\n", - "- For higher values of `C`, the model starts to overfit: it is very confident\n", - " in its predictions almost everywhere, but it should not be trusted: the\n", - " model also makes a larger number of mistakes on the test set (not shown in\n", - " the plot) while adopting a very curvy decision boundary to attempt fitting\n", - " all the training points, including the noisy ones at the frontier between\n", - " the two classes. This makes the decision boundary very sensitive to the\n", - " sampling of the training set and as a result, it does not generalize well in\n", - " that region. This is confirmed by the (slightly) lower accuracy on the test\n", - " set.\n", - "\n", - "Finally, we can also note that the linear model on the raw features was as\n", - "good or better than the best model using non-linear feature engineering. So in\n", - "this case, we did not really need this extra complexity in our pipeline.\n", - "**Simpler is better!**\n", - "\n", - "So to conclude, when using non-linear feature engineering, it is often\n", - "possible to make the pipeline overfit, even if the original feature space is\n", - "low-dimensional. As a result, it is important to tune the regularization\n", - "parameter in conjunction with the parameters of the transformers (e.g. tuning\n", - "`gamma` would be important here). This has a direct impact on the certainty of\n", - "the predictions." + "When you multiply two one-hot encoded categorical features, the resulting\n", + "interaction feature is mostly 0, with a 1 only when both original features are\n", + "active, acting as a logical `AND`. In this case it could mean we are creating\n", + "new rules such as \"has a given education `AND` a given native country\", which\n", + "we expect to be predictive. This new rules map the original feature space into\n", + "a higher dimension space, where the linear model can separate the data more\n", + "easily.\n", + "\n", + "Keep into account that multiplying all pairs of one-hot encoded features may\n", + "lead to a rapid increase in the number of features, especially if the original\n", + "categorical variables have many levels. This can increase the computational\n", + "cost of your model and promote overfitting, as we will see in a future\n", + "notebook." ] } ], diff --git a/notebooks/linear_models_sol_04.ipynb b/notebooks/linear_models_sol_04.ipynb new file mode 100644 index 000000000..54b7a613e --- /dev/null +++ b/notebooks/linear_models_sol_04.ipynb @@ -0,0 +1,395 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "lines_to_next_cell": 2 + }, + "source": [ + "# \ud83d\udcc3 Solution for Exercise M4.04\n", + "\n", + "In the previous Module we tuned the hyperparameter `C` of the logistic\n", + "regression without mentioning that it controls the regularization strength.\n", + "Later, on the slides on \ud83c\udfa5 **Intuitions on regularized linear models** we\n", + "metioned that a small `C` provides a more regularized model, whereas a\n", + "non-regularized model is obtained with an infinitely large value of `C`.\n", + "Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`\n", + "model.\n", + "\n", + "In this exercise, we ask you to train a logistic regression classifier using\n", + "different values of the parameter `C` to find its effects by yourself.\n", + "\n", + "We start by loading the dataset. We only keep the Adelie and Chinstrap classes\n", + "to keep the discussion simple." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "

Note

\n", + "

If you want a deeper overview regarding this dataset, you can refer to the\n", + "Appendix - Datasets description section at the end of this MOOC.

\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n", + "penguins = (\n", + " penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n", + ")\n", + "\n", + "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n", + "target_column = \"Species\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "penguins_train, penguins_test = train_test_split(\n", + " penguins, random_state=0, test_size=0.4\n", + ")\n", + "\n", + "data_train = penguins_train[culmen_columns]\n", + "data_test = penguins_test[culmen_columns]\n", + "\n", + "target_train = penguins_train[target_column]\n", + "target_test = penguins_test[target_column]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We define a function to help us fit a given `model` and plot its decision\n", + "boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging\n", + "colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped\n", + "to the white color. Equivalently, the darker the color, the closer the\n", + "predicted probability is to 0 or 1 and the more confident the classifier is in\n", + "its predictions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.inspection import DecisionBoundaryDisplay\n", + "\n", + "\n", + "def plot_decision_boundary(model):\n", + " model.fit(data_train, target_train)\n", + " accuracy = model.score(data_test, target_test)\n", + " C = model.get_params()[\"logisticregression__C\"]\n", + "\n", + " disp = DecisionBoundaryDisplay.from_estimator(\n", + " model,\n", + " data_train,\n", + " response_method=\"predict_proba\",\n", + " plot_method=\"pcolormesh\",\n", + " cmap=\"RdBu_r\",\n", + " alpha=0.8,\n", + " vmin=0.0,\n", + " vmax=1.0,\n", + " )\n", + " DecisionBoundaryDisplay.from_estimator(\n", + " model,\n", + " data_train,\n", + " response_method=\"predict_proba\",\n", + " plot_method=\"contour\",\n", + " linestyles=\"--\",\n", + " linewidths=1,\n", + " alpha=0.8,\n", + " levels=[0.5],\n", + " ax=disp.ax_,\n", + " )\n", + " sns.scatterplot(\n", + " data=penguins_train,\n", + " x=culmen_columns[0],\n", + " y=culmen_columns[1],\n", + " hue=target_column,\n", + " palette=[\"tab:blue\", \"tab:red\"],\n", + " ax=disp.ax_,\n", + " )\n", + " plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", + " plt.title(f\"C: {C} \\n Accuracy on the test set: {accuracy:.2f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's now create our predictive model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.pipeline import make_pipeline\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Influence of the parameter `C` on the decision boundary\n", + "\n", + "Given the following candidates for the `C` parameter and the\n", + "`plot_decision_boundary` function, find out the impact of `C` on the\n", + "classifier's decision boundary.\n", + "\n", + "- How does the value of `C` impact the confidence on the predictions?\n", + "- How does it impact the underfit/overfit trade-off?\n", + "- How does it impact the position and orientation of the decision boundary?\n", + "\n", + "Try to give an interpretation on the reason for such behavior." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]\n", + "\n", + "# solution\n", + "for C in Cs:\n", + " logistic_regression.set_params(logisticregression__C=C)\n", + " plot_decision_boundary(logistic_regression)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "solution" + ] + }, + "source": [ + "\n", + "On this series of plots we can observe several important points. Regarding the\n", + "confidence on the predictions:\n", + "\n", + "- For low values of `C` (strong regularization), the classifier is less\n", + " confident in its predictions. We are enforcing a **spread sigmoid**.\n", + "- For high values of `C` (weak regularization), the classifier is more\n", + " confident: the areas with dark blue (very confident in predicting \"Adelie\")\n", + " and dark red (very confident in predicting \"Chinstrap\") nearly cover the\n", + " entire feature space. We are enforcing a **steep sigmoid**.\n", + "\n", + "To answer the next question, think that misclassified data points are more\n", + "costly when the classifier is more confident on the decision. Decision rules\n", + "are mostly driven by avoiding such cost. From the previous observations we can\n", + "then deduce that:\n", + "\n", + "- The smaller the `C` (the stronger the regularization), the lower the cost\n", + " of a misclassification. As more data points lay in the low-confidence\n", + " zone, the more the decision rules are influenced almost uniformly by all\n", + " the data points. This leads to a less expressive model, which may underfit.\n", + "- The higher the value of `C` (the weaker the regularization), the more the\n", + " decision is influenced by a few training points very close to the boundary,\n", + " where decisions are costly. Remember that models may overfit if the number\n", + " of samples in the training set is too small, as at least a minimum of\n", + " samples is needed to average the noise out.\n", + "\n", + "The orientation is the result of two factors: minimizing the number of\n", + "misclassified training points with high confidence and their distance to the\n", + "decision boundary (notice how the contour line tries to align with the most\n", + "misclassified data points in the dark-colored zone). This is closely related\n", + "to the value of the weights of the model, which is explained in the next part\n", + "of the exercise.\n", + "\n", + "Finally, for small values of `C` the position of the decision boundary is\n", + "affected by the class imbalance: when `C` is near zero, the model predicts the\n", + "majority class (as seen in the training set) everywhere in the feature space.\n", + "In our case, there are approximately two times more \"Adelie\" than \"Chinstrap\"\n", + "penguins. This explains why the decision boundary is shifted to the right when\n", + "`C` gets smaller. Indeed, the most regularized model predicts light blue\n", + "almost everywhere in the feature space." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Impact of the regularization on the weights\n", + "\n", + "Look at the impact of the `C` hyperparameter on the magnitude of the weights.\n", + "**Hint**: You can [access pipeline\n", + "steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)\n", + "by name or position. Then you can query the attributes of that step such as\n", + "`coef_`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# solution\n", + "lr_weights = []\n", + "for C in Cs:\n", + " logistic_regression.set_params(logisticregression__C=C)\n", + " logistic_regression.fit(data_train, target_train)\n", + " coefs = logistic_regression[-1].coef_[0]\n", + " lr_weights.append(pd.Series(coefs, index=culmen_columns))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "solution" + ] + }, + "outputs": [], + "source": [ + "lr_weights = pd.concat(lr_weights, axis=1, keys=[f\"C: {C}\" for C in Cs])\n", + "lr_weights.plot.barh()\n", + "_ = plt.title(\"LogisticRegression weights depending of C\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "solution" + ] + }, + "source": [ + "\n", + "As small `C` provides a more regularized model, it shrinks the weights values\n", + "toward zero, as in the `Ridge` model.\n", + "\n", + "In particular, with a strong penalty (e.g. `C = 0.01`), the weight of the feature\n", + "named \"Culmen Depth (mm)\" is almost zero. It explains why the decision\n", + "separation in the plot is almost perpendicular to the \"Culmen Length (mm)\"\n", + "feature.\n", + "\n", + "For even stronger penalty strengths (e.g. `C = 1e-6`), the weights of both\n", + "features are almost zero. It explains why the decision separation in the plot\n", + "is almost constant in the feature space: the predicted probability is only\n", + "based on the intercept parameter of the model (which is never regularized)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Impact of the regularization on with non-linear feature engineering\n", + "\n", + "Use the `plot_decision_boundary` function to repeat the experiment using a\n", + "non-linear feature engineering pipeline. For such purpose, insert\n", + "`Nystroem(kernel=\"rbf\", gamma=1, n_components=100)` between the\n", + "`StandardScaler` and the `LogisticRegression` steps.\n", + "\n", + "- Does the value of `C` still impact the position of the decision boundary and\n", + " the confidence of the model?\n", + "- What can you say about the impact of `C` on the underfitting vs overfitting\n", + " trade-off?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.kernel_approximation import Nystroem\n", + "\n", + "# solution\n", + "classifier = make_pipeline(\n", + " StandardScaler(),\n", + " Nystroem(kernel=\"rbf\", gamma=1.0, n_components=100, random_state=0),\n", + " LogisticRegression(max_iter=1000),\n", + ")\n", + "\n", + "for C in Cs:\n", + " classifier.set_params(logisticregression__C=C)\n", + " plot_decision_boundary(classifier)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "solution" + ] + }, + "source": [ + "\n", + "- For the lowest values of `C`, the overall pipeline underfits: it predicts\n", + " the majority class everywhere, as previously.\n", + "- When `C` increases, the models starts to predict some datapoints from the\n", + " \"Chinstrap\" class but the model is not very confident anywhere in the\n", + " feature space.\n", + "- The decision boundary is no longer a straight line: the linear model is now\n", + " classifying in the 100-dimensional feature space created by the `Nystroem`\n", + " transformer. As are result, the decision boundary induced by the overall\n", + " pipeline is now expressive enough to wrap around the minority class.\n", + "- For `C = 1` in particular, it finds a smooth red blob around most of the\n", + " \"Chinstrap\" data points. When moving away from the data points, the model is\n", + " less confident in its predictions and again tends to predict the majority\n", + " class according to the proportion in the training set.\n", + "- For higher values of `C`, the model starts to overfit: it is very confident\n", + " in its predictions almost everywhere, but it should not be trusted: the\n", + " model also makes a larger number of mistakes on the test set (not shown in\n", + " the plot) while adopting a very curvy decision boundary to attempt fitting\n", + " all the training points, including the noisy ones at the frontier between\n", + " the two classes. This makes the decision boundary very sensitive to the\n", + " sampling of the training set and as a result, it does not generalize well in\n", + " that region. This is confirmed by the (slightly) lower accuracy on the test\n", + " set.\n", + "\n", + "Finally, we can also note that the linear model on the raw features was as\n", + "good or better than the best model using non-linear feature engineering. So in\n", + "this case, we did not really need this extra complexity in our pipeline.\n", + "**Simpler is better!**\n", + "\n", + "So to conclude, when using non-linear feature engineering, it is often\n", + "possible to make the pipeline overfit, even if the original feature space is\n", + "low-dimensional. As a result, it is important to tune the regularization\n", + "parameter in conjunction with the parameters of the transformers (e.g. tuning\n", + "`gamma` would be important here). This has a direct impact on the certainty of\n", + "the predictions." + ] + } + ], + "metadata": { + "jupytext": { + "main_language": "python" + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/notebooks/linear_regression_non_linear_link.ipynb b/notebooks/linear_regression_non_linear_link.ipynb index d56505e65..33f6936cc 100644 --- a/notebooks/linear_regression_non_linear_link.ipynb +++ b/notebooks/linear_regression_non_linear_link.ipynb @@ -2,9 +2,10 @@ "cells": [ { "cell_type": "markdown", + "id": "14eec485", "metadata": {}, "source": [ - "# Linear regression for a non-linear features-target relationship\n", + "# Non-linear feature engineering for Linear Regression\n", "\n", "In this notebook, we show that even if linear models are not natively adapted\n", "to express a `target` that is not a linear function of the `data`, it is still\n", @@ -15,16 +16,16 @@ "step followed by a linear regression step can therefore be considered a\n", "non-linear regression model as a whole.\n", "\n", - "
\n", - "

Tip

\n", - "

np.random.RandomState allows to create a random number generator which can\n", - "be later used to get deterministic results.

\n", - "
" + "In this occasion we are not loading a dataset, but creating our own custom\n", + "data consisting of a single feature. The target is built as a cubic polynomial\n", + "on said feature. To make things a bit more challenging, we add some random\n", + "fluctuations to the target." ] }, { "cell_type": "code", "execution_count": null, + "id": "8f516165", "metadata": {}, "outputs": [], "source": [ @@ -43,18 +44,22 @@ }, { "cell_type": "markdown", + "id": "00fd3b4f", "metadata": {}, "source": [ - "
\n", - "

Note

\n", - "

To ease the plotting, we create a pandas dataframe containing the data and\n", - "target:

\n", - "
" + "```{tip}\n", + "`np.random.RandomState` allows to create a random number generator which can\n", + "be later used to get deterministic results.\n", + "```\n", + "\n", + "To ease the plotting, we create a pandas dataframe containing the data and\n", + "target:" ] }, { "cell_type": "code", "execution_count": null, + "id": "5459a97b", "metadata": {}, "outputs": [], "source": [ @@ -66,6 +71,7 @@ { "cell_type": "code", "execution_count": null, + "id": "8b1b2257", "metadata": {}, "outputs": [], "source": [ @@ -78,23 +84,22 @@ }, { "cell_type": "markdown", + "id": "be69fae1", "metadata": {}, "source": [ - "We now observe the limitations of fitting a linear regression model.\n", - "\n", - "
\n", - "

Warning

\n", - "

In scikit-learn, by convention data (also called X in the scikit-learn\n", - "documentation) should be a 2D matrix of shape (n_samples, n_features).\n", - "If data is a 1D vector, you need to reshape it into a matrix with a\n", + "```{warning}\n", + "In scikit-learn, by convention `data` (also called `X` in the scikit-learn\n", + "documentation) should be a 2D matrix of shape `(n_samples, n_features)`.\n", + "If `data` is a 1D vector, you need to reshape it into a matrix with a\n", "single column if the vector represents a feature or a single row if the\n", - "vector represents a sample.

\n", - "
" + "vector represents a sample.\n", + "```" ] }, { "cell_type": "code", "execution_count": null, + "id": "46804be9", "metadata": {}, "outputs": [], "source": [ @@ -103,47 +108,75 @@ "data.shape" ] }, + { + "cell_type": "markdown", + "id": "a4209f00", + "metadata": { + "lines_to_next_cell": 2 + }, + "source": [ + "To avoid writing the same code in multiple places we define a helper function\n", + "that fits, scores and plots the different regression models." + ] + }, { "cell_type": "code", "execution_count": null, + "id": "a1bd392b", "metadata": {}, "outputs": [], "source": [ - "from sklearn.linear_model import LinearRegression\n", - "from sklearn.metrics import mean_squared_error\n", - "\n", - "linear_regression = LinearRegression()\n", - "linear_regression.fit(data, target)\n", - "target_predicted = linear_regression.predict(data)" + "def fit_score_plot_regression(model, title=None):\n", + " model.fit(data, target)\n", + " target_predicted = model.predict(data)\n", + " mse = mean_squared_error(target, target_predicted)\n", + " ax = sns.scatterplot(\n", + " data=full_data, x=\"input_feature\", y=\"target\", color=\"black\", alpha=0.5\n", + " )\n", + " ax.plot(data, target_predicted)\n", + " if title is not None:\n", + " _ = ax.set_title(title + f\" (MSE = {mse:.2f})\")\n", + " else:\n", + " _ = ax.set_title(f\"Mean squared error = {mse:.2f}\")" + ] + }, + { + "cell_type": "markdown", + "id": "7bfcbeb8", + "metadata": {}, + "source": [ + "We now observe the limitations of fitting a linear regression model." ] }, { "cell_type": "code", "execution_count": null, + "id": "1545fec5", "metadata": {}, "outputs": [], "source": [ - "mse = mean_squared_error(target, target_predicted)" + "from sklearn.linear_model import LinearRegression\n", + "from sklearn.metrics import mean_squared_error\n", + "\n", + "linear_regression = LinearRegression()\n", + "linear_regression" ] }, { "cell_type": "code", "execution_count": null, + "id": "e8c79631", "metadata": {}, "outputs": [], "source": [ - "ax = sns.scatterplot(\n", - " data=full_data, x=\"input_feature\", y=\"target\", color=\"black\", alpha=0.5\n", - ")\n", - "ax.plot(data, target_predicted)\n", - "_ = ax.set_title(f\"Mean squared error = {mse:.2f}\")" + "fit_score_plot_regression(linear_regression, title=\"Simple linear regression\")" ] }, { "cell_type": "markdown", + "id": "545fc1f3", "metadata": {}, "source": [ - "\n", "Here the coefficient and intercept learnt by `LinearRegression` define the\n", "best \"straight line\" that fits the data. We can inspect the coefficients using\n", "the attributes of the model learnt as follows:" @@ -152,6 +185,7 @@ { "cell_type": "code", "execution_count": null, + "id": "0f95ceef", "metadata": {}, "outputs": [], "source": [ @@ -163,12 +197,11 @@ }, { "cell_type": "markdown", + "id": "1a34a48c", "metadata": {}, "source": [ - "It is important to note that the learnt model is not able to handle the\n", - "non-linear relationship between `data` and `target` since linear models assume\n", - "the relationship between `data` and `target` to be linear.\n", - "\n", + "Notice that the learnt model cannot handle the non-linear relationship between\n", + "`data` and `target` because linear models assume a linear relationship.\n", "Indeed, there are 3 possibilities to solve this issue:\n", "\n", "1. choose a model that can natively deal with non-linearity,\n", @@ -184,31 +217,29 @@ { "cell_type": "code", "execution_count": null, + "id": "e01b02d2", "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "\n", "tree = DecisionTreeRegressor(max_depth=3).fit(data, target)\n", - "target_predicted = tree.predict(data)\n", - "mse = mean_squared_error(target, target_predicted)" + "tree" ] }, { "cell_type": "code", "execution_count": null, + "id": "9a27773e", "metadata": {}, "outputs": [], "source": [ - "ax = sns.scatterplot(\n", - " data=full_data, x=\"input_feature\", y=\"target\", color=\"black\", alpha=0.5\n", - ")\n", - "ax.plot(data, target_predicted)\n", - "_ = ax.set_title(f\"Mean squared error = {mse:.2f}\")" + "fit_score_plot_regression(tree, title=\"Decision tree regression\")" ] }, { "cell_type": "markdown", + "id": "4d5070e3", "metadata": {}, "source": [ "Instead of having a model which can natively deal with non-linearity, we could\n", @@ -225,6 +256,7 @@ { "cell_type": "code", "execution_count": null, + "id": "28c13246", "metadata": {}, "outputs": [], "source": [ @@ -234,9 +266,8 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "lines_to_next_cell": 2 - }, + "id": "69d0ba50", + "metadata": {}, "outputs": [], "source": [ "data_expanded = np.concatenate([data, data**2, data**3], axis=1)\n", @@ -244,41 +275,46 @@ ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", + "id": "7925141e", "metadata": {}, - "outputs": [], "source": [ - "linear_regression.fit(data_expanded, target)\n", - "target_predicted = linear_regression.predict(data_expanded)\n", - "mse = mean_squared_error(target, target_predicted)" + "Instead of manually creating such polynomial features one could directly use\n", + "[sklearn.preprocessing.PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)." ] }, { "cell_type": "code", "execution_count": null, + "id": "d31ed0f4", "metadata": {}, "outputs": [], "source": [ - "ax = sns.scatterplot(\n", - " data=full_data, x=\"input_feature\", y=\"target\", color=\"black\", alpha=0.5\n", - ")\n", - "ax.plot(data, target_predicted)\n", - "_ = ax.set_title(f\"Mean squared error = {mse:.2f}\")" + "from sklearn.preprocessing import PolynomialFeatures\n", + "\n", + "polynomial_expansion = PolynomialFeatures(degree=3, include_bias=False)" ] }, { "cell_type": "markdown", + "id": "6a7fe453", "metadata": {}, "source": [ - "We can see that even with a linear model, we can overcome the linearity\n", - "limitation of the model by adding the non-linear components in the design of\n", - "additional features. Here, we created new features by knowing the way the\n", - "target was generated.\n", - "\n", - "Instead of manually creating such polynomial features one could directly use\n", - "[sklearn.preprocessing.PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html).\n", + "In the previous cell we had to set `include_bias=False` as otherwise we would\n", + "create a constant feature perfectly correlated to the `intercept_` introduced\n", + "by the `LinearRegression`. We can verify that this procedure is equivalent to\n", + "creating the features by hand up to numerical error by computing the maximum\n", + "of the absolute values of the differences between the features generated by\n", + "both methods and checking that it is close to zero:\n", "\n", + "np.abs(polynomial_expansion.fit_transform(data) - data_expanded).max()" + ] + }, + { + "cell_type": "markdown", + "id": "269fbe2b", + "metadata": {}, + "source": [ "To demonstrate the use of the `PolynomialFeatures` class, we use a\n", "scikit-learn pipeline which first transforms the features and then fit the\n", "regression model." @@ -287,6 +323,7 @@ { "cell_type": "code", "execution_count": null, + "id": "38ba0c5c", "metadata": {}, "outputs": [], "source": [ @@ -297,58 +334,29 @@ " PolynomialFeatures(degree=3, include_bias=False),\n", " LinearRegression(),\n", ")\n", - "polynomial_regression.fit(data, target)\n", - "target_predicted = polynomial_regression.predict(data)\n", - "mse = mean_squared_error(target, target_predicted)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In the previous cell we had to set `include_bias=False` as otherwise we would\n", - "create a column perfectly correlated to the `intercept_` introduced by the\n", - "`LinearRegression`. We can verify that this procedure is equivalent to\n", - "creating the features by hand up to numerical error by computing the maximum\n", - "of the absolute values of the differences between the features generated by\n", - "both methods and checking that it is close to zero:" + "polynomial_regression" ] }, { "cell_type": "code", "execution_count": null, + "id": "5df7d4a4", "metadata": {}, "outputs": [], "source": [ - "np.abs(polynomial_regression[0].fit_transform(data) - data_expanded).max()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Then it should not be surprising that the predictions of the\n", - "`PolynomialFeatures` pipeline match the predictions of the linear model fit on\n", - "manually engineered features." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ax = sns.scatterplot(\n", - " data=full_data, x=\"input_feature\", y=\"target\", color=\"black\", alpha=0.5\n", - ")\n", - "ax.plot(data, target_predicted)\n", - "_ = ax.set_title(f\"Mean squared error = {mse:.2f}\")" + "fit_score_plot_regression(polynomial_regression, title=\"Polynomial regression\")" ] }, { "cell_type": "markdown", + "id": "fe259d20", "metadata": {}, "source": [ + "We can see that even with a linear model, we can overcome the linearity\n", + "limitation of the model by adding the non-linear components in the design of\n", + "additional features. Here, we created new features by knowing the way the\n", + "target was generated.\n", + "\n", "The last possibility is to make a linear model more expressive is to use a\n", "\"kernel\". Instead of learning one weight per feature as we previously did, a\n", "weight is assigned to each sample. However, not all samples are used: some\n", @@ -371,32 +379,29 @@ { "cell_type": "code", "execution_count": null, + "id": "7d46da9b", "metadata": {}, "outputs": [], "source": [ "from sklearn.svm import SVR\n", "\n", "svr = SVR(kernel=\"linear\")\n", - "svr.fit(data, target)\n", - "target_predicted = svr.predict(data)\n", - "mse = mean_squared_error(target, target_predicted)" + "svr" ] }, { "cell_type": "code", "execution_count": null, + "id": "9406b676", "metadata": {}, "outputs": [], "source": [ - "ax = sns.scatterplot(\n", - " data=full_data, x=\"input_feature\", y=\"target\", color=\"black\", alpha=0.5\n", - ")\n", - "ax.plot(data, target_predicted)\n", - "_ = ax.set_title(f\"Mean squared error = {mse:.2f}\")" + "fit_score_plot_regression(svr, title=\"Linear support vector machine\")" ] }, { "cell_type": "markdown", + "id": "fd29730e", "metadata": {}, "source": [ "The predictions of our SVR with a linear kernel are all aligned on a straight\n", @@ -414,30 +419,27 @@ { "cell_type": "code", "execution_count": null, + "id": "ae1550fa", "metadata": {}, "outputs": [], "source": [ "svr = SVR(kernel=\"poly\", degree=3)\n", - "svr.fit(data, target)\n", - "target_predicted = svr.predict(data)\n", - "mse = mean_squared_error(target, target_predicted)" + "svr" ] }, { "cell_type": "code", "execution_count": null, + "id": "c4670a4e", "metadata": {}, "outputs": [], "source": [ - "ax = sns.scatterplot(\n", - " data=full_data, x=\"input_feature\", y=\"target\", color=\"black\", alpha=0.5\n", - ")\n", - "ax.plot(data, target_predicted)\n", - "_ = ax.set_title(f\"Mean squared error = {mse:.2f}\")" + "fit_score_plot_regression(svr, title=\"Polynomial support vector machine\")" ] }, { "cell_type": "markdown", + "id": "732b2b0f", "metadata": {}, "source": [ "Kernel methods such as SVR are very efficient for small to medium datasets.\n", @@ -448,7 +450,7 @@ "as\n", "[KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html)\n", "or\n", - "[Nystroem](https://scikit-learn.org/stable/modules/generated/sklearn.kernel_approximation.Nystroem.html).\n", + "[SplineTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.SplineTransformer.html).\n", "\n", "Here again we refer the interested reader to the documentation to get a proper\n", "definition of those methods. The following just gives an intuitive overview of\n", @@ -458,6 +460,7 @@ { "cell_type": "code", "execution_count": null, + "id": "e30e6b37", "metadata": {}, "outputs": [], "source": [ @@ -467,19 +470,48 @@ " KBinsDiscretizer(n_bins=8),\n", " LinearRegression(),\n", ")\n", - "binned_regression.fit(data, target)\n", - "target_predicted = binned_regression.predict(data)\n", - "mse = mean_squared_error(target, target_predicted)\n", + "binned_regression" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b46eb0ef", + "metadata": {}, + "outputs": [], + "source": [ + "fit_score_plot_regression(binned_regression, title=\"Binned regression\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5403e6b1", + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.preprocessing import SplineTransformer\n", "\n", - "ax = sns.scatterplot(\n", - " data=full_data, x=\"input_feature\", y=\"target\", color=\"black\", alpha=0.5\n", + "spline_regression = make_pipeline(\n", + " SplineTransformer(degree=3, include_bias=False),\n", + " LinearRegression(),\n", ")\n", - "ax.plot(data, target_predicted)\n", - "_ = ax.set_title(f\"Mean squared error = {mse:.2f}\")" + "spline_regression" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0dcdfe92", + "metadata": {}, + "outputs": [], + "source": [ + "fit_score_plot_regression(spline_regression, title=\"Spline regression\")" ] }, { "cell_type": "markdown", + "id": "4b4f0560", "metadata": {}, "source": [ "`Nystroem` is a nice alternative to `PolynomialFeatures` that makes it\n", @@ -491,6 +523,7 @@ { "cell_type": "code", "execution_count": null, + "id": "41d6abd8", "metadata": {}, "outputs": [], "source": [ @@ -500,19 +533,24 @@ " Nystroem(kernel=\"poly\", degree=3, n_components=5, random_state=0),\n", " LinearRegression(),\n", ")\n", - "nystroem_regression.fit(data, target)\n", - "target_predicted = nystroem_regression.predict(data)\n", - "mse = mean_squared_error(target, target_predicted)\n", - "\n", - "ax = sns.scatterplot(\n", - " data=full_data, x=\"input_feature\", y=\"target\", color=\"black\", alpha=0.5\n", - ")\n", - "ax.plot(data, target_predicted)\n", - "_ = ax.set_title(f\"Mean squared error = {mse:.2f}\")" + "nystroem_regression" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "be6a232c", + "metadata": {}, + "outputs": [], + "source": [ + "fit_score_plot_regression(\n", + " nystroem_regression, title=\"Polynomial Nystroem regression\"\n", + ")" ] }, { "cell_type": "markdown", + "id": "7860e12d", "metadata": {}, "source": [ "## Notebook Recap\n", @@ -541,4 +579,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/notebooks/logistic_regression.ipynb b/notebooks/logistic_regression.ipynb index 4c4cf0de7..691283b02 100644 --- a/notebooks/logistic_regression.ipynb +++ b/notebooks/logistic_regression.ipynb @@ -2,9 +2,10 @@ "cells": [ { "cell_type": "markdown", + "id": "b0e67575", "metadata": {}, "source": [ - "# Linear model for classification\n", + "# Linear models for classification\n", "\n", "In regression, we saw that the target to be predicted is a continuous\n", "variable. In classification, the target is discrete (e.g. categorical).\n", @@ -17,18 +18,19 @@ }, { "cell_type": "markdown", + "id": "ac574018", "metadata": {}, "source": [ - "
\n", - "

Note

\n", - "

If you want a deeper overview regarding this dataset, you can refer to the\n", - "Appendix - Datasets description section at the end of this MOOC.

\n", - "
" + "```{note}\n", + "If you want a deeper overview regarding this dataset, you can refer to the\n", + "Appendix - Datasets description section at the end of this MOOC.\n", + "```" ] }, { "cell_type": "code", "execution_count": null, + "id": "a47d670a", "metadata": {}, "outputs": [], "source": [ @@ -46,6 +48,7 @@ }, { "cell_type": "markdown", + "id": "2165fcfc", "metadata": {}, "source": [ "We can quickly start by visualizing the feature distribution by class:" @@ -54,6 +57,7 @@ { "cell_type": "code", "execution_count": null, + "id": "9ac5a70c", "metadata": {}, "outputs": [], "source": [ @@ -68,6 +72,7 @@ }, { "cell_type": "markdown", + "id": "cab96de7", "metadata": {}, "source": [ "We can observe that we have quite a simple problem. When the culmen length\n", @@ -81,6 +86,7 @@ { "cell_type": "code", "execution_count": null, + "id": "b6a3b04c", "metadata": {}, "outputs": [], "source": [ @@ -97,6 +103,7 @@ }, { "cell_type": "markdown", + "id": "4964b148", "metadata": {}, "source": [ "The linear regression that we previously saw predicts a continuous output.\n", @@ -110,6 +117,7 @@ { "cell_type": "code", "execution_count": null, + "id": "47347104", "metadata": {}, "outputs": [], "source": [ @@ -125,6 +133,7 @@ }, { "cell_type": "markdown", + "id": "bafd8265", "metadata": {}, "source": [ "Since we are dealing with a classification problem containing only 2 features,\n", @@ -132,21 +141,22 @@ "the rule used by our predictive model to affect a class label given the\n", "feature values of the sample.\n", "\n", - "
\n", - "

Note

\n", - "

Here, we use the class DecisionBoundaryDisplay. This educational tool allows\n", + "```{note}\n", + "Here, we use the class `DecisionBoundaryDisplay`. This educational tool allows\n", "us to gain some insights by plotting the decision function boundary learned by\n", - "the classifier in a 2 dimensional feature space.

\n", - "

Notice however that in more realistic machine learning contexts, one would\n", + "the classifier in a 2 dimensional feature space.\n", + "\n", + "Notice however that in more realistic machine learning contexts, one would\n", "typically fit on more than two features at once and therefore it would not be\n", "possible to display such a visualization of the decision boundary in\n", - "general.

\n", - "
" + "general.\n", + "```" ] }, { "cell_type": "code", "execution_count": null, + "id": "dd628d44", "metadata": {}, "outputs": [], "source": [ @@ -172,28 +182,61 @@ }, { "cell_type": "markdown", + "id": "dbd93bf3", "metadata": {}, "source": [ - "Thus, we see that our decision function is represented by a line separating\n", - "the 2 classes.\n", + "Thus, we see that our decision function is represented by a straight line\n", + "separating the 2 classes.\n", + "\n", + "For the mathematically inclined reader, the equation of the decision boundary\n", + "is:\n", + "\n", + " coef0 * x0 + coef1 * x1 + intercept = 0\n", + "\n", + "where `x0` is `\"Culmen Length (mm)\"` and `x1` is `\"Culmen Depth (mm)\"`.\n", "\n", - "Since the line is oblique, it means that we used a combination of both\n", - "features:" + "This equation is equivalent to (assuming that `coef1` is non-zero):\n", + "\n", + " x1 = coef0 / coef1 * x0 - intercept / coef1\n", + "\n", + "which is the equation of a straight line.\n", + "\n", + "Since the line is oblique, it means that both coefficients (also called\n", + "weights) are non-null:" ] }, { "cell_type": "code", "execution_count": null, + "id": "8c76e56c", "metadata": {}, "outputs": [], "source": [ - "coefs = logistic_regression[-1].coef_[0] # the coefficients is a 2d array\n", - "weights = pd.Series(coefs, index=culmen_columns)" + "coefs = logistic_regression[-1].coef_[0]\n", + "weights = pd.Series(coefs, index=[f\"Weight for '{c}'\" for c in culmen_columns])\n", + "weights" + ] + }, + { + "cell_type": "markdown", + "id": "416a9aff", + "metadata": {}, + "source": [ + "You can [access pipeline\n", + "steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)\n", + "by name or position. In the code above `logistic_regression[-1]` means the\n", + "last step of the pipeline. Then you can access the attributes of that step such\n", + "as `coef_`. Notice also that the `coef_` attribute is an array of shape (1,\n", + "`n_features`) an then we access it via its first entry. Alternatively one\n", + "could use `coef_.ravel()`.\n", + "\n", + "We are now ready to visualize the weight values as a barplot:" ] }, { "cell_type": "code", "execution_count": null, + "id": "8c9b19ae", "metadata": {}, "outputs": [], "source": [ @@ -203,34 +246,178 @@ }, { "cell_type": "markdown", + "id": "083d61ff", "metadata": {}, "source": [ - "Indeed, both coefficients are non-null. If one of them had been zero, the\n", - "decision boundary would have been either horizontal or vertical.\n", + "If one of the weights had been zero, the decision boundary would have been\n", + "either horizontal or vertical.\n", "\n", "Furthermore the intercept is also non-zero, which means that the decision does\n", "not go through the point with (0, 0) coordinates.\n", "\n", - "For the mathematically inclined reader, the equation of the decision boundary\n", - "is:\n", + "## (Estimated) predicted probabilities\n", "\n", - " coef0 * x0 + coef1 * x1 + intercept = 0\n", + "The `predict` method in classification models returns what we call a \"hard\n", + "class prediction\", i.e. the most likely class a given data point would belong\n", + "to. We can confirm the intuition given by the `DecisionBoundaryDisplay` by\n", + "testing on a hypothetical `sample`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d30ac7e5", + "metadata": {}, + "outputs": [], + "source": [ + "test_penguin = pd.DataFrame(\n", + " {\"Culmen Length (mm)\": [45], \"Culmen Depth (mm)\": [17]}\n", + ")\n", + "logistic_regression.predict(test_penguin)" + ] + }, + { + "cell_type": "markdown", + "id": "6e7141da", + "metadata": {}, + "source": [ + "In this case, our logistic regression classifier predicts the Chinstrap\n", + "species. Note that this agrees with the decision boundary plot above: the\n", + "coordinates of this test data point match a location close to the decision\n", + "boundary, in the red region.\n", "\n", - "where `x0` is `\"Culmen Length (mm)\"` and `x1` is `\"Culmen Depth (mm)\"`.\n", + "As mentioned in the introductory slides 🎥 **Intuitions on linear models**,\n", + "one can alternatively use the `predict_proba` method to compute continuous\n", + "values (\"soft predictions\") that correspond to an estimation of the confidence\n", + "of the target belonging to each class.\n", "\n", - "This equation is equivalent to (assuming that `coef1` is non-zero):\n", + "For a binary classification scenario, the logistic regression makes both hard\n", + "and soft predictions based on the [logistic\n", + "function](https://en.wikipedia.org/wiki/Logistic_function) (also called\n", + "sigmoid function), which is S-shaped and maps any input into a value between 0\n", + "and 1." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f03d6062", + "metadata": {}, + "outputs": [], + "source": [ + "y_pred_proba = logistic_regression.predict_proba(test_penguin)\n", + "y_pred_proba" + ] + }, + { + "cell_type": "markdown", + "id": "bd3a7c7f", + "metadata": {}, + "source": [ + "More in general, the output of `predict_proba` is an array of shape\n", + "(`n_samples`, `n_classes`)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e12bb08c", + "metadata": {}, + "outputs": [], + "source": [ + "y_pred_proba.shape" + ] + }, + { + "cell_type": "markdown", + "id": "67f73ae8", + "metadata": {}, + "source": [ + "Also notice that the sum of (estimated) predicted probabilities across classes\n", + "is 1.0 for each given sample. We can visualize them for our `test_penguin` as\n", + "follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "427587b6", + "metadata": {}, + "outputs": [], + "source": [ + "y_proba_sample = pd.Series(\n", + " y_pred_proba.ravel(), index=logistic_regression.classes_\n", + ")\n", + "y_proba_sample.plot.bar()\n", + "plt.ylabel(\"Estimated probability\")\n", + "_ = plt.title(\"Probability of the sample belonging to a penguin class\")" + ] + }, + { + "cell_type": "markdown", + "id": "053ad22c", + "metadata": {}, + "source": [ + "```{warning}\n", + "We insist that the output of `predict_proba` are just estimations. Their\n", + "reliability on being a good estimate of the true conditional class-assignment\n", + "probabilities depends on the quality of the model. Even classifiers with a\n", + "high accuracy on a test set may be overconfident for some individuals and\n", + "underconfident for others.\n", + "```\n", "\n", - " x1 = coef0 / coef1 * x0 - intercept / coef1\n", + "Similarly to the hard decision boundary shown above, one can set the\n", + "`response_method` to `\"predict_proba\"` in the `DecisionBoundaryDisplay` to\n", + "rather show the confidence on individual classifications. In such case the\n", + "boundaries encode the estimated probablities by color. In particular, when\n", + "using [matplotlib diverging\n", + "colormaps](https://matplotlib.org/stable/users/explain/colors/colormaps.html#diverging)\n", + "such as `\"RdBu_r\"`, the softer the color, the more unsure about which class to\n", + "choose (the probability of 0.5 is mapped to white color).\n", "\n", - "which is the equation of a straight line.\n", + "Equivalently, towards the tails of the curve the sigmoid function approaches\n", + "its asymptotic values of 0 or 1, which are mapped to darker colors. Indeed,\n", + "the closer the predicted probability is to 0 or 1, the more confident the\n", + "classifier is in its predictions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fbcece8a", + "metadata": {}, + "outputs": [], + "source": [ + "DecisionBoundaryDisplay.from_estimator(\n", + " logistic_regression,\n", + " data_test,\n", + " response_method=\"predict_proba\",\n", + " cmap=\"RdBu_r\",\n", + " alpha=0.5,\n", + ")\n", + "sns.scatterplot(\n", + " data=penguins_test,\n", + " x=culmen_columns[0],\n", + " y=culmen_columns[1],\n", + " hue=target_column,\n", + " palette=[\"tab:red\", \"tab:blue\"],\n", + ")\n", + "_ = plt.title(\"Predicted probability of the trained\\n LogisticRegression\")" + ] + }, + { + "cell_type": "markdown", + "id": "54133c3a", + "metadata": {}, + "source": [ + "For multi-class classification the logistic regression uses the [softmax\n", + "function](https://en.wikipedia.org/wiki/Softmax_function) to make predictions.\n", + "Giving more details on that scenario is beyond the scope of this MOOC.\n", "\n", - "
\n", - "

Note

\n", - "

If you want to go further, try changing the response_method to\n", - "\"predict_proba\" in the DecisionBoundaryDisplay above. Now the boundaries\n", - "encode by color the estimated probability of belonging to either class, as\n", - "mentioned in the introductory slides \ud83c\udfa5 Intuitions on linear models.

\n", - "
" + "In any case, interested users are refered to the [scikit-learn user guide](\n", + "https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)\n", + "for a more mathematical description of the `predict_proba` method of the\n", + "`LogisticRegression` and the respective normalization functions." ] } ], @@ -245,4 +432,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/notebooks/logistic_regression_non_linear.ipynb b/notebooks/logistic_regression_non_linear.ipynb deleted file mode 100644 index ccc05be33..000000000 --- a/notebooks/logistic_regression_non_linear.ipynb +++ /dev/null @@ -1,327 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Beyond linear separation in classification\n", - "\n", - "As we saw in the regression section, the linear classification model expects\n", - "the data to be linearly separable. When this assumption does not hold, the\n", - "model is not expressive enough to properly fit the data. Therefore, we need to\n", - "apply the same tricks as in regression: feature augmentation (potentially\n", - "using expert-knowledge) or using a kernel-based method.\n", - "\n", - "We will provide examples where we will use a kernel support vector machine to\n", - "perform classification on some toy-datasets where it is impossible to find a\n", - "perfect linear separation.\n", - "\n", - "We will generate a first dataset where the data are represented as two\n", - "interlaced half circles. This dataset is generated using the function\n", - "[`sklearn.datasets.make_moons`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import pandas as pd\n", - "from sklearn.datasets import make_moons\n", - "\n", - "feature_names = [\"Feature #0\", \"Features #1\"]\n", - "target_name = \"class\"\n", - "\n", - "X, y = make_moons(n_samples=100, noise=0.13, random_state=42)\n", - "\n", - "# We store both the data and target in a dataframe to ease plotting\n", - "moons = pd.DataFrame(\n", - " np.concatenate([X, y[:, np.newaxis]], axis=1),\n", - " columns=feature_names + [target_name],\n", - ")\n", - "data_moons, target_moons = moons[feature_names], moons[target_name]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Since the dataset contains only two features, we can make a scatter plot to\n", - "have a look at it." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "\n", - "sns.scatterplot(\n", - " data=moons,\n", - " x=feature_names[0],\n", - " y=feature_names[1],\n", - " hue=target_moons,\n", - " palette=[\"tab:red\", \"tab:blue\"],\n", - ")\n", - "_ = plt.title(\"Illustration of the moons dataset\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "From the intuitions that we got by studying linear model, it should be obvious\n", - "that a linear classifier will not be able to find a perfect decision function\n", - "to separate the two classes.\n", - "\n", - "Let's try to see what is the decision boundary of such a linear classifier. We\n", - "will create a predictive model by standardizing the dataset followed by a\n", - "linear support vector machine classifier." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.pipeline import make_pipeline\n", - "from sklearn.preprocessing import StandardScaler\n", - "from sklearn.svm import SVC\n", - "\n", - "linear_model = make_pipeline(StandardScaler(), SVC(kernel=\"linear\"))\n", - "linear_model.fit(data_moons, target_moons)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - "

Warning

\n", - "

Be aware that we fit and will check the boundary decision of the classifier on\n", - "the same dataset without splitting the dataset into a training set and a\n", - "testing set. While this is a bad practice, we use it for the sake of\n", - "simplicity to depict the model behavior. Always use cross-validation when you\n", - "want to assess the generalization performance of a machine-learning model.

\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's check the decision boundary of such a linear model on this dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.inspection import DecisionBoundaryDisplay\n", - "\n", - "DecisionBoundaryDisplay.from_estimator(\n", - " linear_model, data_moons, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n", - ")\n", - "sns.scatterplot(\n", - " data=moons,\n", - " x=feature_names[0],\n", - " y=feature_names[1],\n", - " hue=target_moons,\n", - " palette=[\"tab:red\", \"tab:blue\"],\n", - ")\n", - "_ = plt.title(\"Decision boundary of a linear model\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As expected, a linear decision boundary is not enough flexible to split the\n", - "two classes.\n", - "\n", - "To push this example to the limit, we will create another dataset where\n", - "samples of a class will be surrounded by samples from the other class." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.datasets import make_gaussian_quantiles\n", - "\n", - "feature_names = [\"Feature #0\", \"Features #1\"]\n", - "target_name = \"class\"\n", - "\n", - "X, y = make_gaussian_quantiles(\n", - " n_samples=100, n_features=2, n_classes=2, random_state=42\n", - ")\n", - "gauss = pd.DataFrame(\n", - " np.concatenate([X, y[:, np.newaxis]], axis=1),\n", - " columns=feature_names + [target_name],\n", - ")\n", - "data_gauss, target_gauss = gauss[feature_names], gauss[target_name]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ax = sns.scatterplot(\n", - " data=gauss,\n", - " x=feature_names[0],\n", - " y=feature_names[1],\n", - " hue=target_gauss,\n", - " palette=[\"tab:red\", \"tab:blue\"],\n", - ")\n", - "_ = plt.title(\"Illustration of the Gaussian quantiles dataset\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here, this is even more obvious that a linear decision function is not\n", - "adapted. We can check what decision function, a linear support vector machine\n", - "will find." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "linear_model.fit(data_gauss, target_gauss)\n", - "DecisionBoundaryDisplay.from_estimator(\n", - " linear_model, data_gauss, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n", - ")\n", - "sns.scatterplot(\n", - " data=gauss,\n", - " x=feature_names[0],\n", - " y=feature_names[1],\n", - " hue=target_gauss,\n", - " palette=[\"tab:red\", \"tab:blue\"],\n", - ")\n", - "_ = plt.title(\"Decision boundary of a linear model\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As expected, a linear separation cannot be used to separate the classes\n", - "properly: the model will under-fit as it will make errors even on the training\n", - "set.\n", - "\n", - "In the section about linear regression, we saw that we could use several\n", - "tricks to make a linear model more flexible by augmenting features or using a\n", - "kernel. Here, we will use the later solution by using a radial basis function\n", - "(RBF) kernel together with a support vector machine classifier.\n", - "\n", - "We will repeat the two previous experiments and check the obtained decision\n", - "function." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "kernel_model = make_pipeline(StandardScaler(), SVC(kernel=\"rbf\", gamma=5))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "kernel_model.fit(data_moons, target_moons)\n", - "DecisionBoundaryDisplay.from_estimator(\n", - " kernel_model, data_moons, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n", - ")\n", - "sns.scatterplot(\n", - " data=moons,\n", - " x=feature_names[0],\n", - " y=feature_names[1],\n", - " hue=target_moons,\n", - " palette=[\"tab:red\", \"tab:blue\"],\n", - ")\n", - "_ = plt.title(\"Decision boundary with a model using an RBF kernel\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We see that the decision boundary is not anymore a straight line. Indeed, an\n", - "area is defined around the red samples and we could imagine that this\n", - "classifier should be able to generalize on unseen data.\n", - "\n", - "Let's check the decision function on the second dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "kernel_model.fit(data_gauss, target_gauss)\n", - "DecisionBoundaryDisplay.from_estimator(\n", - " kernel_model, data_gauss, response_method=\"predict\", cmap=\"RdBu\", alpha=0.5\n", - ")\n", - "ax = sns.scatterplot(\n", - " data=gauss,\n", - " x=feature_names[0],\n", - " y=feature_names[1],\n", - " hue=target_gauss,\n", - " palette=[\"tab:red\", \"tab:blue\"],\n", - ")\n", - "_ = plt.title(\"Decision boundary with a model using an RBF kernel\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We observe something similar than in the previous case. The decision function\n", - "is more flexible and does not underfit anymore.\n", - "\n", - "Thus, kernel trick or feature expansion are the tricks to make a linear\n", - "classifier more expressive, exactly as we saw in regression.\n", - "\n", - "Keep in mind that adding flexibility to a model can also risk increasing\n", - "overfitting by making the decision function to be sensitive to individual\n", - "(possibly noisy) data points of the training set. Here we can observe that the\n", - "decision functions remain smooth enough to preserve good generalization. If\n", - "you are curious, you can try to repeat the above experiment with `gamma=100`\n", - "and look at the decision functions." - ] - } - ], - "metadata": { - "jupytext": { - "main_language": "python" - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file diff --git a/notebooks/trees_ex_01.ipynb b/notebooks/trees_ex_01.ipynb index 99858920b..a2abdea01 100644 --- a/notebooks/trees_ex_01.ipynb +++ b/notebooks/trees_ex_01.ipynb @@ -6,16 +6,13 @@ "source": [ "# \ud83d\udcdd Exercise M5.01\n", "\n", - "In the previous notebook, we showed how a tree with a depth of 1 level was\n", - "working. The aim of this exercise is to repeat part of the previous experiment\n", - "for a depth with 2 levels to show how the process of partitioning is repeated\n", - "over time.\n", + "In the previous notebook, we showed how a tree with 1 level depth works. The\n", + "aim of this exercise is to repeat part of the previous experiment for a tree\n", + "with 2 levels depth to show how such parameter affects the feature space\n", + "partitioning.\n", "\n", - "Before to start, we will:\n", - "\n", - "* load the dataset;\n", - "* split the dataset into training and testing dataset;\n", - "* define the function to show the classification decision function." + "We first load the penguins dataset and split it into a training and a testing\n", + "sets:" ] }, { @@ -61,10 +58,35 @@ "metadata": {}, "source": [ "Create a decision tree classifier with a maximum depth of 2 levels and fit the\n", - "training data. Once this classifier trained, plot the data and the decision\n", - "boundary to see the benefit of increasing the depth. To plot the decision\n", - "boundary, you should import the class `DecisionBoundaryDisplay` from the\n", - "module `sklearn.inspection` as shown in the previous course notebook." + "training data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now plot the data and the decision boundary of the trained classifier to see\n", + "the effect of increasing the depth of the tree.\n", + "\n", + "Hint: Use the class `DecisionBoundaryDisplay` from the module\n", + "`sklearn.inspection` as shown in previous course notebooks.\n", + "\n", + "
\n", + "

Warning

\n", + "

At this time, it is not possible to use response_method=\"predict_proba\" for\n", + "multiclass problems. This is a planned feature for a future version of\n", + "scikit-learn. In the mean time, you can use response_method=\"predict\"\n", + "instead.

\n", + "
" ] }, { diff --git a/python_scripts/trees_ex_01.py b/python_scripts/trees_ex_01.py index ecfd6bf55..2d7b1d40b 100644 --- a/python_scripts/trees_ex_01.py +++ b/python_scripts/trees_ex_01.py @@ -14,16 +14,13 @@ # %% [markdown] # # 📝 Exercise M5.01 # -# In the previous notebook, we showed how a tree with a depth of 1 level was -# working. The aim of this exercise is to repeat part of the previous experiment -# for a depth with 2 levels to show how the process of partitioning is repeated -# over time. +# In the previous notebook, we showed how a tree with 1 level depth works. The +# aim of this exercise is to repeat part of the previous experiment for a tree +# with 2 levels depth to show how such parameter affects the feature space +# partitioning. # -# Before to start, we will: -# -# * load the dataset; -# * split the dataset into training and testing dataset; -# * define the function to show the classification decision function. +# We first load the penguins dataset and split it into a training and a testing +# sets: # %% import pandas as pd @@ -48,10 +45,24 @@ # %% [markdown] # Create a decision tree classifier with a maximum depth of 2 levels and fit the -# training data. Once this classifier trained, plot the data and the decision -# boundary to see the benefit of increasing the depth. To plot the decision -# boundary, you should import the class `DecisionBoundaryDisplay` from the -# module `sklearn.inspection` as shown in the previous course notebook. +# training data. + +# %% +# Write your code here. + +# %% [markdown] +# Now plot the data and the decision boundary of the trained classifier to see +# the effect of increasing the depth of the tree. +# +# Hint: Use the class `DecisionBoundaryDisplay` from the module +# `sklearn.inspection` as shown in previous course notebooks. +# +# ```{warning} +# At this time, it is not possible to use `response_method="predict_proba"` for +# multiclass problems. This is a planned feature for a future version of +# scikit-learn. In the mean time, you can use `response_method="predict"` +# instead. +# ``` # %% # Write your code here.