chore: removed kaggle

kjappelbaum · Oct 28, 2020 · 29563dc · 29563dc
1 parent 5285591
commit 29563dc
Showing 1 changed file with 28 additions and 74 deletions.
diff --git a/molsim_ml.ipynb b/molsim_ml.ipynb
@@ -2506,31 +2506,31 @@
    "metadata": {},
    "source": [
     "A key component we did not optimize so far are hyperparameters. Those are parameters of the model that we usually cannot learn from the data but have to fix before we train the model. \n",
-    "Since we cannot learn those parameters it is not trivial to select them. \n",
+    "Since we cannot learn those parameters it is not trivial to select them. Hence, what we typically do in practice is to create another set, a \"validation set\", and use it to test models trained with different hyperparameters.\n",
     "\n",
-    "The most approach to hyperparameter optimization is to define a grid of all relevant parameters and to search over the grid for the best model performance."
+    "The most common approach to hyperparameter optimization is to define a grid of all relevant parameters and to search over the grid for the best model performance."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "$\\color{DarkBlue}{\\textsf{Short Exercise}}$\n",
-    "- Think about which parameters you could optimize in the pipeline. Note that your KRR model has parameters you can optimize. You can also switch off some steps by setting them to `None'.\n",
+    "- Think about which parameters you could optimize in the pipeline. Note that your KRR model has two parameters you can optimize. You can also switch off some steps by setting them to `None'.\n",
     "- For each parameter you need to define a resonable grid to search over.\n",
-    "- Run the hyperparameter optimization using 5-fold cross-validation (you can adjust the number of folds according to your computational resources/impatience. It turns out at k=10 is the [best tradeoff between variance and bias](https://arxiv.org/abs/1811.12808)). \n",
+    "- Recall, what k-fold cross-validation does. Run the hyperparameter optimization using 5-fold cross-validation (you can adjust the number of folds according to your computational resources/impatience. It turns out at k=10 is the [best tradeoff between variance and bias](https://arxiv.org/abs/1811.12808)). \n",
     "Tune the hyperparameters until you are statisfied (e.g., until you cannot improve the cross validated error any more)\n",
-    "- Why don't we use the test set for hyperparameter tuning? \n",
+    "- Why don't we use the test set for hyperparameter tuning but instead test on the validation set? \n",
     "- Evaluate the model performance by calculating the performance metrics (MAE, MSE, max error) on the training and the test set.\n",
-    "- Instead of grid search, try to use random search on the same grid (`RandomizedSearchCV`) and fix the number of evaluations (`n_iter`) to a fraction of the number of evaluations of grid search. What do you observe and conclude?"
+    "- *Optional:* Instead of grid search, try to use random search on the same grid (`RandomizedSearchCV`) and fix the number of evaluations (`n_iter`) to a fraction of the number of evaluations of grid search. What do you observe and conclude?"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     " $\\color{DarkRed}{\\textsf{Tips}}$\n",
-    "- If you want to see what is happening, set the verbosity argument of the `GridSearchCV` object to a higher number.\n",
+    "- If you want to see what is happening, set the `verbosity` argument of the `GridSearchCV` object to a higher number.\n",
     " \n",
     "- If you want to speed up the optimization, you can run it in parallel by setting the `n_jobs` argument to the number of workers. If you set it to -1 it will use all available cores.\n",
     " \n",
@@ -2544,13 +2544,14 @@
     "            }\n",
     "```\n",
     "\n",
-    "- After the search, you can access the best model with `.best_estimator_` and the best parameters with `.best_params_` on the GridSearchCV instance\n",
+    "- After the search, you can access the best model with `.best_estimator_` and the best parameters with `.best_params_` on the GridSearchCV instance. For example `grid_krr.best_estimator_`\n",
     "\n",
     "- If you initialize the GridSearchCV instance with `refit=True` it will automatically train the model with all training data (and not only the training folds from cross-validations)\n",
     "\n",
-    "The double underscore (dunder) notation works recursively and specifies the parameters for any pipeline stage. For example, `ovasvm__estimator__cls__C` would specifiy the `C` parameter of the estimator in the one-versus-rest classifier. \n",
+    "The double underscore (dunder) notation works recursively and specifies the parameters for any pipeline stage. \n",
+    "For example, `ovasvm__estimator__cls__C` would specifiy the `C` parameter of the estimator in the one-versus-rest classifier `ovasvm`. \n",
     "\n",
-    "You can print all parameters of the pieline using `pp.pprint(sorted(pipeline.get_params().keys()))`"
+    "You can print all parameters of the pieline using `print(sorted(pipeline.get_params().keys()))`"
    ]
   },
   {
@@ -2575,16 +2576,17 @@
    "source": [
     "# Define the parameter grid and the grid search object\n",
     "param_grid = {\n",
-    "                    'scaling': [MinMaxScaler(), StandardScaler()],\n",
+    "                    'scaling': [MinMaxScaler(), StandardScaler()], # test different scaling methods\n",
     "                    'krr__alpha': #fillme,\n",
     "                    'krr__#fillme': #fillme\n",
     "            }\n",
     "\n",
     "grid_krr = GridSearchCV(#your pipeline, param_grid=param_grid, \n",
     "                        cv=#number of folds, verbose=2, n_jobs=-1)\n",
     "\n",
-    "random_krr = RandomizedSearchCV(#your pipeline, param_distributions=param_grid, n_iter=#number of evaluations,\n",
-    "                        cv=#number of folds, verbose=2, n_jobs=-1)"
+    "# optional random search\n",
+    "#random_krr = RandomizedSearchCV(#your pipeline, param_distributions=param_grid, n_iter=#number of evaluations,\n",
+    "#                        cv=#number of folds, verbose=2, n_jobs=-1)"
    ]
   },
   {
@@ -2595,7 +2597,8 @@
    "source": [
     "# run the grid search by calling the fit method \n",
     "grid_krr.fit(#fillme)\n",
-    "random_krr.fit(#fillme)"
+    "# optional random search\n",
+    "# random_krr.fit(#fillme)"
    ]
   },
   {
@@ -2682,73 +2685,24 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 8. Submit your results to Kaggle (Project, optional)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Join the [Kaggle competition](https://www.kaggle.com/c/che609/) for this course! There we deposited some features that you have not seen before. Use your model to predict the CO$_2$ uptake for the structures there. Tune your model to get the best predictions as you move trough this notebook! Also feel free to explore other models."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Create `submission.csv` with your predictions to join the competition and upload it to the competition site."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "kaggle_data = pd.read_csv('data/features.csv')\n",
-    "kaggle_predictions = #fillme.predict(kaggle_data[FEATURES])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "submission = pd.DataFrame({\n",
-    "    \"id\": kaggle_data[\"id\"],\n",
-    "    \"prediction\": kaggle_predictions\n",
-    "})\n",
-    "\n",
-    "submission.to_csv(\"submission.csv\", index=False)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Once you have created this file, you can had over to the [submission page](https://www.kaggle.com/c/molsim2020/submit) and upload your file."
+    "## 8. Feature Engineering (Project) "
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "![Kaggle submission](_static/kaggle_upload.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 9. Feature Engineering (Project) "
+    "Finally, we would like to remove features with low variance. This can be done by setting a variance threshold."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Finally, we would like to remove features with low variance. This can be done by setting a variance threshold."
+    "$\\color{DarkBlue}{\\textsf{Short Question}}$\n",
+    "    \n",
+    "- What is the reasoning behind doing this? \n",
+    "- When might it go wrong and why?"
    ]
   },
   {
@@ -2810,14 +2764,14 @@
    "source": [
     "$\\color{DarkBlue}{\\textsf{Short Exercise (optional)}}$\n",
     "- replace the variance threshold with a model-based feature selection \n",
-    "`('feature_selection', SelectFromModel(LinearSVC(penalty=\"l1\")))` or [any feature selection meethod that you would like to try](https://scikit-learn.org/stable/modules/feature_selection.html)"
+    "`('feature_selection', SelectFromModel(LinearSVC(penalty=\"l1\")))` or [any feature selection method that you would like to try](https://scikit-learn.org/stable/modules/feature_selection.html)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 10. Saving the model (Project)"
+    "## 9. Saving the model (Project)"
    ]
   },
   {
@@ -2862,15 +2816,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 11. Influence of Regularization (Project)"
+    "## 10. Influence of Regularization (Project)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     " $\\color{DarkBlue}{\\textsf{Short Exercise}}$\n",
-    "- what happens if you set $\\alpha=0$ or to large value? Why is this the case?\n",
+    "- what happens if you set $\\alpha=0$ or to large value? Why is this the case explain what the parameter means using the equation derived in the lectures?\n",
     "\n",
     " To test this, fix this value in one of your pipelines, retrain the models (re-optimizing the other hyperparameters) and rerun the performance evaluation."
    ]
@@ -2891,7 +2845,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 12. Interpreting the model (Project, optional)"
+    "## 11. Interpreting the model (Project, optional)"
    ]
   },
   {