{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name below.\n", "\n", "Rename this problem sheet as follows:\n", "\n", " ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}\n", " \n", "for example\n", " \n", " ps2_blja_problem1\n", "\n", "Submit your homework within one week until next Monday, 9 a.m." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NAME = \"\"\n", "EMAIL = \"\"\n", "USERNAME = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Data Science\n", "## Lab 12: Ridge and Lasso regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the last exercise we looked at subset selection techniques for linear regression models.\n", "These methods used standard linear regression on all (or a subset of) possible models incorporating different numbers of predictors.\n", "\n", "In this exercise we consider two common shrinkage techniques for feature selection and model regularization.\n", "These techniques have long been well-established in mathematical optimization, and have received interest for data science due to their ability to shrink the coefficients of a linear model.\n", "This becomes advantageous as it enables one to trade off between variance and bias in our model.\n", "\n", "We start this lab by exploring the methods provided in `scikit-learn`.\n", "In the first two problems, we consider the diabetes data set.\n", "The goal of these problems is to understand the two main functions for shrinkage, i.e., `sklearn.linear_model.Ridge` and `sklearn.linear_model.Lasso`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part A - Ridge regression (aka Tikhonov regularization)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task (1 point)**: The following code cell loads the diabetes data set in a variable `dia` (the data type is `sklearn.utils.Bunch` which behaves similar to a `dict`).\n", "Set up a `pandas.DataFrame` named `df` which uses the correct column titles and contains the 10 predictor variables.\n", "\n", "**Hint**:The command `print(dia.DESCR)`, displays a description of the data set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "1527cdd33084c2ba2743a183914fe079", "grade": false, "grade_id": "cell-20869e1ace707cf1", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes\n", "import pandas as pd\n", "\n", "dia = load_diabetes()\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "41bd6131a9fc6d5d4857a48ef00ded9e", "grade": true, "grade_id": "cell-ec40b69214c08b13", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert df.shape == (442,10)\n", "df.columns[4] == 's1'\n", "assert abs(df.iloc[20,6] - 0.000778807997017968) < 1e-7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task (1 point)**: Append a column with the target variable to the data frame `df`. Name the column `target`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "a44ba81362ed989af615b8a090ee830c", "grade": false, "grade_id": "cell-9576adb53b568fc4", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "fde9eb0ab04b1cefbb53d168be006292", "grade": true, "grade_id": "cell-274fee8febd114a1", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert df.shape == (442,11)\n", "assert abs(df.target.mean() - 152.13348416289594) < 1e-7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task (1 point)**: Split your data randomly into a test set `X_test, y_test` and training set `(X_train, y_train)`.\n", "Use the function\n", "\n", " from sklearn.model_selection import train_test_split\n", " \n", "with `random_state=1`.\n", "\n", "Your test set should contain approx. 30\\% of the data.\n", "\n", "\n", "**Hint**: Use the appropriate optional parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "cf86b959192bf7824f9ae10a693f8969", "grade": false, "grade_id": "cell-a99c28d7d3b4f625", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "ea3e42a67906c6f05f4358b2b5a7a475", "grade": true, "grade_id": "cell-ba3f1d542b006c3b", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert X_train.shape == (309,10)\n", "assert y_train.shape == (309,)\n", "assert abs(X_test.mean() - -0.0029129190427152033) < 1e-8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell applies ridge regression for `m` different regularization parameters $\\alpha$.\n", "As you know from the lecture, ridge regression adds a penalty term to the RSS term in standard linear regression, i.e., instead of considering the optimization problem\n", "\n", "$$ \\min_{\\beta \\in \\mathbb{R}^{p+1}} \\|y - X \\beta\\|_2^2 = \\min_{\\beta \\in \\mathbb{R}^{p+1}} \\sum_{i=1}^n \\left( y_i - \\sum_{j=0}^p x_{i,j} \\beta_j \\right)^2 $$\n", "\n", "we solve in **ridge regression** the regularized problem\n", "\n", "$$ \\min_{\\beta \\in \\mathbb{R}^{p+1}} \\|y - X \\beta\\|_2^2 + \\alpha \\| \\beta \\|_2^2 = \\min_{\\beta \\in \\mathbb{R}^{p+1}} \\sum_{i=1}^n \\left( y_i - \\sum_{j=0}^p x_{i,j} \\beta_j \\right)^2 + \\alpha \\sum_{j=1}^p \\beta_j^2$$\n", "\n", "The following code fragment performs ridge regression for different values of $\\alpha$ and stores the coefficients in an array called `Coeffs`.\n", "Afterwards, the coefficients are plotted for different regression parameters.\n", "If you named your training and test data `X_train, X_test` and `y_train, y_test`, the following code cell should be executable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.linear_model import Ridge\n", "%matplotlib inline\n", "\n", "# Get dimensions of X_train\n", "n,p = X_train.shape\n", "m = 50\n", "Alpha = np.logspace(-4,4,m)\n", "Coeffs = np.zeros((m,p+1))\n", "\n", "for (i,a) in enumerate(Alpha):\n", " lmr = Ridge(alpha=a)\n", " lmr.fit(X_train, y_train)\n", " Coeffs[i,0] = lmr.intercept_\n", " Coeffs[i,1:] = lmr.coef_\n", " \n", "# Plot the output\n", "import matplotlib.pyplot as plt\n", "plt.semilogx(Alpha, Coeffs[:,:])\n", "plt.xlabel('Alpha')\n", "plt.ylabel('Coefficients');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part B - Lasso regression (aka $\\ell^1$-regularization)\n", "\n", "The **Lasso** is another modification of classical linear regression, and uses the $\\ell^1$ norm in the penalization term instead of the $\\ell^2$ norm in ridge regression. The optimization problem reads\n", "\n", "$$ \\min_{\\beta \\in \\mathbb{R}^{p+1}} \\|y - X \\beta\\|_2^2 + \\alpha \\| \\beta \\|_1 = \\min_{\\beta \\in \\mathbb{R}^{p+1}} \\sum_{i=1}^n \\left( y_i - \\sum_{j=0}^p x_{i,j} \\beta_j \\right)^2 + \\alpha \\sum_{j=1}^p |\\beta_j|$$\n", "\n", "Both the lasso and the ridge regression lead to (strictly) convex optimization problems, that are problems with a unique solution.\n", "This is true even in the case of $p > n$, while classical linear regression does not possess a unique solution.\n", "While the coefficients in ridge regression decrease in absolute value in general as the penalty parameter $\\alpha$ increases, they will never be exactly zero.\n", "In contrast to this, the coefficients in the lasso can become zero, when their influence becomes negligible.\n", "\n", "**Task (1 points)**: Copy the code used for illustrating the influence of the penalty parameter in ridge regression and modify or expand the code to plot the coefficients obtained by the **Lasso** instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "083ce92b8b06fd7340b201dff694da75", "grade": true, "grade_id": "cell-e9f3f6dceff48f23", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }