{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Sheet 8 - Ridge and Lasso" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the last exercise we looked at subset selection techniques for linear regression models.\n", "These methods used standard linear regression on all (or a subset of) possible models incorporating different numbers of predictors.\n", "\n", "In this exercise we consider two common shrinkage techniques for feature selection and model regularization.\n", "These techniques have long been well-established in mathematical optimization, and have received interest for data science due to their ability to shrink the coefficients of a linear model.\n", "This becomes advantageous as it enables one to trade off between variance and bias in our model.\n", "\n", "We start this lab by exploring the methods provided in `scikit-learn`.\n", "In the first two problems, we consider the diabetes data set.\n", "The goal of these problems is to understand the two main functions for shrinkage, i.e., `sklearn.linear_model.Ridge` and `sklearn.linear_model.Lasso`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 8.1 - Ridge regression (aka Tikhonov regularization)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Execute the following code cell to import the diabetes data set. The command `print(dia.DESCR)`, displays a description of the data set." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes\n", "import pandas as pd\n", "dia = load_diabetes()\n", "df = pd.DataFrame(dia.data, columns=dia.feature_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Split your data randomly into a test and training set.\n", "Use the function\n", "\n", " from sklearn.model_selection import train_test_split\n", " \n", "with `random_state=1`.\n", "\n", "Your test set should contain approx. 30\\% of the data (*Hint*: Use the appropriate optional parameter)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell applies ridge regression for `m` different regularization parameters $\\alpha$.\n", "As you know from the lecture, ridge regression adds a penalty term to the RSS term in standard linear regression, i.e., instead of considering the optimization problem\n", "\n", "$$ \\min_{\\beta \\in \\mathbb{R}^{p+1}} \\|y - X \\beta\\|_2^2 = \\min_{\\beta \\in \\mathbb{R}^{p+1}} \\sum_{i=1}^n \\left( y_i - \\sum_{j=0}^p x_{i,j} \\beta_j \\right)^2 $$\n", "\n", "we solve in **ridge regression** the regularized problem\n", "\n", "$$ \\min_{\\beta \\in \\mathbb{R}^{p+1}} \\|y - X \\beta\\|_2^2 + \\alpha \\| \\beta \\|_2^2 = \\min_{\\beta \\in \\mathbb{R}^{p+1}} \\sum_{i=1}^n \\left( y_i - \\sum_{j=0}^p x_{i,j} \\beta_j \\right)^2 + \\alpha \\sum_{j=1}^p \\beta_j^2$$\n", "\n", "**Task**: The following code fragment performs ridge regression for different values of $\\alpha$ and stores the coefficients in an array called `Coeffs`.\n", "Afterwards, the coefficients are plotted for different regression parameters.\n", "If you named your training and test data `X_train, X_test` and `y_train, y_test`, the following code cell should be executable." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.linear_model import Ridge\n", "\n", "# Get dimensions of X_train\n", "n,p = X_train.shape\n", "m = 50\n", "Alpha = np.logspace(-4,4,m)\n", "Coeffs = np.zeros((m,p+1))\n", "\n", "for (i,a) in enumerate(Alpha):\n", " lmr = Ridge(alpha=a)\n", " lmr.fit(X_train, y_train)\n", " Coeffs[i,0] = lmr.intercept_\n", " Coeffs[i,1:] = lmr.coef_\n", " \n", "# Plot the output\n", "import matplotlib.pyplot as plt\n", "plt.semilogx(Alpha, Coeffs[:,:])\n", "plt.xlabel('Alpha')\n", "plt.ylabel('Coefficients');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 8.2 - Lasso regression (aka $\\ell^1$-regularization)\n", "\n", "The **Lasso** is another modification of classical linear regression, and uses the $\\ell^1$ norm in the penalization term instead of the $\\ell^2$ norm in ridge regression. The optimization problem reads\n", "\n", "$$ \\min_{\\beta \\in \\mathbb{R}^{p+1}} \\|y - X \\beta\\|_2^2 + \\alpha \\| \\beta \\|_1 = \\min_{\\beta \\in \\mathbb{R}^{p+1}} \\sum_{i=1}^n \\left( y_i - \\sum_{j=0}^p x_{i,j} \\beta_j \\right)^2 + \\alpha \\sum_{j=1}^p |\\beta_j|$$\n", "\n", "Both the lasso and the ridge regression lead to (strictly) convex optimization problems, that are problems with a unique solution.\n", "This is true even in the case of $p > n$, while classical linear regression does not possess a unique solution.\n", "While the coefficients in ridge regression decrease in absolute value in general as the penalty parameter $\\alpha$ increases, they will never be exactly zero.\n", "In contrast to this, the coefficients in the lasso can become zero, when their influence becomes negligible.\n", "\n", "**Task**: Copy the code used for illustrating the influence of the penalty parameter in ridge regression and modify or expand the code to plot the coefficients obtained by the **Lasso** instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 8.3 Is scaling always important?\n", "In this problem, we consider a new data set.\n", "The data set consists of 1499 samples of a particular red wine from Minho, Portugal, called *Vinho verde*.\n", "The first 11 columns in the csv file contain different measurements, the last column contains an expert rating of the quality.\n", "This set became popular in a kaggle competition, but is also publicly available [here](http://www3.dsi.uminho.pt/pcortez/wine/).\n", "The data set resides also on our [webpage](https://www.tu-chemnitz.de/mathematik/numa/lehre/ds-2018/).\n", "\n", "**Task**: Download the new csv-files from the lecture's webpage.\n", "The following code cell imports the csv-file `wine-train.csv` (adjust the path if necessary).\n", "Have a short look at the data set." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('./datasets/wine-train.csv', sep=\";\")\n", "X = df.loc[:, df.columns != 'quality'].values\n", "y = df['quality'].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we want to look at the coefficient selection for both the scaled and unscaled case.\n", "In this example, *scaled* means that we shift the mean of the features to *zero* and scale the standard deviation to *one*.\n", "This can be easily done by the `scale`-function from `sklearn.preprocessing`.\n", "\n", "**Task**: Normalize your predictor matrix `X` using the function `scale` and store the scaled matrix as `Xscaled`.\n", "You can check this using the methods `mean(axis=0)` and `std(axis=0)` of a numpy array." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to compare the coefficients obtained for the scaled and unscaled predictors.\n", "\n", "**Task**: When done correctly, you should be able to execute the following code.\n", "It computes the *Lasso* estimates for `m` different values of the regularization parameter $\\alpha$ and stores the coefficients as well as the cross-validation score for each $\\alpha$.\n", "Finally, it plots the coefficients in the upper part of the figure, and the corresponding cv-scores in the lower part." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "from sklearn.linear_model import Lasso\n", "from sklearn.preprocessing import scale\n", "from sklearn.model_selection import cross_val_score\n", "\n", "# Get dimensions of X\n", "n,p = X.shape\n", "m = 50\n", "Alpha = np.logspace(-4,1,m)\n", "\n", "cscaled = np.zeros((m,p+1))\n", "corig = np.zeros((m,p+1))\n", "cvscaled = np.zeros((m,))\n", "cvorig = np.zeros((m,))\n", "\n", "for (i,a) in enumerate(Alpha):\n", " lm = Lasso(alpha=a,tol=1e-8)\n", " \n", " lm.fit(Xscaled, y)\n", " cscaled[i,0] = lm.intercept_\n", " cscaled[i,1:] = lm.coef_\n", " \n", " cvscaled[i] = cross_val_score(lm, Xscaled, y, cv=10).mean()\n", " \n", " \n", " lm = Lasso(alpha=a,tol=1e-8)\n", " lm.fit(X, y)\n", " corig[i,0] = lm.intercept_\n", " corig[i,1:] = lm.coef_\n", " \n", " cvorig[i] = cross_val_score(lm, X, y, cv=10).mean()\n", " \n", "# Plot the output\n", "import matplotlib.pyplot as plt\n", "plt.rcParams['figure.figsize']=(15,10)\n", "fig, ax = plt.subplots(2,2)\n", "ax[0][0].semilogx(Alpha, cscaled[:,1:])\n", "ax[0][0].set_title('Scaled predictors')\n", "ax[0][0].set_xlabel('Alpha')\n", "ax[0][0].set_ylabel('Coefficients');\n", "\n", "ax[0][1].semilogx(Alpha, corig[:,1:])\n", "ax[0][1].set_title('Unscaled predictors')\n", "ax[0][1].set_xlabel('Alpha')\n", "ax[0][1].set_ylabel('Coefficients');\n", "\n", "ax[1][0].semilogx(Alpha, cvscaled)\n", "ax[1][0].set_xlabel('Alpha')\n", "ax[1][0].set_ylabel('cv-score');\n", "ax[1][1].semilogx(Alpha, cvorig);\n", "ax[1][1].set_xlabel('Alpha')\n", "ax[1][1].set_ylabel('cv-score');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: What do you observe?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observation**:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**:\n", "Compute the values of $\\alpha$, for which the cv-scores are maximized." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Now we want to compare the mean squared errors for both regressions using the value of $\\alpha$ which maximizes the cv-score.\n", "What do you observe?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observation**:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should observe, that the mean squared errors are very close. Indeed, the MSE for the unscaled problem is even slightly lower than the MSE of the scaled problem.\n", "\n", "**Task**: Import now the csv-file `wine-test.csv` and store the predictors as a numpy array `Xtest` and the target variables as `ytest`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Compute the mean squared errors on the test set using the Lasso models from above for the scaled and unscaled version. Don't forget to scale the predictors for your scaled model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Interpret the results. What could be the reason, why the unscaled model behaves better than the scaled model? Are your predictions good or bad?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }