{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name below.\n", "\n", "Rename this problem sheet as follows:\n", "\n", " ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}\n", " \n", "for example\n", " \n", " ps2_blja_problem1\n", "\n", "Submit your homework within one week until next Monday, 9 a.m." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NAME = \"\"\n", "EMAIL = \"\"\n", "USERNAME = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "c1d10555f944a7d1cf302f22ca3825f7", "grade": false, "grade_id": "cell-2c0be3d551ee40ab", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "# Introduction to Data Science\n", "## Lab 4: Further aspects of linear regression" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "6c4344f7c14157f80995777e849d035d", "grade": false, "grade_id": "cell-6e6033c0dc952d4a", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### Part A - Limitations of the t-test" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "3fb22f0f77f31ca26dc8fce8dae156e2", "grade": false, "grade_id": "cell-6e50608863f3b9af", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "In this notebook, we investigate the limitations of a single-variable **t-test** for the predictor coefficients $\\beta$ in a linear regression setting.\n", "Recall the following statements from the lecture (Slide 105):\n", "* Does a single small $p$-value indicate at least one variable relevant? No.\n", "* Example: $p=100$, $H_0 : \\beta_1 = \\dots = \\beta_p = 0$ true. Then by chance, $5\\%$ of $p$-values below $0.05$. Almost guaranteed that $p<0.05$ for at least one variable by chance.\n", "* Thus, for large $p$, looking only at $p$-values of individual $t$-statistics tends to discover spurious relationships.\n", "\n", "In what follows, we use slightly different values than in the above mentioned example, setting $n = 100$ and $p = 20$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "20cfc0dfdc9f2f4d47ee527a4d26ba5a", "grade": false, "grade_id": "cell-9bf98be6407d8b03", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "import numpy as np\n", "\n", "# Set parameters n (number of training samples) and p (number of predictor variables)\n", "n = 100\n", "p = 20" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "aab988416708ad77fdb83c3e4da77d67", "grade": false, "grade_id": "cell-6d855e4e79bf6669", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "For this purpose, we generate random uncorrelated input and output vectors.\n", "\n", "**Task**: Write the function `drawSample` that generates **uniformly distributed** arrays of random variables\n", "* $X$ should be of size (n, p+1) with values in $[0,1]$; the first column is reserved for the intercept and should contain a only ones\n", "* $y$ should be of size (n,) with values in $[-0.5,0.5]$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "7aa585db04266dd7eea6fc43e5f75839", "grade": false, "grade_id": "cell-8b1fa66b64d265fb", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "def drawSample(n,p):\n", " \"\"\" This function draws a\n", " sample for our experiment. \"\"\"\n", " \n", " # YOUR CODE HERE\n", " raise NotImplementedError()\n", " \n", " return (X,y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "054f97fab92420b6f28b53f990dea3fb", "grade": true, "grade_id": "cell-c21e8f578a5dd23f", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert drawSample(40,4)[0].shape == (40,5), 'Wrong shape of X'\n", "assert drawSample(40,4)[1].shape == (40,), 'Wrong shape of y'\n", "assert all(drawSample(40,4)[0][:,0]==1), 'Check the first column of X'\n", "assert drawSample(40,4)[1].min() > -0.5 and drawSample(40,4)[1].max() < 0.5, 'Wrong range of y'\n", "assert drawSample(40,4)[0].min() > 0 and drawSample(40,4)[0].max() <= 1, 'Wrong range of X'" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "220a4f82353a9b6305bf253437852e67", "grade": false, "grade_id": "cell-6c4fe93ee7103cf8", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "The following function computes single-variable t-statistics for the model\n", "$$ y \\approx X \\beta $$\n", "whose parameters $\\beta \\in \\mathbb{R}^{p+1}$ are estimated via\n", "$$ \\hat \\beta = (X^\\top X)^{-1} X^\\top y. $$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "58b2a92b0a16f788fca540080f1f61f7", "grade": false, "grade_id": "cell-888372a43aa00783", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "from scipy.stats import t\n", "\n", "def printTStatistic(X, y, p_threshold = 0.10, print_table=True):\n", " n, m = X.shape\n", " p = m - 1\n", "\n", " # Invert X^T * X\n", " V = np.linalg.inv((X.T).dot(X))\n", " \n", "\n", " # Compute regression coefficients beta\n", " beta = V.dot( X.T.dot(y) )\n", "\n", " # Extract diagonal of matrix (X^T * X)^-1\n", " v = V.diagonal()\n", "\n", " # Predict y using beta\n", " y_pred = X.dot(beta)\n", "\n", " # Compute estimate of sigma\n", " sigma_hat = np.sqrt( 1./(n-p-1) * np.power(y - y_pred,2).sum() )\n", "\n", " # Compute the standard errors\n", " SE = np.sqrt(v) * sigma_hat\n", "\n", " # Compute the values of the t-statistic\n", " t_vals = beta / SE\n", "\n", " # Compute the corresponding p values\n", " p_vals = 2*t.cdf(-np.absolute(t_vals), n-p-1)\n", "\n", " if print_table:\n", " \n", " # Print header\n", " print('| Coefficient | Estimate | SE | t-statistic | p-value | p < %4.2f |' % p_threshold)\n", " print('----------------------------------------------------------------------------')\n", " \n", " # Print \n", " for i in range(p+1):\n", " pval = p_vals[i]\n", " if pval < 0.0001:\n", " pval_str = '< 0.0001'\n", " else:\n", " pval_str = ' %5.4f' % pval\n", " print('| beta_%02d | %6.3f | %6.4f | %5.2f | %s | %d |' % (i, beta[i], SE[i], t_vals[i], pval_str, pval < p_threshold))\n", " \n", " # YOUR CODE HERE\n", " raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "c00a6dfcd14ba5b82ecd3b1e734dfa41", "grade": false, "grade_id": "cell-982d858d0c6e6425", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: Test the function `printTStatistic` using an example drawn with your function `drawSample`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "7e24ae0ba1409b728f705385c3b77c28", "grade": true, "grade_id": "cell-92e5e11d02c2543d", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "6358fe85b9acff1f9ab728f98f59f362", "grade": false, "grade_id": "cell-54186beed1ed6c35", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Now, we want to find out, how many predictor variables are statistically significant for a threshold of $0.10$ in our setting with `n = 100` and `p = 20`.\n", "\n", "**Task**: Expand the function `printTStatistic` from above. It should **return the proportion of significant predictor variables** at a certain threshold `p_threshold`. Test it using the example below; execute the next cell multiple times (by hitting `Ctrl + Enter`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "44956a27060c2be1608c5e6aa875677c", "grade": true, "grade_id": "cell-cec12dcb08dc21c7", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "cb5b56cc5e6504f7b5b7bf3964d9ea86", "grade": false, "grade_id": "cell-0ada6a2275e9f16e", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: Write a small script that carries out the experiment `1000` times and computes the mean proportion of significant values in our experiment. It should be around `p_threshold`.\n", "\n", "**Hint 1**: Use the keyword argument `print_table` to suppress the printing of the tables.\n", "\n", "**Hint 2**: You can collect the returned values in a list initialized by `vals = []`. You can append a new value `new_val` using `vals.append(new_val)`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "e7d2eff96689491da971484e4e87d239", "grade": true, "grade_id": "cell-900fd83ad0f07f69", "locked": false, "points": 2, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "612e5cbde5a2cfc9bb18094f6997d14d", "grade": false, "grade_id": "cell-ade74e5fc814e08d", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### Part B: \"Nonlinear\" linear regression" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "f60c62a1297b2604afb2f429fcfa3b20", "grade": false, "grade_id": "cell-56275c08b9d03356", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "The goal of this problem is to approximate given data points $(x_i,y_i)$ for $i=1,\\ldots,n$ by polynomials of degree $p$.\n", "This can be done by solving the linear regression problem:\n", "\n", "$$\n", " y_i \\approx \\beta_0 + \\beta_1 \\, x_i + \\beta_2 \\, x_i^2 + \\ldots + \\beta_p \\, x_i^p\n", "$$\n", "\n", "By splitting our data into a training and test data set, we want to illustrate graphically the problem of overfitting." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "f31b01b4889c82cd5383e7e54c083fcd", "grade": false, "grade_id": "cell-160b12607c089949", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: Define the 'unknown' function\n", "\n", "$$\n", "f(x) = \\sin(10 \\, x) + 5 \\, \\cos(3 \\, x)\n", "$$\n", "\n", "using `numpy`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "57cbeb5aec80bf01037439faa3650392", "grade": false, "grade_id": "cell-f885a94e9c772a30", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import numpy as np\n", "\n", "# Define the 'unknown' function f\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "e40d5dd536dc0711c06b7c67fe619e73", "grade": true, "grade_id": "cell-97535adf15a536a9", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert(np.abs(f(np.pi)+5) < 1e-8)\n", "assert(np.abs(f(np.pi/2)) < 1e-8)\n", "assert(np.abs(f(0)-5) < 1e-8)\n", "assert(np.abs(f(2) - 5.71) < 1e-2)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e230df4127b89855d831e0c900a7b70c", "grade": false, "grade_id": "cell-07c30dcca25eb800", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: Generate a uniformly distributed random vector `x` of size `n = 200`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "27a8aca60ea1224422f2aee65d4f5e72", "grade": false, "grade_id": "cell-a3beb882c6199b0e", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Set random seed to make random variables 'predictable'\n", "np.random.seed(0)\n", "\n", "# Generate uniformly distributed data samples over [0,1)\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "7350cdd1a4746421372f7fd9e5e3a854", "grade": true, "grade_id": "cell-1c0f265beb29c4ea", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert n == 200\n", "assert np.abs(x.mean() - 0.5004377979051402) < 1e-8\n", "assert x.shape == (200,)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "8402eeac2809fc5f90720a049f75d773", "grade": false, "grade_id": "cell-3b0982ae444d1c27", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: Determine the vector `y` in the following way\n", "\n", "$$\n", "y_i = f(x_i) + \\varepsilon \\, \\eta_i\n", "$$\n", "\n", "with $\\eta_i$ standard-normal distributed and $\\varepsilon = 1$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "8a13460794c88d0d71743504633bc9e8", "grade": false, "grade_id": "cell-a98eacb6c66093ab", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "65975f7f4595a7a59b3065060086906c", "grade": true, "grade_id": "cell-f632b8c0c5988c29", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert y.shape == (200,)\n", "assert np.abs(y.mean() - 0.2748887916140714) < 1e-8" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "f311ed30f080dba05e3d87c82b81b580", "grade": false, "grade_id": "cell-e0628aef205798bf", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: Generate one figure with the following data:\n", "* mark the **data points** $(x_i,y_i)$ as black circles\n", "* draw the **population line** (the line representing the *unknown* function $f$) as a red solid line\n", "* draw the **regression line** for a fitted polynomial with polynomial degree `p = 20` as a blue dashed line\n", "\n", "**Hint**: Use the functions `np.polyfit` and `np.polyval` to determine the regression line." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "d8a3a96894091b4fda3cf4bab166c598", "grade": true, "grade_id": "cell-374277908e484443", "locked": false, "points": 4, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.rcParams['figure.figsize'] = (15,8)\n", "\n", "fig = plt.figure(1, clear = True)\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1ca7f33500a62feff146a47a7621c95e", "grade": false, "grade_id": "cell-c482bc8c1c9bf128", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Split the dataset $(x,y)$ into a training and test set using `np.split`\n", "- the training set should contain `ntrain` samples\n", "- the test set should contain `n - ntrain` samples\n", "\n", "Choose `ntrain = 80`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "65986c9c8b187b17a2d3b11737c2b263", "grade": false, "grade_id": "cell-8f48e7bce19fd7cb", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "71ab6e5dcb6c75b5e0a32f14cc366fe4", "grade": true, "grade_id": "cell-7e8109022605f809", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert(ntrain == 80)\n", "assert(xtrain.shape == (80,))\n", "assert(xtest.shape == (120,))\n", "assert(ytrain.shape == (80,))\n", "assert(ytest.shape == (120,))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we want to fit polynomial models with varying polynomial degrees ($p= 0,\\ldots,20$).\n", "As a quality measure, we store the training MSE (mean squared error) and the test MSE.\n", "\n", "**Note**: You can ignore the `RankWarning`s!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "6784154946fcab7912a32ea128d0cdf7", "grade": false, "grade_id": "cell-fc2d52ce5ecc280d", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "def computeMSE(y, fhatx):\n", " \" This function returns the mean squared error between x and y.\"\n", " return np.mean(np.power(y-fhatx,2))\n", "\n", "# Initialize lists that contain test and training mean squared errors\n", "MSEtrain = []\n", "MSEtest = []\n", "\n", "# Set range for different degrees\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "\n", "for j in deg_range:\n", " \n", " # Fit polynomial of degree 'j' on training data\n", " # YOUR CODE HERE\n", " raise NotImplementedError()\n", " \n", " # Append test and training mse to according list\n", " # YOUR CODE HERE\n", " raise NotImplementedError()\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "aa23629891f618a5069e22dbed92206e", "grade": false, "grade_id": "cell-6f0e245c3fb999fe", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: Generate one figure that contains\n", "- the test mse in a logarithmic plot as a blue dashed line\n", "- the training mse in a logarthmic plot as a red solid line\n", "\n", "against the polynomial degree.\n", "You should use the function `plt.semilogy` and set meaningful `label`s." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "1329d429604f5a7c4f75dc6ec57d3ab5", "grade": true, "grade_id": "cell-ca2617631310e831", "locked": false, "points": 4, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "fig = plt.figure(2, clear=True)\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "plt.legend()\n", "plt.xlabel(\"Polynomial degree\")\n", "plt.ylabel(\"MSE\")\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }