{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Sheet 4 - Further aspects of linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 1 - Limitations of the t-test\n", "\n", "In this notebook, we investigate the limitations of a single-variable **t-test** for the predictor coefficients $\\beta$ in a linear regression setting.\n", "Recall the following statements from the lecture (Slide 102):\n", "* Does a single small $p$-value indicate at least one variable relevant? No.\n", "* Example: $p=100$, $H_0 : \\beta_1 = \\dots = \\beta_p = 0$ true. Then by chance, $5\\%$ of $p$-values below $0.05$. Almost guaranteed that $p<0.05$ for at least one variable by chance.\n", "* Thus, for large $p$, looking only at $p$-values of individual $t$-statistics tends to discover spurious relationships.\n", "\n", "In what follows, we use slightly different values than in the above mentioned example, setting $n = 100$ and $p = 20$." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# Set parameters n (number of training samples) and p (number of predictor variables)\n", "n = 100\n", "p = 20" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this purpose, we generate random uncorrelated input and output vectors.\n", "\n", "**Task**: Write a function that generates uniformly distributed random variables\n", "* $y$ should be of size (n,) with values in $[-0.5,0.5]$ \n", "* $X$ should be of size (n, p+1) with values in $[0,1]$; the first column is reserved for the intercept and should contain a only ones" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def generateExample(n,p):\n", " # Put your code here.\n", " return (X,y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def generateExample(n,p):\n", " y = -0.5+np.random.rand(n,)\n", " X = np.random.rand(n, p+1)\n", " X[:,0] = 1.\n", " return (X,y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function computes single-variable t-statistics for the model\n", "$$ y \\approx X \\beta $$\n", "whose parameters $\\beta \\in \\mathbb{R}^{p+1}$ are computed via\n", "$$ \\hat \\beta = (X^\\top X)^{-1} X^\\top y. $$" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import t\n", "\n", "def printTStatistic(X, y, p_threshold = 0.10, print_table=True):\n", " n, p = X.shape\n", " p = p - 1\n", "\n", " # Invert X^T * X\n", " V = np.linalg.inv((X.T).dot(X))\n", " \n", "\n", " # Compute regression coefficients beta\n", " beta = V.dot( X.T.dot(y) )\n", "\n", " # Extract diagonal of matrix (X^T * X)^-1\n", " v = V.diagonal()\n", "\n", " # Predict y using beta\n", " y_pred = X.dot(beta)\n", "\n", " # Compute estimate of sigma\n", " sigma_hat = np.sqrt( 1./(n-p-1) * np.power(y - y_pred,2).sum() )\n", "\n", " # Compute the standard errors\n", " SE = np.sqrt(v) * sigma_hat\n", "\n", " # Compute the values of the t-statistic\n", " t_vals = beta / SE\n", "\n", " # Compute the corresponding p values\n", " p_vals = 2*t.cdf(-np.absolute(t_vals), n-p-1)\n", "\n", " if print_table:\n", " # Print header\n", " print('| Coefficient | Estimate | SE | t-statistic | p-value | p < %4.2f |' % p_threshold)\n", " print('----------------------------------------------------------------------------')\n", " # Print \n", " for i in range(p+1):\n", " pval = p_vals[i]\n", " if pval < 0.0001:\n", " pval_str = '< 0.0001'\n", " else:\n", " pval_str = ' %5.4f' % pval\n", " print('| beta_%02d | %6.3f | %6.4f | %5.2f | %s | %d |' % (i, beta[i], SE[i], t_vals[i], pval_str, pval < p_threshold))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import t\n", "\n", "def printTStatistic(X, y, p_threshold = 0.10, print_table=True):\n", " n, p = X.shape\n", " p = p - 1\n", "\n", " # Invert X^T * X\n", " V = np.linalg.inv((X.T).dot(X))\n", " \n", "\n", " # Compute regression coefficients beta\n", " beta = V.dot( X.T.dot(y) )\n", "\n", " # Extract diagonal out of matrix (X^T * X)^-1\n", " v = V.diagonal()\n", "\n", " # Predict y using beta\n", " y_pred = X.dot(beta)\n", "\n", " # Compute estimate of sigma\n", " sigma_hat = np.sqrt( 1./(n-p-1) * np.power(y - y_pred,2).sum() )\n", "\n", " # Compute the standard errors\n", " SE = np.sqrt(v) * sigma_hat\n", "\n", " # Compute the values of the t-statistic\n", " t_vals = beta / SE\n", "\n", " # Compute the corresponding p values\n", " p_vals = 2*t.cdf(-np.absolute(t_vals), n-p-1)\n", "\n", " if print_table:\n", " # Print header\n", " print('| Coefficient | Estimate | SE | t-statistic | p-value | p < %4.2f |' % p_threshold)\n", " print('----------------------------------------------------------------------------')\n", " # Print \n", " for i in range(p+1):\n", " pval = p_vals[i]\n", " if pval < 0.0001:\n", " pval_str = '< 0.0001'\n", " else:\n", " pval_str = ' %5.4f' % pval\n", " print('| beta_%02d | %6.3f | %6.4f | %5.2f | %s | %d |' % (i, beta[i], SE[i], t_vals[i], pval_str, pval < p_threshold))\n", " \n", " return np.mean(p_vals < p_threshold)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Test the function `printTStatistic` using an example drawn with your `generateExample` function." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Put your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| Coefficient | Estimate | SE | t-statistic | p-value | p < 0.10 |\n", "----------------------------------------------------------------------------\n", "| beta_00 | -0.231 | 0.2444 | -0.94 | 0.3485 | 0 |\n", "| beta_01 | 0.023 | 0.1025 | 0.23 | 0.8198 | 0 |\n", "| beta_02 | 0.185 | 0.1138 | 1.63 | 0.1078 | 0 |\n", "| beta_03 | 0.079 | 0.1138 | 0.70 | 0.4885 | 0 |\n", "| beta_04 | -0.174 | 0.1078 | -1.62 | 0.1100 | 0 |\n", "| beta_05 | -0.016 | 0.1038 | -0.15 | 0.8785 | 0 |\n", "| beta_06 | 0.030 | 0.1043 | 0.28 | 0.7767 | 0 |\n", "| beta_07 | -0.074 | 0.1025 | -0.72 | 0.4724 | 0 |\n", "| beta_08 | 0.129 | 0.1064 | 1.21 | 0.2283 | 0 |\n", "| beta_09 | -0.040 | 0.1100 | -0.37 | 0.7137 | 0 |\n", "| beta_10 | 0.041 | 0.1192 | 0.35 | 0.7297 | 0 |\n", "| beta_11 | 0.092 | 0.1080 | 0.85 | 0.3984 | 0 |\n", "| beta_12 | 0.065 | 0.1170 | 0.55 | 0.5820 | 0 |\n", "| beta_13 | 0.098 | 0.1019 | 0.96 | 0.3389 | 0 |\n", "| beta_14 | 0.049 | 0.1083 | 0.45 | 0.6518 | 0 |\n", "| beta_15 | -0.063 | 0.1028 | -0.61 | 0.5441 | 0 |\n", "| beta_16 | 0.124 | 0.1095 | 1.13 | 0.2606 | 0 |\n", "| beta_17 | -0.138 | 0.1083 | -1.27 | 0.2077 | 0 |\n", "| beta_18 | -0.013 | 0.1109 | -0.12 | 0.9042 | 0 |\n", "| beta_19 | -0.213 | 0.1047 | -2.03 | 0.0454 | 1 |\n", "| beta_20 | 0.220 | 0.1165 | 1.89 | 0.0630 | 1 |\n" ] }, { "data": { "text/plain": [ "0.09523809523809523" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X,y = generateExample(n,p)\n", "printTStatistic(X,y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we want to find out, how many predictor variables are statistically significant for a threshold of $0.10$.\n", "\n", "**Task 1**: Expand the function `printTStatistic`.\n", "It should return the proportion of significant predictor variables at a certain threshold `p_threshold`.\n", "**Task 2**: Write a small script that carries out the experiment `1000` times and computes the mean proportion of significant values in our experiment.\n", "\n", "**Hint**: You can collect the returned values in a list initialized by `vals = []`. You can append a new value using `vals.append(new_val)`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Put your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.09714285714285714\n" ] } ], "source": [ "vals = []\n", "for i in range(1000):\n", " X, y = generateExample(n, p)\n", " vals.append(printTStatistic(X, y, p_threshold = 0.10, print_table=False))\n", "print(np.mean(vals))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should observe that approximately $10\\%$ of the predictor variables are significant by chance." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }