{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Problem Sheet 4 - Further aspects of linear regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problem 1 - Limitations of the t-test\n",
    "\n",
    "In this notebook, we investigate the limitations of a single-variable **t-test** for the predictor coefficients $\\beta$ in a linear regression setting.\n",
    "Recall the following statements from the lecture (Slide 102):\n",
    "* Does a single small $p$-value indicate at least one variable relevant? No.\n",
    "* Example: $p=100$, $H_0 : \\beta_1 = \\dots = \\beta_p = 0$ true. Then by chance, $5\\%$ of $p$-values below $0.05$. Almost guaranteed that $p<0.05$ for at least one variable by chance.\n",
    "* Thus, for large $p$, looking only at $p$-values of individual $t$-statistics tends to discover spurious relationships.\n",
    "\n",
    "In what follows, we use slightly different values than in the above mentioned example, setting $n = 100$ and $p = 20$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Set parameters n (number of training samples) and p (number of predictor variables)\n",
    "n = 100\n",
    "p = 20"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this purpose, we generate random uncorrelated input and output vectors.\n",
    "\n",
    "**Task**: Write a function that generates uniformly distributed random variables\n",
    "* $y$ should be of size (n,) with values in $[-0.5,0.5]$ \n",
    "* $X$ should be of size (n, p+1) with values in $[0,1]$; the first column is reserved for the intercept and should contain a only ones"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generateExample(n,p):\n",
    "    # Put your code here.\n",
    "    return (X,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generateExample(n,p):\n",
    "    y = -0.5+np.random.rand(n,)\n",
    "    X = np.random.rand(n, p+1)\n",
    "    X[:,0] = 1.\n",
    "    return (X,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following function computes single-variable t-statistics for the model\n",
    "$$ y \\approx X \\beta $$\n",
    "whose parameters $\\beta \\in \\mathbb{R}^{p+1}$ are computed via\n",
    "$$ \\hat \\beta = (X^\\top X)^{-1} X^\\top y. $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import t\n",
    "\n",
    "def printTStatistic(X, y, p_threshold = 0.10, print_table=True):\n",
    "    n, p = X.shape\n",
    "    p = p - 1\n",
    "\n",
    "    # Invert X^T * X\n",
    "    V = np.linalg.inv((X.T).dot(X))\n",
    "    \n",
    "\n",
    "    # Compute regression coefficients beta\n",
    "    beta = V.dot( X.T.dot(y) )\n",
    "\n",
    "    # Extract diagonal of matrix (X^T * X)^-1\n",
    "    v = V.diagonal()\n",
    "\n",
    "    # Predict y using beta\n",
    "    y_pred = X.dot(beta)\n",
    "\n",
    "    # Compute estimate of sigma\n",
    "    sigma_hat = np.sqrt( 1./(n-p-1) * np.power(y - y_pred,2).sum() )\n",
    "\n",
    "    # Compute the standard errors\n",
    "    SE = np.sqrt(v) * sigma_hat\n",
    "\n",
    "    # Compute the values of the t-statistic\n",
    "    t_vals = beta / SE\n",
    "\n",
    "    # Compute the corresponding p values\n",
    "    p_vals = 2*t.cdf(-np.absolute(t_vals), n-p-1)\n",
    "\n",
    "    if print_table:\n",
    "        # Print header\n",
    "        print('|  Coefficient  | Estimate |    SE    | t-statistic |  p-value  | p < %4.2f |' % p_threshold)\n",
    "        print('----------------------------------------------------------------------------')\n",
    "        # Print \n",
    "        for i in range(p+1):\n",
    "            pval = p_vals[i]\n",
    "            if pval < 0.0001:\n",
    "                pval_str = '< 0.0001'\n",
    "            else:\n",
    "                pval_str = '  %5.4f' % pval\n",
    "            print('|    beta_%02d    |  %6.3f  |  %6.4f  |    %5.2f    | %s  |     %d    |' % (i, beta[i], SE[i], t_vals[i], pval_str, pval < p_threshold))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import t\n",
    "\n",
    "def printTStatistic(X, y, p_threshold = 0.10, print_table=True):\n",
    "    n, p = X.shape\n",
    "    p = p - 1\n",
    "\n",
    "    # Invert X^T * X\n",
    "    V = np.linalg.inv((X.T).dot(X))\n",
    "    \n",
    "\n",
    "    # Compute regression coefficients beta\n",
    "    beta = V.dot( X.T.dot(y) )\n",
    "\n",
    "    # Extract diagonal out of matrix (X^T * X)^-1\n",
    "    v = V.diagonal()\n",
    "\n",
    "    # Predict y using beta\n",
    "    y_pred = X.dot(beta)\n",
    "\n",
    "    # Compute estimate of sigma\n",
    "    sigma_hat = np.sqrt( 1./(n-p-1) * np.power(y - y_pred,2).sum() )\n",
    "\n",
    "    # Compute the standard errors\n",
    "    SE = np.sqrt(v) * sigma_hat\n",
    "\n",
    "    # Compute the values of the t-statistic\n",
    "    t_vals = beta / SE\n",
    "\n",
    "    # Compute the corresponding p values\n",
    "    p_vals = 2*t.cdf(-np.absolute(t_vals), n-p-1)\n",
    "\n",
    "    if print_table:\n",
    "        # Print header\n",
    "        print('|  Coefficient  | Estimate |    SE    | t-statistic |  p-value  | p < %4.2f |' % p_threshold)\n",
    "        print('----------------------------------------------------------------------------')\n",
    "        # Print \n",
    "        for i in range(p+1):\n",
    "            pval = p_vals[i]\n",
    "            if pval < 0.0001:\n",
    "                pval_str = '< 0.0001'\n",
    "            else:\n",
    "                pval_str = '  %5.4f' % pval\n",
    "            print('|    beta_%02d    |  %6.3f  |  %6.4f  |    %5.2f    | %s  |     %d    |' % (i, beta[i], SE[i], t_vals[i], pval_str, pval < p_threshold))\n",
    "        \n",
    "    return np.mean(p_vals < p_threshold)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Test the function `printTStatistic` using an example drawn with your `generateExample` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Put your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|  Coefficient  | Estimate |    SE    | t-statistic |  p-value  | p < 0.10 |\n",
      "----------------------------------------------------------------------------\n",
      "|    beta_00    |  -0.231  |  0.2444  |    -0.94    |   0.3485  |     0    |\n",
      "|    beta_01    |   0.023  |  0.1025  |     0.23    |   0.8198  |     0    |\n",
      "|    beta_02    |   0.185  |  0.1138  |     1.63    |   0.1078  |     0    |\n",
      "|    beta_03    |   0.079  |  0.1138  |     0.70    |   0.4885  |     0    |\n",
      "|    beta_04    |  -0.174  |  0.1078  |    -1.62    |   0.1100  |     0    |\n",
      "|    beta_05    |  -0.016  |  0.1038  |    -0.15    |   0.8785  |     0    |\n",
      "|    beta_06    |   0.030  |  0.1043  |     0.28    |   0.7767  |     0    |\n",
      "|    beta_07    |  -0.074  |  0.1025  |    -0.72    |   0.4724  |     0    |\n",
      "|    beta_08    |   0.129  |  0.1064  |     1.21    |   0.2283  |     0    |\n",
      "|    beta_09    |  -0.040  |  0.1100  |    -0.37    |   0.7137  |     0    |\n",
      "|    beta_10    |   0.041  |  0.1192  |     0.35    |   0.7297  |     0    |\n",
      "|    beta_11    |   0.092  |  0.1080  |     0.85    |   0.3984  |     0    |\n",
      "|    beta_12    |   0.065  |  0.1170  |     0.55    |   0.5820  |     0    |\n",
      "|    beta_13    |   0.098  |  0.1019  |     0.96    |   0.3389  |     0    |\n",
      "|    beta_14    |   0.049  |  0.1083  |     0.45    |   0.6518  |     0    |\n",
      "|    beta_15    |  -0.063  |  0.1028  |    -0.61    |   0.5441  |     0    |\n",
      "|    beta_16    |   0.124  |  0.1095  |     1.13    |   0.2606  |     0    |\n",
      "|    beta_17    |  -0.138  |  0.1083  |    -1.27    |   0.2077  |     0    |\n",
      "|    beta_18    |  -0.013  |  0.1109  |    -0.12    |   0.9042  |     0    |\n",
      "|    beta_19    |  -0.213  |  0.1047  |    -2.03    |   0.0454  |     1    |\n",
      "|    beta_20    |   0.220  |  0.1165  |     1.89    |   0.0630  |     1    |\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.09523809523809523"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X,y = generateExample(n,p)\n",
    "printTStatistic(X,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we want to find out, how many predictor variables are statistically significant for a threshold of $0.10$.\n",
    "\n",
    "**Task 1**: Expand the function `printTStatistic`.\n",
    "It should return the proportion of significant predictor variables at a certain threshold `p_threshold`.\n",
    "**Task 2**: Write a small script that carries out the experiment `1000` times and computes the mean proportion of significant values in our experiment.\n",
    "\n",
    "**Hint**: You can collect the returned values in a list initialized by `vals = []`. You can append a new value using `vals.append(new_val)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Put your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.09714285714285714\n"
     ]
    }
   ],
   "source": [
    "vals = []\n",
    "for i in range(1000):\n",
    "    X, y = generateExample(n, p)\n",
    "    vals.append(printTStatistic(X, y, p_threshold = 0.10, print_table=False))\n",
    "print(np.mean(vals))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should observe that approximately $10\\%$ of the predictor variables are significant by chance."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}