{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Data Science\n", "## Lab 3: Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The previous exercises gave an introduction to Python, Numpy and Pandas. Beginning with this exercise, we shift our focus to statistical learning itself. To this end, we will employ the module scikit-learn which offers many functions we will cover over the remaining semester." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part A: Linear regression for the Advertising dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If not already done, please download the file [Advertising.csv](https://www.tu-chemnitz.de/mathematik/numa/lehre/ds-2019/Exercises/ps3/Advertising.csv) and move it into the current directory on the Jupyter Hub." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Import the *Advertising* dataset using `pandas` `read_csv` function. Ensure, that the first column is treated as the index column." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "5bc404a50048cbd3310c7f72d759d89a", "grade": false, "grade_id": "cell-8d64ab6b1a68631f", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "\n", "# Print first entries of adv\n", "print(adv.head(3))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "31e44968e45e337d38e76bc706db30b6", "grade": true, "grade_id": "cell-b1115ea7f17d9f4b", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert len(adv.index) == 200\n", "assert adv.TV.mean() == 147.0425\n", "assert adv.shape == (200,4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience, we extract the values from this `pandas`-DataFrame.\n", "This can be done with the method `values`, which returns a `numpy` array.\n", "\n", "**Task**: Extract the data in the following manner:\n", "- `X` contains all predictor variables, i.e., the values from the columns `TV`, `Radio` and `Newspaper`.\n", "- `Y` contains the dependent variable `Sales`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "c2c554648df65c96215bbe534fcaf0c4", "grade": false, "grade_id": "cell-f0d1c2f7e6986c87", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "cfc8f9558132e42bc7e6433a10e65e9b", "grade": true, "grade_id": "cell-6af75eca942f8185", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert X.shape == (200,3)\n", "assert Y.shape == (200,)\n", "assert type(X) == np.ndarray\n", "assert type(Y) == np.ndarray" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the variables `X` and `Y` contain only the values itself, and no further information like the column title.\n", "\n", "Using the `numpy` function `hsplit`, we can split an array horizontally.\n", "This will become handy in many circumstances.\n", "Here, we split the 2-dimensional numpy array `X` into three \n", "**2**-dimensional slices `tv`, `radio` and `news`.\n", "Note that the returned arrays have still the second dimension (which is one)!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tv, radio, news = np.hsplit(X,3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute for each of the 3 predictor variables **TV**, **radio** and **newspaper** simple (1-dimensional) linear regressions, e.g.\n", "\n", "$$ y^{TV}_i \\approx \\beta_0^{TV} + \\beta_1^{TV} \\, x_i^{TV}$$\n", "\n", "Use the following function:\n", "\n", " from sklearn.linear_model import LinearRegression\n", " \n", "Store the intercepts, i.e., the values $\\beta_0$ in variables\n", " \n", " intercept_tv, intercept_radio and intercept_news\n", " \n", "and the linear coefficients, i.e., the values $\\beta_1$ in variables\n", "\n", " lincoef_tv, lincoef_radio and lincoeff_news\n", "\n", "To print your results in a nice fashion, you can use a command similar to\n", "\n", " print('y = %5.4f + %5.4f x TV' % (intercept, lincoef))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "e0455dd47c54c42d8fe3169ed3b2ba99", "grade": false, "grade_id": "cell-5cb14fe0e222492c", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "6fee455dec1e4bbaafb6cbe4552dbde4", "grade": true, "grade_id": "cell-ed24b27d868aff40", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert np.abs(intercept_tv-7.0326) < 1e-4\n", "assert np.abs(lincoef_tv-0.0475) < 1e-4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should observe, that the regression coefficients for **TV** and **newspaper** are very similar.\n", "As you already know from the lecture, it is not satisfying from a mathematical point of view to restrict our investigation to the absolute values of the coefficients.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part B: Assessing the quality of a linear fit." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the lecture you learned about different measures for assessing the quality of a linear fit.\n", "In the last exercise, we already implemented a function to compute the mean squared error (MSE).\n", "\n", "This time, we want to compare the $R^2$ scores. You can use the method `score()` of a `LinearRegression` to get the $R^2$ values.\n", "Remember that this value is the proportion of variability in $Y$ explained using **TV**, **radio** or **newspaper** as predictor in a 1-dimensional linear regression fit.\n", "\n", "**Task**: Compute the $R^2$ scores and store them in variables\n", " \n", " R2_tv, R2_radio, R2_news" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "4c9021fae88e5026f06552e4b278e329", "grade": false, "grade_id": "cell-fd3b6790a9e543c0", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "\n", "print(\"R^2 for TV: \", R2_tv)\n", "print(\"R^2 for radio: \", R2_radio)\n", "print(\"R^2 for newspaper: \", R2_news)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "ab00ea68c91893ff86e23ca78c996761", "grade": true, "grade_id": "cell-b9cdf81fe534585a", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert np.abs(R2_tv - 0.611875050850071) < 1e-9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part C: Predicting values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we want to compute the predicted value of sales if we restrict our prediction to one input, i.e. **TV**, **radio** or **newspaper**, resp.\n", "Predict the values $\\hat{y}^{TV}$ $\\hat{y}^{radio}$ and $\\hat{y}^{newspaper}$ using the `LinearRegression`-method `predict()` and store them in variables\n", "\n", " y_tv, y_radio, y_news" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "6fe1e5444a4de153ad0630265e51d1a0", "grade": false, "grade_id": "cell-9e5459d7aaa2dfc8", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "a61f2d22b1f0a69c6ab82f77ca5dbf8f", "grade": true, "grade_id": "cell-71f652a968725ea2", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert np.abs(y_tv.mean() -14.0225) < 1e-5\n", "assert np.abs(y_tv.std() - 4.071006120646744) < 1e-5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part D: Plotting\n", "Plot the datapoints as well as the corresponding regression line for each of the inputs **TV**, **radio** or **newspaper**.\n", "\n", "You can use the functions `subplots` or `fig.add_subplot` to arrange the plots in one figure." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "30331b1e6bf24bcf82d20fb8e3865fc1", "grade": true, "grade_id": "cell-f5323561f0d02b4a", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# We plot our findings using subplots\n", "import matplotlib.pyplot as plt\n", "plt.rcParams['figure.figsize'] = (16,9)\n", "\n", "fig = plt.figure()\n", "fig.add_subplot(1,3,1)\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "\n", "plt.title('Lin. regr. TV')\n", "\n", "fig.add_subplot(1,3,2)\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "\n", "plt.title('Lin. regr. radio')\n", "\n", "fig.add_subplot(1,3,3)\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "\n", "plt.title('Lin. regr. newspaper');\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part E: Statistical functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a closer look at the correlation matrix for the `DataFrame` `adv`.\n", "You can use the method `corr()` that is implemented for pandas \n", "`DataFrames`.\n", "Which features are correlated most strongly?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "d52f531524556769438cc4f17d1630ab", "grade": false, "grade_id": "cell-4ba6cf2f14622849", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "adv.corr()\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "fb0744d989515ed6dca84235f5ddf361", "grade": false, "grade_id": "cell-aada4e9553f481e3", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Name the two features, that are correlated most strongly in the next cell!" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e8fd522b3d24dca209b8dafbd21e2da4", "grade": true, "grade_id": "cell-3d9e79b4a9566134", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "source": [ "YOUR ANSWER HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Investigate the statistical significance of the medium **newspaper** in a linear regression involving only this feature. Use a **t-test** for this purpose as described on slide 83 in the lecture notes.\n", "\n", "You should observe the following values:\n", "\n", "|Coefficient | Estimate | SE | t-statistic | p-value|\n", "|:-----------|----------|----|-------------|--------|\n", "| $\\beta_0$ | 12.351 | 0.621 | 19.88 | < 0.0001 |\n", "| $\\beta_{newspaper}$ | 0.055 | 0.017 | 3.30 | 0.00115\n", "\n", "\n", "You should use `scipy` to get the $t$-distribution using\n", "\n", " from scipy.stats import t\n", " \n", "The cumulative distribution function at a point `x` for `n` degrees of freedom can than be called by\n", "\n", " t.cdf(x, n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "77835196b87fb3e28536cb54c67e297c", "grade": true, "grade_id": "cell-02461ff24a250328", "locked": false, "points": 3, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "from scipy.stats import t\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "\n", "# Print information\n", "print(\"Intercept: %7.4f\" % beta_0)\n", "print(\"Lin. coef: %7.4f\" % beta_1)\n", "print(\"\")\n", "print(\"std-error Intercept: %7.4f\" % SE0)\n", "print(\"std-error Lin. coef: %7.4f\" % SE1)\n", "print(\"\")\n", "print(\"t-statistic Intercept: %6.2f\" % tstat_0)\n", "print(\"t-statistic Lin. coef: %6.2f\" % tstat_1)\n", "print(\"\")\n", "print(\"P-value Intercept: %7.5f\" % pval_0)\n", "print(\"P-value Lin. coef: %7.5f\" % pval_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part F: Linear regression on all prectictors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now construct a linear regression on all three predictor variables, i.e.\n", "\n", "$$y_i ≈ \\beta_0 + \\beta_{TV} x^{TV}_i + \\beta_{radio} x^{radio}_i + \\beta_{newspaper} x^{newspaper}_i$$\n", "\n", "Ensure that the intercept and linear regression coefficients are stored in the variables\n", "\n", " beta_0, beta_tv, beta_radio and beta_news" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "ff72f2ca6f70c41a81c2cd4885650d83", "grade": false, "grade_id": "cell-59ddc25633526b2d", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "print('y = %5.4f + %5.4f x TV + %5.4f x radio + %5.4f x newspaper' % (beta_0, beta_tv, beta_radio, beta_news))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "f14a98d83af5cb0fdb5e0f1d66a3c04a", "grade": true, "grade_id": "cell-cd23d2ff6972429b", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert np.abs(beta_0 - 2.9389) < 1e-4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What portion of the variance (between 0 and 1) is explained by this linear regression fit? Store your answer in the variable `explained_var`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "d54de71df8d393960cf28d2075e589bd", "grade": false, "grade_id": "cell-9cb2dad97c8c9bf5", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "b92bfbccdcc7db52b276587d0d799aa8", "grade": true, "grade_id": "cell-66b0fe342bdf5b49", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "print('The portion of the variance that can be explained by the full model is about %8.6f' % explained_var)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now perform a linear regression that incorporates only the predictors **TV** and **radio**.\n", "Compute the $R^2$-value of this linear regression model and compare it to the $R^2$-value of the full multiple linear regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "edd1d18a9dc1f3d8653c5fdc4ab659aa", "grade": false, "grade_id": "cell-caee2298f250d394", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "91d08d7fd84051ca3ae902008c1f59f8", "grade": true, "grade_id": "cell-664727b9fc60e7e8", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "print('The R^2 score for the model using TV and Radio as predictors is %8.6f' % R2_tv_radio)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should observe that the $R^2$-score for the linear regression fit incorporating all three features is only marginally larger than the score using only **TV** and **radio** for the prediction.\n", "Thus, it might be sufficient to exclude the **feature** newspaper from our prediction.\n", "\n", "The procedure we did today is called *feature selection* is should be one of the first steps in every statistical learning problem." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "8d5bda1ca8646d2ae49b11784c1029b1", "grade": false, "grade_id": "cell-694eb0d911bd5ca5", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### Part G: Computing the t-statistic for the full model\n", "\n", "We have already seen, that the **t-test** comes in handy when one has to decide whether a coefficient for a single feature is significant or not.\n", "As has been outlined in the lecture, one can also use the t-test in a multiple linear regression fit\n", "\n", "$$ Y = X \\beta + \\varepsilon $$\n", "\n", "while the intercept is incorporated into $X$, i.e. a column containing only ones is stacked in front of the original matrix $X$.\n", "\n", "The formula to compute the test statistic in this generalized setting is\n", "\n", "$$ t_j = \\frac{\\hat{\\beta}_j}{\\hat{\\sigma} \\sqrt{v_j}} $$\n", "\n", "while $\\hat \\beta_j$ is the $j$-th entry of the coefficient vector\n", "\n", "$$ \\hat \\beta = (X^\\top X)^{-1} X^\\top y, $$\n", "\n", "$\\hat{\\sigma}$ is the unbiased estimate of $\\sigma$, which is determined by\n", "\n", "$$ \\hat{\\sigma} = \\sqrt{\\frac{1}{n-p-1} \\, \\sum_{i=1}^n (y_i - \\hat{y}_i)^2} $$\n", "\n", "and $v_j$ is the $j$-th diagonal element of the matrix $(X^\\top X)^{-1}$.\n", "\n", "Then $t_j$ is distributed according to a $t$-distribution with $n-p-1$ degrees of freedom (dofs). \n", "\n", "**Task**: Compute the values in the following statistic and try to print it in a similar way. \n", "\n", "| Coefficient | Estimate | SE | t-statistic | p-value |\n", "|:-----------------|-----------|-------|-------------|---------|\n", "| $\\beta_0$ | 2.939 |0.3119 | 9.42 | < 0.0001|\n", "| $\\beta_{TV}$ | 0.046 |0.0014 | 32.81 | < 0.0001|\n", "| $\\beta_{radio}$ | 0.189 |0.0086 | 21.89 | < 0.0001|\n", "| $\\beta_{news}$ | −0.001 |0.0059 | −0.18 | 0.8599 |\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "58c7450ce29d9163c261b6459343e995", "grade": true, "grade_id": "cell-5de872312430d808", "locked": false, "points": 3, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 1 }