{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name below.\n", "\n", "Rename this problem sheet as follows:\n", "\n", " ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}\n", " \n", "for example\n", " \n", " ps2_blja_problem1\n", "\n", "Submit your homework within one week until next Monday, 9 a.m." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NAME = \"\"\n", "EMAIL = \"\"\n", "USERNAME = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "75998ad0be7bb835fdd23aa33879a397", "grade": false, "grade_id": "cell-d3f1af4753cf0349", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "# Introduction to Data Science\n", "## Lab 6: Test for importance of a subset of predictors" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "5c1912f1ab3cd263923b8a3e7c9bddbd", "grade": false, "grade_id": "cell-eb893e9a7222f145", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "In this problem, we want to investigate another data set.\n", "The data come from a study of [Stamey et al. (1989)](https://www.sciencedirect.com/science/article/pii/S002253471741175X?via%3Dihub).\n", "It consists of some clinical measurements of 97 patients, who where about to receive a prostatectomy.\n", "\n", "The predictors are:\n", "* lcavol - Logarithm of cancer volume\n", "* lweight - Logarithm of prostate weight\n", "* age - Age of patient\n", "* lbph - Logarithm of amount of benign prostatic hyperplasia\n", "* svi - Seminal vesicle invasion\n", "* lcp - Logarithm of capsular penetration\n", "* gleason - Gleason score\n", "* pgg45 - Percent of Gleason scores 4 or 5\n", "\n", "The variable that we want to predict is:\n", "* lpsa - Level of prostate-specific antigen\n", "\n", "There are no missing values." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "470dfe63691f20824d634773c68a8a01", "grade": false, "grade_id": "cell-505cbe735d8eeabd", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "The first task is to download and import the `prostate` data set.\n", "You can find it [here](https://www.tu-chemnitz.de/mathematik/numa/lehre/ds-2019/).\n", "\n", "Taking a short look at the data reveals that the data is seperated by tabs.\n", "You can use the convenient `pandas` method `read_csv` with the options `sep = \"\\t\"` and `index_col = 0`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "68f85a36d393578762f6d609ff9ef7cd", "grade": false, "grade_id": "cell-04d76c8cb3e66529", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Import prostate cancer dataset\n", "pc = pd.read_csv('prostate.data', sep=\"\\t\", index_col=0)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "50bfaecc8bee359c78b77ed2416154b5", "grade": false, "grade_id": "cell-8b7539177e190d44", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "The method `head` of a `pd.DataFrame` prints by default the first 5 rows of the `DataFrame`, which is suitable for getting an overview of the data.\n", "\n", "**Task**: Try it!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "399ea4f51277c6192611fb11f8b21149", "grade": true, "grade_id": "cell-62a814ce0129cf1c", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "7339c24a2ab8666c9d720db6fdd07bb2", "grade": false, "grade_id": "cell-dfc617de98a26c2a", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "As you can see, there is a column named **train**, which is either `\"T\"` (True) or `\"F\"` (False).\n", "This means that the data has already been split up into a training and a test set, which we will use in some minutes." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "cb244a052aea092f4e34a4fedf1a2ce9", "grade": false, "grade_id": "cell-1bc39fbcd62ea9c3", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: You should alse take a look at the correlation matrix using the method `corr`.\n", "What do you observe, especially concerning the correlation between the features?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "6a47c290b31734f9f28ea1099f25d594", "grade": true, "grade_id": "cell-b48f0eca1685a99e", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "b811e73d7d87e19f41019f0485ed8439", "grade": false, "grade_id": "cell-10742305a16beda7", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "You should observer that there are many predictor variables that have a high correlation.\n", "High correlation is always an *indicator* of multiple features having similar effects on the independent variable." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "a8567efa4d9b9a52f7ba216ad94852de", "grade": false, "grade_id": "cell-8a405ebfdf734fd5", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "From now on, we concentrate on the numerical values of the data set:\n", "- extract the predictors of the dataset as a `numpy.array` $X$ using the method `values` of a `pd.DataFrame`\n", "- extract the column **lspa** as a `numpy.array` $y$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "ea6e3a26b10426febb61144dc79bb4b2", "grade": false, "grade_id": "cell-75db3b8e79b3c7b3", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "33e19bdd634a0e0a3ae65dff1a62dbac", "grade": true, "grade_id": "cell-7c379d1604861a94", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "import numpy as np\n", "\n", "assert np.abs(X.mean() - 12.514554639313143) < 1e-10\n", "assert np.abs(y.mean() - 2.4783868783505154) < 1e-10\n", "assert X.shape == (97,8)\n", "assert y.shape == (97,)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "9f519005cdfbdefaa8818a7376652183", "grade": false, "grade_id": "cell-da34f8777fc4dada", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Before we further analyze our data, we want to normalize the predictor variables.\n", "This technique is often necessary when comparing different kinds of inputs.\n", "\n", "Here, we want to normalize the **predictor variables** such that the *normalized predictor variables* have mean $0$ and variance $1$.\n", "\n", "**Caution**: There are a lot of functions that sound like they're doing the right thing, but there are a lot more ways to normalize a data set, e.g. $l^2$-normalization etc." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "2e94075ae423dfd7a368ad8c1f50d8ca", "grade": false, "grade_id": "cell-59f260869b31429f", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: Normalize the predictor variables, you can use the function `scale` from `scikit-learn`:\n", "\n", " from sklearn.preprocessing import scale\n", "\n", "You should name the normalized variable `Xnorm`, otherwise the following code might not work." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "dca6eec6fd56c56ec108664baab8582d", "grade": false, "grade_id": "cell-06c9aa04d05180bc", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "26cfe881454e6f0ed3a8cdfeb401c909", "grade": true, "grade_id": "cell-f05591af5072d7d6", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert max(abs(Xnorm.mean(axis=0))) < 1e-10\n", "assert max(abs(Xnorm.var(axis=0)-1)) < 1e-10" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "33a6c46a515a4e5dc5c02d93e0fbe494", "grade": false, "grade_id": "cell-be8c7d7eef70202a", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "The next step is to devide the data set into a training and a test data set.\n", "We can extract the column **train** by\n", "\n", " train = (pc[\"train\"]==\"T\")\n", " \n", "which is simply a `pandas Series` object containing the values `True` and `False`.\n", "The nice thing about this object is that we can use this *filter* to get our training data by\n", "\n", " Xtrain = Xnorm[train,:]\n", " \n", "The same indexing can be applied to $y$.\n", "\n", "**Task**: Extract the training samples and save the new variables under `Xtrain` and `ytrain`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "500edd9201a8addff12b22c2efe4190d", "grade": false, "grade_id": "cell-4d64145820558ddc", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "c8b919e5d4785116266979d12e5b991a", "grade": true, "grade_id": "cell-93a67a49cfd901ea", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert abs(Xtrain[-1,-1] - 1.9822513682855871) < 1e-10\n", "assert abs(Xtrain.mean() - 0.011451538411465602) < 1e-10\n", "assert Xtrain.shape == (67,8)\n", "\n", "assert abs(ytrain[50] - 3.3928290999999997) < 1e-10\n", "assert abs(ytrain.mean() - 2.4523450850746267) < 1e-10\n", "assert ytrain.shape == (67,)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "4f8791bfb871fd5fc49d8d3228964e8e", "grade": false, "grade_id": "cell-ed8ef2a6c514abb7", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Now, we want to fit a linear regression model on our **training set** using all of the predictor variables.\n", "In the last tutorials, we have employed `numpy`s methods `polyfit` and `polyval`.\n", "Here, you should use the following class:\n", "\n", " from sklearn.linear_model import LinearRegression\n", " \n", "Have a look at its documentation and fit the model.\n", "Store the trained intercept in the variable `intercept` as well as the regression coefficients (as a list) in the variable `coeffs`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "02af79b9ee93d317e51a12411e4aab3a", "grade": false, "grade_id": "cell-6d868b7aa18e75f6", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "7aa54103ef923124a3ad464c5f77e557", "grade": true, "grade_id": "cell-2b4ce902b975c47a", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert abs(intercept - 2.464932922123745) < 1e-10\n", "assert abs(np.mean(coeffs) - 0.15837992642962145) < 1e-10" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "ee7a34ef18d96698ec909b621d21e305", "grade": false, "grade_id": "cell-2c0bb12f770fc1b6", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "On the webpage, you find the file `multipleTTest.py`.\n", "After downloading the file, you can import it using\n", " \n", " from multipleTTest import multipleTTest\n", "\n", "This function enhances the capability of our previous method `computeTStatistic` slightly:\n", "* You may now also provide a `labels` object that contains the names of the predictor variables and modifies only the output table. You can get the labels of a `pandas DataFrame` by\n", " \n", " labels = pc.keys()\n", " \n", "* Additionally, the function `multipleTTest` returns the residual sum of squares of the linear regression fit. \n", "* You may set the optional input variable `includeIntercept` to `True`. This appends the intercept in the estimate, and you have not to put a column containing only ones by yourself.\n", "\n", "**Task**: Execute the function for this data set. Store the RSS value as `RSS1`. Store also the number of variables in this model as `p1` (count the intercept as one variable!).\n", "How many variables are not significant at a threshold of 5 %?\n", "Store your answer in the variable 'notSign1'." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "283714980220f6ecbfdbb8f86342a1c7", "grade": false, "grade_id": "cell-6ada2c15c960c6b6", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "ce963c775e9e45065fb5ae446827c702", "grade": true, "grade_id": "cell-c13af06a1faa4ac4", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert p1 == 9\n", "assert abs(RSS1 - 29.4263844599084) < 1e-10\n", "assert 'notSign1' in locals()\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "2ac4e1de10e8f600d625ae30e907f0aa", "grade": false, "grade_id": "cell-aff68d0577de7ebb", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "We want to compare the fit containing all predictor variables against a model containing only those variables that are significant at the $5~\\%$ level. The variables that we exclude should be `age`, `lcp`, `gleason` and `pgg45`.\n", "\n", "**Task**: Use the function `multipleTTest` to perform a test using only the significant variables.\n", "Store the residual sum of squares as `RSS0` and the number of variables as `p0`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "8d9acb885394e069026e63a8b46b4314", "grade": false, "grade_id": "cell-f51321bcd10789a7", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "9d8015f4fdb31f55ec5edb3ebf73c3e9", "grade": true, "grade_id": "cell-946379ec2bc62a02", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert p0 == 5\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "5a37a1286e4aa5f3e96979eda694cc34", "grade": false, "grade_id": "cell-5d331a02b212f7b5", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "According to the lecture, slide 104, we may now test the reduced model against the full model using an appropriate $F$-statistic.\n", "Thus, we test the null hypothesis is:\n", "\n", "$$ \\textbf{H}_0: \\beta_{\\text{age}} = \\beta_{\\text{lcp}} = \\beta_{\\text{gleason}} = \\beta_{\\text{pgg45}} = 0$$\n", "\n", "against\n", "\n", "$$ \\textbf{H}_1: \\text{at least one of the variables } \\beta_{\\text{age}}, \\beta_{\\text{lcp}}, \\beta_{\\text{gleason}}, \\beta_{\\text{pgg45}} \\text{ is not zero } $$\n", "\n", "We can do this by computing the $F$-stastitic:\n", "\n", "$$ F = \\frac{(RSS_0 - RSS_1) / (p_1 - p_0)}{RSS_1 / (n - p_1)} $$\n", "\n", "while $n$ is the number of training samples.\n", "\n", "According to our assumptions, $F$ will have an $F$-distribution with $(p_1-p_0, n-p_1)$ degrees of freedom.\n", "\n", "**Task**: Compute the value of the test statistic and store it in a variable `F`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "c0acaf09ea47e825dba6d7349e5dc8bc", "grade": false, "grade_id": "cell-348ff63a3e9ae6d1", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "03078066568547262cfeaa5b93a84d29", "grade": true, "grade_id": "cell-94f2551615bec23c", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert 'F' in locals()\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "8dc1f74160759f11e8867d48ca3ababf", "grade": false, "grade_id": "cell-75a2e3c93b1433f5", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "To compute the $p$-value, we may import the $F$-distribution by\n", "\n", " from scipy.stats import f\n", " \n", "**Task**: Determine the corresponding $p$-value and store the value in a variable `pval`.\n", "Assuming a level of significance of 5%, can we safely reject the null hypothesis? Store either `True` or `False` in the variable `reject_null`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "ca1428cbf972504cc8b69f5f3eda1535", "grade": false, "grade_id": "cell-b4943a831c958b55", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "eef1b6db097b10eaaf668b3dc8fd2e74", "grade": true, "grade_id": "cell-bf8f4276b433f952", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert 'pval' in locals()\n", "assert 'reject_null' in locals()\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }