{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework 9 Prinicipal Component Regression and Partial Least Squares\n", "\n", "## Part 1 - Data preparation\n", "\n", "We start this homework with probably the most important step in data science: data preparation.\n", "This week, we are going to investigate a data set that contains baseball data from the (North American) Major League during 1986 and 1987." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Import the Hitters data set, available on the class web page and drop all rows containing missing values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Identify the three variables containing categorical variables and store their labels in a list `dummy_vars`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pandas function `get_dummies` converts categorical variables into 0-1-dummy variables. This has been discussed in Chapter 3 (slide 112).\n", "\n", "**Task**: Convert the three categorical variables in the dataset into dummy variables and store them in a new `DataFrame` called `df_dummy`.\n", "Take a look at the new `DataFrame` using the method `head`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have done this, you should see that there are only two categories in each of the dummy variables.\n", "Thus we should only include one of each into our final data frame.\n", "\n", "**Task**: If you did everything right so far, the following code should execute without errors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "real_vars = ['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat',\\\n", " 'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks','PutOuts', 'Assists', 'Errors']\n", "dfX = pd.concat([df.loc[:,real_vars], df_dummy.loc[:,['League_A', 'Division_E', 'NewLeague_A']]], axis=1)\n", "dfy = df.Salary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2 - Applying PCR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have studied intensively PCA during the lab.\n", "We will now combine this knowledge to perform a PCR.\n", "\n", "According to [Wikipedia](https://en.wikipedia.org/wiki/Principal_component_regression), the PCR method may be broadly divided into three major steps:\n", "\n", "1. Perform PCA on the observed data matrix for the explanatory variables to obtain the principal components, and then (usually) select a subset, based on some appropriate criteria, of the principal components so obtained for further use.\n", "2. Now regress the observed vector of outcomes on the selected principal components as covariates, using ordinary least squares regression (linear regression) to get a vector of estimated regression coefficients (with dimension equal to the number of selected principal components).\n", "3. Now transform this vector back to the scale of the actual covariates, using the selected PCA loadings (the eigenvectors corresponding to the selected principal components) to get the final PCR estimator (with dimension equal to the total number of covariates) for estimating the regression coefficients characterizing the original model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Scale the data using `StandardScaler` from `sklearn.preprocessing` and perform a (full) principal component analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you should implement a loop over the number of principal components in your model.\n", "We want to measure the quality by a cross-validated mean squared error using 10 folds.\n", "\n", "**Task**:\n", "Implement a loop over the number of principal components in your model.\n", "Use the function `LinearRegression` as an estimator in `cross_val_score`.\n", "As data, you should choose the first $j$ principal components.\n", "Store the means of the mean squared errors in a list called `mse`.\n", "You can use an appropriate `scoring` option." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Determine the number of components for which the MSE is smallest." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should observe that the MSE is minimized by taking all but one principal components into consideration.\n", "This corresponds to no dimensionality reduction at all, and simply performs a linear regression using all of the variables.\n", "But we also observe that the values do not change very much, even using only one predictor yields a good fit." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Plot the MSE against the number of components in your model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Plot the percentage of variance explained by the first $j$ principal components against the number of principal components $j$.\n", "You should use the attribute `explained_variance_ratio_` from your `PCA()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3 - Partial Least Squares" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to apply partial least squares regression to this data set.\n", "The function `PLSRegresssion` is provided by sklearn in the module `cross_decomposition`.\n", "\n", "**Task**: Implement a loop over the number of components in your PLS regression model.\n", "Use 10-fold cross-validation and store the means of the mean squared errors in a list called `mse`.\n", "You can use an appropriate `scoring` option." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Determine the number of components, for which the MSE is minimized." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print('Minimum MSE for %d components' % (np.argmin(mse)+1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Plot the MSE against the number of components in your model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, we might draw a similar conclusion as for PCA.\n", "Altough the MSE is minimized for 14 components, it is fairly low for other values as well.\n", "\n", "**Task**: Finally, we want to take a look at the declared variance in the response in terms of the number of compontens used in the PLS regression.\n", "You can copy your code from above, and only have to change the `scoring` option.\n", "What do you observe?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observation**:" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }