{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Sheet 11 - A Complete scikit-learn Project - Part 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous lab we discussed a complete scikit-learn project.\n", "In the end, we applied a linear regression algorithm to our transformed data.\n", "This was a lot of new material, including the definition of own transformers, setting up `sklearn` pipelines etc.\n", "To emphasize the advantages of this approach, we will next apply different regression models to our (training) data.\n", "Then, we consider decision tree models as well as random forests and try to determine a set of *good* model parameters (aka hyperparameters).\n", "\n", "Before we go into details, we recall the main steps of Problem Sheet 10.\n", "## We started by loading the data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Set standard figure size\n", "plt.rcParams['figure.figsize'] = (16,9)\n", "\n", "# Read csv file\n", "df = pd.read_csv('datasets/housing.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stratified train-test splitting" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "income_cat = pd.qcut(df.median_income,5)\n", "\n", "ttsplit = train_test_split(df, test_size = 0.2, random_state=1, stratify=income_cat)\n", "\n", "train, test = ttsplit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementation of custom transformers for new features" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator, TransformerMixin\n", "\n", "ix_rooms = 3\n", "ix_beds = 4\n", "ix_households = 6\n", "\n", "# We derive our new class from an BaseEstimator and TransformerMixin\n", "class AddBedroomsPerRoom(BaseEstimator, TransformerMixin): \n", " \n", " # The constructor in Python is defined by the method __init__,\n", " # we have to pass self as a first argument in the functions to\n", " # be able to access the attributes of the class object.\n", " def __init__(self, add_bedrooms_per_room = True):\n", " self.add_bedrooms_per_room = add_bedrooms_per_room\n", " \n", " # Now, we define the fit-method, but there is nothing to do\n", " # here, so we only return the object itself, as you might\n", " # have noticed before.\n", " def fit(self, X, y = None):\n", " return self\n", " \n", " # Here, we define the transform method. We want to append a new\n", " # column that gives the number of bedrooms per room\n", " def transform(self, X, y = None):\n", " if self.add_bedrooms_per_room:\n", " new_var = X[:,ix_beds] / X[:,ix_rooms]\n", " return np.c_[X,new_var]\n", " else:\n", " return X\n", " \n", "# We derive our new class from an BaseEstimator and TransformerMixin\n", "class AddRoomsPerHousehold(BaseEstimator, TransformerMixin): \n", " \n", " # The constructor in Python is defined by the method __init__,\n", " # we have to pass self as a first argument in the functions to\n", " # be able to access the attributes of the class object.\n", " # The argument\n", " # add_rooms_per_household = True\n", " # is a parameter with standard value 'True'.\n", " def __init__(self, add_rooms_per_household = True):\n", " self.add_rooms_per_household = add_rooms_per_household\n", " \n", " # Now, we define the fit-method, but there is nothing to do\n", " # here, so we only return the object itself, as you might\n", " # have noticed before.\n", " def fit(self, X, y = None):\n", " return self\n", " \n", " # Here, we define the transform method. We want to append a new\n", " # column that gives the number of rooms per household\n", " def transform(self, X, y = None):\n", " \n", " # We add the 'rooms_per_household' attribute only,\n", " # if add_rooms_per_household = True\n", " if self.add_rooms_per_household: \n", " new_var = X[:,ix_rooms] / X[:,ix_households]\n", " return np.c_[X,new_var]\n", " else:\n", " return X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementation of custom transformer `AttributeSelector`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator, TransformerMixin\n", "\n", "class AttributeSelector(BaseEstimator, TransformerMixin): \n", " \n", " def __init__(self, attributes):\n", " self.attributes = attributes\n", " \n", " def fit(self, X, y = None):\n", " return self # This again does nothing\n", " \n", " def transform(self, X, y = None):\n", " return X.loc[:,self.attributes].values\n", " \n", "num_cols = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',\n", " 'total_bedrooms', 'population', 'households', 'median_income']\n", "cat_cols = ['ocean_proximity']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementation of custom transformer `PipelineBinarizer`" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import LabelBinarizer\n", "\n", "class PipelineBinarizer(LabelBinarizer):\n", " def fit(self, X, y=None):\n", " super(PipelineBinarizer, self).fit(X)\n", " \n", " def transform(self, X, y=None):\n", " return super(PipelineBinarizer, self).transform(X)\n", "\n", " def fit_transform(self, X, y=None):\n", " return super(PipelineBinarizer, self).fit(X).transform(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## We arrived at a pipeline for the quantitative features ..." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.impute import SimpleImputer\n", "\n", "num_pipeline = Pipeline([('selection', AttributeSelector(num_cols)),\n", " ('fill_nas', SimpleImputer(strategy='median')),\n", " ('add_rooms_per_household', AddRoomsPerHousehold()),\n", " ('add_bedrooms_per_room', AddBedroomsPerRoom()),\n", " ('scaling', StandardScaler())])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ... and one for the categorical features" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "cat_pipeline = Pipeline([('selection', AttributeSelector(cat_cols)),\n", " ('bin', PipelineBinarizer())])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## With the function `FeatureUnion`, data preparation became a one-liner" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import FeatureUnion\n", "\n", "unite_features = FeatureUnion([('num_pipe', num_pipeline),\n", " ('cat_pipe', cat_pipeline)])\n", "\n", "X = unite_features.fit_transform(train)\n", "Xtest = unite_features.transform(test)\n", "y = train['median_house_value'].values\n", "ytest = test['median_house_value'].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In the end, we applied a simple linear regression\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RSME for linear model 68219.00\n", "RSME for mean prediction 115546.39\n" ] } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import mean_squared_error\n", "\n", "lin_reg = LinearRegression()\n", "lin_reg.fit(X, y)\n", "lin_rsme = np.sqrt(mean_squared_error(lin_reg.predict(X),y))\n", "print('RSME for linear model %8.2f' % lin_rsme)\n", "null_rsme = np.sqrt(mean_squared_error(y.mean()*np.ones_like(y),y))\n", "print('RSME for mean prediction %8.2f' % null_rsme)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Decision trees\n", "\n", "**Task**: Apply a regression decision tree to our (training) data. Use the `DecisionTreeRegressor` from the module `sklearn.tree` and determine the RSME (root mean-squared error)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wow, this looks like a perfect fit!\n", "But as you all know, our model might be overfitted.\n", "We don't want to waste our precious test set in checking this assertion.\n", "Therefore, we are going to check this with the well-known cross-validation function `cross_val_score`.\n", "\n", "**Task**: Determine the mean of the cross-validated RSME for the (standard) decission tree model using 10 folds (`cv = 10`).\n", "Set the correct scoring option, see also [here](https://scikit-learn.org/stable/modules/model_evaluation.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the story is quite different.\n", "Our simple linear regression seems to outperform the decision tree model.\n", "But this might due to the [standard options](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) for a `DecisionTreeRegressor` in `sklearn`.\n", "Try to modifiy the standard options such that the mean of the cross-validated RSME (root mean squared error) is less than $60000$. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the lecture you learned about different methods for decreasing the probability of overfitting: pruning, bagging, boosting and random forests, see [Chapter 8 in the lecture notes](https://www.tu-chemnitz.de/mathematik/numa/lehre/ds-2018/Slides/ds-intro-chapter8.pdf).\n", "\n", "Here, we want to test the `RandomForestRegressor`, provided in the `sklearn.ensemble` module.\n", "\n", "**Task**: Implement a `RandomForestRegressor` with `n_estimators = 10`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you take a look at the standard options of a `RandomForestRegressor`, you might observe that there are more than 10 different options (also called *hyperparameters*).\n", "Suppose, that you want to try at least a number of different settings.\n", "\n", "One way to do this would be the trial-and-error approach: Try one setting, then change some of the options. Reiterate this procedure until you are happy with the model.\n", "\n", "A more rigorous approach can be executed by specifying a Cartesian parameter grid.\n", "Suppose, you want to try the values `1, 5, 10` for `min_samples_leaf` and `2, 4, 6, 8` for `max_features`.\n", "Then, you should train $3 \\cdot 4 = 12$ models, each with a different combination of the `min_samples_leaf` and `max_features` hyperparameters.\n", "If you also want to check whether bootstraping the samples might be advantageous, you end up with $2\\cdot3\\cdot4 = 24$ different models.\n", "\n", "Of course, you can implement this by hand.\n", "Fortunately, `sklearn` provides a nice way to do this with `GridSearchCV` in the module `sklearn.model_selection`.\n", "The grid has to be provided by a dictionary\n", "\n", " params = { 'bootstrap': [True, False],\n", " 'min_samples_leaf': [1, 5, 10],\n", " 'max_features': [4, 6, 8, 10] }\n", " \n", "**Task**: Use `GridSearchCV` with `cv = 5` and `scoring='neg_mean_squared_error'` to evaluate random forest models (with `n_estimators = 10`) with all combinations of hyperparameters in the dictionary `params`.\n", "Don't forget to call the `fit()` method using our training data (this step can take up to a minute).\n", "\n", "**Question**: How many different models do you have to fit during the training?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer**:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Which setting leads to the best results, i.e., the lowest RSME?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }