# Problem Sheet 11 - A Complete scikit-learn Project - Part 2

In the previous lab we discussed a complete scikit-learn project.
In the end, we applied a linear regression algorithm to our transformed data.
This was a lot of new material, including the definition of own transformers, setting up `sklearn` pipelines etc.
To emphasize the advantages of this approach, we will next apply different regression models to our (training) data.
Then, we consider decision tree models as well as random forests and try to determine a set of *good* model parameters (aka hyperparameters).

Before we go into details, we recall the main steps of Problem Sheet 10.
## We started by loading the data.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set standard figure size
plt.rcParams['figure.figsize'] = (16,9)

# Read csv file
df = pd.read_csv('datasets/housing.csv')

## Stratified train-test splitting

In [3]:
from sklearn.model_selection import train_test_split

income_cat = pd.qcut(df.median_income,5)

ttsplit = train_test_split(df, test_size = 0.2, random_state=1, stratify=income_cat)

train, test = ttsplit

## Implementation of custom transformers for new features

In [4]:
from sklearn.base import BaseEstimator, TransformerMixin

ix_rooms = 3
ix_beds = 4
ix_households = 6

# We derive our new class from an BaseEstimator and TransformerMixin
class AddBedroomsPerRoom(BaseEstimator, TransformerMixin): 
 
 # The constructor in Python is defined by the method __init__,
 # we have to pass self as a first argument in the functions to
 # be able to access the attributes of the class object.
 def __init__(self, add_bedrooms_per_room = True):
 self.add_bedrooms_per_room = add_bedrooms_per_room
 
 # Now, we define the fit-method, but there is nothing to do
 # here, so we only return the object itself, as you might
 # have noticed before.
 def fit(self, X, y = None):
 return self
 
 # Here, we define the transform method. We want to append a new
 # column that gives the number of bedrooms per room
 def transform(self, X, y = None):
 if self.add_bedrooms_per_room:
 new_var = X[:,ix_beds] / X[:,ix_rooms]
 return np.c_[X,new_var]
 else:
 return X
 
# We derive our new class from an BaseEstimator and TransformerMixin
class AddRoomsPerHousehold(BaseEstimator, TransformerMixin): 
 
 # The constructor in Python is defined by the method __init__,
 # we have to pass self as a first argument in the functions to
 # be able to access the attributes of the class object.
 # The argument
 # add_rooms_per_household = True
 # is a parameter with standard value 'True'.
 def __init__(self, add_rooms_per_household = True):
 self.add_rooms_per_household = add_rooms_per_household
 
 # Now, we define the fit-method, but there is nothing to do
 # here, so we only return the object itself, as you might
 # have noticed before.
 def fit(self, X, y = None):
 return self
 
 # Here, we define the transform method. We want to append a new
 # column that gives the number of rooms per household
 def transform(self, X, y = None):
 
 # We add the 'rooms_per_household' attribute only,
 # if add_rooms_per_household = True
 if self.add_rooms_per_household: 
 new_var = X[:,ix_rooms] / X[:,ix_households]
 return np.c_[X,new_var]
 else:
 return X

## Implementation of custom transformer `AttributeSelector`

In [5]:
from sklearn.base import BaseEstimator, TransformerMixin

class AttributeSelector(BaseEstimator, TransformerMixin): 
 
 def __init__(self, attributes):
 self.attributes = attributes
 
 def fit(self, X, y = None):
 return self # This again does nothing
 
 def transform(self, X, y = None):
 return X.loc[:,self.attributes].values
 
num_cols = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
 'total_bedrooms', 'population', 'households', 'median_income']
cat_cols = ['ocean_proximity']

## Implementation of custom transformer `PipelineBinarizer`

In [6]:
from sklearn.preprocessing import LabelBinarizer

class PipelineBinarizer(LabelBinarizer):
 def fit(self, X, y=None):
 super(PipelineBinarizer, self).fit(X)
 
 def transform(self, X, y=None):
 return super(PipelineBinarizer, self).transform(X)

 def fit_transform(self, X, y=None):
 return super(PipelineBinarizer, self).fit(X).transform(X)

## We arrived at a pipeline for the quantitative features ...

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([('selection', AttributeSelector(num_cols)),
 ('fill_nas', SimpleImputer(strategy='median')),
 ('add_rooms_per_household', AddRoomsPerHousehold()),
 ('add_bedrooms_per_room', AddBedroomsPerRoom()),
 ('scaling', StandardScaler())])

## ... and one for the categorical features

In [8]:
cat_pipeline = Pipeline([('selection', AttributeSelector(cat_cols)),
 ('bin', PipelineBinarizer())])

## With the function `FeatureUnion`, data preparation became a one-liner

In [9]:
from sklearn.pipeline import FeatureUnion

unite_features = FeatureUnion([('num_pipe', num_pipeline),
 ('cat_pipe', cat_pipeline)])

X = unite_features.fit_transform(train)
Xtest = unite_features.transform(test)
y = train['median_house_value'].values
ytest = test['median_house_value'].values

## In the end, we applied a simple linear regression


In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_rsme = np.sqrt(mean_squared_error(lin_reg.predict(X),y))
print('RSME for linear model %8.2f' % lin_rsme)
null_rsme = np.sqrt(mean_squared_error(y.mean()*np.ones_like(y),y))
print('RSME for mean prediction %8.2f' % null_rsme)


RSME for linear model 68219.00
RSME for mean prediction 115546.39


## Decision trees

**Task**: Apply a regression decision tree to our (training) data. Use the `DecisionTreeRegressor` from the module `sklearn.tree` and determine the RSME (root mean-squared error).

Wow, this looks like a perfect fit!
But as you all know, our model might be overfitted.
We don't want to waste our precious test set in checking this assertion.
Therefore, we are going to check this with the well-known cross-validation function `cross_val_score`.

**Task**: Determine the mean of the cross-validated RSME for the (standard) decission tree model using 10 folds (`cv = 10`).
Set the correct scoring option, see also [here](https://scikit-learn.org/stable/modules/model_evaluation.html).

Now, the story is quite different.
Our simple linear regression seems to outperform the decision tree model.
But this might due to the [standard options](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) for a `DecisionTreeRegressor` in `sklearn`.
Try to modifiy the standard options such that the mean of the cross-validated RSME (root mean squared error) is less than $60000$. 

In the lecture you learned about different methods for decreasing the probability of overfitting: pruning, bagging, boosting and random forests, see [Chapter 8 in the lecture notes](https://www.tu-chemnitz.de/mathematik/numa/lehre/ds-2018/Slides/ds-intro-chapter8.pdf).

Here, we want to test the `RandomForestRegressor`, provided in the `sklearn.ensemble` module.

**Task**: Implement a `RandomForestRegressor` with `n_estimators = 10`.

If you take a look at the standard options of a `RandomForestRegressor`, you might observe that there are more than 10 different options (also called *hyperparameters*).
Suppose, that you want to try at least a number of different settings.

One way to do this would be the trial-and-error approach: Try one setting, then change some of the options. Reiterate this procedure until you are happy with the model.

A more rigorous approach can be executed by specifying a Cartesian parameter grid.
Suppose, you want to try the values `1, 5, 10` for `min_samples_leaf` and `2, 4, 6, 8` for `max_features`.
Then, you should train $3 \cdot 4 = 12$ models, each with a different combination of the `min_samples_leaf` and `max_features` hyperparameters.
If you also want to check whether bootstraping the samples might be advantageous, you end up with $2\cdot3\cdot4 = 24$ different models.

Of course, you can implement this by hand.
Fortunately, `sklearn` provides a nice way to do this with `GridSearchCV` in the module `sklearn.model_selection`.
The grid has to be provided by a dictionary

 params = { 'bootstrap': [True, False],
 'min_samples_leaf': [1, 5, 10],
 'max_features': [4, 6, 8, 10] }
 
**Task**: Use `GridSearchCV` with `cv = 5` and `scoring='neg_mean_squared_error'` to evaluate random forest models (with `n_estimators = 10`) with all combinations of hyperparameters in the dictionary `params`.
Don't forget to call the `fit()` method using our training data (this step can take up to a minute).

**Question**: How many different models do you have to fit during the training?


**Answer**:

**Task**: Which setting leads to the best results, i.e., the lowest RSME?