Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below.

Rename this problem sheet as follows:

 ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}
 
for example
 
 ps2_blja_problem1

Submit your homework within one week until next Monday, 9 a.m.

In [None]:
NAME = ""
EMAIL = ""
USERNAME = ""

---

# Introduction to Data Science

## Lab 12 - Is scaling always important?
In this problem, we consider a new data set.
The data set consists of 1499 samples of a particular red wine from Minho, Portugal, called *Vinho verde*.
The first 11 columns in the csv file contain different measurements, the last column contains an expert rating of the quality.
This set became popular in a kaggle competition, but is also publicly available [here](http://www3.dsi.uminho.pt/pcortez/wine/).
The data set resides also on our [webpage](https://www.tu-chemnitz.de/mathematik/numa/lehre/ds-2018/).

**Task (1 point)**: Download the new csv-files `wine-train.csv` and `wine-test.csv` from the lecture's webpage.

Import the data from the csv-file `wine-train.csv` and store it in the `pandas.DataFrame` df.

In [None]:
import pandas as pd

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df.shape == (1278,12)
assert df.columns[5] == 'free sulfur dioxide'
assert abs(df['pH'].mean() - 3.2998122065727697) < 1e-8

**Task (1 point)**: Our target variable is stored in the column labeled `'quality'`.
Extract the values of this column into a `numpy.array` named `y` and the remaining data into another array `X`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert X.shape == (1278,11)
assert y.shape == (1278,)
assert abs(y.mean() - 5.663536776212832) < 1e-8
assert X.dtype == 'float64'

Now we want to look at the coefficient selection for both the scaled and unscaled case.
In this example, *scaled* means that we shift the mean of the features to *zero* and scale the standard deviation to *one*.
This can be easily done by using a `StandardScaler` provided within `sklearn.preprocessing`.

**Task (1 point)**: Set up a `StandardScaler` named `stdScaler` and normalize your predictor matrix `X` using the method `fit_transform()`.
Store the scaled matrix as `Xscaled`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert max(abs(Xscaled.mean(axis=0))) < 1e-10
assert max(abs(Xscaled.std(axis=0)-1)) < 1e-10

We want to compare the coefficients obtained for the scaled and unscaled predictors.

**Task (2 points)**: Fill in the two gaps in the following code cell.

When done correctly, it computes the *Lasso* estimates for `m` different values of the regularization parameter $\alpha$ and stores the coefficients as well as the cross-validation score for each $\alpha$ for both the **scaled and unscaled data**.

Finally, it plots the coefficients in the upper part of the figure, and the corresponding cv-scores in the lower part.

In [None]:
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score

# Get dimensions of X
n,p = X.shape
m = 50
Alpha = np.logspace(-4,1,m)

cscaled = np.zeros((m,p+1))
corig = np.zeros((m,p+1))
cvscaled = np.zeros((m,))
cvorig = np.zeros((m,))

for (i,a) in enumerate(Alpha):
 
 # Task: Perform a Lasso regression with tol = 1e-8
 # using the scaled input and current alpha
 # Store the learned coefficients in the array cscaled with
 # the intercept being stored in the first column of cscaled
 # YOUR CODE HERE
 raise NotImplementedError()

 # Additionaly, we perform a cross-validation for the current model
 # with the scaled data
 cvscaled[i] = cross_val_score(lm, Xscaled, y, cv=10).mean()
 
 # Task: Perform a Lasso regression with tol = 1e-8
 # using the original input and current alpha
 # Store the learned coefficients in the array corig with
 # the intercept being stored in the first column of corig 
 # YOUR CODE HERE
 raise NotImplementedError()

 # Additionaly, we perform a cross-validation for the current model
 # with the unscaled data
 cvorig[i] = cross_val_score(lm, X, y, cv=10).mean()
 
# Plot the output
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize']=(15,10)
fig, ax = plt.subplots(2,2)
ax[0][0].semilogx(Alpha, cscaled[:,1:])
ax[0][0].set_title('Scaled predictors')
ax[0][0].set_xlabel('Alpha')
ax[0][0].set_ylabel('Coefficients');

ax[0][1].semilogx(Alpha, corig[:,1:])
ax[0][1].set_title('Unscaled predictors')
ax[0][1].set_xlabel('Alpha')
ax[0][1].set_ylabel('Coefficients');

ax[1][0].semilogx(Alpha, cvscaled)
ax[1][0].set_xlabel('Alpha')
ax[1][0].set_ylabel('cv-score');
ax[1][1].semilogx(Alpha, cvorig);
ax[1][1].set_xlabel('Alpha')
ax[1][1].set_ylabel('cv-score');

In [None]:
assert abs(corig.mean() -0.29114416793438974) < 1e-6
assert abs(cscaled.mean() - 0.4755654903995176) < 1e-6

You should make the following observations:
- the ranges of the coefficients differ by a magnitude of 10
- in the unscaled model, some predictors don't enter the model even for the smallest regularization parameter
- the order of variables entering/leaving the model differ

**Task (1 point)**:
Compute the values of $\alpha$, for which the cv-scores are maximized in variables `alphascaled` and `alphaorig`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert abs(alphascaled - 0.01757510624854793) < 1e-8

**Task (1 point)**: Store the indices of the predictors selected by the best models (in this case given by highest cv score) using both the scaled and unscaled data in variables `idxscaled` and `idxorig`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(idxscaled) == 7
assert len(idxorig) == 11

**Task (2 points)**: Now we want to compare the training mean squared errors for both regressions using the value of $\alpha$ which maximizes the cv-score.
Store the MSE's in variables `msescaled` and `mseorig`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert abs(mseorig - 0.4161264558967955) < 1e-8

In this case, the scaling does not improve the mean squared error on the training data.
Actually, you should observe that the MSE using the unscaled data is only slighty superior to the scaled data.

**Task (1 point)**: Import now the csv-file `wine-test.csv` and store the predictors as a numpy array `Xtest` and the target variables as `ytest`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert abs(Xtest.mean() - 8.069555417728688) < 1e-8
assert abs(ytest.mean() - 5.526479750778816) < 1e-8
Xtest.shape == (321, 11)
ytest.shape == (321,)

**Task (1 point)**: Compute the test MSE's for the unscaled and scaled data and store their values in variables `testmseorig` and `testmsescaled`.

Don't forget to scale the test data using the previously trained `StandardScaler`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

print('Test error on orig set:\n\t {}'.format(testmseorig))
print('Test error on scaled set:\n\t {}'.format(testmsescaled))

In [None]:
assert abs(testmseorig - 0.4355191924296641) < 1e-8
assert abs(testmsescaled - 0.439533544980072) < 1e-8

**Task (2 points)**: Interpret the results. What could be the reason, why the unscaled model behaves (slightly) better than the scaled model?
In absolute terms, are your predictions good or bad?
Which model do you prefer? Why?

In [None]:
# You should use this cell to study both models further

# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE