Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below.

Rename this problem sheet as follows:

 ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}
 
for example
 
 ps2_blja_problem1

Submit your homework within one week until next Monday, 9 a.m.

In [None]:
NAME = ""
EMAIL = ""
USERNAME = ""

---

# Introduction to Data Science
## Lab 12: Ridge and Lasso regression

In the last exercise we looked at subset selection techniques for linear regression models.
These methods used standard linear regression on all (or a subset of) possible models incorporating different numbers of predictors.

In this exercise we consider two common shrinkage techniques for feature selection and model regularization.
These techniques have long been well-established in mathematical optimization, and have received interest for data science due to their ability to shrink the coefficients of a linear model.
This becomes advantageous as it enables one to trade off between variance and bias in our model.

We start this lab by exploring the methods provided in `scikit-learn`.
In the first two problems, we consider the diabetes data set.
The goal of these problems is to understand the two main functions for shrinkage, i.e., `sklearn.linear_model.Ridge` and `sklearn.linear_model.Lasso`.

### Part A - Ridge regression (aka Tikhonov regularization)

**Task (1 point)**: The following code cell loads the diabetes data set in a variable `dia` (the data type is `sklearn.utils.Bunch` which behaves similar to a `dict`).
Set up a `pandas.DataFrame` named `df` which uses the correct column titles and contains the 10 predictor variables.

**Hint**:The command `print(dia.DESCR)`, displays a description of the data set.

In [None]:
from sklearn.datasets import load_diabetes
import pandas as pd

dia = load_diabetes()

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df.shape == (442,10)
df.columns[4] == 's1'
assert abs(df.iloc[20,6] - 0.000778807997017968) < 1e-7

**Task (1 point)**: Append a column with the target variable to the data frame `df`. Name the column `target`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df.shape == (442,11)
assert abs(df.target.mean() - 152.13348416289594) < 1e-7

**Task (1 point)**: Split your data randomly into a test set `X_test, y_test` and training set `(X_train, y_train)`.
Use the function

 from sklearn.model_selection import train_test_split
 
with `random_state=1`.

Your test set should contain approx. 30\% of the data.


**Hint**: Use the appropriate optional parameter.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert X_train.shape == (309,10)
assert y_train.shape == (309,)
assert abs(X_test.mean() - -0.0029129190427152033) < 1e-8

The following cell applies ridge regression for `m` different regularization parameters $\alpha$.
As you know from the lecture, ridge regression adds a penalty term to the RSS term in standard linear regression, i.e., instead of considering the optimization problem

$$ \min_{\beta \in \mathbb{R}^{p+1}} \|y - X \beta\|_2^2 = \min_{\beta \in \mathbb{R}^{p+1}} \sum_{i=1}^n \left( y_i - \sum_{j=0}^p x_{i,j} \beta_j \right)^2 $$

we solve in **ridge regression** the regularized problem

$$ \min_{\beta \in \mathbb{R}^{p+1}} \|y - X \beta\|_2^2 + \alpha \| \beta \|_2^2 = \min_{\beta \in \mathbb{R}^{p+1}} \sum_{i=1}^n \left( y_i - \sum_{j=0}^p x_{i,j} \beta_j \right)^2 + \alpha \sum_{j=1}^p \beta_j^2$$

The following code fragment performs ridge regression for different values of $\alpha$ and stores the coefficients in an array called `Coeffs`.
Afterwards, the coefficients are plotted for different regression parameters.
If you named your training and test data `X_train, X_test` and `y_train, y_test`, the following code cell should be executable.

In [None]:
import numpy as np
from sklearn.linear_model import Ridge
%matplotlib inline

# Get dimensions of X_train
n,p = X_train.shape
m = 50
Alpha = np.logspace(-4,4,m)
Coeffs = np.zeros((m,p+1))

for (i,a) in enumerate(Alpha):
 lmr = Ridge(alpha=a)
 lmr.fit(X_train, y_train)
 Coeffs[i,0] = lmr.intercept_
 Coeffs[i,1:] = lmr.coef_
 
# Plot the output
import matplotlib.pyplot as plt
plt.semilogx(Alpha, Coeffs[:,:])
plt.xlabel('Alpha')
plt.ylabel('Coefficients');

## Part B - Lasso regression (aka $\ell^1$-regularization)

The **Lasso** is another modification of classical linear regression, and uses the $\ell^1$ norm in the penalization term instead of the $\ell^2$ norm in ridge regression. The optimization problem reads

$$ \min_{\beta \in \mathbb{R}^{p+1}} \|y - X \beta\|_2^2 + \alpha \| \beta \|_1 = \min_{\beta \in \mathbb{R}^{p+1}} \sum_{i=1}^n \left( y_i - \sum_{j=0}^p x_{i,j} \beta_j \right)^2 + \alpha \sum_{j=1}^p |\beta_j|$$

Both the lasso and the ridge regression lead to (strictly) convex optimization problems, that are problems with a unique solution.
This is true even in the case of $p > n$, while classical linear regression does not possess a unique solution.
While the coefficients in ridge regression decrease in absolute value in general as the penalty parameter $\alpha$ increases, they will never be exactly zero.
In contrast to this, the coefficients in the lasso can become zero, when their influence becomes negligible.

**Task (1 points)**: Copy the code used for illustrating the influence of the penalty parameter in ridge regression and modify or expand the code to plot the coefficients obtained by the **Lasso** instead.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()