# Problem sheet 3
The previous exercises gave an introduction to Python, Numpy and Pandas. Beginning with this exercise, we shift our focus to statistical learning itself. To this end, we will employ the module scikit-learn which offers many functions we will cover over the remaining semester.

If not already done, please download the file [Advertising.csv](https://www.tu-chemnitz.de/mathematik/numa/lehre/ds-2018/exercises/Advertising.csv) and move it into a subfolder called `datasets`.

## Exercise 1:
We start this exercise with the Advertising dataset known from the lecture.

We read the dataset using Pandas:

In [None]:
import pandas as pd
import numpy as np

adv = pd.read_csv('./datasets/Advertising.csv', index_col=0)

# Print first entries of adv
print(adv.head(3))

For convenience, we extract the values from this pandas-DataFrame

In [None]:
X = adv.values[:,0:3]
tv, radio, newspaper = np.hsplit(X,3)
Y = adv.values[:,3]

### Part (a)
Compute for each of the 3 predictor variables **TV**, **radio** and **newspaper** simple (1-dimensional) linear regressions, e.g.


$$ y^{TV}_i \approx \beta_0^{TV} + \beta_1^{TV} \, x_i^{TV}$$

Use the following function:

 from sklearn.linear_model import LinearRegression
 
You can use a command similar to

 print('y = %5.4f + %5.4f x TV' % (intercept, lincoef))
 
to print your results in a nice fashion.

In [None]:
# Put your code always into these code blocks

You should observe, that the regression coefficients for **TV** and **newspaper** are very similar.
As you already know from the lecture, it is not satisfying from a mathematical point of view to restrict our investigation to the absolute values of the coefficients.


### Part (b)

In the lecture you learned about different measures for assessing the quality of a linear fit.
In the last exercise, we already implemented a function to compute the mean squared error (MSE).

This time, we want to compare the $R^2$ scores. You can use the method `score()` of a `LinearRegression` to get the $R^2$ values.
Remember that this value is the proportion of variability in $Y$ explained using **TV**, **radio** or **newspaper** as predictor in a 1-dimensional linear regression fit.

In [None]:
# Put your code for part (b) here.

### Part (c)
Now we want to compute the predicted value of sales if we restrict our prediction to one input, i.e. **TV**, **radio** or **newspaper**, resp.
Predict the values $\hat{y}^{TV}$ $\hat{y}^{radio}$ and $\hat{y}^{newspaper}$ using the method `predict()`

In [None]:
# and that of part (c) here...

### Part (d)

Plot the datapoints as well as the corresponding regression line for each of the inputs **TV**, **radio** or **newspaper**.

You can use the functions `subplots` or `fig.add_subplot` to arrange the plots in one figure.

In [None]:
# We plot our findings using subplots
import matplotlib.pyplot as plt

fig = plt.figure()
fig.add_subplot(1,3,1)
# I guess you know what you have to put in here ...

fig.add_subplot(1,3,2)
# ... and here ...

fig.add_subplot(1,3,3)
# and here.


### Part (e)
Take a closer look at the correlation matrix.
You can use the method `corr()` that is implemented for pandas `DataFrames`.
Which features are correlated most strongly?

In [None]:
# Now, I think you know how to proceed.

**Answer**: 

### Part (f)
Investigate the statistical significance of the medium **newspaper** in a linear regression involving only this feature. Use a **t-test** for this purpose as described on slide 80 in the lecture notes.

You should observe the following values:

|Coefficient | Estimate | SE | t-statistic | p-value|
|:-----------|----------|----|-------------|--------|
| $\beta_0$ | 12.351 | 0.621 | 19.88 | < 0.0001 |
| $\beta_{newspaper}$ | 0.055 | 0.017 | 3.30 | 0.00115

You should use `scipy` to get the $t$-distribution using

 from scipy.stats import t
 
The cumulative distribution function at a point `x` for `n` degrees of freedom can than be called by

 t.cdf(x, n)

### Part (g)
Now construct a linear regression on all three predictor variables, i.e.

$$y_i ≈ \beta_0 + \beta_{TV} x^{TV}_i + \beta_{radio} x^{radio}_i + \beta_{newspaper} x^{newspaper}_i$$ 

What do you observe? Compare your results with your findings from above.

**Answer**: 

### Part (h)
What portion of the variance is explained by this linear regression fit?

### Part (i)
Now perform a linear regression that incorporates only the predictors **TV** and
**radio**.
Compute also the $R^2$-value and compare it to the full multiple linear regression.

**Extra task**: Present the datapoints and the regression plane in a 3-dimensional plot.


**Answer**: 

# Homework:

We have already seen, that the **t-test** comes in handy when one has to decide whether a coefficient for a single feature is significant or not.
As has been outlined in the lecture, one can also use the t-test in a multiple linear regression fit

$$ Y = X \beta + \varepsilon $$

while the intercept is incorporated into $X$, i.e. a column containing only ones is stacked in front of the original matrix $X$.

The formula to compute the test statistic in this generalized setting is

$$ t_j = \frac{\hat{\beta}_j}{\hat{\sigma} \sqrt{v_j}} $$

while $\beta_j$ is the $j$-th entry of the coefficient vector

$$ \beta = (X^\top X)^{-1} X^\top y, $$

$\hat{\sigma}$ is the unbiased estimate of $\sigma$, which is determined by

$$ \hat{\sigma} = \sqrt{\frac{1}{n-p-1} \, \sum_{i=1}^n (y_i - \hat{y}_i)^2} $$

and $v_j$ is the $j$-th diagonal element of the matrix $(X^\top X)^{-1}$.

Then $t_j$ is distributed according to a $t$-distribution with $n-p-1$ degrees of freedom (dofs). 

**Task**: Compute the values in the following statistic and try to print it in a similar way. 

| Coefficient | Estimate | SE | t-statistic | p-value |
|:-----------------|-----------|-------|-------------|---------|
| $\beta_0$ | 2.939 |0.3119 | 9.42 | < 0.0001|
| $\beta_{TV}$ | 0.046 |0.0014 | 32.81 | < 0.0001|
| $\beta_{radio}$ | 0.189 |0.0086 | 21.89 | < 0.0001|
| $\beta_{news}$ | −0.001 |0.0059 | −0.18 | 0.8599 |

