{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n",
    "\n",
    "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name below.\n",
    "\n",
    "Rename this problem sheet as follows:\n",
    "\n",
    "    ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}\n",
    "    \n",
    "for example\n",
    "    \n",
    "    ps2_blja_problem1\n",
    "\n",
    "Submit your homework within one week until next Monday, 9 a.m."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NAME = \"\"\n",
    "EMAIL = \"\"\n",
    "USERNAME = \"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Data Science\n",
    "## Lab 2: Data import and linear regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This course assumes that you are comfortable with the basic functions of Jupyter and Python."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part A: Introduction Data Import"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download the file `Advertising.csv` from the homepage’s exercise section and upload it to your Jupyter Hub folder.\n",
    "Take a short look at the `csv`-file using a spreadsheet, e.g., LibreOffice.\n",
    "The file contains information about the sales of products in different markets, along with advertising budgets in the three media: **TV**, **radio** and **newspaper**.\n",
    "\n",
    "**Task**: Import the `csv`-file using the `numpy` function `genfromtxt` and store it as an array `X`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "4d30a197dd4a5c036d94268c8e5706b9",
     "grade": false,
     "grade_id": "cell-824784e084c340f0",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "bfc1926fc028bb60c9e6a1265b2535f4",
     "grade": true,
     "grade_id": "cell-784ee0c212e8cdfc",
     "locked": true,
     "points": 1,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert X.shape == (200,5)\n",
    "assert X[34][1] == 95.7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Extract the columns from the array `X` and store them as 1-dimensional arrays `idx`, `tv`, `radio`, `newspaper` and `sales`, e.g.,\n",
    "    \n",
    "    idx = X[:, 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "856514a71d1631744105ae4f81607689",
     "grade": false,
     "grade_id": "cell-983eb3ebbf4d3eed",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "e042a60a6603acebbc4e08c5828d4361",
     "grade": true,
     "grade_id": "cell-05f112d846deb52b",
     "locked": true,
     "points": 1,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert idx.ndim == 1\n",
    "assert tv.ndim == 1\n",
    "assert radio.ndim == 1\n",
    "assert newspaper.ndim == 1\n",
    "assert sales.ndim == 1\n",
    "i=27\n",
    "assert idx[i] == 28\n",
    "assert tv[i] == 240.1\n",
    "assert radio[i] == 16.7\n",
    "assert newspaper[i] == 22.9\n",
    "assert sales[i] == 15.9"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Add subplots to plot sales against radio as well as sales against newspaper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "916afdc04b9a31e557a79092cac6787f",
     "grade": true,
     "grade_id": "cell-cc1107edacb5af03",
     "locked": false,
     "points": 1,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams['figure.figsize'] = (16, 9)\n",
    "fig1 = plt.figure()\n",
    "fig1.add_subplot(1,3,1)\n",
    "plt.plot(tv, sales, 'ro')\n",
    "plt.xlabel('TV budget')\n",
    "plt.ylabel('sales')\n",
    "plt.title('TV ads')\n",
    "\n",
    "fig1.add_subplot(1,3,2)\n",
    "# TASK: Plot sales against radio\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()\n",
    "\n",
    "fig1.add_subplot(1,3,3)\n",
    "# TASK: Plot sales against newspaper\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part B: Creation of a function in Python\n",
    "The mean squared errer, short **MSE**, is one of the most important performance indicators for the quality of data fitting.\n",
    "The goal of this exercise is to implement the function `computeMSE` with the following **input**:\n",
    "- the observations $y_i \\in Y$, $i = 1, \\ldots, N$ that belong to measurements $x_i \\in X$, $i = 1, \\ldots, N$\n",
    "- the predictions of $f(x_i)$, which are denoted by $\\hat f(x_i)$, $i = 1, \\ldots, N$\n",
    "\n",
    "and corresponding **output**:\n",
    "\n",
    "$$\n",
    "    MSE = \\frac{1}{N} \\sum_{i=1}^N (y_i - \\hat f (x_i))^2\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "e08fbc9d8b47bf2e33c1ce3f31f69486",
     "grade": false,
     "grade_id": "cell-656779d5a9b31dc3",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Define function for mean squared error\n",
    "def computeMSE(y, fhatx):\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return mse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "1d62c1acc5b99c1b3d5c8826a992f6d4",
     "grade": true,
     "grade_id": "cell-103b708d111d04d3",
     "locked": true,
     "points": 2,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "x = np.array([0.55, 0.72, 0.6 , 0.54, 0.42,\n",
    "    0.65, 0.44, 0.89, 0.96, 0.38,\n",
    "    0.79, 0.53, 0.57, 0.93, 0.07,\n",
    "    0.09, 0.02, 0.83, 0.78, 0.87])\n",
    "\n",
    "y = np.array([6.9 , 7.58, 7.03, 6.61, 5.84,\n",
    "    7.32, 6.29, 8.38, 9.03, 5.75,\n",
    "    7.95, 6.63, 7.  , 8.8 , 4.37,\n",
    "    4.49, 4.01, 7.95, 7.87, 8.37])\n",
    "\n",
    "fhatx = np.array([\n",
    "       6.74792024, 7.61454115, 7.00280875, 6.69694254, 6.08521014,\n",
    "       7.25769725, 6.18716554, 8.48116205, 8.83800595, 5.88129934,\n",
    "       7.97138505, 6.64596484, 6.84987564, 8.68507285, 4.30099064,\n",
    "       4.40294604, 4.04610214, 8.17529585, 7.92040735, 8.37920665])\n",
    "\n",
    "print('MSE: ', computeMSE(y, fhatx))\n",
    "\n",
    "assert np.abs(computeMSE(y, fhatx) - 0.01375124894963537) < 1e-9"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Compute the linear regression coefficients $\\beta_0, \\beta_1 \\in \\mathbb{R}$ with the `numpy` function `np.polyfit`.\n",
    "Remember, linear regression finds the values $\\beta_0, \\beta_1 \\in \\mathbb{R}$ which solve the minimization problem\n",
    "\n",
    "$$\n",
    "\\text{Minimize } \\frac{1}{N} \\sum_{i=1}^N \\left( y_i - (\\beta_0 + \\beta_1 x_i) \\right)^2 \\text{ over } \\beta_0, \\beta_1 \\in \\mathbb{R}\n",
    "$$\n",
    "\n",
    "*Hint*: Again, use the question mark `?` to get help and have a look at the documentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "af4f43bb2493a7c6b3d5ec3f91210de6",
     "grade": false,
     "grade_id": "cell-4f6c481b7e4d76bf",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "46f0db18bfa165e6bd85fd359966d927",
     "grade": true,
     "grade_id": "cell-515608379fe4789e",
     "locked": true,
     "points": 1,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert np.abs(beta[0] - 5.09777001) < 1e-8\n",
    "assert np.abs(beta[1] - 3.94414674) < 1e-8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Compute the prediction of `x` by yourself and store it in a variable `z`.\n",
    "The values should coincide with those of the variable `fhatx` (by at most `1e-8`). You can use the function `np.polyval`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "54bc71c014c525600938c37b52d1500d",
     "grade": false,
     "grade_id": "cell-d190b7c4862b9971",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "c9160ffe2ae4135b9e958f3603464292",
     "grade": true,
     "grade_id": "cell-c6db1531a735649e",
     "locked": true,
     "points": 1,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert np.max(np.abs(fhatx - z)) < 1e-8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "31c5e6c51e7619d7edea7b769d567c3b",
     "grade": false,
     "grade_id": "cell-135adcc3298dee41",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "**Task**: Plot the data points $(x_i, y_i)$, $i = 1 \\ldots, N$ together with the least squares line aka regression line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "2ca5f0aafff4390bcf281b6a92a0741d",
     "grade": true,
     "grade_id": "cell-ae78dcf4fb4cbe76",
     "locked": false,
     "points": 2,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part C: Introduction Pandas\n",
    "In **Part A**, you got to know a method to import `csv`-files using the function `np.genfromtxt`.\n",
    "At one point or another, we would have to deal with a problem\n",
    "that is inherent to numpy-arrays, namely that numpy-arrays can only handle one data type at a time.\n",
    "If we have different kinds of data like booleans, floats, integers or\n",
    "strings, we have to take a different route.\n",
    "One possible solution lies in the usage of the package `csv`.\n",
    "Here, every single row is scanned seperately, and thus can be handled to catch special cases.\n",
    "Another possibility is to use the package `pandas`, whose complexity is between the other two. It can be imported by\n",
    "\n",
    "    import pandas as pd\n",
    "\n",
    "and `csv`-files can be imported by the function `pd.read_csv`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Work through the [pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/10min.html#min) (this is a link!)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Download the file `Auto.csv` from the lecture’s homepage. Import the `csv`-file using the pandas function `read_csv` as a `DataFrame` named `Auto`.\n",
    "Beware of the missing values in the `csv`-file.\n",
    "You can use the optional parameter `na_values` from the function `read_csv`.\n",
    "In this problem, we want to **remove those data sets** that contain missing values.\n",
    "You should use the method `dropna(axis=0, inplace=True)` for this purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "a089138b1429e278c8e68d8aa146f761",
     "grade": false,
     "grade_id": "cell-efd646077862b192",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "d7968c35672bfbc21d4d6b08426b7fd7",
     "grade": true,
     "grade_id": "cell-3f85fae5600a9d5b",
     "locked": true,
     "points": 1,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert Auto.shape == (392,9)\n",
    "assert Auto.iloc[301,2] == 85\n",
    "assert np.abs(Auto[\"mpg\"].mean() - 23.445918367346938) < 1e-8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "3ff08c5a392ad68d4f818f3da97f5791",
     "grade": false,
     "grade_id": "cell-8cbd3a65bcdc462a",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "**Task**: Create a short summary of the most important statistics of the data set using the method `describe`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "fd5fc0938406b06943f005ff850a4b03",
     "grade": true,
     "grade_id": "cell-73e33597f66d3327",
     "locked": false,
     "points": 1,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "22114029bf59b003e6da3a376027a463",
     "grade": false,
     "grade_id": "cell-8a053852cc6d2989",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "**Task**: Create a graphical overview of the distributions of the input variables of the data set using the method `hist`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "296d5bce2c4b1a566bbc3d9a3940d80b",
     "grade": true,
     "grade_id": "cell-2ccdaf2c47ba29e9",
     "locked": false,
     "points": 1,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "1629a45ff9c950531166f489db33368b",
     "grade": false,
     "grade_id": "cell-419dc2360236f76a",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "**Task**: Create a scatter matrix using the function `pd.plotting.scatter_matrix` for the variables `[\"horsepower\", \"mpg\", \"weight\"]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "f74c436b729667fa450a87524184d031",
     "grade": true,
     "grade_id": "cell-9c3cea01f6163f40",
     "locked": false,
     "points": 1,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create two figures that relate the variable `\"horsepower\"` with `\"mpg\"` and `\"weight\"`, resp.\n",
    "Use the possibilities that are provided by pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code cell investigates a linear or quadratic connection between horsepower and mpg (miles per gallon)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Investigate linear and quadratic connection between horsepower and mpg\n",
    "x = Auto.horsepower\n",
    "y = Auto.mpg\n",
    "\n",
    "mpgbeta1 = np.polyfit(x,y,deg=1)\n",
    "mpgbeta2 = np.polyfit(x,y,deg=2)\n",
    "\n",
    "Auto.plot(x='horsepower', y = 'mpg', marker='o', alpha = .7, kind='scatter')\n",
    "xr = np.linspace(Auto.horsepower.min(), Auto.horsepower.max(), 100)\n",
    "plt.plot(xr,np.polyval(mpgbeta1,xr), c='r')\n",
    "plt.plot(xr,np.polyval(mpgbeta2,xr), c='b')\n",
    "\n",
    "print('Investigating mpg against horsepower')\n",
    "print('MSE for linear fit: ', computeMSE(y, np.polyval(mpgbeta1, x)))\n",
    "print('MSE for quadratic fit: ', computeMSE(y, np.polyval(mpgbeta2, x)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Use the example from above and investigate a possible linear and quadratic relationship between horsepower and weight."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "75b5231842732b8ed98be951f5b2e475",
     "grade": true,
     "grade_id": "cell-b75fc32748a1111f",
     "locked": false,
     "points": 1,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}