{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name below.\n", "\n", "Rename this problem sheet as follows:\n", "\n", " ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}\n", " \n", "for example\n", " \n", " ps2_blja_problem1\n", "\n", "Submit your homework within one week until next Monday, 9 a.m." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NAME = \"\"\n", "EMAIL = \"\"\n", "USERNAME = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Data Science\n", "## Lab 2: Data import and linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This course assumes that you are comfortable with the basic functions of Jupyter and Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part A: Introduction Data Import" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download the file `Advertising.csv` from the homepage’s exercise section and upload it to your Jupyter Hub folder.\n", "Take a short look at the `csv`-file using a spreadsheet, e.g., LibreOffice.\n", "The file contains information about the sales of products in different markets, along with advertising budgets in the three media: **TV**, **radio** and **newspaper**.\n", "\n", "**Task**: Import the `csv`-file using the `numpy` function `genfromtxt` and store it as an array `X`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "4d30a197dd4a5c036d94268c8e5706b9", "grade": false, "grade_id": "cell-824784e084c340f0", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import numpy as np\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "bfc1926fc028bb60c9e6a1265b2535f4", "grade": true, "grade_id": "cell-784ee0c212e8cdfc", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert X.shape == (200,5)\n", "assert X[34][1] == 95.7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Extract the columns from the array `X` and store them as 1-dimensional arrays `idx`, `tv`, `radio`, `newspaper` and `sales`, e.g.,\n", " \n", " idx = X[:, 0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "856514a71d1631744105ae4f81607689", "grade": false, "grade_id": "cell-983eb3ebbf4d3eed", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "e042a60a6603acebbc4e08c5828d4361", "grade": true, "grade_id": "cell-05f112d846deb52b", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert idx.ndim == 1\n", "assert tv.ndim == 1\n", "assert radio.ndim == 1\n", "assert newspaper.ndim == 1\n", "assert sales.ndim == 1\n", "i=27\n", "assert idx[i] == 28\n", "assert tv[i] == 240.1\n", "assert radio[i] == 16.7\n", "assert newspaper[i] == 22.9\n", "assert sales[i] == 15.9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Add subplots to plot sales against radio as well as sales against newspaper." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "916afdc04b9a31e557a79092cac6787f", "grade": true, "grade_id": "cell-cc1107edacb5af03", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.rcParams['figure.figsize'] = (16, 9)\n", "fig1 = plt.figure()\n", "fig1.add_subplot(1,3,1)\n", "plt.plot(tv, sales, 'ro')\n", "plt.xlabel('TV budget')\n", "plt.ylabel('sales')\n", "plt.title('TV ads')\n", "\n", "fig1.add_subplot(1,3,2)\n", "# TASK: Plot sales against radio\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "\n", "fig1.add_subplot(1,3,3)\n", "# TASK: Plot sales against newspaper\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part B: Creation of a function in Python\n", "The mean squared errer, short **MSE**, is one of the most important performance indicators for the quality of data fitting.\n", "The goal of this exercise is to implement the function `computeMSE` with the following **input**:\n", "- the observations $y_i \\in Y$, $i = 1, \\ldots, N$ that belong to measurements $x_i \\in X$, $i = 1, \\ldots, N$\n", "- the predictions of $f(x_i)$, which are denoted by $\\hat f(x_i)$, $i = 1, \\ldots, N$\n", "\n", "and corresponding **output**:\n", "\n", "$$\n", " MSE = \\frac{1}{N} \\sum_{i=1}^N (y_i - \\hat f (x_i))^2\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "e08fbc9d8b47bf2e33c1ce3f31f69486", "grade": false, "grade_id": "cell-656779d5a9b31dc3", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import numpy as np\n", "\n", "# Define function for mean squared error\n", "def computeMSE(y, fhatx):\n", " # YOUR CODE HERE\n", " raise NotImplementedError()\n", " return mse" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "1d62c1acc5b99c1b3d5c8826a992f6d4", "grade": true, "grade_id": "cell-103b708d111d04d3", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "x = np.array([0.55, 0.72, 0.6 , 0.54, 0.42,\n", " 0.65, 0.44, 0.89, 0.96, 0.38,\n", " 0.79, 0.53, 0.57, 0.93, 0.07,\n", " 0.09, 0.02, 0.83, 0.78, 0.87])\n", "\n", "y = np.array([6.9 , 7.58, 7.03, 6.61, 5.84,\n", " 7.32, 6.29, 8.38, 9.03, 5.75,\n", " 7.95, 6.63, 7. , 8.8 , 4.37,\n", " 4.49, 4.01, 7.95, 7.87, 8.37])\n", "\n", "fhatx = np.array([\n", " 6.74792024, 7.61454115, 7.00280875, 6.69694254, 6.08521014,\n", " 7.25769725, 6.18716554, 8.48116205, 8.83800595, 5.88129934,\n", " 7.97138505, 6.64596484, 6.84987564, 8.68507285, 4.30099064,\n", " 4.40294604, 4.04610214, 8.17529585, 7.92040735, 8.37920665])\n", "\n", "print('MSE: ', computeMSE(y, fhatx))\n", "\n", "assert np.abs(computeMSE(y, fhatx) - 0.01375124894963537) < 1e-9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Compute the linear regression coefficients $\\beta_0, \\beta_1 \\in \\mathbb{R}$ with the `numpy` function `np.polyfit`.\n", "Remember, linear regression finds the values $\\beta_0, \\beta_1 \\in \\mathbb{R}$ which solve the minimization problem\n", "\n", "$$\n", "\\text{Minimize } \\frac{1}{N} \\sum_{i=1}^N \\left( y_i - (\\beta_0 + \\beta_1 x_i) \\right)^2 \\text{ over } \\beta_0, \\beta_1 \\in \\mathbb{R}\n", "$$\n", "\n", "*Hint*: Again, use the question mark `?` to get help and have a look at the documentation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "af4f43bb2493a7c6b3d5ec3f91210de6", "grade": false, "grade_id": "cell-4f6c481b7e4d76bf", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "46f0db18bfa165e6bd85fd359966d927", "grade": true, "grade_id": "cell-515608379fe4789e", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert np.abs(beta[0] - 5.09777001) < 1e-8\n", "assert np.abs(beta[1] - 3.94414674) < 1e-8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Compute the prediction of `x` by yourself and store it in a variable `z`.\n", "The values should coincide with those of the variable `fhatx` (by at most `1e-8`). You can use the function `np.polyval`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "54bc71c014c525600938c37b52d1500d", "grade": false, "grade_id": "cell-d190b7c4862b9971", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "c9160ffe2ae4135b9e958f3603464292", "grade": true, "grade_id": "cell-c6db1531a735649e", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert np.max(np.abs(fhatx - z)) < 1e-8" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "31c5e6c51e7619d7edea7b769d567c3b", "grade": false, "grade_id": "cell-135adcc3298dee41", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: Plot the data points $(x_i, y_i)$, $i = 1 \\ldots, N$ together with the least squares line aka regression line." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "2ca5f0aafff4390bcf281b6a92a0741d", "grade": true, "grade_id": "cell-ae78dcf4fb4cbe76", "locked": false, "points": 2, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part C: Introduction Pandas\n", "In **Part A**, you got to know a method to import `csv`-files using the function `np.genfromtxt`.\n", "At one point or another, we would have to deal with a problem\n", "that is inherent to numpy-arrays, namely that numpy-arrays can only handle one data type at a time.\n", "If we have different kinds of data like booleans, floats, integers or\n", "strings, we have to take a different route.\n", "One possible solution lies in the usage of the package `csv`.\n", "Here, every single row is scanned seperately, and thus can be handled to catch special cases.\n", "Another possibility is to use the package `pandas`, whose complexity is between the other two. It can be imported by\n", "\n", " import pandas as pd\n", "\n", "and `csv`-files can be imported by the function `pd.read_csv`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Work through the [pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/10min.html#min) (this is a link!)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Download the file `Auto.csv` from the lecture’s homepage. Import the `csv`-file using the pandas function `read_csv` as a `DataFrame` named `Auto`.\n", "Beware of the missing values in the `csv`-file.\n", "You can use the optional parameter `na_values` from the function `read_csv`.\n", "In this problem, we want to **remove those data sets** that contain missing values.\n", "You should use the method `dropna(axis=0, inplace=True)` for this purpose." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "a089138b1429e278c8e68d8aa146f761", "grade": false, "grade_id": "cell-efd646077862b192", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "d7968c35672bfbc21d4d6b08426b7fd7", "grade": true, "grade_id": "cell-3f85fae5600a9d5b", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert Auto.shape == (392,9)\n", "assert Auto.iloc[301,2] == 85\n", "assert np.abs(Auto[\"mpg\"].mean() - 23.445918367346938) < 1e-8" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "3ff08c5a392ad68d4f818f3da97f5791", "grade": false, "grade_id": "cell-8cbd3a65bcdc462a", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: Create a short summary of the most important statistics of the data set using the method `describe`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "fd5fc0938406b06943f005ff850a4b03", "grade": true, "grade_id": "cell-73e33597f66d3327", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "22114029bf59b003e6da3a376027a463", "grade": false, "grade_id": "cell-8a053852cc6d2989", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: Create a graphical overview of the distributions of the input variables of the data set using the method `hist`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "296d5bce2c4b1a566bbc3d9a3940d80b", "grade": true, "grade_id": "cell-2ccdaf2c47ba29e9", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1629a45ff9c950531166f489db33368b", "grade": false, "grade_id": "cell-419dc2360236f76a", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task**: Create a scatter matrix using the function `pd.plotting.scatter_matrix` for the variables `[\"horsepower\", \"mpg\", \"weight\"]`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "f74c436b729667fa450a87524184d031", "grade": true, "grade_id": "cell-9c3cea01f6163f40", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create two figures that relate the variable `\"horsepower\"` with `\"mpg\"` and `\"weight\"`, resp.\n", "Use the possibilities that are provided by pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code cell investigates a linear or quadratic connection between horsepower and mpg (miles per gallon)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Investigate linear and quadratic connection between horsepower and mpg\n", "x = Auto.horsepower\n", "y = Auto.mpg\n", "\n", "mpgbeta1 = np.polyfit(x,y,deg=1)\n", "mpgbeta2 = np.polyfit(x,y,deg=2)\n", "\n", "Auto.plot(x='horsepower', y = 'mpg', marker='o', alpha = .7, kind='scatter')\n", "xr = np.linspace(Auto.horsepower.min(), Auto.horsepower.max(), 100)\n", "plt.plot(xr,np.polyval(mpgbeta1,xr), c='r')\n", "plt.plot(xr,np.polyval(mpgbeta2,xr), c='b')\n", "\n", "print('Investigating mpg against horsepower')\n", "print('MSE for linear fit: ', computeMSE(y, np.polyval(mpgbeta1, x)))\n", "print('MSE for quadratic fit: ', computeMSE(y, np.polyval(mpgbeta2, x)))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Use the example from above and investigate a possible linear and quadratic relationship between horsepower and weight." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "75b5231842732b8ed98be951f5b2e475", "grade": true, "grade_id": "cell-b75fc32748a1111f", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 1 }