{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name below.\n", "\n", "Rename this problem sheet as follows:\n", "\n", " ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}\n", " \n", "for example\n", " \n", " ps2_blja_problem1\n", "\n", "Submit your homework within one week until next Monday, 9 a.m." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NAME = \"\"\n", "EMAIL = \"\"\n", "USERNAME = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Data Science\n", "## Lab 10: Verification of the Leave-One-Out cross-validation (LOOCV) magic formula" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise, we want to verify the magic formula for the Leave-One-Out cross-validation known from the lecture, slide 229.\n", "\n", "According to this, the cross-validation estimate of the error is\n", "\n", "$$\n", "\\text{CV}^{\\text{LOOCV}}_{(n)} = \\frac{1}{n} \\sum_{i=1}^n \\left(\n", "\\frac{y_i - \\hat y_i}{1 - h_i}\n", "\\right)^2,\n", "\\quad \\text{with } \n", "h_i = \\frac{1}{n} + \\frac{(x_i - \\bar x)^2}{\\sum_{j=1}^n (x_j - \\bar x)^2}\n", "$$\n", "being the leverage statistic of observation $i$.\n", "\n", "The general formula for the CV estimate, which is valid for all cross-validition methods, is\n", "\n", "$$\n", "\\text{CV}_{(k)} = \\frac{1}{k} \\sum_{i=1}^{k} \\text{MSE}_i.\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we generate an example to verify that $\\text{CV}^{\\text{LOOCV}}_{(n)} = \\text{CV}_{(n)}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "cd64dc79811e1c91b820e1b65bb1a503", "grade": false, "grade_id": "cell-5fb7e0b6a98e747e", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "import numpy as np\n", "n = 20\n", "X = np.arange(n)\n", "np.random.seed(1)\n", "y = 4*X+0.1*np.random.randn(n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task (1 point)**: Fit a simple linear regression model on our data `X, y`. Here, it should be sufficient to use `numpy`'s `polyfit` function together with `polyval`.\n", "Store the regression coefficients in the variable `p`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "44110611c74841efab5e9e9bceff5415", "grade": false, "grade_id": "cell-a91bdd71d8619f08", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "2f076ffa3ab83b4c95e0dcd0415eb707", "grade": true, "grade_id": "cell-af008d8fa468b94f", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert np.max(np.abs(p - np.array([ 3.99916958, -0.0054475 ]))) < 1e-7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task (1 point)**: Determine the predicted values of your data set and store them in the variable `ypred`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "c468241c9f027deb93ff2f48fdb11a6d", "grade": false, "grade_id": "cell-b7707ff29de634ae", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "ff4cb6ddef7dbadc312b05fcb9aa3c37", "grade": true, "grade_id": "cell-8ef2a783a2bee007", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert np.abs(ypred.mean() - 37.98666353635393) < 1e-8\n", "assert np.abs(ypred.std() - 23.06033677189789) < 1e-8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task (2 points)**: Implement the function `magicCV` to compute the $\\text{CV}^{\\text{LOOCV}}_{(n)}$ and evaluate it for our example.\n", "Store the obtained value $\\text{CV}^{\\text{LOOCV}}_{(n)}$ in a variable `cvn1`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "2eb968392c8bb875108050e688fe475b", "grade": false, "grade_id": "cell-a81d6910b7f18622", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "def magicCV(X, y, ypred):\n", " \"\"\"Function to compute and return the\n", " CV error estimate using the magic formula\n", " from slide 229.\"\"\"\n", " # Implement the magicCV function here\n", " # YOUR CODE HERE\n", " raise NotImplementedError()\n", " \n", "# Evaluate the function for our example\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "b1902d1a8b43b35f240daf4f5cec7fa4", "grade": true, "grade_id": "cell-6216ebbea69b7d80", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert np.abs(cvn1 - 0.014723772434852839) < 1e-6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task (2 points)**: Now, you should determine the value $\\text{CV}_{(n)}$ by performing simple linear regression fits in a LOOCV fashion.\n", "You should use the function `LeaveOneOut().split(...)` from `scikit-learn` to generate the test and training indices for each iterate, and update the quantity `cvn2` in each iterate.\n", "\n", "You should observe that both values `cvn1` and `cvn2` are almost identical." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "fd8fbc89c30b1691b88b004d5b520cd7", "grade": false, "grade_id": "cell-e0cd0baa3b5ce671", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "from sklearn.model_selection import LeaveOneOut\n", "cvn2 = 0.0\n", "for train_idx, test_idx in LeaveOneOut().split(X):\n", " # Determine the mean-squared error \n", " # YOUR CODE HERE\n", " raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "aa4509e77bfc108c324834f13d00bd60", "grade": true, "grade_id": "cell-915d5fbc022f1a91", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert np.abs(cvn2 - 0.014723772434852688) < 1e-6\n", "assert abs(cvn1-cvn2) < 1e-12" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }