{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name below.\n", "\n", "Rename this problem sheet as follows:\n", "\n", " ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}\n", " \n", "for example\n", " \n", " ps2_blja_problem1\n", "\n", "Submit your homework within one week until next Monday, 9 a.m." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NAME = \"\"\n", "EMAIL = \"\"\n", "USERNAME = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Data Science\n", "## Lab 10: Cross-validation for parameter tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part A: Introduction K-Nearest Neighbor classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the lecture you have learned about the $K$-nearest neighbor classifier. It often performs very well, altough its computational cost is rather high for higher dimensional problems.\n", "\n", "Without worrying about the implementational details, we want to learn about another application of cross-validation: **parameter tuning**.\n", "\n", "In this problem, we want make use of cross-validation to tune the parameter $K$, i.e., the number of neighbors used in the $K$-nearest neighbor classifier.\n", "\n", "We start by importing the iris dataset.\n", "\n", "**Task**: Execute the following code cell to import and load the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "X = iris.data\n", "y = iris.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task (1 point)**: Use the function `train_test_split` to split the data into a training and test set:\n", "- set the option `stratify` to ensure that both the training and test set contain approximately equal proportions of classes\n", "- the training set should contain 80 % of the data\n", "- set the random seed to 1, i.e., use the option `random_state = 1`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "22706bf3c0419d3b303924a783dd9fdb", "grade": false, "grade_id": "cell-8f95c7557af9d3a7", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "e04a08b079015d990c37d28eae3f546e", "grade": true, "grade_id": "cell-889601a4368b0717", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "import numpy as np\n", "assert Xtrain.shape == (120,4)\n", "assert ytrain.shape == (120,)\n", "assert np.mean(ytrain) == 1\n", "assert abs(Xtest.mean() - 3.3958333333333335) < 1e-8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, you should try the function `KNeighborsClassifier` from the module `sklearn.neighbors`.\n", "\n", "**Task (1 point)**: Fit a model using the K nearest neighbors classifier on the *training data*.\n", "Use $K=5$ and compute the accuracy of the model, i.e., the proportion of correct classifications.\n", "Store the *accuracy on the test data* as `knn_accuracy`.\n", "\n", "Remember, the accuracy of a classification task is defined by\n", "\n", "$$\n", "\\text{accuracy} = \\frac{\\text{number of correct predictions}}{\\text{total number of predictions}}\n", "$$\n", "\n", "Use either a routine provided by `scikit-learn`, or compute it by yourself." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "cc0b0a5cfd8faa6cad8393f5d82b50fe", "grade": false, "grade_id": "cell-29d82c6fa028fed3", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "e0bba67577cb802f1cbd13560fda8eb3", "grade": true, "grade_id": "cell-27864c973ad26109", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert 'knn_accuracy' in locals()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should observe an accuracy of around $96.67\\%$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part B: Cross-validation for parameter tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we want to use cross-validation to tune the parameter $K$.\n", "You should use the function `cross_val_score` from the `sklearn.model_selection` module to get a reliable estimate of the accuracy for a given value of $K$ (number of neighbors).\n", "A good choice for the optional parameter `cv`, which sets the number of folds used for the cross-validation, is 8.\n", "You should also set the optional parameter `scoring`, so that the function returns an array containing the accuracy of each fold.\n", "\n", "**Task (1 point)**: \n", "Complete the following cell.\n", "Perform $K$-nearest neighbor classification for every $K=1,\\ldots,25$ using cross-validation **with 8 folds**.\n", "Store the *mean of the accuracy scores* in the list `k_scores`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "6deae61c819fe8ebeac4fe26868097b0", "grade": false, "grade_id": "cell-4b9bde6eadd475f3", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score\n", "n_fold = 8\n", "k_range = list(range(1, 26))\n", "k_scores = []\n", "\n", "# Use a for-loop to perform the task\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "23e0052ec5adb54ce7623f4f067e469a", "grade": true, "grade_id": "cell-de731e8948853574", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "import numpy as np\n", "assert abs(np.mean(k_scores) - 0.9665079365079363) < 1e-8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task (1 point)**: What value of $k$ maximizes the accuracy? Store it in the variable `k_max`. If there are multiple values of $k$ reaching the same maxima, choose the smallest one." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "fc586ba1700fb253a2c6fa117282b7b6", "grade": false, "grade_id": "cell-ad728b1bc568288e", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "8aee433485269a5740ff55d3ac158099", "grade": true, "grade_id": "cell-7e837d2f9e2b4c0e", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert 'k_max' in locals()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task (1 point)**: Plot the optained accuracy estimates against the parameter values $K$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "90f6d0e9c1a72a54508bcf5b2a1d8ac1", "grade": true, "grade_id": "cell-87fc88c9d6c23bb6", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }