{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Data Science\n",
    "## Lab 15: Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part A - K-Means Algorithm\n",
    "The goal of this problem is to implement the K-Means algorithm known from slide 444:\n",
    "\n",
    "### Algorithm (K-Means Clustering)\n",
    "\n",
    "1. Fix the number of clusters $K$ and assign (by random) a number, from $1$ to $K$, to each of the observations. These serve as initial cluster assignments for the observations.\n",
    "2. Iterate until the cluster assignments stop changing:\n",
    "  1. For each of the $K$ clusters, compute the cluster centroid. The $k$-th cluster centroid is the vector of the $p$ feature means for the observations in the $k$-th cluster.\n",
    "  2. Assign each observation to the cluster whose centroid is closest (where closest is defined by Euclidean distance).\n",
    "  \n",
    "You should start by generating a test example, e.g., with the `make_blobs` function from `sklearn.datasets`.\n",
    "\n",
    "**Task**: Implement the K-Means Clustering using only `numpy` (and `matplotlib`, if you wish to display your findings graphically)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Execute the following cell and generate a test example with 100 samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "f610f5bd51cfbd3f54d0ea153b1fd667",
     "grade": false,
     "grade_id": "cell-844c94102ff7bba1",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Plot the sample data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "fa2b2d0b431f706b5d7fb8c9cbf530d0",
     "grade": false,
     "grade_id": "cell-545e8ffc1220f7ab",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Perform step 1 from the K-Means Clustering algorithm with $K = 2$, i.e., two classes, and store the initialized labels as `l`.\n",
    "You can use the function `np.random.randint`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "94e2b164120bda7ab9451bf0093888b9",
     "grade": false,
     "grade_id": "cell-c68434ccb6523d8f",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "# Initialize labels of samples randomly\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Plot data with randomly assigned labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "fdfdbd1fc754f0b41a962ea4eb32ad63",
     "grade": false,
     "grade_id": "cell-d2af60ea6e985e12",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "6dcb67f3ec506771e62a4f61c559a512",
     "grade": true,
     "grade_id": "cell-88b9270558ee386d",
     "locked": true,
     "points": 0,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert l.shape == (100,)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Compute the mean within each class according to Step 2A."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "4e939e1c866d909880d910d92c4378ce",
     "grade": false,
     "grade_id": "cell-08d8d558850a8373",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Plot the data samples and their class means."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "6481b61a3370137c2cdc29a83b987a4f",
     "grade": false,
     "grade_id": "cell-0da4f8cfe41e2b0b",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Determine distance of sample points to the cluster centers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "b73fcf8796284ee70d14f4497b423b55",
     "grade": false,
     "grade_id": "cell-f0678256901e2fcd",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Determine the new class according to Step 2B."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "bea2ca14dfc2f308b13f5248ba542b1f",
     "grade": false,
     "grade_id": "cell-60804725af0b5c72",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: Now, we have developed all the incredients that we need in our algorithm.\n",
    "\n",
    "Collect the snippets from above and complete the implementation of the function `myKMeans`.\n",
    "\n",
    "Add comments as necessary!\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "1ab0f9a9eb990958606ffb721568acff",
     "grade": false,
     "grade_id": "cell-03a8c168618d1809",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def myKMeans(X, K=2):\n",
    "    # myKMeans perform the K-Means Clustering algorithm\n",
    "    # Input: X - Data matrix of size n,m\n",
    "    #        K - Number of clusters\n",
    "    # Output: l - labels of clusters of size (n,)\n",
    "    #         M - Coordinates of cluster centers \n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return (l, M)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: If done correctly, the following code will work and present an almost perfectly classified sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_blobs\n",
    "n = 100\n",
    "X,y = make_blobs(n_samples=n, n_features=2, centers=4, random_state=1)\n",
    "\n",
    "K = 4\n",
    "l,M = myKMeans(X, K)\n",
    "\n",
    "plt.scatter(X[:,0], X[:,1], c=l)\n",
    "plt.scatter(M[:,0], M[:,1], s=200, alpha=0.5, edgecolors='k', c=range(K));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part B - Further Clustering \n",
    "\n",
    "While the K-Means Clustering algorithm is by far the simplest clustering algorithm, there are many other clustering algorithms available.\n",
    "This [Towards Data Science article](https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68) describes 5 important clustering algorithms, the first one being *K-Means Clustering*, the last one being *Agglomerative Hierarchical Clustering*.\n",
    "\n",
    "You should also have a look at [the `sklearn` documentation](https://scikit-learn.org/stable/modules/clustering.html).\n",
    "\n",
    "**Task**: Implement at least two of these using methods from `scikit-learn`. Test the performance of the algorithms for the swiss roll data set, which can be generated and plotted by the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_swiss_roll\n",
    "import mpl_toolkits.mplot3d.axes3d as p3\n",
    "# Generate swiss roll\n",
    "n = 1000\n",
    "eps = 0.05\n",
    "X, _ = make_swiss_roll(n, eps)\n",
    "\n",
    "# Plot the swiss roll\n",
    "fig = plt.figure()\n",
    "ax = p3.Axes3D(fig)\n",
    "ax.view_init(7, -80)\n",
    "ax.scatter(X[:,0], X[:,1], X[:,2]);"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}