{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Data Science\n", "## Lab 15: Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part A - K-Means Algorithm\n", "The goal of this problem is to implement the K-Means algorithm known from slide 444:\n", "\n", "### Algorithm (K-Means Clustering)\n", "\n", "1. Fix the number of clusters $K$ and assign (by random) a number, from $1$ to $K$, to each of the observations. These serve as initial cluster assignments for the observations.\n", "2. Iterate until the cluster assignments stop changing:\n", " 1. For each of the $K$ clusters, compute the cluster centroid. The $k$-th cluster centroid is the vector of the $p$ feature means for the observations in the $k$-th cluster.\n", " 2. Assign each observation to the cluster whose centroid is closest (where closest is defined by Euclidean distance).\n", " \n", "You should start by generating a test example, e.g., with the `make_blobs` function from `sklearn.datasets`.\n", "\n", "**Task**: Implement the K-Means Clustering using only `numpy` (and `matplotlib`, if you wish to display your findings graphically)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Execute the following cell and generate a test example with 100 samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "f610f5bd51cfbd3f54d0ea153b1fd667", "grade": false, "grade_id": "cell-844c94102ff7bba1", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Plot the sample data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "fa2b2d0b431f706b5d7fb8c9cbf530d0", "grade": false, "grade_id": "cell-545e8ffc1220f7ab", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Perform step 1 from the K-Means Clustering algorithm with $K = 2$, i.e., two classes, and store the initialized labels as `l`.\n", "You can use the function `np.random.randint`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "94e2b164120bda7ab9451bf0093888b9", "grade": false, "grade_id": "cell-c68434ccb6523d8f", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import numpy as np\n", "# Initialize labels of samples randomly\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Plot data with randomly assigned labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "fdfdbd1fc754f0b41a962ea4eb32ad63", "grade": false, "grade_id": "cell-d2af60ea6e985e12", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "6dcb67f3ec506771e62a4f61c559a512", "grade": true, "grade_id": "cell-88b9270558ee386d", "locked": true, "points": 0, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert l.shape == (100,)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Compute the mean within each class according to Step 2A." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "4e939e1c866d909880d910d92c4378ce", "grade": false, "grade_id": "cell-08d8d558850a8373", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Plot the data samples and their class means." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "6481b61a3370137c2cdc29a83b987a4f", "grade": false, "grade_id": "cell-0da4f8cfe41e2b0b", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Determine distance of sample points to the cluster centers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "b73fcf8796284ee70d14f4497b423b55", "grade": false, "grade_id": "cell-f0678256901e2fcd", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Determine the new class according to Step 2B." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "bea2ca14dfc2f308b13f5248ba542b1f", "grade": false, "grade_id": "cell-60804725af0b5c72", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Now, we have developed all the incredients that we need in our algorithm.\n", "\n", "Collect the snippets from above and complete the implementation of the function `myKMeans`.\n", "\n", "Add comments as necessary!\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "1ab0f9a9eb990958606ffb721568acff", "grade": false, "grade_id": "cell-03a8c168618d1809", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "def myKMeans(X, K=2):\n", " # myKMeans perform the K-Means Clustering algorithm\n", " # Input: X - Data matrix of size n,m\n", " # K - Number of clusters\n", " # Output: l - labels of clusters of size (n,)\n", " # M - Coordinates of cluster centers \n", " # YOUR CODE HERE\n", " raise NotImplementedError()\n", " return (l, M)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: If done correctly, the following code will work and present an almost perfectly classified sample." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\n", "n = 100\n", "X,y = make_blobs(n_samples=n, n_features=2, centers=4, random_state=1)\n", "\n", "K = 4\n", "l,M = myKMeans(X, K)\n", "\n", "plt.scatter(X[:,0], X[:,1], c=l)\n", "plt.scatter(M[:,0], M[:,1], s=200, alpha=0.5, edgecolors='k', c=range(K));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part B - Further Clustering \n", "\n", "While the K-Means Clustering algorithm is by far the simplest clustering algorithm, there are many other clustering algorithms available.\n", "This [Towards Data Science article](https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68) describes 5 important clustering algorithms, the first one being *K-Means Clustering*, the last one being *Agglomerative Hierarchical Clustering*.\n", "\n", "You should also have a look at [the `sklearn` documentation](https://scikit-learn.org/stable/modules/clustering.html).\n", "\n", "**Task**: Implement at least two of these using methods from `scikit-learn`. Test the performance of the algorithms for the swiss roll data set, which can be generated and plotted by the following cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_swiss_roll\n", "import mpl_toolkits.mplot3d.axes3d as p3\n", "# Generate swiss roll\n", "n = 1000\n", "eps = 0.05\n", "X, _ = make_swiss_roll(n, eps)\n", "\n", "# Plot the swiss roll\n", "fig = plt.figure()\n", "ax = p3.Axes3D(fig)\n", "ax.view_init(7, -80)\n", "ax.scatter(X[:,0], X[:,1], X[:,2]);" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }