{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 2: Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this lab session is to code a clustering algorithm, apply it to data and compare the performance with other techniques.\n", "\n", "You have to send the filled notebook named **\"L2_familyname1_familyname2.ipynb\"** (groups of 2) by email to aml.centralesupelec.2019@gmail.com by October 10, 2019. Please put **\"AML-L2\"** in the subject. \n", "\n", "We begin with the standard imports:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import sklearn.cluster as cluster # all clustering techniques except hdbscan\n", "%matplotlib inline\n", "sns.set_context('poster')\n", "sns.set_color_codes()\n", "plot_kwds = {'alpha' : 0.25, 's' : 80, 'linewidths':0}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will have two toy datasets to try the different methods:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn.datasets as data\n", "moons, _ = data.make_moons(n_samples=50, noise=0.05)\n", "blobs, _ = data.make_blobs(n_samples=50, centers=[(-0.75,2.25), (1.0, 2.0)], cluster_std=0.25)\n", "test_data_0 = np.vstack([moons, blobs])\n", "\n", "test_data_1 = np.load('clusterable_data.npy')\n", "\n", "fig, ax = plt.subplots(1, 2, figsize=(16, 6))\n", "ax[0].scatter(test_data_0.T[0], test_data_0.T[1], c='b', **plot_kwds)\n", "ax[0].set_title('Toy Dataset', size=16)\n", "\n", "ax[1].scatter(test_data_1.T[0], test_data_1.T[1], color='b', **plot_kwds)\n", "ax[1].set_title('Noisy Toy Dataset', size=16)\n", "\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a lot of clustering algorithms to choose from the `sklearn` library. So what clustering algorithms should you be using? It depends." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## K-means" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "K-Means is the 'go-to' clustering algorithm for many simply because it is fast, easy to understand, and available everywhere (there's an implementation in almost any statistical or machine learning tool you care to use). However, K-Means has a few problems caused by its simplicity. \n", "\n", "We try the `sklearn` implementation in our toy datasets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "kmeans_0 = KMeans(n_clusters=4, max_iter=200).fit(test_data_0)\n", "kmeans_1 = KMeans(n_clusters=4, max_iter=200).fit(test_data_1)\n", "\n", "fig, ax = plt.subplots(1, 2, figsize=(16, 6))\n", "ax[0].scatter(test_data_0.T[0], test_data_0.T[1], c=kmeans_0.labels_ , **plot_kwds)\n", "ax[0].set_title('Toy Dataset', size=16)\n", "\n", "ax[1].scatter(test_data_1.T[0], test_data_1.T[1], c=kmeans_1.labels_, **plot_kwds)\n", "ax[1].set_title('Noised Toy Dataset', size=16)\n", "\n", "\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aglomerative Single Linkage clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Agglomerative clustering is a suite of algorithms all based on the same idea. The fundamental idea is that you start with each point in it's own cluster and then, for each cluster, use some criterion to choose another cluster to merge with. Do this repeatedly until you have only one cluster and you get get a hierarchy, or binary tree, of clusters branching down to the last layer which has a leaf for each point in the dataset. The most basic version of this, single linkage, chooses the closest cluster to merge, and hence the tree can be ranked by distance as to when clusters merged/split.\n", "\n", "**Code your own Aglomerative Single Linkage clustering algorithm**!:\n", "\n", "- Fill in the class \n", "- During the process, keep track of the cluster merges by saving a (num_samples-1,4) np.array being a linkage matrix in scypy format (to use their function to plot dendrogram: `scipy.cluster.hierarchy.dendrogram`). check documentation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class my_SingleLinkageAglomerativeClustering():\n", " \n", " def __init__(self, metric=\"euclidean\", n_clusters=3):\n", " '''\n", " Attributes:\n", " \n", " metric_: {\"euclidean\",\"precomputed\"}\n", " the distance to be used\n", " if precomputed then X is distance matrix\n", " n_clusters: integer\n", " number of clusters to return \n", " linkage_matrix_: (n-1, 4) np.array\n", " in the same format as linkage \n", " labels_: integer np.array\n", " label assigment\n", " hierarchy_: list of np.array\n", " each array corresponds to label assigment\n", " at each level (number of clusters)\n", " hierarchy_[0]=np.array(list(range(n)))\n", " '''\n", " self.metric_ = metric\n", " self.n_clusters_ = n_clusters\n", " self.linkage_matrix_ = None\n", " self.labels_ = None\n", " self.hierarchy_ = None\n", " \n", " def fit(self, X):\n", " \"\"\" Create a hierarchy of clusters\n", " \n", " Parameters:\n", " -----------\n", " X: (n, p) np.array\n", " Data matrix\n", " \n", " Returns:\n", " -----\n", " self: my_SingleLinkageAglomerativeClustering\n", " to have access to labels_\n", " \"\"\"\n", " # if it's not precomputed compute the distance matrix\n", " # using from scipy.spatial import distance \n", " \n", " # HINT:\n", " # You can use a minimum spanning tree and add merge in increasing order\n", " # or modifying the distance matrix \n", " # (add row/column for new clusters and remove/put zero in old row/colums)\n", " \n", " # keep track of merges in linkage_matrix_ and labels in hierarchy_\n", " \n", " # update labels_ from the hierarchy level selected by n_clusters_ \n", " \n", " def plot_dendrogram():\n", " '''Use self.linkage_matrix_ in `scipy.cluster.hierarchy.dendrogram` \n", " to plot the dendrogram of the hierarchical structure\n", " ''' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply the method to our toy datasets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.spatial import distance \n", "from scipy.cluster.hierarchy import dendrogram\n", "\n", "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try the different linkage methods implemented in `sklearn` and comment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DBSCAN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DBSCAN is a density based algorithm -- it assumes clusters for dense regions. It is also the first actual clustering algorithm we've looked at: it doesn't require that every point be assigned to a cluster and hence doesn't partition the data, but instead extracts the 'dense' clusters and leaves sparse background classified as 'noise'. In practice DBSCAN is related to agglomerative clustering. As a first step DBSCAN transforms the space according to the density of the data: points in dense regions are left alone, while points in sparse regions are moved further away. Applying single linkage clustering to the transformed space results in a dendrogram, which we cut according to a distance parameter (called epsilon or `eps` in many implementations) to get clusters. Importantly any singleton clusters at that cut level are deemed to be 'noise' and left unclustered. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply it to the test_data, how do you tune the parameters?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## HDBSCAN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "HDBSCAN is a recent algorithm developed by some of the same people who write the original DBSCAN paper. Their goal was to allow varying density clusters. The algorithm starts off much the same as DBSCAN: we transform the space according to density, exactly as DBSCAN does, and perform single linkage clustering on the transformed space. Instead of taking an epsilon value as a cut level for the dendrogram however, a different approach is taken: the dendrogram is condensed by viewing splits that result in a small number of points splitting off as points 'falling out of a cluster'. This results in a smaller tree with fewer clusters that 'lose points'. That tree can then be used to select the most stable or persistent clusters. This process allows the tree to be cut at varying height, picking our varying density clusters based on cluster stability." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import hdbscan\n", "\n", "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Apply the algorithms to the following images and comment the results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NASA Curiosity Picture" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Look at following NASA photo taken by a robot in mars:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from PIL import Image\n", "\n", "Im_1 = Image.open('im_nasa_reduced.jpg')\n", "\n", "fig = plt.figure(figsize=(8, 6))\n", "plt.imshow(Im_1)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Transform the image to an intensity (I) and saturation (S) representation, it helps to distinguish bright and textures.\n", "\n", "$$I=\\frac{R+G+B}{3}$$\n", "$$S=1-I\\times min(R, G, B)$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO:\n", "# I = \n", "# S = " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Apply the seen algorithms to segment the image in the Intensity-Saturation representation, comment the results and check if you find something on mars' surface. Be careful with hdbscan and memory errors for some parameters choice (use algorithm='boruvka_kdtree')." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Color compression\n", "\n", "One interesting application of clustering is in color compression within images. \n", "For example, imagine you have an image with millions of colors.\n", "In most images, a large number of the colors will be unused, and many of the pixels in the image will have similar or even identical colors.\n", "\n", "Get a simplified 10-colored version of the following image by applying k-means. Plot both images." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_sample_image\n", "china = load_sample_image(\"china.jpg\")\n", "\n", "fig = plt.figure(figsize=(8, 6))\n", "plt.imshow(china);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }