{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 4-6: Mixture Models+Model orden selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this lab session is to study mixture models. In the first part you will code the EM algorithm to estimate the parameters of a GMM given the number of mixed distributions and in the second part you will try different model order selection methods. You will send only one notebook for both parts.\n", "\n", "You have to send the filled notebook named **\"L4_6_familyname1_familyname2.ipynb\"** (groups of 2) by email to aml.centralesupelec.2019@gmail.com before November 28 at 23:59 and put **\"AML-L4-6\"** in the subject. \n", "\n", "We begin with the standard imports:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline\n", "sns.set_context('poster')\n", "sns.set_color_codes()\n", "plot_kwds = {'alpha' : 0.25, 's' : 80, 'linewidths':0}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will have two toy datasets to try the different methods:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GMM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. After estimation of those parameters we get an estimation of the distribution of our data. For the clustering task, one can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians. \n", "\n", "### First part\n", "\n", "Fill in the following class to implement a multivariate GMM:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class my_GMM():\n", " \n", " def __init__(self, k):\n", " '''\n", " Parameters:\n", " k: integer\n", " number of components\n", " \n", " Attributes:\n", " \n", " mu_: np.array\n", " array containing means\n", " Sigma_: np.array\n", " array cointaining covariance matrix\n", " cond_prob_: (n, K) np.array\n", " conditional probabilities for all data points \n", " labels_: (n, ) np.array\n", " labels for data points\n", " '''\n", " self.mu_ = None\n", " self.Sigma_ = None\n", " self.cond_prob_ = None\n", " self.labels_ = None\n", " \n", " def fit(self, X):\n", " \"\"\" Find the parameters mu_ and Sigma_\n", " that better fit the data\n", " \n", " Parameters:\n", " -----------\n", " X: (n, p) np.array\n", " Data matrix\n", " \n", " Returns:\n", " -----\n", " self\n", " \"\"\"\n", " def compute_condition_prob_matrix(X, mu, Sigma):\n", " '''Compute the conditional probability matrix \n", " shape: (n, K)\n", " '''\n", " \n", " # TODO:\n", " # initialize the parameters\n", " # apply sklearn kmeans or randomly initialize them\n", " \n", " # While not(convergence)\n", " # Compute conditional probability matrix\n", " # Update parameters\n", " \n", " # Update labels_\n", " \n", " # Return self\n", " \n", " def predict(self, X):\n", " \"\"\" Predict labels for X\n", " \n", " Parameters:\n", " -----------\n", " X: (n, p) np.array\n", " New data matrix\n", " \n", " Returns:\n", " -----\n", " label assigment \n", " \"\"\"\n", " # TODO\n", " \n", " def compute_proba(self, X):\n", " \"\"\" Compute probability vector for X\n", " \n", " Parameters:\n", " -----------\n", " X: (n, p) np.array\n", " New data matrix\n", " \n", " Returns:\n", " -----\n", " proba: (n, k) np.array \n", " \"\"\"\n", " # TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate your own mixture of Gaussian distributions to test the model, choose parameters so that GMM performs better than K-Means on it. Use `np.random.multivariate_normal`. \n", "\n", "Plot data with colors representing predicted labels and shapes representing real labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Bonus (not graded): Implement a mixture of asymmetric generalized Gaussians (AGGD)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Second Part\n", " \n", "- Implement the information criterions from the lecture (AIC, BIC, etc.) to select the number of clusters:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Implement the merge criterions \n", " - Correlation coefficients\n", " - Measuring Error \n", " - Comparing the parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Implement cross-validation " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the model selection criterions to choose the number of clusters for the two given datasets (data-MM-i.csv). Compare the results and the computational time. Try to visually validate your results." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Application" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You are going to work with the following data:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "shape: (1797, 64)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()\n", "print(\"shape:\", digits.data.shape)\n", "\n", "def plot_digits(data):\n", " fig, ax = plt.subplots(10, 10, figsize=(8, 8),\n", " subplot_kw=dict(xticks=[], yticks=[]))\n", " fig.subplots_adjust(hspace=0.05, wspace=0.05)\n", " for i, axi in enumerate(ax.flat):\n", " im = axi.imshow(data[i].reshape(8, 8), cmap='binary')\n", " im.set_clim(0, 16)\n", "plot_digits(digits.data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Model your data with your GMM class using a model order selection method to produce new synthetic handwritten numbers. Explain why you used that model selection method in this case. You should use PCA to reduce the dimension as GMM doesn't perform well in high-dimensional contexts. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TODO" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }