{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 7: Non-negative Matrix Factorization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this lab session is to code a NMF algorithm and use it in some applications.\n", "\n", "You have to send the filled notebook named **\"L7_familyname1_familyname2.ipynb\"** (groups of 2) by email to aml.centralesupelec.2019@gmail.com before 23:59 on December 5, 2018 and put **\"AML-L7\"** in the subject. \n", "\n", "We begin with the standard imports:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline\n", "sns.set_context('poster')\n", "sns.set_color_codes()\n", "plot_kwds = {'alpha' : 0.25, 's' : 80, 'linewidths':0}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NMF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Non-negative Matrix Factorization is a model where a matrix V is factorized into two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to interpret.\n", "\n", "Fill in the following class that implements a NMF by multiplicative updates using the Frobenius norm or the Kullback-Leiber divergence as loss function (implement both), you can add more methods if needed. Try 10 different random initializations and choose the best one." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class my_NMF():\n", " \n", " def __init__(self, n_components, loss, epsilon, max_iter = 60):\n", " '''\n", " Attributes:\n", " \n", " n_components_ : integer\n", " the unknown dimension of W and H\n", " max_iter_: integer\n", " maximum number of iterations\n", " epsilon_: float\n", " convergence\n", " loss_ = {\"Frobenius\", \"KL\"}\n", " w_: np.array\n", " W Matrix factor\n", " H_: np.array\n", " H Matrix factor\n", " '''\n", " self.n_components_ = n_components\n", " self.max_iter_ = max_iter\n", " self.loss_ = loss\n", " self.epsilon_ = epsilon\n", " self.W_ = None\n", " self.H_ = None\n", " \n", " def fit_transform(self, X):\n", " \"\"\" Find the factor matrices W and H\n", " \n", " Parameters:\n", " -----------\n", " X: (n, p) np.array\n", " Data matrix\n", " \n", " Returns:\n", " -----\n", " self\n", " \"\"\" \n", " # TODO:\n", " # initialize both matrices\n", " # random(0, 1)\n", "\n", " # While not(convergence):\n", " # Update W\n", " # Update H\n", " \n", " # Return self" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Bonus (not graded)**: Implement the regularized version" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applications" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### First application" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the first application you are going to analyse the following data to give an interpretation of the factorization:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_olivetti_faces\n", "\n", "dataset = fetch_olivetti_faces(shuffle=True)\n", "\n", "faces = dataset.data\n", "image_shape = (64, 64)\n", "\n", "n_samples, n_features = faces.shape\n", "\n", "def plot_faces(title, images, image_shape, n_col=5, n_row=5, cmap=plt.cm.gray):\n", " plt.figure(figsize=(2. * n_col, 2.26 * n_row))\n", " plt.suptitle(title, size=16)\n", " for i, comp in enumerate(images):\n", " plt.subplot(n_row, n_col, i + 1)\n", " vmax = max(comp.max(), -comp.min())\n", " plt.imshow(comp.reshape(image_shape), cmap=cmap,\n", " interpolation='nearest',\n", " vmin=-vmax, vmax=vmax)\n", " plt.xticks(())\n", " plt.yticks(())\n", " plt.subplots_adjust(0.01, 0.05, 0.99, 0.93, 0.04, 0.)\n", " \n", "plot_faces(\"Some faces\", faces[:25], image_shape)\n", "\n", "faces.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply your NMF algorithm for this dataset and plot the approximated face pictures." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Plot the $W$ matrix as images in a $(\\sqrt{r}, \\sqrt{r})$ grid\n", "- Choose one face, plot its corresponding weights (in $H$) in a grid and explain the interpretation of both factor matrices." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Second application" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the 20newsgroups dataset (from sklearn.datasets import fetch_20newsgroups_vectorized) that contains a collection of ~18,000 newsgroup documents from 20 different newsgroups.\n", "\n", "Model the topics present in a subsample with NMF. Print the most common words of each topic." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TODO" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }