{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 1: Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Welcome to the advanced Machine Learning Course.\n", "\n", "The objective of this lab session is to code a few regression algorithms and to apply them to synthetic and real data.\n", "\n", "We begin with the standard imports:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns; sns.set()\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple Linear Regression\n", "\n", "We will start with the most familiar linear regression, a straight-line fit to data.\n", "A straight-line fit is a model of the form\n", "$$\n", "y = ax + b\n", "$$\n", "where $a$ is commonly known as the *slope*, and $b$ is commonly known as the *intercept*.\n", "\n", "Consider the following data, which is scattered about a line with a slope of 2 and an intercept of -5:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rng = np.random.RandomState(1)\n", "x = rng.rand(30)\n", "y = 2 * x - 5 + 0.1* rng.randn(30)\n", "plt.scatter(x, y);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fill in the MultivariateLinearRegression class whose method fit takes a matrix $X$ and an array $y$ as input and returns an array of coefficients" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class MultivariateLinearRegression():\n", " # Class for linear regression solving least-squares:\n", "\n", " def __init__(self,):\n", " self.coef_ = None\n", " \n", " def fit(self, X, y):\n", " \"\"\" Fit the data (X, y).\n", " \n", " Parameters:\n", " -----------\n", " X: (num_samples, num_features) np.array\n", " Design matrix\n", " y: (num_sampes, ) np.array\n", " Output vector\n", " \n", " Note:\n", " -----\n", " Updates self.coef_\n", " \"\"\"\n", " # TODO :\n", " # Create a (num_samples, num_features+1) np.array X_aug whose first column \n", " # is a column of all ones (so as to fit an intercept).\n", " \n", " # Update self.coef_\n", " \n", " def predict(self, X):\n", " \"\"\" Make predictions for data X.\n", " \n", " Parameters:\n", " -----------\n", " X: (num_samples, num_features) np.array\n", " Design matrix\n", " \n", " Returns:\n", " -----\n", " y_pred: (num_samples, ) np.array\n", " Predictions\n", " \"\"\"\n", " # TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try your model on the data and plot the data points and the fitted line:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the scope and the intercept:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Model slope: \")\n", "print(\"Model intercept:\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the results are very close to the inputs, as we might hope.\n", "\n", "Of course our linear regression estimator is much more capable than this, however—in addition to simple straight-line fits, it can also handle multidimensional linear models of the form\n", "$$\n", "y = a_0 + a_1 x_1 + a_2 x_2 + \\cdots\n", "$$\n", "where there are multiple $x$ values.\n", "Geometrically, this is akin to fitting a plane to points in three dimensions, or fitting a hyper-plane to points in higher dimensions.\n", "\n", "The multidimensional nature of such regressions makes them more difficult to visualize, but we can see one of these fits in action by building a toy example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rng = np.random.RandomState(1)\n", "X = 3 * rng.rand(100, 3)\n", "y = 0.5 + np.dot(X, [1.5, -2., 1.])\n", "\n", "#model.fit(X, y)\n", "#print(model.coef_) # perfect fitting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the $y$ data is constructed from three random $x$ values, and the linear regression recovers the coefficients used to construct the data.\n", "\n", "In this way, we fit lines, planes, or hyperplanes to our data.\n", "It still appears that this approach would be limited to strictly linear relationships between variables, but it turns out we can relax this as well.\n", "\n", "## Basis Function Regression\n", "\n", "One trick you can use to adapt linear regression to nonlinear relationships between variables is to transform the data according to *basis functions*.\n", "\n", "The idea is to take our multidimensional linear model:\n", "$$\n", "y = a_0 + a_1 x_1 + a_2 x_2 + a_3 x_3 + \\cdots\n", "$$\n", "and build the $x_1, x_2, x_3,$ and so on, from our single-dimensional input $x$.\n", "That is, we let $x_n = f_n(x)$, where $f_n()$ is some function that transforms our data.\n", "\n", "For example, if $f_n(x) = x^n$, our model becomes a polynomial regression:\n", "$$\n", "y = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + \\cdots\n", "$$\n", "Notice that this is *still a linear model*—the linearity refers to the fact that the coefficients $a_n$ never multiply or divide each other.\n", "What we have effectively done is taken our one-dimensional $x$ values and projected them into a higher dimension, so that a linear fit can fit more complicated relationships between $x$ and $y$.\n", "\n", "### Polynomial basis functions\n", "\n", "This polynomial projection is useful enough that it is built into Scikit-Learn, using the ``PolynomialFeatures`` transformer:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import PolynomialFeatures\n", "x = np.array([2, 3, 4])\n", "poly = PolynomialFeatures(5, include_bias=False) # with or without intercept\n", "poly.fit_transform(x[:, None])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see here that the transformer has converted our one-dimensional array into a three-dimensional array by taking the exponent of each value.\n", "This new, higher-dimensional data representation can then be plugged into a linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this transform, we can use the linear model to fit much more complicated relationships between $x$ and $y$. \n", "For example, here is a sine wave with noise:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rng = np.random.RandomState(1)\n", "x = rng.rand(50)\n", "y = 2 * np.sin(1.8*np.pi*x) + 0.1 * rng.randn(50)\n", "\n", "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try with different maximum degrees. Our linear model can provide an excellent fit to this non-linear data!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Bonus:** How can we avoid overfitting?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer:**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regularization\n", "\n", "The introduction of basis functions into our linear regression makes the model much more flexible, but it also can very quickly lead to over-fitting." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the data projected to the 30-dimensional basis, the model has far too much flexibility and goes to extreme values between locations where it is constrained by data.\n", "We can see the reason for this if we plot the coefficients of the Gaussian bases with respect to their locations:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ridge regression ($L_2$ Regularization)\n", "\n", "Perhaps the most common form of regularization is known as *ridge regression* or $L_2$ *regularization*, sometimes also called *Tikhonov regularization*.\n", "This proceeds by penalizing the sum of squares (2-norms) of the model coefficients; in this case, the penalty on the model fit would be \n", "$$\n", "P = \\alpha\\sum_{n=1}^N \\theta_n^2\n", "$$\n", "where $\\alpha$ is a free parameter that controls the strength of the penalty." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Fill in the following class:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class RidgeRegularization():\n", " # Class for ridge regression with closed form equation:\n", "\n", " def __init__(self, alpha):\n", " self.coef_ = None\n", " self.alpha_ = alpha\n", " \n", " def fit(self, X, y):\n", " \"\"\" Fit the data (X, y).\n", " \n", " Parameters:\n", " -----------\n", " X: (num_samples, num_features) np.array\n", " Design matrix\n", " y: (num_sampes, ) np.array\n", " Output vector\n", " \n", " Note:\n", " -----\n", " Updates self.coef_\n", " \"\"\"\n", " # TODO:\n", " # Create a (num_samples, num_features+1) np.array X_aug whose first column \n", " # is a column of all ones (so as to fit an intercept).\n", " \n", " # Update self.coef_ adding the shrinkage ridge term\n", " \n", " def predict(self, X):\n", " \"\"\" Make predictions for data X.\n", " \n", " Parameters:\n", " -----------\n", " X: (num_samples, num_features) np.array\n", " Design matrix\n", " \n", " Returns:\n", " -----\n", " y_pred: (num_samples, ) np.array\n", " Predictions\n", " \"\"\"\n", " # TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try the model in our data. Plot the coefficients of the regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The $\\alpha$ parameter is essentially a knob controlling the complexity of the resulting model.\n", "In the limit $\\alpha \\to 0$, we recover the standard linear regression result; in the limit $\\alpha \\to \\infty$, all model responses will be suppressed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Bonus:** How can we choose the $\\alpha$ parameter?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer:** " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lasso regression ($L_1$ Regularization)\n", "\n", "Another very common type of regularization is known as lasso, and involves penalizing the sum of absolute values (1-norms) of regression coefficients: $$\n", "P = \\alpha\\sum_{n=1}^N |\\theta_n|\n", "$$ Though this is conceptually very similar to ridge regression, the results can differ surprisingly: for example, due to geometric reasons lasso regression tends to favor sparse models where possible: that is, it preferentially sets model coefficients to exactly zero.\n", "\n", "We can see this behavior in duplicating the ridge regression figure, but using L1-normalized coefficients.\n", "\n", "First, fill in the following class:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class LassoRegularization():\n", " # Class for lasso regression with soft thresholding:\n", "\n", " def __init__(self, alpha, learning_rate=0.01, iterations=1000):\n", " self.coef_ = None\n", " self.alpha = alpha\n", " self.learning_rate_ = learning_rate\n", " self.iterations_ = iterations\n", " \n", " def soft_threshold(self, alpha):\n", " \"\"\" Soft threshold function\"\"\"\n", " #TODO\n", " \n", " def fit(self, X, y):\n", " \"\"\" Fit the data (X, y).\n", " Parameters:\n", " -----------\n", " X: (num_samples, num_features) np.array\n", " Design matrix\n", " y: (num_sampes, ) np.array\n", " Output vector\n", " \n", " Note:\n", " -----\n", " Updates self.coef_\n", " \"\"\"\n", " # TODO:\n", " # Update self.coef_ by using the coordinate soft thresholding algorithm\n", "\n", " def predict(self, X):\n", " \"\"\" Make predictions for data X.\n", " \n", " Parameters:\n", " -----------\n", " X: (num_samples, num_features) np.array\n", " Design matrix\n", " \n", " Returns:\n", " -----\n", " y_pred: (num_samples, ) np.array\n", " Predictions\n", " \"\"\"\n", " # TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try the model in our data. Plot the coefficients of the regression and compare them to the Ridge's coefficients." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Robust regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Linear least-squares estimates can behave badly when the error distribution is not normal, particularly when\n", "the errors are heavy-tailed. One remedy is to remove influential observations from the least-squares fit. Another approach, termed robust regression, is to employ a fitting criterion that is not as vulnerable as least squares to unusual data.\n", "\n", "The most common general method of robust regression is M-estimation, introduced by Huber (1964).\n", "\n", "Fill in the following class:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class RobustRegression():\n", " # Class for robust linear regression:\n", "\n", " def __init__(self, potential, k):\n", " self.coef_ = None\n", " self.potencial_ = potential\n", " self.k_ = k\n", " \n", " def mad(self, x):\n", " \"\"\" Calculate mad.\"\"\"\n", " #TODO\n", " \n", " def weight_function(self, x, potential, k):\n", " \"\"\" Calculate weigth of point residual x.\n", " \n", " Parameters:\n", " -----------\n", " x: standarize by mad residual\n", " potential: name of the potential to use:\n", " \"huber\" or \"bisquare\"\n", " k: parameter of the potential function\n", " \n", " Returns:\n", " -----\n", " weight: weight corresponding to x \n", " \"\"\"\n", " #TODO\n", " \n", " def fit(self, X, y):\n", " \"\"\" Fit the data (X, y).\n", " \n", " Parameters:\n", " -----------\n", " X: (num_samples, num_features) np.array\n", " Design matrix\n", " y: (num_sampes, ) np.array\n", " Output vector\n", " \n", " Note:\n", " -----\n", " Updates self.coef_\n", " \"\"\"\n", " #TODO:\n", " \n", " # Create a (num_samples, num_features+1) np.array X_aug whose first column \n", " # is a column of all ones (so as to fit an intercept).\n", " \n", " # Start with initial coefficients \n", "\n", " # Iteratively update coefficients by weighted least-squares until convergence\n", " \n", " def predict(self, X):\n", " \"\"\" Make predictions for data X.\n", " \n", " Parameters:\n", " -----------\n", " X: (num_samples, num_features) np.array\n", " Design matrix\n", " \n", " Returns:\n", " -----\n", " y_pred: (num_samples, ) np.array\n", " Predictions\n", " \"\"\"\n", " #TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try it in the following data with outliers coming from the heavy-tail error and compare with the performance of the other models:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(300)\n", "rng = np.random.RandomState(1)\n", "x = rng.rand(30)\n", "y = 2 * x - 5 + 0.1* np.random.standard_cauchy(30)\n", "plt.scatter(x, y);\n", "\n", "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bonus: Predicting Bicycle Traffic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an example, let's take a look at whether we can predict the number of bicycle trips across Seattle's Fremont Bridge based on weather, season, and other factors.\n", "\n", "In this section, we joinned the bike data with another dataset, and try to determine the extent to which weather and seasonal factors—temperature, precipitation, and daylight hours—affect the volume of bicycle traffic through this corridor.\n", "\n", "As you may now, we should use time series techniques to analyze this dataset, instead, as a first simple approach, we will perform a multivariate linear regression to relate weather and other information to bicycle counts, in order to estimate how a change in any one of these parameters affects the number of riders on a given day.\n", "\n", "Let's start by loading the dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "daily = pd.read_csv('data.csv', index_col='Date', parse_dates=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this in place, we can choose the columns to use, and fit a linear regression model to our data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO:\n", "# Drop any rows with null values\n", "# Apply the previous algorithms to fit the number of bicycles\n", "# Save it in the daily dataframe in a 'predicted' column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can compare the total and predicted bicycle traffic visually:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#daily[['Total', 'predicted']].plot(alpha=0.5);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is evident that we have missed some key features, especially during the summer time.\n", "Either our features are not complete (i.e., people decide whether to ride to work based on more than just these) or there are some nonlinear relationships that we have failed to take into account (e.g., perhaps people ride less at both high and low temperatures)." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }