Textbook

To Read:

The Matrix Cookbook (Petersen and Pedersen, 2006)
Linear Algebra (Shilov 1977)
Probability Theory, The Logic of Science (Jaynes 2003)
Pearl (1988)
Ramsey (1926) - Similar axioms for bayesians, frequentists
Machine Learning Murphy 2012
Cover and Thomas (2006) Information Theory
MacKay (2003) Information Theory
Convex Optimization - Boyd and Vandenberghe (2004)
Convex Optimization - Rockafellar (1997)
Second order optimization algorithms - Nocedal and Wright (2006)
Machine Learning - Bishop (2006)
No Free Lunch Theorem (Wolpert, 1996)
Non-local generalization (Bengio and Monperrus, 2005, Bengio et al., 2006c)
Universal Approximation Theorem (Hornik et al., 1989; Leshno et al., 1993)
Network representation power exponential in depth (Montufar et. al (2014)
Sequence Learning with Recurrent Neural Networks (Graves 2012)

Introduction

Change - In some cases, a mathematical equation is presented before the variables are defined. This can make it hard to know that I need to skip ahead to understand an equation until I’m halfway through reading through it.

The idea is that information is nicely structured as a hierarchy of concepts, and learning that hierarchy would be ideal.

It’s fascinating that this is sitting in a textbook, and is the first intuitive point given. I currently think that the hierarchical abstraction hypothesis is wrong, but I’m open to considering both sides of the evidence.

I’m also sad that they don’t express the arbitrariness of the functions that can be learned by these algorithms - they’ll say an algorithm can accomplish a particular task, instead of going through the intermediate representation of classification. It makes it sound like ML algorithms are hard coded for particular tasks.

Cartesian vs. Polar coordinates as examples of different ways to represent information is a glorious, eye-opening example.

RELUs are a step away from neural realism (spiking) but neural realism hasn’t lead to an improvement in performance.

Linear Algebra

Inversion

Matrix inversion is presented in two contexts. The first is a definition, relating it to the identity matrix by multiplying the inverse with the original matrix. The second is in application, as a way to solve Ax = b with x = A-1b. This has been used as motivation prior. There is a recommendation against using the inverse to solve such a problem in practical application, but it’s not an argument from complexity - the reason is that the inverse matrix can only be represented with limited numerical precision on a machine, and that algorithms that take advantage of b will be more accurate.

Linear Dependence and Span

There is a column space that’s defined by the columns of the matrix. Multiplied by arbitrary coefficients, there’s a set of all possible outcomes that represents the column space / span of the matrix. If any columns have the same direction, they are linearly dependent (can be obtained by multiplying the other by a scalar) and the redundant direction doesn’t contribute to the matrix’s span/column space.This can leave a matrix with m columns with a span that is < Rm, meaning that a solution to Ax=b may not be found by matrix inversion. It’s essential that A span Rm if you want to guarantee a solution for all possible b.

The content on inverting square matrices is accurate, we always multiply A times A-1 to obtain a square matrix (moore-penrose pseudoinverse) and so I should re-read with this in mind.

Norms

A norm measures the size of a vector. The general form of a norm is the absolute value of the vector, with the sum of each component to a power p. The whole thing is taken to the 1/p to account for the change (or in the case of the L2 norm, is left to be).
Common is the L2 squared norm, which has a derivative that allows each component to only depend on itself (and not the rest of the vector). The L2 norm (and all L>1 norms) will take a value that’s less that 1 and make it even smaller in the size computation. And so it can become hard to differentiate between quite small values and actual 0. In this situation, we have the L1 norm. This is a sum of the absolute value of elements in the vector.
There are three criterion for norms - the norm must output zero for the zero vector, the norm of the sum of two vectors must be <= the sum of their norms, and when you multiply a vector by a scalar the norm of that should be equal to the original norm of the vector multiplied by the scalar.
Often people incorrectly reference the count of non-zero values in a vector as a L0 norm. But it violates the third necessary condition in the definition of a norm.
There’s a max (Linfinity) norm that is equal to the value of the maximum value in the vector.
For matrices, we have the frobenius norm. It’s the square root of the sum over every element in the matrix squared.
It’s possible to use norms (and the cosine of the angle) to compute the dot product of two vectors.

Special Kinds of Matrices and Vectors

Diagonal matrices are nice because their product with other matrices is easy to compute (element wise multiplication) and because their inverse is easy to compute (each element of the inverse is 1/the original element, and that vector is transposed).
It’s also possible to cheaply multiply by a non square diagonal matrix, either adding zeros or dropping terms to account for the shape. Only square matrices are invertible.
There are unit vectors whose L2 norm equals 1. Orthogonal vectors whose dot product is zero. Orthonormal vectors who have dot product of zero and where each are unit length. There are also orthogonal matrices where each column vector is orthonormal to each other column vector, and the same is true for the rows. The inverse of an orthogonal matrix is its transpose.

Eigendecomposition

It’s useful to break mathematical objects into constituent parts. For matrices, they can be broken into eigenvectors and corresponding eigenvalues.
Eigenvectors are vectors which, when multiplied by the matrix, only change in scale and not in direction. The change in scale that occurs to the unit vector is the eigenvalue for that eigenvector.
We can construct an eigendecomposition where we have a matrix equal to an orthogonal matrix of its eigenvectors with columns in descending order of eigenvalue size multiplied by a diagonal matrix of eigenvalues multiplied by the inverse of the orthogonal eigenvector matrix which is just the transpose of that matrix.
This decomposition can tell us whether a matrix is singular (if it has an eigenvalue of 0).
The matrix is called positive definite if all eigenvalues are positive. It’s called negative definite if all eigenvalues are negative. It’s called positive semidefinite if all eigenvalues are positive or 0 (also implying that the matrix is singular).

Singular Value Decomposition

Struck gold here. The standard eigendecomposition doesn’t generalize to non-square matrices, but SVD does and maintains many of the important properties.
This decomposes a matrix into the eigenvectors of AAT multiplied by a diagonal non-square matrix of non-zero singular values of A which are the square roots of the eigenvalues of ATA (m x n) multiplied by the eigenvectors of ATA.
The left singular vectors are of matrix U, the right singular vectors are of matrix VT. They are not official eigenvectors of A, but of AAT and ATA.

Moore-Penrose Pseudoinverse

It’s often not possible to solve for an inverse. The pseudoinverse retains many of the properties we care about. It can be solved for using singular value decomposition.
The pseudoinverse is equal to ATA plus a constant time the appropriately sized identity matrix, multiplied by AT. Usually instead of computing this, it’s found using singular value decomposition to get VD+UT where D+ is the pseudoinverse of D obtained by taking the reciprocal of the diagonal and transposing the matrix. V and U also come from the SVD of A.

Trace

The trace is the sum of the values of the matrix’s diagonal.
The trace has a number of awesome properties. It’s possible to take a situation where you would have had to use summation (like the frobenious norm) and couch it in terms of matrix multiplication and the trace operator.
Crazily, the frobenious norm is sqrt(Trace(AAT)).
The trace is also invariant to the order of matrix multiplications, even if you end up with differently sized matrices. The trace of a set of products of matrices will always be the same - while matrix multiplication doesn’t commute, the trace operator does.

The Determinant

The determinant is the product of the eigenvalues of a matrix. It’s a real valued scalar that represents how much a space is contracted or expanded by a transformation. If the determinant is 0 (at least one eigenvalue is zero) then the space contracts completely along one axis, collapsing its volume. If the determinant is 1, the transformation preserves space. It’s likely useful for avoiding exploding gradients, if determinants can be kept close to 1.

Principal Components Analysis

Let’s find a compressed representation of the data using matrix multiplication. We want an encoding to a lower dimension and a decoding from the lower dimension to the original matrix g(c) = Dc that minimizes the distance between the decoded value and the original value.
Also, it feels like there are two flavors of optimization in ML - gradients and distances.
We constrain the decoder matrix D to have unit norm across its vectors and to have its column vectors orthogonal to one another.
We can find the solution to the minimum distance between reconstruction and original by representing the L2 distance as a matrix operation. If we distribute the computation we see that a term can be omitted. We use our constraints on D to have unit norm and orthogonal vectors to see that DDT is the identity, then we take a derivative using matrix calculus and set that derivative equal to zero, finding the transformation that yields the minimal distance (c = DTx)
So we can encode the lower dimensional representation with f(x) = DTx.
We have to minimize the distances over all points, which we can see as minimizing the frobenious norm of the error matrix. The way to get the reconstruction error is DDTx, and so we’re minimizing X-DDTx, which can be ‘cosmetically’ rearranged as X-xdd, assuming that d is one dimensional and because d is its own transpose.
We use the trace representation of the frobenious norm to get a matrix multiplication solution. We can distribute over the transpose times itself and eliminate a term that doesn’t depend on d (which we’re minimizing over). Then we can rearrange terms due to the invariance of the trace to commutativity, and reintroduce our dTd = 1 constraint to disappear / merge a few terms. We end up taking the argmax of dTXTXd which can be solved with an eigendecomposition. Optimal d is given by the eigenvector of XTX corresponding to the largest eigenvalue.
The matrix D is given by the eigenvectors corresponding to the largest eigenvalues.

Probability Theory

Why Probability?

There are many situations where we need systems to be able to deal with uncertainty. There are three major sources - inherent stochasticity in a system (think quantum mechanics), incomplete information about a system, or an incomplete or approximate model of a system. All are sources of uncertainty that we would like to model.
There are simpler, probabilistic rules for a system that are often nicer than complex, precise rules for a system.
There’s a divide between thinking about the frequency of events as probability (which probability theory was developed for) and thinking about probabilities as degrees of belief about a system - the divide between frequentist and bayesian frameworks. It’s likely that both have the same foundational axioms.

Random Variables

A random variable can take different values based on some underlying probability distribution. The variable can be continuous or discrete.

Probability Distributions

These are representations of the possible values of a random variable. They sum to one, all values are non-negative, and encompass the domain of possible values of the random variable.
A discrete probability distribution is called a probability mass function. It gives a probability for each discrete value of x. Uniform over k values for x has each probability at 1/k, for example.
A probability density function is for continuous random variables. The function integrates to 1, and denotes probability density over a range of values. To get the probability over a range, integrate the function within that range.

Marginal Probability

Often we’ll want a probability distribution over a subset of possible discrete values or a subset of continuous space. We can compute this by summing up the probability mass or integrating over the density in that particular space.

Conditional Probability

We can condition on the probability of another event with P(y|x) = P(y)*P(x) / P(x).

Chain Rule of Conditional Probabilities

The product rule (the ability to compute the intersection of events by multiplying) can be derived from conditional probability - you take (P(y)*P(x) / P(x)) * P(x), which cancels out to P(y) * P(x).

Independence and Conditional Independence

If the probability distribution of each variable does not include the other variable, they can be considered independent. Their joint probability is the product of their probability distributions at the given points. They’re considered to be conditionally independent if when conditioning both variables on a third variable, their probability distribution does not include the other variable.

Expectation, Variance and Covariance

An expected value aggregates the possible values of a random variables multiplied by their probability. The result is an average or ‘expected’ value for that random variable. The expected value is invariant to a constant multiplying the random variable vs multiplying the expected value itself.
Variance looks at the difference between the values of a random variable and its expected value. It squares those differences, and then takes the expected value over those squared differences. The standard deviation is the square root of this.
Covariance considers the deviation from expectation in one variable multiplied by the deviation from expectation in another variable (or in itself) for each possible value of each variable on their distribution. There’s a covariance matrix that tracks each possible combination of values if the distribution is discrete.

Bernoulli Distribution

The Bernoulli distribution has a single random variable that is 0 or 1, with a single parameter that gives the probability of 1. Call that parameter phi. P(1) = phi, p(0) = 1-phi, E(x) = phi, Var(x) = phi(1-phi).

Multinoulli Distribution

Multinoulli is the term given to a distribution over a set of categories. The reason for the change in terms is that multinomial implies multiple distributions, when in fact there is only one. Usually the values of the multinoulli are category bins, and so don’t have real meaning. They’re arbitrary. That means that it makes little sense to compute the expected value and variance.
The multinoulli is parameterized by… I don’t understand the syntax. I imagine that it’s what comes out of a softmax, where there’s a probability over each class (in this case, given the datapoint). Ah yes, and the final probability can be framed as 1 - the sum over the other probabilities, which can be represented as a dot product with the one vector.

Normal Distribution

This distribution naturally encodes a minimal amount of prior information about the data, and comes up consistently in part as a result of the central limit theorem. It’s parameterized with two major params (mu and sigma) which are the mean and variance of the data. The standard formula isn’t the most efficient, and so often the inverse variance is used to parameterize the normal distribution instead.
There’s also a multivariate normal distribution that’s parameterized by the covariance matrix of the data. This also has a more efficient parameterization.

Exponential and Laplace Distributions

The exponential function runs an indicator function over the values and assigns 0 probability to all negative portions of the distribution.
The Laplace distribution also lets us produce a spike of probability at an arbitrary point.

The Dirac Distribution and Empirical Distribution

The Dirac Distribution assigns probability mass on the basis of the way that the function (a particular functional form) is integrated. The result is a distribution with infinite amounts of probability mass at a single point.
The Emperical Distribution can be thought of as a way to approximate a discrete distribution with a continuous distribution, where the probability mass focuses on specified points.

Mixtures of Distributions

We can construct richer probability distributions by combining simpler probability distributions. The Diriac / Emperical distribution is already one example of this, where a Dirac is generated for each training point. A multinoulli is used to select over a number of probability distributions.
Common is the Gaussian Mixture Model, which is a multinoulli over normal distributions. Each distribution has its own parameterization (mean and variance), and it is flexible enough to be able to approximate any smooth distribution given enough components.
Mixtures also point to the possibility of latent variables, which are unobserved but indirectly observed through other variables. We can use conditional probability to get at these values.
We’ll have prior probabilities over our distributions in the multinoulli, and posterior probabilities once we observe a datapoint for its underlying distribution.

Useful Properties of Common Functions

Covers the Sigmoid and Softplus, where the softplus is log(1 + exp(x)), and a softened version of a Relu. The sigmoid shows up as it’s a good way to generate a bernoulli probability. Sigmoids saturate when the reach very low or very high values, becoming insensitve to small changes. Both have relationships and properties that make them nice to work with mathematically.

Bayes Rule

We can compute the probability P(y|x) from P(x|y) and P(x), deriving from conditional probability - P(y|x) = P(x|y)*P(x) / P(y).

Technical Details of Continuous Variables

There are issues around continuous variables and infinity where it is possible to constructs sets of a probability distribution to draw from where the size of the sets is greater than 1 even though they do not intersect. These are properties of sets that take advantage of infinities, like fractal sets.
Measure theory has a few ways of dealing with this and of separating out the spaces that are safe for probability theory. There’s this concept of measure zero spaces (lines and points, for example, versus shapes with filled in space). There’s another concept of ‘almost everywhere’ that generalizes what works for discrete sets to a subset of continuous sets.
There’s also basic principles that don’t easily hold, where a simple transformation of a variable can distort space and lead to contradictory results.
In once case, accounting for the Jacobian (matrix where values are the combinations of partial derivatives with respect to each of two variables) is necessary to make up for the distortion in space caused by a function.

Information Theory

We have an intuition that there’s more information in events that are less likely, and no information in events that are guaranteed to happen.
This can be captured by the self-information of an event, that event’s negative log probability (-log(P(x)).
We have a shannon entropy over a distribution, -Ex~P[log(P(x))], that tells us how much information is in said distribution.
There’s a metric for comparing probability distributions called the KL-Divergence, taking the log probability of the difference between distributions which simplifies to -Ex~P[log(P(x))-log(Q(x))].
There is also the related cross entropy, which omits the first distribution and gives:
-Ex~P[log(Q(x))].

Structured Probabilistic Models

These models capture the dependency graph between random variables. The dependencies can either be directed, in which case we see which interactions lead to dependence and independence, or undirected, where we have probability distributions that interact directly or indirectly. We can compute the probabilities with the product rule over the conditional dependencies in the graph. The probabilities can be normalized by dividing by the aggregate probability present in the graph.

Numerical Computation

Overflow and Underflow

Major issues around using computers are imprecision around floating point numbers and pushing to 0 or infinity if a value is close to one or the other. This can cause divide by zero errors, undefined functions, etc.
The softmax is a good example of a function that needs to be modified to avoid numerical issues. It computes e(xi)/sum(e(xj)) and when xi is extremely large will overflow to infinity. Similarly, it’s possible for values that are very close to zero to be transformed into 0.
Keep these issues in mind when programming algorithms.

Poor Conditioning

Some functions are extremely sensitive - their output changes dramatically if there’s a small change in their input. This is a big problem if there are rounding / imprecision errors in your input. Those errors will be amplified by your function.
A great example of a poorly conditioned function is using matrix inversion for least squares. The sensitivity of that transformation depends on the ratio between eigenvalues in the original matrix. When that ratio is very large, the operation will be over-sensitive to tiny changes in the input, leading to bad results.

Gradient-Based Optimization

Often it’s necessary to minimize or maximize a scalar valued loss function when training machine learning algorithms. This can be done by taking small steps greedily in the direction of minimizing a function, even if that function is non-convex. The non-convexity implies that we will have local minima that are not global minima, as well as saddle points that may be surrounded by extremely flat regions. We will need to design optimization methods that deal well with this.
Gradient steps have a learning rate. This can be set to a small constant, or one can search for the optimal step size (the one that most reduces the objective) and take that step, which is called Line Search (usually this is done by evaluating for several values of the learning rate, not by creating a second optimization problem - real time adaptive learning-rate / cross validation).
While gradients are for continuous problems, the idea of taking small steps in a good direction can be generalized to discrete spaces, which is referred to as hill climbing.

Beyond the Gradient: Jacobian and Hessian Matrices

The Jacobian of a function of multiple inputs and multiple outputs is the matrix of first partial derivatives of each component with respect to each other component.
The Hessian is the Jacobian of the gradient, or a representation of each possible set of second derivatives. In general second derivatives are equivalent regardless of the order they’re posed in (dxdy == dydx) and so the Hessian is symmetric everywhere the function is continuous.
The curvature of the relationship between the objective function and the input is determined by the second derivative. The Hessian represents how the gradient is changing, and can be used to correct for the gradient over or under-stepping since it’s just an infinitesimal approximation of the direction of the curve.
The Hessian, due to its symmetry, can be represented as an orthogonal matrix of eigenvectors and a set of eigenvalues where the eigenvalues are the min and max of particular second derivatives, and any second derivative is a linear combination of eigenvalues in the appropriate direction.
We can use a Taylor series approximation of the function at a point to see how the curvature will affect the quality of our gradient step. That approximation works well for small step sizes. It also lets us solve for the optimal step size (given the taylor approximation).
We can also use the second derivative to test for whether we’re in a saddle point or local minimum.
This test can be adapted to the Hessian. If the Hessian is positive definite (all positive eigenvalues) then we know we’re at a local minimum. If the Hessian is negative definite (all negative eigenvalues) then we know we’re at a local maximum. If there’s a zero eigenvalue (semidefinite) we can’t conclusively solve the test, but we know that if there’s a positive and negative eigenvalue that we’re at a saddle point where we’re at a local minimum in one direction but a local maximum in another direction.
The Hessian has a condition number that is the relative curvature - cases where the direction of most curvature has 5x more curvature than the direction with least curvature. But Gradient descent can’t see this.
Newton’s method consists of using a taylor series approximation of the function and jumping directly to the minimum of the approximation using the curvature information. If the quadratic approximation isn’t close enough, we can do this repeatedly. This will converge in much fewer steps than gradient descent.
We can try to put constraints on the optimization by ensuring that functions are Lipschitz continuous - this relates the difference between function values to the difference in their L2 norm, and guarantees that we get small changes in the function for small changes in the gradient.
Convex optimization has been extremely successful, on the strong constraints of the Hessian being positive semidefinite everywhere and any local minimum also being a global minimum.

Constrained Optimization

Karush-Kuhn-Tucker can frame a constrained optimization problem as an unconstrained problem in such a way that infeasible solutions will never be optimal. This is called the generalized Lagrangian, and involves parameterizing the constraints in a nice way. But it’s a nice solution to constrained optimization problems which avoids writing overspecified algorithms. This generalizes Lagrange multipliers to inequality constraints.
Inactive constraints will not change the locally optimal solution that has been reached if they are relaxed, even if the did cut off some other portion of the space.

Example: Linear Least Squares

We can solve this with gradient descent by taking the matrix derivative and iterating until convergence. We can also use newton’s method, and since this problem is quadratic solve it in a single step.
Introducing a constraint, we can use Karush-Kuhn-Tucker to specify the constraint xTx <= 1 as lambda*(xTx - 1), and take a gradient and solve for a constrained solution.

Machine Learning Basics

The generic formulation of a machine learning task is that you have some Task to perform, some Performance metric for the task, and some Experience of the world. And you convert that experience into better performance on the task.

The Task, T

Machine learning algorithms can be used for classification, taking in a set of features and outputting a probability distribution over classes or outputting classes directly.

Missing feature classification comes up often. One approach is to marginalize over the missing features. My own thought here is that you should have a model for every possible set of missing features, and train with all the data that will count for that. Then you use the appropriate model.

Regression is regression.

Translation is interestingly different - you go from one language’s outputs to another language’s outputs.

Machine transcription allows us to go from some signal input to text. For example, image to text transcription for reading off the word or words in an image. Or speech to text, converting frequencies into words.

Structured output prediction, where an output that is a vector of interdependent components is output. For example, scene description where the output is a sentence that needs to be coherent. Or generating a tree with parts of speech associated.

Anomaly detection, where you want to predict whether a datapoint fits the set of normal behaviors.

Synthesis and sampling, where new data points that resemble points coming from the original distribution are generated. In a video game, textures may be generated instead of having an artist annotate each pixel.

Imputation - missing values can be modeled and predicted based on other features of the data.

Denoising, where a corrupted input is converted into the originally correct output.

Density Estimation, where the probability mass function or probability density function that generated the data is explicitly modeled, and where that density can then be used for other tasks.

The Experience, E

There are a number of ways that experience if framed in ML, most often through a dataset of examples. There’s supervised learning, which slits the data up between features and labels and attempts to learn p(y|x). There’s also unsupervised learning, which looks to learn the underlying distribution of a dataset or some interesting properties of that distribution.
We could use supervised learning to find the relationship between every feature and the rest of the data, which can be seen as equivalent to learning the joint distribution of the data (with the chain rule, we end up taking the product over every feature vs. all others).
There are other representations of experience, like reinforcement learning where there is feedback from an environment to the agent, which takes actions.

Example: Linear Regression

This runs through a basic supervised regression algorithm. The derivation of the optimal weights is done directly by solving for where the derivative of the weight vector is 0. With optimal weights we have a function for predicting y.
The bias term is also explained as being a “bias” in the sense that without any input to the function, this is the value that will be output. It’s different from statistical bias.
The weights are found by minimizing the mean squared error. The gradient with respect to the weights taken, and the operation is in linear algebra terms.

Capacity, Overfitting and Underfitting

There’s a great representation of over/underfitting here which is framing it as two sources of error - one is an inability to fit the training set (underfitting) and the other is trying to bridge a gap between the training error and the testing error (overfitting).

There’s a representational power that a model will have, based on the probability distributions and functions that it’s capable of modeling. There’s also an effective representational power that comes from what can actually be learned, instead of what’s theoretically possible to learn. This theory is hard to use, especially around deep networks with ambiguous representational power.

There’s also a clean definition of parametric models here. The parametric model has a vector that is learned (the weights, generally) which is fixed in advance of seeing any data. Non-parametric models are more flexible, using an arbitrary number of parameters based on the data (think KNN).

There’s an optimal amount of capacity to a model that determines how flexible it should be for the task. If the model isn’t flexible enough it won’t be able to fit the training data well, and will underfit. The algorithm will plateau as its capacity pases optimal capacity.

The No Free Lunch Theorem

The No-Free-Lunch theorem says that every algorithm performs equally well if you average them across all data-generating distributions. This is extremely misleading, in my opinion. It’s completely useless. It’s also saying that if you predict that every point belongs to the same class, you have equal coverage on average over all possible probability distributions.
But that’s not how the world works. Occam’s razor exists, and some probability distributions are substantially more common than others - the central limit theorem and the normal distribution is a great example of that. The claim is about a world that is not our own, it feels like someone saying that probability theory is broken because infinite sets break some definitions of probability distributions. I should read Wolpert, 1996 to be sure, but it seems super weak.

Regularization

Regularization is a general term for ways (both implicit and explicit) to reduce the generalization error of an algorithm. The classic example is weight decay in linear regression, where the L2 norm of the weights is included in the loss function. The degree of overfitting can be controlled by it.
Regularization is also a preference weighting over the type of function that an algorithm will learn.

Hyperparameters and Validation Sets

In general, the values of hyperparameters are not adapted by the learning algorithm itself. This is an obvious limitation that should be lifted. Line search is a good example of a way that it can be lifted.
You can imagine doing metalearning to create a function for which weights it makes most sense to disconnect, and doing that systematically to build properly connected architectures.

There are hyperparameters to the model that can’t be well estimated during training. Often this is because the optimal value for the training setting is to maximize model complexity, leading to values for hyperparameters that don’t generalize well to the test data.
For this reason we use a validation set to optimize our hyperparameters. The split in our training data between true training and validation is usually 80/20 (I wonder if it’s possible to meta-learn optimal values for this split over several datasets).
The test set can be validated on. Test sets that are widely used in the community are even eventually overfit to, leading to a misperception of the state of the art in the field.

Cross Validation

When there is not very much data, it can be expensive to have a large training and validation data split because if your test data is too small it will be hard to compare algorithms against one another - there will be too much noise.
Cross validation lets us split our training data into folds, and build our model on each split to get robust validation for hyperparameters over our data.

Estimators, Bias and Variance
Point Estimation

Point estimators are any function that map from a set of input values to a parameter output. It doesn’t even have to be restricted to the domain of that output, but a good point estimator will try to find a value as close to the real output as possible. You can think of estimating the parameters of a weight vector as being a Point Estimation process, where there is some true (fixed) but unknown set of parameters that underlie (generated) the data that we are trying to model with random data that came out of those parameters
Function estimators (approximators) are a reframing of point estimators, where we’re trying to find a function (either directly, or indirectly through parameters) that will make a good estimate of the output. Supervised learning consists of function estimators.

Bias

Bias is the difference between an estimator and the underlying parameters of the distribution that generated the data. Unbiased estimators have no difference between the values of the parameters that define the data generating distribution and the expectation over the data (samples from a random variable).
I believe that the more intuitive (less couched in technical language) way to view bias is as the difference between the assumption that your estimator makes about how the data will be structured and the actual structure of the function that generated the data.

Variance and Standard Error

Variance can be thought of as how much an estimator will change with different draws from the same underlying distribution.
We can construct confidence intervals to measure algorithms against one another, using the central limit theorem to get bounds for that standard error of a prediction.
Both the sample standard deviation and variance underestimate the true standard deviation in the data, though we still are willing to use them.

Trading off Bias and Variance to Minimize Mean Squared Error

We tend to make tradeoffs between estimators that suffer from high bias and those that suffer from high variance, as increased capacity tends to increase variance and decreased capacity tends to increase bias.
We can use the mean squared error, which takes both bias and variance into account, to estimate the optimal capacity. We can also use it in concert with cross validation.

Consistency

We want our estimators to be consistent, meaning that as we see more data points our point estimates will converge to the true value of their corresponding parameters.

Maximum Likelihood Estimation

Maximum likelihood is the idea that over all of our data points, the best set of parameters for our model are the ones that lead to the best possible probability output for each datapoint. We can aggregate these probabilities by multiplying them by one another. Multiplying probabilities tends to underflow, so we deal by taking the log of those probabilities, which also turns the product into a sum without changing the argmax.

We can think of maximum likelihood as minimizing the difference between the underlying parameters of the model and the parameters of the data generating distribution, that is minimizing the KL divergence between those distributions.

The negative log likelihood of the bernoulli or softmax is often referred to as the cross entropy, though the cross entropy is general. For example, mean squared error is the cross entropy between a gaussian distribution and a model of parameters approximating a gaussian from data.

Conditional Log-Likelihood and Mean Squared Error

We can take the log probability of our outputs (probability of the correct class) as our loss function. And we can generate those probabilities with a bernoulli or softmax.

We can use maximum likelihood to fit linear regression by assuming that the model is predicting the mean of the underlying gaussian. We substitute in a normal distribution for a given variance to represent the conditional probability of the output given the input. This yields a different cost function in terms of value but the same cost function in terms of the optimal point as the mean square error (we also get the exponential from the normal distribution transformed into a term that looks exactly like the squared error).

Properties of Maximum Likelihood

Maximum likelihood can yield a consistent estimator (where with increasing data, we approach the true parameters of the data generating distribution) if we know that the data is being generated by the distribution being modeled by the estimator and that the true distribution corresponds to a single value of the parameters, and is not being generated by multiple versions of the distribution.

Maximum likelihood is also provably efficient for large m, in terms of approaching the true parameters.

In general we use maximum likelihood, and if there isn’t enough data to avoid overfitting we use regularization to deal and convert the maximum likelihood cost function into a variant that penalizes complexity.

Side question - where are the opposite regularizers, the penalize simplicity?

Bayesian Statistics

Instead of taking a point estimate of the underlying parameters of the approximation of the data generating distribution, Bayesian methods integrate over the information about the distribution present in the training data and have a continuous contribution of every part of the distribution over the parameters that have weight to the prediction made.
This comes out of a change in assumption, where instead of thinking of the underlying parameter as fixed and the estimator as a random variable (which comes out of the randomness in the underlying dataset that the estimator is trained on), we think of the dataset as information to update on.
We have to chose a prior distribution over our parameters, which is generally chosen to assume ‘simplicity’, which caches out as a high entropy distribution over the data or a gaussian or uniform prior distribution.
Bayesian methods tend to perform much better over small data but become computationally challenged with large datasets. They’re much more difficult to overfit than frequentist methods due to integrating over the distribution over parameters instead of taking a point estimate.

In practice we establish a conditional distribution (gaussian in the case of linear regression) on the probability of y given x. The normal distribution (parameterized by the weights, a mean, and a covariance) is an exponential function that can be solved to obtain a posterior distribution.
The bayesian estimate of the covariance matrix gives an estimate for how likely all values of our weight matrix are, rather than providing a single estimate.

Maximum A Posteriori (MAP) Estimation

Maximum A Posteriori estimation consists of taking the maximal point estimate from the posterior probability distribution instead of integrating over that distribution. This has an important advantage: computational tractability. It still includes information about the prior that is not present in maximum likelihood. That prior ends up being weight decay! (assuming gaussian prior for linear regression) - this means that we can think of maximum likelihood with regularization as doing MAP estimation over a posterior distribution. It also gives strong underpinnings to regularization through weight decay.
The MAP estimation will lower variance at the expense of increasing bias for linear regression - this points to the way that bayesian algorithms tend to overfit less, but since we’re taking a point estimate from the distribution we only get a benefit from the prior.

Supervised Learning Algorithms
Probabilistic Supervised Learning

We can think of learning algorithms in terms of conditional probability distributions.
Logistic regression can be thought of as linear regression, with weights squashed between 0 and 1. We then fit the weights with maximum likelihood, minimizing the negative log likelihood with a gradient descent method.

Support Vector Machines

Support vector machines use a kernel to transform the input before learning a linear relationships between the transformed input data and the output.
The Gaussian kernel (or Radial Basis Function Kernel) is an example of a popular kernel. It can be expensive to evaluate kernels on large datasets. The gaussian kernel does some “template matching”, where it evaluates the distance between two datapoints as part of a weighting for the predictions.
We tend to train SVMs to learn sparse evaluation functions, where most data points do not contribute to the predictions - only a few that are dubbed “support vectors”.
The Kernel Trick consists of doing the dot product first, and then doing a transformation over that scalar. There are a surprising number of ways to do a more complex transformation extremely efficiently by taking the inner product first, and so we can use this to efficiently create an interesting feature space.

Other Simple Supervised Learning Algorithms

Coverage of knn consisted of pointing out that it can have a zero bayes rate with infinite training data, but performs poorly on sparse data and can’t deal with noise (even if it has a perfect feature, it will lose it in a lot of other less important features). There’s effectively no training, it’s just a set of lookups on the training data at test time.
Decision trees - so much hate! Also, Leo Breiman strikes again. They talk about how it can’t learn simple decision boundaries that aren’t axis aligned without stepping back and forth. They don’t mention that it can automatically learn non-linear interactions, or that it doesn’t require preprocessing on the input to deal intelligently with missing values, or that it automatically doesn’t consider features that aren’t predictive. Pretty biased coverage, to be honest.

Unsupervised Learning Algorithms

They frame this as algorithms that do 4 things - find a manifold that close to matches the data (I hope they include dimensionality reduction here), finding clusters of data points by some definition for cluster, finding densities, and denoising data. (No generative models here?)
They talk about the types of behavior expected from an algorithm that’s trying to learn a representation of the data. The simplest of these is compression of the data to a lower dimensional representation that maintains information. We also have Sparse representations, which seek to create a higher dimensional representation that’s mostly sparse, and independent representations that try to learn to separate out the dimensions so that they’re statistically independent from one another.

Principal Components Analysis

Note - Since Decision trees only learn axis-aligned cuts, doing PCA to create axes of high variance and then running the tree on that representation may be a great way to avoid this issue. Any projection of the data onto axes should be extremely helpful.

PCA can be seen as both reducing the dimensionality and disentangling the axes to that there is statistical linear independence between the axes. Ideally a dimensionality reduction algorithm would also separate out the non-linear relationships in the data.

They show that the covariance matrix of the result of principal component analysis is diagonal. What this implies is that the correlation is 0 between every axis, showing that the data is successfully disentangled (independent columns) by PCA.

K-means Clustering

One hot encoding are sparse, and so most data (categorical) is sparse data. Or at least can easily be represented as sparse data.

K-means clustering is presented as an example of creating a sparse representation, where each datapoint belongs to just one cluster and no others which is represented as a one-hot encoding.
One hot encoding is an extreme example of a sparse representation that loses a lot of the value that comes with sparsity.
The k-means process isn’t well stated - it will reach some convergence, but it won’t be clear what the best converged values are. It’s also inflexible, and can’t really assign a point to multiple clusters. We can minimize the reconstruction error by minimizing the overall distance to cluster centers. But this doesn’t mean that we’ve found the clustering that best corresponds to the problem we’re looking to solve.
Ideally we’d find a distributed representation that can capture multiple aspects of the data - each attribute of a dataset could be a cluster, and the clustering would be an assignment from data points to attributes. Ex., a dataset of red trucks and cars as well as blue trucks and cars could be separated into redness/blueness as well as truck/car.

Stochastic Gradient Descent

SGD is gradient descent with small batch sizes from 1 to a few hundreds. The gradient can be approximated reasonably well with these, and this will be constant complexity growth in the size of the training data.
SGD is valuable for training large linear regression models, where the size of the data is prohibitive for other methods.
Training algorithms that learn nonlinear structure without neural networks involved training a kernel on the data, which didn’t scale to large datasets because the complexity was (m2) quadratic in the size of the dataset. SGD doesn’t have this limitation.

Building a Machine Learning Algorithm

Emphasis here on their model of how Machine learning algorithms should be viewed in the abstract - as combinations of a dataset, a model, a cost function and an optimizer.
Interesting that they also claim that unsupervised methods can fall under this umbrella. Apparently k-means can also be framed within this paradigm.
The upside to thinking about ML this way is that we get a nice set of principles for thinking about learning algorithms. The fall into a structure instead of being a grab-bag of techniques.
Interesting how this re-framing doesn’t change the reality, it just changes the way we think about it. And so maybe the whole time in reality there was structure, and we didn’t see it. Or maybe there’s been not very much structure the whole time and we looked for structure until we found something that approximately worked. It’s at a pretty low level for cost functions and a pretty high level for models and optimizers. A model could be a functional form. But it could also be an architecture, or a method.

Challenges Motivating Deep Learning

There has been a failure of machine learning to solve some central problems in AI, such as recognizing objects and recognizing speech. Deep Learning has performed uniquely well on these tasks.
There are also major challenges around generalizing with a learning algorithm in a high dimensional space. Typically this has been computationally intractable.

The Curse of Dimensionality

As the dimensionality increases, data points are increasingly far away from one another. It becomes extremely important to be able to generalize between data points, actually learning the structured relationships between features. Algorithms like KNN that use distance or like Decision Trees will say that the output should be the same as the nearest training points.

Local Constancy and Smoothness Regularization

There are priors that we take into account with these algorithms - simple assumptions about their structure, like the assumption that data points that are close to one another will have similar outputs and that the function predicting those output shouldn’t vary in a small space. And so smoothness regularization makes sense.
The argument that it’s insufficient comes from the fact that there’s this exponential number of possible spaces in a high dimensional problem. And in order for many methods to fit this large space, they need a datapoint in each space so that when classifying a new datapoint there are data points to compare with.
There’s an exponential growth in the number of data points necessary to fit a high dimensional space. One great advantage of a deep representation is that it allows the complexity of the model to scale in a structured way, that’s not dependent on particular points but capable of learning the relationships between them. This lets us generalize non-locally, in between large spans of space. If the function is complicated and behaves differently in different regions, a distributed representation can still learn that function.

Manifold Learning

There are strong arguments that most data lies on a low dimensional manifold in the high dimensional spaces that we encounter them in. It’s clear that the probability density over data points is condensed, where most images are in a vanishingly small part of all of possible image space and most natural language is in a similarly small part of possible configurations of the alphabet or dictionary.
The really strong argument for a manifold is that there is continuity between data, where the space of all possible images has a set of manifolds for each type of object, and small transformations to the contortions and locations of these object correspond to a movement along their manifold. And that most of the images that we see in life are a series of variations on a single experience, instead of experiences that are random or out of the blue.
Oh, what a powerful idea.

Modern Practical Deep Networks

The split from this text is between industry grade, practical algorithms and algorithms with promise that haven’t made it to widespread adoption as yet.
The progression will be from the generic model (deep feedforward networks) to training details (regularization and optimization) to major workhorses on image and sequence data (CNNs and RNNs) to practical methodology for training and applications.

Deep Feedforward Network

The presentation here is feature learning. We have ways to manually engineer nonlinear features. We have kernel functions that function as nonlinear transformations. But framed as a generalization from linear models which are limited in capacity, networks learn a set of composed hidden functions that constitute an adaptive non-linear transformation on the data that is transformed by a standard linear function after that.
There’s some neural inspiration in the architecture - a network of connected units, where there are vector-unit transformations that have activations transform them.
But the linear/non-linear dichotomy is mathematical, the optimization of the network is not particularly neurally inspired, and so we’re not trying to stick to the brain so much as do whatever works best.

Example: Learning XOR

The presentation here is representation learning. They show how the input space is contorted to be made linearly separable by a nonlinear transformation. It’s also a great step-by-step explicit display of the transformations being done and the input/output at each transformation.

The Relu activation function is nearly linear (it’s piecewise linear), leading to a nearly convex (and so easier to descend on) optimization problem.

Gradient-Based Learning

Question - Are there network architectures where we know what the global minimum is, where we can compare the results from our non-convex optimizers to the true minimum and see how close the local minima we tend to get is to true minimum?

Modern neural networks have a non-convex loss function that cannot be optimized directly in a way that has guarantees around the minimum value. Instead, gradient based methods (SGD) are used.
SGD is also used for extremely large SVM and Linear Regression problems.
Improvements to SGD are important to NNs and will be covered.
The initialization is important to the optimization process - it’s usually small values around zero for the weights, there will be more detail hereafter.

Cost Functions

There are two primary approaches taken to cost functions - the fit with maximum likelihood and the fit that doesn’t take the entire probability distribution over y but computes a statistic of y based on x (hierarchical softmax? I suppose I’ll find out.).
There is also generally a regularization term appended to the softmax.

Learning Conditional Distributions with Maximum Likelihood

Our loss function is going to be the cross-entropy loss, or the expectation of the log probability of the labels given the data. The log helps numerically, avoiding the saturation of the gradient - the signals shouldn’t become too weak to get propagated through the network. This hints at a useful intervention point as well, if an equivalent loss function that saturates the gradient barely at all can be found.
The great thing about maximum likelihood is that it gives us a generic formulation for deriving a cost function. Anytime we’re searching for a conditional probability distribution p(y|x), there will be this immediate derivation to use the log probability as the loss.
The cross-entropy loss has the unusual property of not having a minimum - it will only get arbitrarily close to its minimum. The formulation of the softmax / sigmoid only allows for extremely high or low probability of a class, not 1 or 0.
If you try to learn the variance parameter of the gaussian in addition to the mean, (that is, controlling the density of the output distribution) then it’s possible to assign arbitrarily high density to the correct training outputs. This sends the loss to negative infinity. You would need to use regularization to avoid this problem.

Learning Conditional Statistics

This looks like using the mean squared error or the mean absolute error to compute the cost, instead of the cross entropy.
There are methods (calculus of variations) that we can use to derive gradients for these conditional statistics, but they’ve remained less popular than cross entropy because output units tend to saturate when combined with these cost functions.

Output Units

Idea - could we use the softmax as an activation function? And for each neuron in the next layer have a number of inputs equal to the number of classes?

Any output unit we have could be used as an activation function. The output unit is very closely tied to the loss function. It converts the output from a hidden layer into an output that can be scored.

Linear Units for Gaussian Output Distributions

This unit multiplies the input from the hidden layer by a set of weights and adds a bias.
It can be used to learn the mean of a conditional gaussian.
A linear unit doesn’t saturate gradients, and so it is fairly safe to use within any neural network type.

Sigmoid Units for Bernoulli Output Distributions

Sigmoid units form a Bernoulli distribution where there is a single probability being predicted - p(y = 1 | x). We can imagine trying to have a linear function be used for binary classification, where we threshold the output at some point. The downside to this is that there will be ranges where the gradient saturates to 0 and learning stops.

It’s best to train the sigmoid with maximum likelihood. This keeps the gradients in a range that is unlikely to saturate the activation. The maximum likelihood fit leads to the loss function being a function of the softplus.

The exponential of the softmax is undone by the log of the likelihood function, keeping the size of the gradient in check. (likely see this behavior from the softmax as well).

When we use loss functions that are not maximum likelihood (say mean squared error), the loss will saturate anytime the sigmoid saturates and learning will crawl to a halt.

Softmax Units for Multinoulli Output Distributions

The Softmax has the nice property of penalizing the most incorrect points heavily and penalizing the points that are already classified relatively little so that the algorithm can focus on classifying incorrect point correctly.
The log in the cost function (assuming cross entropy) undoes the exponential in the softmax, ensuring that even if the values are small or large the softmax will not saturate.
This is less true if there is no log in the cost function - training a softmax with squared error will lead to the gradient vanishing if the range of the softmax output becomes too small, because the exponent will take strongly negative values and turn them to a near-zero loss function with a tiny gradient.

The softmax is made stable by subtracting the maximum element of the vector and providing values that are between (0, -inf) to the exponent. This guarantees that the value doesn’t overflow.
The softmax is also capable of saturating - when the difference between the predicted value (max) and the rest is very large, the input will be extremely negative or extremely positive. At this point, changes in the input won’t lead to a large change in the softmax - the change will be relatively small. If the softmax saturates, its corresponding loss function will also saturate.

Other Output Types

We have a nice generic formulation for any new cost function in terms of maximum likelihood.
Another output type is one where the variance of a conditional gaussian is learned in addition to the mean. This flexibility allows for a heteroskedastic model, where the variance of predictions changes with the input data. The Gaussian distribution has to be formulated using precision, rather than variance - in the multivariate case we use a diagonal precision matrix diag(beta). This works well with SGD because the formula of the gaussian changes by multiplying by beta and adding a log(beta), which are well behaved transformations.
The transformations that keep the gradient well behaved are addition, multiplication and log. Division gets arbitrarily large near 0, and so can be unstable. Squared terms tend to disappear near 0. And so the gradient is much poorly behaved over these terms.
Gaussian mixture outputs lead to neural networks with outputs that can be flexible to many outputs of y for a given x. Dubbed mixture density networks, these can lead to spikes in gradients due to division in the variance, which can be dealt with using gradient clipping or scaling the gradients heuristically.

Hidden Units

This space is unique to feedforward neural network training.
This space has little theoretical grounding and it is often impossible to predict what will perform well. You may have to try multiple hidden units and validate time against a validation set to get the best possible output.
Some hidden units will not be differentiable. This is okay, as most software just returns the gradient as being equal to the left gradient or right gradient (in the case of a Relu, 0 or 1) when we’re at the non-differentiable point.
This can be seen as okay, as it’s unlikely that we’re actually at 0 anyways, due to numerical imprecision - we were just rounded to 0.

Rectified Linear Units and Their Generalizations

The relu computes max{0, z} where z is the output from an affine transformation.
There are interesting alterations. The absolute value relu takes |z| and can work well for image recognition.

The maxout unit splits the input space into k regions and has a parameter for each region. It learns the best output, and is capable of approximating leaky relu and absolute value relu.

The leaky relu has a value smaller than 0 over the range of z that’s less than 0.

Logistic Sigmoid and Hyperbolic Tangent

Question - In what sense is the softmax a generalization of the sigmoid? Can we adapt it to make it more like a generalization of the tanh?

Question - Can we detect and resist saturation of units during training? “Soften” the sigmoid / tanh operation, make it closer to linear? Adapt its shape, either in a concrete way or by learning its shape dynamically?

Sigmoid units saturate at the high and low ends of their range, which makes gradient-based training difficult. Many units may effectively die, or be very hard to update over.
Tanh units are 0 centered, meaning that they resemble the identity transformation and so make the network much closer to linear. If Tanh starts out near 0, we get a set of linear transformation in the affine transforms and only deviate to nonlinear in ways that are useful.
Sigmoids are used in cases where we can’t have piecewise activations (like the relu) in autoenecoders and some RNNs.

Other Hidden Units

Many standard transformations that are not common work well. The authors got <1% error on MNIST with a cos activation, for example. Usually we only get excited when an activation outperforms.
The Softmax can be used as an activation, especially when you want to internally learn to approximate a probability distribution. Used commonly to manipulate memory.
The RBF unit is a similarity / distance metric between the input the the layer and the values of the feature that the unit is being applied to. It is difficult to train because it takes the exponential of this distance, and so will saturate to 0 over most of its range (why don’t we fix this?)
The softplus (softer RELU) looks like it should do well when we hit it with our model of how these units work - it saturates less completely than a Relu, and is differentiable everywhere. But it performs worse than a Relu in practice (Hinton did some experiments) which make me think that our model for why units perform well or poorly may be somewhat broken in general. Or maybe the rectifier just has a property that we’re not considering that makes it perform well, and the rest of the model can be intact. Perhaps the softplus isn’t linear enough.
The Hard tanh exists, it creates hard bounds, taking the maximum of -1 and the min(1, a) - it’s -1 if the input is less than -1, it’s equal to the input if -1 < a < 1, and equal to 1 if a > 1.
Linear activations can be used to reduce the number of parameters in a network. A linear activation is equivalent to factoring the weight matrix into two matricies from a network with an extra layer to a network with one less layer and a factored weight matrix. This can also be though of as having no activation at all. This way, some portions of the network could be purely linear.

Architecture Design

Most networks are organized into groups of units called layers. The architecture (in feedforward networks) is primarily the width and depth of the network, or the number of hidden units per layer and the number of layers.
It’s can be harder to train a deep network, but you can often get a better fit to the training data with fewer parameters than with a single large hidden layer.
Question - how easily can we metalearn these hyperparameters, or at least warm start a search?
Question - can we use RL to learn architectures?
Question - are there ways other than regularization to make architectures adaptive?
Question - are there architecture datasets filled with the traning processes of various architectures to study?

Universal Approximation Properties and Depth

With arbitrarilly large hidden layers, a single layer network can approximate any continuous funciton. This does not guarantee learnability or generalization, just that there exists such a network.
Representation ability is exponential in depth, meaning that we can represent familes of complex functions with exponentially fewer parameters by adding depth to a network.
We can think of the choice of a neural network for a problem as a prior belief that a composition of transformation will represent the problem we’re working on well.
There is research on which families of functions can and cannot be represented well at particular levels of depth. These families of functions are a great source of training data constructed to represent particular types of relationships.
Universal approximation has been proven for Relus as well as squashing units (sigmoid/tanh/sin/cos).
Deeper models can perform much better on vision tasks, and not just because there are more parameters and the model is larger. Using a deep model expresses a useful preference over the space of functions the model can learn.

Other Architectural Considerations

There are many architectural designs for a more diverse set of networks. Convolutional networks involve convolutional and pooling layer types. Recurrent neural networks encode recurring connections. And many networks are not fully connected, using sparser connections or skip connections to save computational time or find gradients that are closer to the input.
Most of these specialised networks correspond to particular problem domains. For example, sequence learning over language with RNNs.

Backpropagation and Other Differentiation Algorithms

In forward propagation, information moves through the network until a cost is computed. In Backpropagation, information from that cost is propagated back through the network to obtain a gradient. Then, some other algorithm, say SGD, takes that gradient and modifies the weights of the network appropriately.
The reason backprop is valuable is that it’s often easy to analytically state the gradient of weights of a network but difficult to compute them. Backprop gives us an efficient way to compute these gradients with multiple dependencies.
Backprop is very general. It can be used to compute gradients that are unconnected to cost functions, for example the Jacobian of a function with multiple outputs. We usually use it to compute the gradient with respect to some parameters so that we can update those parameters.

Computational Graphs

Computational graphs are models of a set of operations that interact with one another. There are objects and transformations, and the objects are the nodes. They’re combined with directed edges that are annotated with the associated transformation between variables.

Chain Rule of Calculus

The chain rule is a method for finding the derivative/gradient of composed functions, by multiplying the corresponding relevant partial derivatives. We can generalize the chain rule to vectors and tensors, and in the generalization we multiply jacobians (that come out of the many-to-many (vector/vector, tensor/vector, tensor/tensor) functions) with gradients (that come out of the many to one (vector/scalar, tensor/scala) functions).

Recursively Applying the Chain Rule to Obtain Backprop

Backpropagation can be applied to the computational graph by computing subexpressions once and reusing them. The forward pass over the graph computes the composed output value (in this case as scalar). This first algorithms is for all scalar computations, a simplified case.

Back-Propagation Computation in Fully-Connected MLP

Much cleaner explanations here. It may just be that the scalar based notation on a computational graph was extremely foreign and I didn’t know what objects they were referring to.
The idea here is you compute your initial gradient from the cost funciton, and then go into a loop over your layers. First you step through the gradient of the activation, using the element wise derivative of the activation. With that gradient you can compute the bias gradient (just the gradient you have) and the weights gradient (multiply your gradient by the input (transposed) coming into the layer up from the previous layer). Save those gradient updates, and prepare your gradient for the next layer by multiplying it by the weights at that layer transposed.

I should really try coding up MLP backprop and adding it to oracle. I’m so close with the neural network that I already write regularly.

Symbol-to-Symbol Derivatives

There’s Symbol-to-Number backprop which computes the gradient as a set of numbers for each layer and Symbol-to-Symbol backprop which expresses the gradient as its own computational graph.
Torch and Caffe use Symbol-to-Number backprop, while Tensorflow and Theano use Symbol-to-Symbol backprop.
Symbol-to-Symbol backprop actually exposes the computational graph, and can be thought of as subsuming the Symbol-to-Number approach within it. Exposing the gradient graph makes it easy to compute second derivatives as well.
The description of the derivative is symbolic, in that no computation is done - a graph for computing the derivative is created, and a graph evaluation engine can come along and execute the operations called for by the symbolic graph.

General Back-Propagation

Back prop can be though of as computing jacobian-gradient product through to the input of the graph. When there are multiple paths supplying gradients to a single layer, those paths need to be added together, a la the chain rule. When supplied with a gradient with respect to some input, it needs to know what operation to perform in what order on which input.

Example: Back-Propagation for MLP Training

This describes the process of running backprop for a single layer MLP.
The cross entropy loss produces a gradient. Gradient updates need to be found for two values - the weights that multiply the input and the weights that multiply the hidden layer to produce an output layer.

The update of the last is giong to be the gradient through the cross entropy loss multiplied by the input to that layer, in this case the output of the hidden layer.
The update to the first weight layer is going to depend on the gradient through the cross entropy, but multiplied through the second set of weights. That is put through the derivative element wise of the activation at the hidden layer, which in this case is a relu gradient that zeros out any values less than zero. The gradient through the activation is multiplied by the input to that layer (in this case, the input data) to get the update to the lower set of weights.

Complications

The naive implementation will build up a number of large tensors. Instead, these need to be added as they’re computed so that they don’t take up too much memory.
Real implementation needs to deal with 32 and 64 floating points and with ints.
Users may input a gradient that is undefined at that point, which needs to be dealt with.
Many operations need to return more than one tensor at a time.

Differentiation outside the Deep Learning Community

Automatic differentiation exists.
Reverse mode accumulation is the general formulation of the backprop problem.
Finding the optimal sequence of computations for backprop is np-complete.
Forward mode accumulation is more expensive for backprop generally but can be valueable for computing gradients of other structures.

Higher-Order Derivatives

Computing the hessian is usually infeasible in deep learning becasue its size would be in the billions. Instead, we compute hessian-vector products. The gradient of x times a function multiplied by a vector will produce a result that can again be multiplied by the gradient of x, giving a vector that preserves properties of the hessian.

Historical Notes

Backprop as sourced from the chain rule based in Leibniz and L’Hopital. Rediscovered multiple times. Minsky pointed out that linear models couldn’t learn XOR and this created a backlash against the neural network approach. Very popular until outperformed by other methods in the late 90s and early 00s. New renaissance now.

Regularization for Deep Learning

Question - I write a regression network and normalize my labels, so that they’re between 0/1 and in reach. Then I take off that normalization and find that my error drops (somewhat dramatically). How can I find a normalization that destroys less information / makes less of a distributional assumption?

The reason there aren’t ‘reverse regularizers’ (techniques that increase variance / reduce bias) is that the techniques that do do that (feature engineering, adding more parameters to a network) are thought of as part of the model / preprocessing. We have techniques for it, we just don’t name them regularization. Regularization is literally defined as that which helps the generalization error w/out helping the training error - additional flexibility would have a different name, and would also decrease the training error and so not be called regularization.

Regularization builds on concepts we’ve covered before. Bias. Variance. Overfitting. Underfitting. Training/test error. Generalization.

Often, regularization takes the form of a restriction on our parameters. For example, a penalization on weight size in our objective function. This amounts to encoding a prior preference for simpler, smoother functions into our model.
Regularization also can take the form of ensemble methods, incorporating multiple hypotheses about the training data into a prediction.

Fundamentally, we’re searching for a strong model of the original data generating process. Theoretically, we can envision three scenarios on a performance continuum. One in which our model does not contain the data generating process, or a close approximation. A second in which our model does contain the data generating process or a close approximation. A third in which our model does not contain the data generating process, or a close approximation, but it is only one of many hypothetical possibilities. The goal of regularization is to get from the third domain closer to the second domain.

Our networks are often used in systems where they almost certainly do not literally contain a hypothesis for the data generating process - that would mean, in the cases of images or speech, literally simulating the universe. There are so many interacting portions of that process that it would be difficult to even say what that process is. But the hope is that the’ll be flexible enough to approximate that process.

Due to the complexity of the world, controlling the complexity of the model doesn’t just mean finding the appropriate relationships and limiting your model to learning those relationships. Often a much more effective approach in practice is to create a large, extremely flexible model and regularize it appropriately.

Parameter Norm Penalties

This has the effect of penalizing the weight values, and so the network both optimizes for weight values that create correct labels and that are do it with relatively small values. The penalty is multiplied by a number between 0 and infinity that controls the regularization strength.

We typically don’t regularize the bias terms - those terms contribute little to overfitting, and we can run into real underfitting problems if we do regularize them. That means that there are more parameters in the model than we typically apply regularization to. In the text, theta will refer to all of the parameters in the model (weights and bias) where w will refer to just the wights.

It can be desireable to use a different penalty on the weights for each layer of the network. It is also expensive to search for the right parameters on each layer (exponentially large space), and so it can also be reasonable just to use the same weight decay at all layers.

L2 Parameter Regularization

In the gradient, L2 regularization caches out as subtracting a proportion of the weights (where the proportion is the learning rate times the regularization strength) from the weights at each iteration of descent.

We can relate L2 regularization to the impact that it has on directions in gradient descent. The regularization means that when there is little variance in the curvature (the eignenvalue of the hessian in a direction is small) then the impact from regularization will dominate it and the descent won’t go in that direction. Only when the variance is large in a direction will the decent move along that axis, which has implications for where it will move in future. Because of regularization, the weights used in the next gradient computation will be different only along axes with large variance.
In linear regression, we see that the impact of L2 regularization on the normal equations is to subtract the identity times the regularization strength from XTX. The diagonal of this matrix represents the variance for each feature. And so regularization makes it look like the data has higher variance than it actually does. This gets the optimizer to account for it.

L1 Regularization

In the gradient L1 regularization isn’t linearly proportional to the weights. It’s proportional to a constant term, and in the subgradient amounts to shrinking the weights towards zero in an amount equal to the learning rate times the regularization strength times the sign of the weight in question.
Question - won’t the regularization ‘overshoot’ zero? Why would it end up being the max of zero and the ending value of the descent step? Maybe this is why we have different optimizers for lasso, or why the descent method is looked down upon!

The lasso regularization term creates sparse solutions. This is fundamentally different behavior than L2 regularization. Parameters often end up at zero. In the linear case, this implies that the features behind those weights can safely be thrown out, turning this regularizer into a feature selection mechanism.

We can also think of the L1 term as a prior over the weights. Like the L2 norm is a gaussian prior over the weights, the L1 norm is maximized by MAP bayesian inference when the prior is an isotropic Laplace distribution over the weights.

Norm Penalties as Constrained Optimization

We can think of these regularization techniques that penalize the weights as establishing a constraint on the region where solutions are found. We can use KKT to imaging a constraint k that corresponds to some regularization strength alpha.
We can optimize a constrained problem by using SGD to take a gradient step and then projecting our weights onto a space that’s within the constrained region. This imposes stability on the optimization procedure, and prevents a feedback loop where large gradients create large changes to the weights and increase the weights disproportionately given the regularization.
Penalties also have an issue where the create local minima over small weights, causing weights to die in the network (likely why L1 regularization is frowned upon in NNs).

Hinton suggests that the constraint be applied to the column of each weight matrix (corresponding to each hidden unit) instead of to the frobenius norm over the entire matrix.

Regularization and Under-Constrained Problems

If a data matrix is singular (in the for XTX) then regularization (adding a scalar multiple of the identity) will make it solvable. This is effectively what the moore-penrose pseudoinverse does.
In logistic regression, if the data is linearly separable then a gradient descent algorithm will get a lower error with multiples of the weights. It will go on increasing the weight sizes by scalar multiples until the weights overflow. Regularization will cap the weights at the point where the magnitudes of the regularization and gradient of the error are equal.

Dataset Augmentation

A very useful technique for increasing the generalization accuracy of a model is to augment the data, training on variants of the training data. For example, given an image dataset you can transform the data in ways you want your algorithm to be robust to - translating the image, slight rotations, adding noise, etc.
Be careful to only compare algorithms trained on the same datasets. Augmentation can lead to substantially better generalization performance.
Adding gaussian noise can improve generalization error - we’ll be introduced to dropout which can be seen as multiplying by noise.
You can also inject noise at each hidden layer - this can be seen as adding noise at each layer of abstraction, introducing invariance in the hierarchy.
Dataset augmentation has been particularly effective for object recognition, and is also effective for speech recognition.

Noise Robustness

It can be valuable to inject noise into the input, which in some cases is mathematically equivalent to L2 Regularization (yet another argument for the canonical nature of that form of regularization, in addition to the impact on variance in the XTX matrix). In other cases it is a worthy form of regularization in its own right. Weight nose will push the model to places that are not just local minima but that are insensitive to small perturbations of the weights, where the region around the minimum is flat.

Injecting Noise at the Output Targets

Datasets can have errors in their output labels, which can be a disaster to attempt to learn. One soft resolution to this is label smoothing. This presents a 1 - epsilon probability for the correct class, and an epsilon/k probability for the other classes. (Another formulation is (1 - (k-1 choose k)) for the correct class).
This label smoothing solves the problem in the softmax where it increases the weights more and more trying to output a perfect 1 for a class probability when it’s not possible with its formulation.

Semi-Supervised Learning

The case of semi-supervised learning is where datapoint without labels and data points with labels are both a part of the modeling process.
There can be both supervised and unsupervised components to the model. The classic application is doing PCA over all the data (labeled and unlabeled) to generate a robust representation of the data. The goal is to take datapoints that are clustered in space and give them a similar representation that can be more easily learned by a linear classifier.
There is also an approach which does not separate components of the model. Instead, a generative network is trained on unsupervised data that shares weights with a discriminative network. (Note that at this point the reader has no idea what a generative network is). The generative network generates -log(P(x)) or -log(P(x,y)) while the discriminative network is optimized for P(y | x). Their importance in the model has a parameter that can trade off the weighting on each.

Multi-Task Learning

Different tasks with similar structure can share parameters to achieve better generalization results than using the data from one task alone. Often the structure is to have a set of shared weights and a set of task-specific weights. The shared weights benefit from the shared structure in each task, and are constrained to a representation that is valuable to both tasks.

Early Stopping

Idea - if you want to train with SSE but find weights effective for a median or absolute error, use early stopping with that validation criterion.
Early stopping is an elegant regularization technique where you monitor the validation error during training and save the weights that achieve the lowest validation error to that point. You return those weights at the end of training, and end training when there is no improvement in the validation error for some number of steps.
This is valuable because generalization error is generally u-shaped with respect to training time - the training error continues to decrease monotonically, but due to overfitting. The initial learning happens over the axes of greatest curvature, and what’s left late in training are small noisy properties of the training data.
There is additional computation faced in evaluating the validation error repeatedly. Remedies are to only compute the validation error after a number of iterations (decreasing the resolution but saving compute), computing the validation score of weights in parallel with actual model training so not to disrupt it (if you have the resources available), or to compute the validation error on a smaller cut of the dataset.
Many methods for reapplying your validation data to training after early stopping are addressed, most of which look like bad methods to me. But who knows.
One idea is to retrain your model to the number of parameter updates or sweeps over the data (unclear which of those will be best) that you learned from early stopping. Another is to start from your early stopped weights and train with the additional data.
Early stopping can be shown to be equivalent to L2 regularization in linear regression with gradient descent, by taking a taylor approximation of the loss.

Parameter Tying and Parameter Sharing

We can impose a prior belief about parameter values by constraining them to be close to or the same as a prior value. In L2 regularization this is 0, but it can be the weights of an unsupervised model, or in the case of convolutional neural nets, shared parameters that impose a prior of translational invariance.

Sparse Representations

Thought - this version of sparse representations means doing all of the work of a dense representation to get the sparse representation by converting a dense representation into a sparse one. It’s distributed, but just because it happens to be. Feels like a lot of work, where it could be a direct process.

The way sparse representations are obtained is with a L1 style regularization, or a hard constraint, over the activations of the network. This forces the value to 0 if it’s not important.

Bagging and Other Ensemble Methods

An ensemble of members will perform strictly better to the degree that the outputs of each member or independent, or uncorrelated from one another. It actually reminds me of information theory ‘noisy channel’ type problems.
The claim is that we shouldn’t use ensemble methods in papers because they distract from the algorithm itself. But it’s likely that some algorithms benefit more from boosting than others, because they make uncorrelated errors. If some change to a network increases the variance in its output and decreases the bias, it’ll benefit more from bootstrap aggregation and that should be part of its performance score. Ensembling can also be built into the algorithms, a la boosting.
Here bagging is presented as the models differing in the dataset they’re trained on, which is samples with replacement that end up with about 2/3rds of the unique values of the total dataset.

Dropout

Here dropout is presented as an ensemble method, before we even know what it is. But if we’ve been paying attention we know that it’s noise injection to the weights at each layer.

Dropout is presented as both bagging and noise resistance, as well as making the model spread learning over more aspects of the data. The weights need to be adjusted after training to account for dropout - for example, a ½ probability over the binary mask leads to weights that are twice as strong as they need to be at test time, and so we divide them by two.
Dropout is equivalent to L2 regularization in linear regression but not in Neural Nets with activations.
An upside to multiplicative noise is that a hidden unit can’t (as it could in the case of additive noise) make the output very large so that the noise is small by comparison. (though you hope the noise would be made a function of the input size?)
Training runtimes can increase with dropout.
We can also get good dropout performance with multiplicative noise that’s normally distributed around 1. This also omits the need for rescaling the weights.

Adversarial Training

Thought - Neural networks blur the line between feature engineering and modeling, and so the benchmarks for algorithms disallowing feature engineering feels odd. For example, imagine you baked adversarial training regularization into the network automatically instead of making it a data creation step. It would be suddenly be fair to claim a higher score on a benchmark, even though you were exploiting the same concept, and now your network wouldn’t benefit from more adversarial training data. It’s equivalent in an important sense.

Adversarial examples that look almost identical to the original input can be generated by adding a tiny bit of noise to any input. This can have a large impact on the output due to the linearity of the structure - it means that lots of small noise across all features can have a large impact, especially when the number of features is large.
Linear and Logistic Regression are extremely vulnerable to adversarial examples, but aren’t able to account for it with adversarial training because they can’t learn a nonlinear function of the inputs.
Adversarial training has a regularization effect on the data, where there’s a local constancy prior that gets reinforced. Datapoints that are close in the space should generate the same predictions. In the context of the manifold hypothesis, the predictions should not jump from one manifold to the next due to a tiny perturbation. We want to enlarge the space around each manifold that is learned to be of a particular class. It’s clear that the “dead space” in classification space is just oddly behaved.
Adversarial examples can be used in semi-supervised learning - make predictions on the unlabeled data, and then make sure that those predictions are made robust by turning the datapoint into an adversarial example and making sure you get the same output.

Tangent Distance, Tangent Prop, and Manifold Tangent Classifier

Tangent Prop is a regularization technique that finds the tangent vectors for the manifold being approximated - asks what directions are parallel to the manifold of a class - so that small perturbations from that manifold can be trained against. This manifests itself as a part of the objective function in tangent prop, or as an approximation of the tangent lines by an autoencoder in the manifold tangent classifier.

Optimization for Training Deep Models

Optimization is extremely important. The problem can take days to months to solve for a single neural network. This is why there are specialized tools developed for the task.
Many techniques here will take advantage of second order approximations of curvature, intellegently adapt the learning rate, and be applied to a loss function based on regularizaiton terms as well as the parameters.

How Learning Differs from Pure Optimization

Pure optimimzation is usually done directly over the task in question - a factory optimizing its output, for example. The greater problem to be solved is the same as the problem the optimizer solves.
Learning is usually indirect, creating a loss function that takes an expectation over a sample from the data generating distribution (the training set).
If we knew the true distribution p(x, y) we would have an optimization task for an optimization algorithm.
When we do not know p(x, y) but have data from it we have a machine learning problem.

Empirical Risk Minimization

Emperical risk minimization is optimization on the training set. This tends to overfit.
Question - I’m not sure what the 0-1 loss function is here. Is it just a utility function that gets 1 util if the class is correct and 0 if not?

Surrogate Loss Functions and Early Stopping

Typically we use a loss function that can continue to train even after getting correct results, instead of a 0-1 loss that stops updating as soon as we have a correct answer. With early stopping, we stop training before overfitting starts. Typical optimization goes until the gradient is very small, where we can say we’ve converged.

Batch and Minibatch Algorithms

Idea - stochastic gradient descent, where you choose your batch from the datapoints that are currently being misclassified.

It’s normal to use an amount of data that’s smaller than the whole dataset to make a gradient update, because the value to more data increases the error of the estimate of the gradient in the square root of the additional data while the runtime increases linearly.
Online learning with a single datapoint (sgd) can lead to great generalization accuracy due to the noise in the gradient that comes out of taking a single point estimate of the dataset at a time. The downside to this is an extremely long runtime.
SGD is usually implemented by shuffling the indices of the training data, instead of generating new numbers each time (though that’s also an option).
Second order approximations of the Hessian require larger batch sizes, because error in computing the gradient is magnified in approximating the hessian.

Challenges in Neural Network Optimization

Ill Conditioning

Often the curvature of the loss is so large that the step size must be made extremely small for learning to happen - even large gradients can lead to an increased cost due to large curvature.
Newton’s method is great at accounting for this, but has to be adapted before it can work with NNs.

Local Minima

I do not see the link from the ability to scale incoming weights and biases down and up to the claim that every local minimum lies on an m x n dimensional hyperbola of equivalent local minima. And there’s no reference.

In general, Goodfellow and Bengio say the local minima problem in optimizing non-convex NNs get overblown, and that if people checked the value of their gradients norms they would see that they’re often not converging to 0 - this means that the problem is something other than local minima.
They also argue that most local minima are equivalent and are extremely good solutions to the learning problem. There are few high cost-local minima to get stuck in.
It can be hard to test positively for local minima - you may be at a saddle point instead.

Plateaus, Saddle Points and Other Flat Regions

Both minima and maxima become exponentially rare in high dimensional space (it requires every dimension be simultaneously small).
Saddle points are much more common and occur when there’s a relative local minima in one direction and a relative local maxima in the other directions.
Netwon’s method simply solves for points that are critical, and so will find saddle points instead of local minima with high probability.
Long flat plateaus are weaknesses for all training algorithms.

Cliffs and Exploding Gradients

Cliffs in the parameter space occur when there is extreme nonlinearity due to multiplied parameters. Often gradient descent can compute a huge gradient at the cliff and undo a lot of the optimization work that has been done - this is resolved with gradient clipping, restricting the size of the gradient so that cliffs can be dealt with.

Long-Term Dependencies

When repeatedly doing matrix multiplication, the value of the multiplied can explode if the eigenvalues of the weight matrix are consistenly far from 1 in either direction. If the eigenvalues are large, the gradient will explode in the direction of that large eigenvector. Gradient explosion leads to overstepping in optimization. If the eigenvalues are small, the gradient will die and the network will be unable to learn eficiently.
This is especially true of RNNs that multiply by the same weight matrix over and over again.

Inexact Gradients

Often we can only approximate the gradient, and it will be noisy. When we have a loss function that’s intractable, we often have to replace it with a surrogate loss which introduces even more inexactness to the gradient when compared with the true problem.

Poor Correspondence between Local and Global Structure

Often the long training process corresponds to a large distance between the initialization position and a point of sufficiently low cost. This is because the focus of current optimization algorithms on choosing the correct local direction. But if there is an initialization that puts regions of low cost far away or over local boundries, the optimizer has little chance of getting there without circumnavigating large boundaries.
Most training processes do not end at a local minimum. They end at a place with large gradient norm but sufficiently low cost.

Theoretical Limits of Optimization

Theoretical limits on optimization generally require making a number of incorrect assumptions about the system to constrain things to a point where there are limits. These typically have little baearing on the use of neural networks in practice. Realistic assumptions made tractable to reason about would be a large benifit to optimization research.

Basic Algorithms
Stochastic Gradient Descent

SGD has a number of benifits - it allows convergence on large datasets without increase in runtime, has much better computational properties than a batch gradient, and is straightforward to implement.
There are theoretial asymptotic results that show that a batch gradient should perform really well, but those results don’t account for real advantages that SGD and others have in practice (as well as important constant factors outside of big O convergence).
In training, it’s important to lower the learning rate for SGD over time. Generally this is implemented as a linear decrease in aan initial learning rate down to a base learning rate, where there is a small decrease at each iteration. The parameters are the initial learning rate, the final learning rate, and the number of steps over which to decrease the learning rate.
The best way to choose the learning rate hyperparameters is to moniter the learning curves that plot training error against iterations.
Usually the number of iterations to take before stopping the decay of the learning rate is a few hundred passes through the training set (which depends on the minibatch size as well). The final learning rate shuld be about 1% of the initial learning rate.
If the initial learning rate is too large, you’ll see violent oscillations in the loss and often the loss will increase dramatically. If the initial learning rate is too low, the opitimizer will get stuck in a region of high cost.

Momentum

Change - The pictures for gradient descent with and without momentum should be side by side.
Change - I don’t see how the momentum is actually being computed anywhere here? I got all my impelmentation details from a single sentence in the opening, that it’s this decaying average gradient.

Momentum computes an exponentially decaying cumulative gradient and adds that value to the gradient at each step.
Momentum is treated here both as friction and as momentum. At an intuitive level, it will ahve both effects - causing the gradient to move more quickly in a cumulative direction, as well as stopping the gradient from increasing the cost dramatially by taking a step to far in a direction that’s not supported by the curvature.
Momentum has a hyperparameter that impacts how large it is, its own learning rate. This can also be tuned over time, but usually to increase instead of decrease over time.

Nesterov Momentum

Here is says “in the stochastic gradient case, nesterov momentum does not improve the rate of convergence” - I’m sure this is just in the theoretical big O sense, and not in practice. May be worth experimenting with momentum sgd in the convex case, and possible contributing to spark if it’s more efficient.

The difference here is that the gradient is evaluated after the momentum is applied - that is, the weights that we use to compute the gradient have already been updated with momentum.

Parameter Initialization Strategies

The major concept is that we don’t want an initialization that leads to weights being updated in exactly the same ways, learning identical functions of the input, and so being redundant. We also want stable gradients.

Given these principles, initializing weights as orthogonal matrices makes sense. It also is easy and makes sense to use small random weight initializations.

In general, biases are initialized to 0.

We can find better initializations by trining on another task beforehand, or using an autoencoder.

Algorithms with Adaptive Learning Rates

The learning rate is important to model performance and hard to tune. Momentum ameliorates this, but ideally we would have an adaptive learning rate for each parameter in the network.

The delta-bar-delta algorithm is a n example of such an adaptive learning rate. The thinking is simple. If the sign of a parameter’s gradient is the same (that is, we keep updating it in the same direction) we should increase the learning rate. If the sign is different (switching back and fourth) we should decrease the learning rate.
Question / Change - it’s not clear to me that this requires a batch gradient, at all. This is either wrong or I’m missing something in this algorithm.

AdaGrad (Adaptive Gradient!)

Adagrad takes values with historically small gradients and increases their learning rates, and takes values with historically large gradients and decreases their learning rates. This should lead to more progress in the gently sloped regions of the space.
In practive if AdaGrad is used from the beginning of training it ends up with bad prior information about the gradient.
The metric used for the history of the gradient is the sum of squares of previous gradients, inverted. This gets them all to be positive and scales the value quadratically.
I should code up AdaGrad and RMSProp on some convex solution, at the very least.

RMSProp

A modification to AdaGrad, using an exponentally decaying moving average over the training history to adapt the learning rate for particular parameters.
When training on non-convex loss functions, often it takes time to get to a convex bowl where the adaptive learning rate will be most effective. This lets you forget old structure and converge in it efficiently.
You can get the exponentially decreasing average by just continually multiplying the previous gradient by a hyperparameter less than 1!

Adam

Adam involves encorperating nesterov momentum into the adaptive learning rate’s gradient computation and also, when computing the exponentially weighted average, use an estimate of the weighted historical gradient in the numerator instead of a single hyperparameter and continuing to use the sqrt sum of squares of previous gradient in the denominator, much like in RMSProp.

Choosing the Right Optimization Algorithm

Change - Adadelta is referred to as if we know what it is, but never explained.

The most prominent optimizers are SGD, SGD + Momentum, RMSProp, RMSProp + momentum, AdaDelta, and Adam. The one to use often depends on the users’ familiarity with their hyperparameters.

Approximate Second-Order Methods

Newton’s Method

Newton’s method involves computing the hessian as well as the gradient, and taking a step that involves multiplying the inverse hessian by the gradient and adding it to the weights. In the quadratic loss function case, this leads directly to the optimal solution of the quadratic approximation of the loss function. This is a large part of why newton’s method is very popular.
The above method can also be iterated for convex non-quadratic loss functions.
In neural networks the hessian may not be positive definite, in which case newton’s method will fail (the curvature is at a saddle point, for instance). There are ways to ameliorate the problem - a normalized hessian can be formed by adding a set of values to the diagonal of the hessian, forcing the diagonal to be positive and so forcing the hessian to be positive definite. But in the worst case this normalization will dominate the hessian.
Newton’s method is based on using a second order Taylor series expansion to approximate the loss funciton near some parameter.
Computationally, the hessian can be extremely large for neural networks. The Hessian is squared in the number of parameters. Newton’s method requires the inversion of a k x k matrix, complexity of O(k3). With even small neural networks the number of parameters can be in the millions.

Conjugate Gradients

Questions - how does line search work, at a low level? How does steepest descent work (is it just like line search + gd)?
In steepest descent, there’s a problem where steps are always orthogonal to one another which leads to an inefficient zig-zag in parameters space. Conjugate gradients are introduced to solve this problem.
Two directions are ‘conjugate’ if you get 0 when you muliply the directions by the Hessian in question with each on the right or left side of the hessian.
The other framing is that in newton’s method, computing the inverse hessian is expensive. We can compute the conjugate directions, often with the ratio of gradients squared over time steps. The ratio of gradients in an approximation of the curvature at that point. This lets us refine our search direction for the line search - instead of it just being the gradient, it’s the gradient plus the conjugate direction.
There’s a nonlinear conjugate gradient that people report works well for training neural networks. Some assumptions break a bit, but that’s okay.

BFGS

Quasi-newton methods like BFGS approximate the inverse hessian by using an initialization and a matrix that combines the ratio of weights (past, present) and the ratio of gradients (past, present). It then does line search over the direction given by the inverse-hessian approximation multiplied by the gradient and takes a step. Then it re-approximates the inverse hessian at the new point. Much like conjugate gradient, quasi-newton methods look to replicate the effect of newton’s method without inverting the hessian.
Unlike conjugate gradients, this method is well behaved even when the line search isn’t perfect.

Limited Memory BFGS (L-BFGS)

BFGS also forces you to hold the inverse hessian approximation in memory. With Limited Memory BFGS, we can initialize our inverse hessian to the identity, equivalent to using two vectors to approximate the inverse hessian without holding the entire matrix - those approximations are added to the gradient direction and are approximated based on combinations of the ratio of parameters and the ratio of gradients. This is equivalent to replacing the previous Hessian approximation (Mt-1) with an identity matrix, that is diagonal and so can be multiplied efficiently.

Optimization Strategies and Meta-Algorithms

Batch Normalization

Batch Normalization at a low level means taking the output from your affine transform / convolution and normalizing that output by subtracing the mean of the units in that batch and dividing by the standard deviation of the uints in that batch.
The model for each update is that in general, updates are computed without respect for higher order interactions. That is, each layer is updated assuming that the other layers are held constant. This means that if there’s a higher order interaction where layer 1 impacts layer 2’s ideal update and layer 2’s update impacts layer 3’s ideal update, it becomes hard to account for and the layers that have many dependencies end up with poor updated.
Batch Normalization alliviates the impact of these higher order interactions, because there are guarantees that the statistics of each layer in future will be similar to the layer’s current statistics.
Batch Norm also involves learning parameters for adding a scalar to each unit (new bias) or multiplying everything by a scalar (flexible mean).

Coordinate Descent

This involves minimizing one set of parameters at a time, or a single parameter at a time, independent of the other parameters. Works poorly if there are dependencies between parameters.

Polyak Averaging

The idea is that you average the parameters of solutions found during parameter search. Intuitively, if the points are on different sides of a bowl the average of their values will be near the bottom of the bowl.
In convex optimization you can get strong convergence guarantees with Polyak Averaging. Empirically, it performs well in non-convex optimization as well. Often it’s employed with an exponentially decaying moving average - it’s unclear what this means in this context, if it’s an average over more recent parameters instead of older parameters or what.

Supervised Pretraining

You can train a shallow network and then keep the parameters as the first and last set of layer weights and train the middle weights given ‘already good’ weights.
Transfer learning leverages this, where you can replace the output weights and start from previous lower level weights.

In greedy supervised pretraining, each layer is initialized by being trained as a part of a shallow supervised MLP, with the input being the previous layer’s weights.

Designing Models to Aid Optimization

It’s often much more important to choose a model family that is easy to optimize than to use a powerful optimizer. Being close to linear does a lot for ensuring that the gradient signals are good both locally and over the long term.
We could use strongly nonlinear activations, put in practice we struggle to optimize that model.
There are models that have skip connections between layers to propagate gradient signal more efficiently to intermediate layers. There are also models (inception) with the output at multiple points in the network, all propagating gradient back so that layers can learn more efficiently.

Continuation Methods and Curriculum Learning

Continuation methods are based on the idea that we can initially train a model with a simple cost function, and use the results from the simpler representation to initialize a slightly more complex model, and use that to initialize an even more complex model, and so on. This should make it easier to learn a complex loss than trying to learn it directly.
Continuation would often generate this by blurring a non-convex loss function, with the hope that this will make it nearly linear, and then training on a less blurred version in the next loss, and so on.

Similarly, curriculum learning involves going from learning simple concepts to learning more complex concepts based on the simpler concepts. By stochastically presenting a higher proportion of hard examples, a model can lean a better representation than if it’s show all the data at once.
Metaphors can be made with a teacher who teaches the learner simple concepts first, and then builds on those with more complex examples.

Convolutional Networks

They intend to discuss the convolution operation, its variants in DL, its neural inspiration, pooling, making the operation efficient.
The tendency to start with the most general treatment and then get concrete is really damaging from an understanding viewpoint. Clear examples are much better than abstract structures for understanding, this is the second time that it’ll be made harder.

The Convolution Operation

The convolution operation can be smooth in the case of a real-valued input and filter/kernel and will be discrete in most real applications.
The cross-correlation is the operation that most libraries actually implement, which is called convolution as well. This cross-correlation comes out of the commutative property in the summation between the kernel and the image.
Discrete convolution can be treated as matrix multiplication. In this case, each row of the matrix is constrained to be equal to the row above shifted by an element, known as a Toeplitz matrix.

Motivation

Sparse connectivity, parameter sharing, and equivariant representations.
Sparse connectivity means that the number of parameters we need is orders of magnitude less than what would be required for fully connected networks. We also (in deep networks) still end up with a model where interactions are captured, as an output late in the network pulls input value from a large set of the inputs.
With Sparse connectivity, the runtime of a forward operation goes from O(m x n) (matrix multiplication time) to O(k x n) (where the kernel size k can be orders of magnitude less than n).
Parameter sharing dramatically increases the efficiency of basic linear transformations. It cannot be compared (is billions of times more efficient) to matrix multiplication.
Note - this is a wonderful example of specialized feature learning.
Note - Perhaps it’s possible to think of convolution as expressing a prior belief over the structure of the features that should be learned by the algorithm - that they be local, that they be linear.
Equivariant representations are extremely elegant - what’s meant by equivariant is that as you vary the input, the change in the output is equal to the change in the input. If you shift pixels in an image, the convolved transformation will be the same (same weights) and the output will be shifted in an equivalent manner.
Convolution is equivariant to translations. It’s not equivariant to the scale of the image (closer in / farther away) or to the rotation of an image. Dealing with this will require other transformations.

Pooling

Pooling takes an average or max over a region of the convolution output. It represents a strong prior that the output be invariant to small translations - small translations will lead to the pooling layer outputting close to the same outputs for each version of the input.
Pooling can also learn invariance to translation. Given multiple features to pool over (where the features are learned with separate parameters), if any of the units activate strongly the pooled unit will be the same.
The image of architectures here is very valuable, and answers important questions.
Kernels are passed over the image and so the result is many convolved images (feature maps), as many as there are filters.
These are put through a Relu.
The output is then put through a pooling operation - this operates over each feature map / convolved image independently. The result is a smaller size output than the original image (convolution preserves size, but pooling reduces size). That output can be put through another convolutional layer, or be put through a reshape, or a softmax (depending on size).
If put through a reshape, the transformed images are flattened and laid side by side. This vector is an input to a fully connected layer (matrix multiply) which can convert it into a shape appropriate for the softmax. I should reason about what shape a softmax requires.
Question - I have no idea how to backprop through a reshape. This would be a great question to ask, and one that will soon be very relevant. I guess in backprop you just impose a shape on the gradient vector?

Convolution and Pooling as an Infinitely Strong Prior

We can think of convolution as having a prior on relatedness of local elements and of pooling as an infinitely strong prior (some weights are not possible) that the image should be translationally invariant. This satisfies my thinking of convolution as a prior from before, and implies that the creator of the algorithm, instead of having it learn topology, hard-codes topological assumptions in.

Variants of the Basic Convolution Function

Oh! The reason my convolved size shrank was that my padding has to be a function of the kernel size!! … And it worked perfectly!! The right ratio was kernel_length/2

As a side note, my model for how systems are working has gotten disturbingly close to reality. I’ve been making real and meaningful predictions about how changes to my code will change the running of my algorithm.

There are a few variants of convolution that vary based on the amount of 0 padding they add to the image. A ‘valid’ convolution is one that fits the filter only over spaces that are completely present in the image. This will reduce the size of the image at each iteration of a convolutional layer.

Another variant is ‘same’ - it adds just enough padding to replicate the size of the image, but also underweights the edges of the image relative to the center.

A final variant is ‘full convolution’, which adds enough padding to pass over outer regions in equal amount to passing over the central regions. With this much padding the image will grow to m + k -1 after the convolution. The output pixels near the border are a function of fewer real pixels than the output pixels near the center, making it difficult to learn a single kernel that performs well at all positions in the convolutional feature map.
Page 357, Convolutional Gradient.
Multiply by the transpose of the matrix defined by convolution.

We can use the convolution operation to define the convolutional gradient.

Sequence Modeling: Recurrent and Recursive Nets

RNNs are specialized for modeling sequences, where there are a set of time steps over the data. The inputs are processed one at a time as a sequence, with parameter sharing over the time steps (much like CNNs parameter share over the image). Where an MLP with a fixed input size would have to learn how to deal with the same structure at every feature, an RNN will apply the same set of weights at each point in the sequence to pick up a structure no matter where it is in the sequence.
The output is a sequence where each output is influenced by the previous outputs. The ‘recurrence’ comes from the value of a unit influencing the future value of that same unit as the sequence progresses.
A shallow version of sequence parameter sharing can be accomplished by a 1D convolution applied over the sequence, modeling the neighbors in a small region.
The sequence length processed by RNNs is arbitrary, and typically there are minibatches of sequences that are processed.

Unfolding Computational Graphs

A recurrent sequence is a function of its previous state and some parameters. This can be represented in two ways - one is a state by state representation, where each possible state is shown with a function that transitions from one to the next. The indices of the sequence length inform the length of the unfolded sequence. Another is as a single node that recurs, for an arbitrary number of timesteps.
These sequences are usually informed by an input as well as parameters. So we’ll have a hidden unit that is a function of the previous hidden state, some weight matrix, and the sequence input to the model at that time step.
The advantage of this structure is that we can share parameters across the sequence. Each input in the sequence is processed by the same weight matrix and a hidden state that’s a function of all previous hidden states. This also allows us to flexibly deal with arbitrary output lengths.

Recurrent Neural Networks

Recurrent neural networks take advantage of the recurrence defined previously. The hidden state of the network is a function of the previous hidden state and the input - we multiply the input by a weight matrix, multiply the previous hidden state by a weight matrix and add a bias to get our hidden layer, then throw an activation over it element-wise.
The output can be represented flexibly. Classic representation is take the hidden state at that time step and multiply by an output weight matrix. Then throw a softmax over that to get the output. Do that at each timestep to get an output sequence of equal length to the input sequence. Loss gradients can be computed to run backprop on the set of operations.
The downside to backprop in this context is that the hidden states are dependent on each hidden state up to the point of the output, and so the forward and backward passes are computationally harsh.
A single output can be generated at the end of a sequence, by making the single output only a function of the hidden state.
The RNN can also be made easier to parallelize by only making the next hidden state a function of the previous output. This tends to destroy a lot of information, but computationally can be much easier to train.
The RNN can approximate a Turing machine, even with a finite set of weights. When you discretize its output it has been shown to be able to compute any function in theory.

Teacher Forcing and Networks with Output Recurrence

One upside to only using the ouputs to inform the hidden state at the next time step is that because our training lables are given, we can independently compute the gradient in parallel for our weights based on the training labels. This is called techer forcing, and lets us avoid having to use backpropagation through time which is much more time intensive.
At test time, we use the ouptut at the previous time step to modify the hidden state. It will make sense during training to mix outputs from the network itself as labels with the true labels, so that the network can learn to deal with that kind of input stream.

Computing the Gradient in a Recurrent Neural Network

Backpropagation on an RNN is just backprop on the unfolded computational graph. There’s a nested chain rule on parameters that have dependencies.
The gradient involves starting with the final output and computing the gradient of the weights between the hidden layer and output after the gradient through the softmax is computed. That hidden layer only depends on the final output layer, but hidden layers earlier in the graph will depend on every future hidden layer as well as the gradient from their own output.
The gradient on the matrix that transforms the input to the hidden state will only depend on the hidden state at its layer’s gradient (as well as the activation).
The dependencies of the biases have a similar structure.

Recurrent Networks as Directed Graphical Models

Change - The reader doesn’t know graphical models yet - I don’t understand the purpose of a lot of this content, despite having thought ideas in this section already.
I really don’t know what they were trying to do here. It may be that I don’t know graphical models, and so couching RNNs as one makes a ton of assumptions about what I already know.

There are a number of ways to encode the end of a sequence. One is to have a binary output that determines whether the sequence should end or not. Another is to have a character in the vocabulary that represents the end of the sequence, and terminate when that value is hit. A final way is to predict the actual length of the sequence, and end at that length.
The RNN represents a joint distribution of dependency of previous states on future states.
The RNN has an extremely efficient way to represent dependency with parameter sharing. The downside is that the parameterization can be very difficult to learn.

Modeling Sequences Conditioned on Context with RNNs

We may want to represent a vector as the input instead of a sequence. On approach is to make each element of the vector into an input.

This is a discussion of how to frame RNNs probabilistically. They’re both related to the output at previous timesteps and related to the input. They can be framed as a joint distribution over y and as a conditional distribution over x.

In order to use an RNN for a single vector, we can multiply that vector by a weight matrix and use those outputs as inputs to the hidden units.

There’s generally an independence assumption made between inputs that we can get rid of by adding connections from the output at time t to the hidden layer at time t+1 - this makes the next step dependent on all previous steps.

Bidirectional RNNs

In many sequence to sequence learning tasks, we want to condition the output on the entire input sequence. This is currently only possible by having outputs that come after the sequence, but we can remedy this efficiently for outputting sequences with Bidirectional recurrent neural networks.
Bidirectional RNNs have been very successful for speech recognition, handwriting recognition, and bioinformatics. And for state of the art end-to-end machine translation.
The Bidirectional RNN simply has hidden units that are recurrent in both directions - from t=1 to n and t=n to 1. Both hidden layers contribute to the output. The representation at any given output depends on both the past and the future, with a focus on inputs close to the time step in question.

Encoder-Decoder Sequence-to-Sequence Architectures

In many applications - machine translation, question-answer, speech recognition - we may want to map an input sequence to a sequence of a different length.
The input sequence is referred to as the context, C. C might be a vector or sequence of vectors.
The framing for sequence to sequence is to use two RNNs. The encoder RNN processes the input and creates a fixed length sequence C. The decoder RNN is trained on that fixed length sequence C and produces a variable length output. The entire system’s loss is optimized over the training set.
One downside of this approach is that if fixed-length C is too small to represent the input information appropriately, the algorithm will perform poorly. An innovation made to deal with this is to let C be variable length, and have an attention mechanism pull out the most relevant information from the context for the decoder.

Deep Recurrent Networks

There are three major transformation in Recurrent neural networks.Input to hidden state, hidden state time t to hidden state time t + 1, and hidden state to output. Each transformation is represented by a weight matrix followed by a nonlinearity.
It would be valuable to add depth to these networks. There are a number of ways to add depth. One is hierarchical, where there are multiple hidden states, and they feed through each other towards the output.
Another way to introduce depth would be to add multiple weight matrices for a transformation from input to hidden, hidden to hidden, or hidden to output connections.
A third way would be to add multiple weight matrices but also add skip connections. Since adding size to the hidden-to-hidden connection doubles the length between any input/output, these connections are important for making information available at a distance.

Recursive Neural Networks

Recursive Neural Networks have a computational graph that is structured as a deep tree, instead of the chain like structures we’ve seen in RNNs. They’ve been used successfully in procesing data structures for inputs to neural networks, in feature engineering for NLP and vision. The ideal shape of the tree can depend on the context - parse trees that come out of a parser can be learned on top of, or a balanced binary tree may make sense for keeping information close by.

The Challenge of Long-Term Dependencies

Long term dependencies are hard to learn because gradients tend either to vanish or explode over the long term (usually vanish). This is due to many jacobian multiplications, leading to exponentially smaller weights for long-term interactions.
This is worse in RNNs because the same weight matrix is applied multiple times. This does lead to extremely nonlinear functions. If its eigenvalues deviate from 1 we’ll get difficulties in learning long-term dependencies. Any component of the hidden state misaligned with the largest eigenvalue will eventually be discarded.
The issue with long term dependencies is that they tend to have tiny weights, and so are in a region where gradients will vanish. It’s easy for long-term gradient signals to be overridden by short term fluctuations that don’t have to operate at such an exponential disadvantage.

Echo State Networks

The idea here is that it can be difficult to learn the weight matrix for the input to hidden and hidden to hidden connections. One approach is to only learn the output weight matrix, which is done in Echo State Networks and Liquid State Machines (where LSM use spiking neurons).
The important question here is that if we’re only learning the output, how do we set the hidden/input weights to create a rich feature representation for us to learn on top of?
The answer given is to treat the network as a dynamical system, and keep it near the edge of stability.
One idea is to keep the eigenvalues of the jacobian matrix that orchestrates state transitions between hidden layers close to 1. Especially noting the spectral radius of the jacobian, or the maximum absolute value of its eigenvalues.
When an eigenvalue’s absolute value is greater than 1 we get an exponential scaling of the value of the hidden state. When it’s less than one it becomes exponentially small.
When we add in the nonlinearity, the effect is much less pronounced - its derivative will approach zero on many time steps, preventing the explosion that comes from a large spectral radius. Current recommended spectral radius is actually much larger than unity (1).
We can initialize to a spectral radius of 3 and train, or use echo state networks for pre training RNNs with an initialization around 1.2.

Leaky Units and Other Strategies for Multiple Time Scales

One way to deal with long time scales is to create a model that can deal with both fine and coarse time scales. Techniques for accomplishing this include skip connections across time and units that process data at different time scales.

Adding Skip Connections through Time

One way to avoid having gradients vanish is to re-introduce the information through a delayed skip connection. This connects variables from the distance past to variables in the present.

Leaky Units and a Spectrum of Different Time Scales

One way to maintain information over time is with linear self-connection, where a proportion of the hidden state is carried over into the next hidden state. These are called leaky units. They have a parameter that represents how much of the hidden state to remember, which can control how quickly information is remembered or forgotten.

Removing Connections

While skip connections add edges between computations, we can also deal with long term dependencies by removing one-step connections. This will shorten the path for information to travel. One option is to make recurrent units leaky, but have different groups of units associated with different time scales. Another approach is to have updates take place at different times, with a different frequency for different groups of units.

The Long Short-Term Memory and Other Gated RNNs

Gated RNNs are currently state of the art, used often in cases where RNNs are effective. Gated RNNs allow for state to be maintained, much like leaky units - they allow for paths through time that aren’t lost to an exploding or vanishing gradient. With Gated RNNs the network will accumulate information over time. They also allow for state to be forgotten, or set to 0, after it has been used. This forgetting doesn’t happen manually - they learn when they should do this.

LSTM

Change? At first look, the computation of the external input gates looks like two scalars multiplied - certainly not the same shape as the state unit. Likely that I’m just confused, but I want this presentation to be enough.
In an LSTM, the self loops that allow information to be maintained for long periods have weights on that loop that are learned from the context. The allows an LSTM to flexibly deal with memory over different durations.
The weight of this self loop is controlled by another hidden unit, which lets that time scale change dynamically based on the input.
LSTMs have been extremely successful for machine translation, parsing, speech recognition, handwriting generation, image captioning, and much more.
Every cell of the the LSTM is a Recurrent Neural Network. RNNs within an RNN. In addition to being an RNN, each cell has gating units that control the flow of information.
The major RNN has some internal state that is modified by a self-loop - partially influenced by its previous state and partially influenced by the new input. The degree to which it is influenced by its previous state is determined by the forget gate, which also sees the input at each step, and is an RNN with its own hidden state that computes the optimal weighting with a sigmoid.
The major RNN’s hidden state is also influenced by the input gate.
The output of the major RNN is multipled by a scalar from the output gate. This allows the output gate to shut off its output by giving a value close to 0.
In addition to sending the input to each cell/gate, the hidden state of the major RNN can also be an input to each gate. This allows those gates to condition on the current hidden state, wilth additional parameters to modulate its effects.

Other Gated RNNs

Change? Nothing makes sense because everything is scalar??
YES!! The 1-multiplier I was looking for last time!
GRUs have a single gate that decides what proportion of state to forget and when to update the state of the RNN.
The reset gates choose which part of the state get used to compute the future state.

Optimization for Long-Term Dependencies

Long-Term Dependencies are often easier to deal with by creating a model that’s easy to optimize than by creating a stronger optimizer.
Second order methods can be seen as dividing the first derivative, by the second derivative, or multiplying the inverse hessian by the gradient. Often these both go to zero at the same time.
Decent results have been gotten from second order methods, but those results can also be gotten with SGD. In LSTM training this is generally what is used.

Gradient Clipping

Often we have a step size that doesn’t generalize to the space around the step size. If there’s strong curvature, the step size will send the optimizer far off the proper path, especially if the gradient is quite large.
One solution to this is gradient clipping. You can measure the gradient norm, and if it is above a certain threshold ‘clip’ the gradient by dividing element wise by the norm and multiplying by the threshold. That will keep the gradient going in the proper direction but with less magnitude.
Often what works just as well is to take a random step when the gradient is too large, to get away from the numerical instability.
Note that when you clip the gradient, aggregates of previous gradients will be biased.

Regularizing to Encourage Information Flow

While gradient clipping is good for exploding gradients, it doesn’t help us with vanishing gradients. One approach is to ensure that the amount of information flow from the gradient has a similar magnitude all through the LSTM. Instead of having the gradient from the loss die out over the long term dependency, we ensure that the gradient at all hidden layer updates has a magnitude that’s equal to the magnitude of the gradient right next to the loss. When applied simultaneously with gradient clipping, this can strongly improve the ability for an RNN to learn long term dependencies. It is currently outperformed by LSTM for tasks where data is abundant, but I don’t see why they’re incompatible (if they are). This is true for most regularization strategies as well, where tons of data makes them less important. But it also takes much more time to train.

Explicit Memory

The parameters of a network are good at storing implicit knowledge. But remembering facts is extremely hard for SGD. SGD also needs to see many instances of a piece of information to store it implicitly. Ideally the network could store facts and use them to reason.
To resolve this difficulty, memory networks were presented. Networks could read from and write to memory, leading to Neural Turing Machines that could use a soft attention mechanism to learn how to interact with memory. This allowed for the creation of algorithmic neural networks that still allow gradient based optimization.
The idea with the Neural Turning Machine is to have state learn which parts of memory to read from for write to. It’s hard to do this with specific addresses, and so we can read a weighted average from many cells at the same time. Writes modify different cells by different amounts.
There are a few ways to lable the addresses in memory - one is with content based addressing, where by recalling a few parts of the content of the memory you can retrieve a vector valued entire memory. The other is with labels, where numbers/slots are associated with each memory.
The choice of addresses to read is due to an attention mechanism.
We have had more success training these soft decisions, where a nubmer of cells are weighted, than hard decisions where we read a single memory vector and have to use a specialized optimizer.

Practical Methodology

Couching of applied ML as a decision making process. Deciding between collecting more data, algorithms to use, regularizing features to use, debugging the implementation, etc. It important because all of the actions can be necessary and are costly.
The first steps need to be:

Establish your goals
Determine your error metric
Create pipeline end-to-end
Iterate over pipeline

In practice you’ll have to make decisions about which steps to take - important to establish what your bottlenecks are so you can target them (Overfitting? Underfitting? Insufficient data?)

Performance Metrics

Often situations call for more complex performance metrics like straight accuracy on a problem. For example, the precision recall problem where you have unbalanced classes and want to plot these against one another in a PR curve.
You need an idea of the error rate you need for your application to work, and an idea of how your error interacts with the real problem at hand.
The precision recall ratio can be transformed into a single score, an F-score (2pr / p+r). You could also use the area under the curve.
In many tasks it’s possible to sense that you have low probability of getting the correct answer and ship it out to a human. There’s a coverage amount that you include in your error metric, to get a higher score with lower coverage.

Default Baseline Models

A ton of practical advice here.
If the problem is one that we know DL performs well on, it makes sense to start with it. Otherwise, linear classifiers are lighteight and easy.
Start with Adam / SGD + Momentum + learning rate decay if you’re using a feedforward net or convnet for a task.
Always use early stopping. Introduce regularization (batch norm, dropout) quickly, especially if you don’t have much data.
If your problem has been studied in the past, take a model from that space that performs well and copy it as your baseline. In computer vision, often taking a pre-trained model from trained on image net works well.
Only include unsupervised learning in your baseline if you know it’s important, like word embeddings for NLP.

Determining Whether to gather More Data

If you’re performing poorly on the training set, it’ll be less valuable to gather more data. Instead, tune your learning algorithm / look at the quality of the training data.
One useful way to decide whether this is necessary is to training on logarithmically more training data and plot training data size vs generalization error. Extrapolating this can tell you the impact more data will have.
Other options are to reduce the size of the model or tune hyperparams.

Selecting Hyperparameters

This can be done manually or automatically.

Manual Hyperparameter Tuning

Manual tuning is a tradeoff between runtime/memory considerations and generalization error.
Most hyperparameters have a U shaped relationship with the generalization error. Often this corresponds to an underfitting and overfitting region.
Capacity is constrained by three factors:

Representation ability of the model
Ability of the algorithm to successfully minimize the cost function
Degree to which the cost function / training regularizes the model.

There are many hyperparameters with different types of structure (binary, one sided, continuous, categorical options). If you can only tune one hyperparam, tune the learning rate - it has to be the correct value and has a U shaped relationship to generalization error.

If training error is higher than your target error, you have to increase model capacity. This may mean wider or deeper networks.

Networks perform well when they’re getting low training set accuracy and the problem is bridging the generalization gap to the test set. Often this can jsut be solved with more data, or by introducing regularization like dropout and weight decay.

Hyperparameters:

Weight Decay
1. To Reduce Capacity
Number of Hidden Units
1. To Increase capacity with more, reduce with less
Depth
1. To Increase capacity with more, reduce with less
Learning Rate
1. Worry about optimization failure
Padding
Convolution Width
1. Increased kernel width increases the number of parameters in the model

Automatic Hyperparameter Optimization

Manual hyperparameter tuning can work when the user has experience in the space or starts from a good set of hyperparameters. NNs often have 40+ hyperparams to tune, which can be costly. Automatic methods include grid search and random search. Automatic methods wrap the algorithm and output a set of good hyperparams. They often have their own hyperparams, like ranges for the hyperparameters. Usually these ranges can be the same across tasks.

Grid Search

Grid search wastes computation when hyperparams are independent by trying the same value of that hyperparam repeatedly.

Grid search takes a logarithmic or uniform scale. It should be performed repeatedly until it finds the range that is best. The search should be refined as well, by zooming in the grid.

The downside is that computational cost grows exponentially with the number of hyperparams. m hyperparams with n possible values leads to O(nm) evaluations of the model.

An upside is that Grid Search is easy to exploit in parallel.

Random Search

Random search converges much faster to good values of the hyperparameters. This is because it doesn’t do redundant computation over independent parameters, and so is much less wasteful than grid search. It can aso be refined based off of previous results.

For me, the obvious thing to do is discover which variables are independent and don’t waste time grid searching over them. Do them in isolation, or with the hyperparams that they do interact with.

Model based Hyperparameter Optimization

Typically we don’t actually have a loss function between the hyperparams and validation error. But we can approximate one, including both our expectation and uncertainty over values of hyperparams and doing an exploration/exploitation style search. Bayesian optimization.
Software for this is avaliable (Spearmint, TPE, SMAC) and research in the space could be great for all of engeneering - there are many systems with hyperparemters to tune and a general method for solving the problem would be powerful. Where’s the metalearning research?!?
Bayesian optimization fails catastrophically on some problems.
There’s a problem with this automation where you can’t get results of an experiment until the process is over, which is slow. Software that can quit bad experiemts and spin up new ones that benefit from the information gained immediately will be valuable.

Debugging Strategies

It can be difficult to tell if poor performance in an ML system is the algorithm itself. We often don’t know, a-priori, the intended behavior of the algorithm.
The system can compensate for many bugs. For instance, an error in the bias gradient can be compensated for by the weights.
It’s important to visualize the system in action - not just the error rates, but look at the results. Listen to the sound output, visualize the bounding box.
Looking at your worst errors will also help the debugging process. This can reveal errors in the data and preprocessing as well.
Evaluation errors are dangerous. They can lead to thinking you’re performing well when you’re not.
Train on a tiny dataset, overfit it. If this fails you know your learning algorithm has some issue in training.
Check your gradients against finite differences gradients. Especially if you’re implementing your own gradient computations, compare against a small change. This can be hard if you want an entire jacobian or gradient vector, since we can only take a single derivative at a time with finite differences.
I should really just closely read gradient checking code.

Use visualizations of the activations and gradients - look at histograms of their values, and how they change over time. The ratio between gradients and parameter values can be very helpful, to see which parts of the algorithm are learning well over time. The range should be 1% of parameter weights, not 5000% or .00001%.

Example: Multi-Digit Number Recognition

They used a convnet with multiple softmaxes, one for each digit. They also ran an ML system over the data to detect the house numbers before trying to transcribe them.
It seems like all real ML problems have this custom flavor where you need a sequence output, or to layer systems on top of one another.

Applications

The goal is to describe how deep learning has been used to solve important problems. Often these domains have required some level of specialization.

Large Scale Deep Learning

It is extremely important for exhibiting quality behavior that networks be large. The exponential growth in network size is largely responsible for their success.
Training on a single CPU is now generally considered insufficient for training networks of appropriate size.

Fast CPU Implementations

There can be dramatic differences that depend on the type of hardware being run on (for example, in fixed vs. floating point operation runtimes).
Also important are vector instructions (vectorization, I assume) and data structures that avoid cache misses - neglecting for these by ML researchers can lead to smaller networks and worse results.

GPU Implementations

GPUs were designed for parallel matrix multiplication and division so that video games could transform a 3D rendering into an on-screen 2D rendering efficiently. The result is a machine with a high degree of parallelism and high memory bandwidth, traded off with lower ‘clock speed’ (my mac has 2.5 Gh7) and a worse branching factor.
Neural networks have similar requirements - typically important parameters fall out of cache on a CPU and force the computer to retrieve them slowly from disk. The GPU allows for fast parallel matrix multiplies across all the neurons in a large network.
GPUs used to be so specialized for rendering pixels that they couldn’t be used for other tasks. General purpose GPUs allowed the research community to start using them for networks. The CUDA programming language to make GPU programming more accessible was also a huge development.
GPU computing often requires specialized knowledge. It can be quite different from writing CPU code - on the CPU, reading from the cache as often as possible is optimal. On the GPU, it’s often faster to recompute a value than to read from cache.
The GPU has multiple threads that are fastest running the same actions at the same time. But there are bundles of transformation that have to happen at different times as well. This means that it is best if GPU code writing can be specialized.
A library like Pylearn2 can make calls to Theano and Cuda-convnet and so run efficiently on a GPU. It can also run on a CPU, creating flexibility among hardware types. Other libraries with these features include Torch and Tensorflow.

Large Scale Distributed Implementations

Scaling up the computation to multiple machines is often necessary for training large networks and reducing computation time. The two major types of parallelism include Data Parallelism and Model parallelism. Data parallelism involves splitting the data over several nodes. Model parallelism involves taking the network and splitting it across multiple notes.
Data parallelism speedup is limited due to the synchronous nature of gradient descent - each update step depends on the previous gradient. And so waiting for that step to be completed and communicating it limits the benefits from parallelism. Benefits to additional nodes are generally sublinear.
A major innovation and current state of the art is asynchronous stochastic gradient descent. The parameters are read from a parameter server (shared memory across processor cores representing the parameters) by workers and updates are pushed without a lock in the read/write of parameters. Sometimes the workers overlap with other another, but the general result is much better than sequential updates. Usually the update steps won’t undo the other’s progress (the process is stochastic) and so the incremental update is still valuable.
Each core reads parameters without a lock (that is, parameters can be read by another machine before or after first core is done making an update), computes the gradient, and increments the parameters without a lock.

Model Compression

Question - the claim is that “Model compression is applicable when the size of the original model is driven primarily by a need to prevent overfitting” - This is weird because more size is generally for preventing underfitting, and the larger the model, the more overfitting is a problem.
The value in model compression is that at test time, the user of the model likely has much less compute. It’s important to optimize for test time runs.
One approach is to train a large ensemble of deep networks, and use their optimal result to generate infinite amounts of data for a smaller network. That network will learn to replicate the other network’s properties.

Dynamic Structure

Dynamic structure means dynamic to the input - given the type of input, a different line of computations are done. This can come out of the training process where a network learns to process a particular example through a subset of the neurons, or be explicitly included by having an ensemble where particular parts are responsible for particular types of input (you can think of the modeling strategy where you cluster and build a model for each part of the cluster).

One great example of dynamic structure is for problems with a rare event / class. Using a cascade of classifiers, where the first few classifiers ensure that the rare class is not filtered out (and the data that definitely does not contain the rare class is ejected in each classifier). The later classifiers make sure that a high proportion of the data given the rare label actually have the rare label.

Another example of dynamic structure is when you use one ML algorithm to isolate the relevant part of the image and use a second to do close analysis on that image.

Decision Trees can be thought of as having dynamic structure, where each branch of the tree determines which data will see the next branch. Data that can already easily be classified will end up in a shallow pure class quickly.

A neural decision tree can have a neural network at each node determining the optimal split points.

One strategy is to have a neural network decide which of several neural networks a datapoint should be classified by. This decision can be hard or soft, with multiple nets computing a result and having that result be merged.

A major obstacle to dynamic structure in neural nets is decreased parallelism and decreased ability to represent transformation with straightforward matrix multiplication.

Specialized Hardware Implementations of Deep Networks

Level of precision can be reduced at test time. This can be valuable for lower power devices. Fixed point rather than floating point precision can help as well.
Over time, improvement in hardware speed has transitioned to being more about the number of cores available (taking advantage of parallelism).

Computer Vision

Object Recognition (Classification)
Object Localization (Bounding Box around object)
Image Segmentation
Transcribing a sequence of symbols (Digits Recognition)

Preprocessing

The only strictly necessary preprocessing is to make sure your pixels are on the same range. Mixing ranges is a no-no. It’s often best to scale to [0, 1].
It can often be important to make sure your images have the same size. Some networks have pooling regions that can deal with size dynamically.
Dataset augmentation, where you create new data that are different in small ways from the original data or reduce the data somehow can be valuable for regularizing your model and improving its flexibility.
You can also reduce the amount of variation in the training data if you know that variation is unimportant to the actual task and is doing damage to the algorithm’s accuracy.

Contrast Normalization

Contrast normalization involves reducing the variance between pixel values in a region of an image. The contrast is defined as the standard deviation across the channels (RBG) of the image - the sum of the value of the pixel minus the mean across the pixels, squared.
By dividing by the standard deviation (max of that and epsilon, to avoid division by 0) we can do contrast normalization over the image.
Global contrast normalization can be seen as mapping the raw input to a spherical shell.
Local contrast normalization normalizes with respect to a sliding window around the target pixels.

Dataset Augmentation

Changes to the input data that do not change the class of the datapoint can create larger datasets that regularize the classifier, improving generalization ability. Translations, rotations, and rescaling can add a lot of valuable data.

Speech Recognition

The speech recognition problem is to take a sequence of acoustic sounds and generate a sequence of words that best represent the input words.
The state of the art up to 2009-2012 were Gaussian Mixture Model - Hidden Markov Model systems, where the GMM would model the relationship between phonemes and sounds/acoustic features and the Hidden Markov Model would deal with the sequence of phonemes.
The transition occurred with much larger datasets and much larger network models, where neural networks replaced GMMs for associating audio features with phonemes.
Recent state of the art speech recognition systems have been trained as end-to-end deep learning systems with deep LSTM networks.

Natural Language Processing

This domain generally requires that we use word-based models that operate on sparse data. Generally our models generate a probability distribution over words.

N-grams

N-gram models fit a probability distribution over words occurring in sequence. N refers to the number of words we consider in the n-gram.
It’s common to train an n gram and an n-1 gram model simultaneously - in this case the n-1 gram will be a subset.
Often in n-grams the sparse data leads to there being no probability for a particular sequence. This can be ameliorated by having a uniform prior over all next values in a sequence, and by using n-grams at different levels - lower n will lead to less sparse solutions.
Because the space is so sparse, language models need to be able to share knowledge between similar words.
One way to share knowledge between words is to cluster words based on co-occurrence and use the cluster value to update n-grams.

Neural Language Models

Change - this interpretation, where words with similar context impact one another, seems wrong based on my current model of how word vectors work.
Neural language models are distributed representations of words that overcome the curse of dimensionality by relating sentences with words that have similar representations to one another.
This ‘word embedding’ takes a high dimensional co-occurence space and maps it down to a lower dimensional space.
The concept of an embedding is general across neural networks. The hidden layers of a convolutional neural network define an image embedding, for example. It’s a distributed, often compressed representation of the data.

High-Dimensional Outputs

If you want to output words, your output space will be in the hundreds of thousands. This is extremely difficult computationally and difficult for the softmax to output.

Use of a Short List

One way to deal is by limiting the vocabulary to 10,000-20,000 words. This damages generalization ability to the most frequent words.

Hierarchical Softmax

This output structure creates broad categories of words, followed by less broad categories, and so on, getting strong computational savings. This is the classic response to the high dimensional outputs problem. Multiple paths can also be used to output the same word to introduce flexibility.
This is typically done with logistic regression at each node of the tree. The logistic regression models can be trained in supervised fashion with the words in the training set.
Learning the best hierarchy can be done but is difficult.
The sad reality is that hierarchical softmax is outperformed by importance sampling - may bee due to a bad hierarchy of concepts and words.

Importance Sampling

We can speed up training (since almost all words will have small probability) by sampling a subset of the words.
The sampling is from a proposal distribution. It uses the appropriate weights to correct for the bias that comes out of sampling from the wrong distribution.
The gradient size is equal to the number of negative samples. Negative words are chosen randomly.

Noise-contrastive Estimation and Ranking Loss

This looks like a hinge loss for the output - it’s a score for each word with a very sparse gradient which only has value if the score for the word is within a margin of 1.

Combining Neural Language Models with n-grams

There are ensemble methods that use both at different levels. Easier is comparing outputs. But a more intimate model involves adding the n-grams features as inputs in the neural network model that connect directly to the output layer, expanding the model’s capacity without a big increase in computational load.

Neural Machine Translation

Often these systems have many components. For example, they would have a component suggest many grammatical forms of the same concept, and have a language model choose among the suggestions.
Recently people have been using networks for the language model. Previously they would use n-grams, or a maximum entropy model where a softmax was used to predict the next word based on frequent n-grams.
One approach would be to use an MLP to map to output sentences. A disadvantage of this approach is that it forces the input sequence to be of fixed size.
A solution is to use an encoder to go from a flexible sized input sequence to a context C, which is a representation of the input that has fixed size. This can be done with an RNN or CNN. Then we go from that fixed size representation to a variable sized output with an RNN. This is the decoder, and so we have an encoder-decoder framework. The same setup for sequence outputs can be used for image captioning and other tasks.

Using an Attention Mechanism and Aligning Pieces of Data

For longer sequences it can be useful to pay attention to a portion of the input at a time.
An attention mechanism involves a reader that generates a distributed representation of each input, a memory that stores those representations, and an exploitative output part that reads from memory and converts to the appropriate distributed representation in the other language, and goes to a likely output.

Historical Perspective

Embeddings used to be learned with SVD. The original backprop paper learned an embedding of symbols and relationships. Neural word embeddings was taken up in the early 00s, followed by more focus on word embeddings.
Question - this section really confused me, since I feel some conflation between word vectors and embeddings that come out of RNNs. And I feel like word vectors weren’t really explictly treated as such, but conflated with a bunch of other things.

Other Applications

Recommender Systems

Discusses that networks can be used in recommender systems, but goes on to describe item/product matrix based collaborative filtering… also points to being able to use regression to generate scores or binary classification to generate conditional probabilities.
Singular value decomposition can also be used to reduce user/item embeddings to a low dimension.
Neural networks can also be used for collaborative filtering. A restricted boltzman machine undirected probabilistic model, for example.
The cold start problem (where a user hasn’t bought anything yet) can be ameliorated with other data about the user (their profile, their browsing, etc.). This leads to content based recommender systems, mapping from user features to outcomes which can be learned through deep learning.
Convnets can also be used for understanding songs, so that better music predictions can be made.

Exploration Versus Exploitation

Most recommendation settings are actual a contextual bandit problem, where there’s a single option or set of options recommended and we only get data about the classes that we recommend. We get no data about the classes we don’t recommend, and so to understand what would have happened had they been recommended we need to explore the space of recommendation for a given type of user.
Exploration - collecting more training data, and exploitation - choosing the current best policy, can be implemented in a number of ways. One is by randomly exploring, hopefully to fill a diverse training set. Another is to model both the expected value of each choice and the uncertainty of each choice, and explore on that basis.
There are environment variables that will influence how much you want to explore or exploit in a given problem. For example, if the time scale is short, a preference for more exploitation is necessary to get utility out of that setting. If a time scale is long, a richer exploration phase makes sense so that you can capitalize in future.
Supervised learning doesn’t encounter the exploration-exploitation tradeoff because the labels it gets are known to be correct over the entire space of possible labels. Training data contains all of the information of the label space, and so exploration wouldn’t be valuable.
Reinforcement learning has a problem where it can be very difficult to evaluate policies against one another, because a policy’s output at t-1 determines the input it gets at time t.

Knowledge Representation, Reasoning and Question Answering

Deep learning has been successful in natural language processing in large part due to embeddings for words and for symbols. Research on creating embeddings for phrases and for facts is important for the future of knowledge representation.

It would be great to be able to represent relationships between objects. Say, represent relationships between numbers or between mammals. The relationship of less than or greater than or belonging to the same set.

We can represent this information in a structured language format that has a verb (that defines the relation) as well as a subject (the first entity) and an object (the related entity).
There is a second kind of relationship, an attribute, where a subject has an attribute.

The ‘relational database’ is generally used to store these types of relationship.

Change - Dude. They use “Neural Language Models” again to represent word vectors, and much more explicitly here - “learn a vector that provides a distributed representation of each word”. This so needlessly breaks with the way that every other resource talks about this concept.

Knowledge bases are large sets of these types of relations.

It would be great if we could use relations in a knowledge base as training labels in tandem with a large corpus of text (say, wikipedia) and learn new relations automatically with high probability to fill out the knowledge base.

Knowledge bases have also been used for word-sense disambiguation.

Deep Learning Research

It’s worth generating samples, or learning when there are no labels. Unsupervised learning is hard for reasons (inference is intractable in that learning algorithms don’t scale with the number of dimensions of the input, or computing the normalization term in probabilistic ML is intractable) but progress would be meaningful.

Computing the normalization term (partition function) usually requires taking the gradient of the log of the partition function. This requires sampling to compute, which works poorly when surface has many modes that are separate from one another.

Linear Factor Models

Linear factor models are one of the simplest examples of learning a probability distribution for the input - it’s actually similar to an idea I’ve always liked, which is a model that can predict any aspect of its environment / features from any other aspect of its features. I don’t see how this interpretation holds for something like PCA yet, but I’ve been re-affirmed in liking that frame.
Latent variable models that learn some parameters h and then predict p(x|h) are also framable as representation learning - this is done in autoencoders, including VAEs. But we all know how I feel about going this far with ‘unsupervised learning’. I think this takes it to literally and goes way too far.

For generation in a linear factor model, we sample the factors from a factorial distribution (distribution where the features are independent of one another - can be gaussian) and use that latent variable representation times a weight matrix plus a bias plus noise to generate a sample. Super simple. But not how I thought Probabilistic PCA worked. It does make sense that if you learn the weight matrix and bias by optimizing for reconstruction error (where you have an encoder as well as a decoder) that that decoder matrix would be usable for generation.

Types of Linear factor model differ in the distribution that they sample the factors of variation from. This can be thought of as a prior over the latent variables.

Probabilistic PCA and Factor Analysis

The goal is for x to be conditionally independent of the sampled factors, and so have the latent variables (the weight matrix) capture the interaction between the sampled factors and x. This means we can frame a linear factor model as a normal distribution where the bias is the mean and the product of the weight matrix times its transpose plus the sampled factors is the variance (covariance). We can sample from a variant of a gaussian to get our factors. For probabilistic pca we’ll make that gaussian have a diagonal variance where the values are equal to one another.

You can use EM (expectation maximization) to learn the latent variables (the weights and the covariance? Why is it not the weights and the bias? I’m missing something here. I guess the bias can be included in the weights. And so you optimize the score for the weights/bias keeping the conditional variance fixed. And then you optimize the conditional variance keeping the weights fixed. (Because I know that EM is similar to lloyd’s algorithm)

You can see probabilistic pca approaching pca as the conditional variance term goes to 0.

One frame for this is as projection onto the subspace defined by the weight matrix. The conditional value of the latent variables given the input is the projection of the input onto the learned weight space. Which does make sense.

Independent Component Anaysis

Herault and Ans, 1984
Jutten and Herault, 1991
Comon, 1994
Hyvarinen, 1999
Hyvarinen et al, 2001a
Hinton et al., 2001
Teh et al., 2003

Ah, ICA looks to not merely be linearly independent (decorrelated) but non-linearly independent. Mutual information style.

The claim is that ICA is used to disntangle low level signals (for example, multiple speakers in a room, whose voices are conflated in a single audio channel) rather than learning abstract, higher level structure in the data.

ICA disentangles neural signals in an EEG machine, for example.

ICA often requires that the distribution that the latent factors are sampled from (p(h)) is not gaussian - otherwise the weight matrix won’t be identifiable (there will be multiple matricies that satisfy the same data)

A version of ICA learns groups of features that are encouraged to interact with one another and not interact with other groups. It’s called topographical ICA and learns gabor filters when trained over image data.

How does ICA learn gabor filters? Are the feature maps set up like convolution?

Slow Feature Analysis

An attempt to use temporal structure to learn abstraction, where the features that are maintained across time (say, whether there’s a zebra in a frame of a video) are more general than features that change quickly (say, the pixel values as the zebra moves over them).

You can make slow feature analysis nonlinear by applying a basis transformation. And it’s actually an extremely understandable learning process, where the learned representation can be theoretically predicted and then is recovered in practice.

SFA actually learns features that look like V1??

Sparse Coding

Sparse coding models choose a distribution with a peak over 0 (like Cauchy, Laplace or factorized studnt t distribution) as the prior over the latent features.

Generally the assumption is that linear factors have gaussian noise with a pricision that has the same values along the diagonal (isotropic).

The learning in sparse coding switches between latent factors and weights, as approximating with maximum likelihood directly is intractable.

Sparse coding features get (relatively) strong generalization when the amount of training data per class is very small.

The problem with generative linear models is that random sets of latent features are sampled and used for generation, meaning that the generated sample will be a combination of all of the classes in the model.

Autoencoders

Can see autoencoder as having two parts (lol, weird) - an encoder function h = f(x) and a decoder that produces a reconstruction r = g(h).
This frame doesn’t make sense because every layer of the autoencoder is both part of the encoding and part of the decoding, depending on how you decide to think about it. How do we determine where the encoding ends and the decoding begins?

Autoencoders generally are forced to learn a representation of the input that requires compression, so that interesting features of the input are learned. They can be practically used to do dimensionality reduction / create a feature representation.

There’s a technique for training autoencoders called ‘recirculation’ that compares activations on the input to activations on the decoded output. People are always rooting for some marginal amount of biological plausibility…. But it’s not clear that the system is close enough for that transfer to make sense.

Undercomplete Autoencoders

Undercompleteness just points to the hidden layer being of lower dimensionality than the input.

Convolutional autoencoder may make more sense - do they exist? You can map anything to anything, right? They must exist…

I think undercompleteness as a concept make sense but may not be true in practice. In practice learning to copy is too hard, and so the network ends up doing anything else anyways. But actually, with skip connections it’s totally possible. Autoencoding resnets may make it too easy.

The idea that it learns the same subspace as PCA… I guess it’s similar, in that you have a weight matrix to a lower dimension and some reconstruction error. If you have no non-linear activations, that may just work.

Regularized Autoencoders

I hadn’t thought about this, but regularization in an autoencoder to avoid copying is extremely reasonable. The truth is that you actually want an overcomplete representation of the particulars of your data, becasue the abstract knowledge that touches your data is much much greater than the datapoint in particular that you’re looking at. And if you want to represent that abstract knowledge, you need dimensionality far greater than the pixel space of your image or whatever.

Interesting! Diff between VAE / generative stochastic networks and autoencoders is looking to maximize the probability of the training data rather than recreate the input.

I didn’t know that VAEs were descendents of helmholtz machines - maybe I should figure out what a helmholtz machine is.

Sparse Autoencoders

For a reason I don’t quite see, sparse autoencoders (through regularization) can’t be interpreted as having priors (in the technical sense) because priors are independent of the data, and these penalty terms are not data independent…

The interesting thing is the frame of autoencoders as generative models, once they’ve learned a good latent representation for the input.

They go on to describe sparse autoencoders as resulting from priors with a heavy weight on zero, like the Laplace distribution and the student t distribution.

Also, I don’t really like our abstractions over the math. There are terms in these models, and what they do doesn’t correspond direclty to how we talk about them. There will be a term in the loss that includes the log probabilty of the hidden representation (maybe this is what they mean by a data-dependent prior - others are over the parameters instead). But terms in the loss like the probability of the hidden state which has a laplace prior, say. There are likely many interpretations of that distributional assumption.

Sparsity is in structure of informaiton. Should I add ‘slowness’ from the last section? Perhaps as a type of temporal structure?

Denoising Autoencoders

Ah, denoising autoencoders! I like the idea a lot. It pushes the regularization into the data. It forces generalization directly in the loss. Love it when loss functions are cast as being for transfer / generalization and prevent overfitting in some fundamental way. I guess in a way this is what regulariztion is, but if you can get a gradient that is about how you generalize rather than about how you perform on the training data, that’s just great.

Actually wait, that’s an obvious training paradigm - get a gradient from generalization data… this may not make sense, I have to think about it. A quesiton for a given gradient is how it affects the validation loss. Maybe only apply gradients that generalize? And so in your batch, you threshold an update to pass a generalization bar? You should certainly only keep improvements to the representation that generalize. Maybe you should only let yourself learn things that generalize…

This feels like automated dataset augmentation.

Regularizing by Penalizing Derivatives

The penalty term here says that if there’s a slight modification in x, penalize derivatives in the amount that they change. So you only want those gradients that would be the same under different variants of x.
This feels like automating dataset augmentation, in many ways.
But this idea, of penalizing your data gradient norm in your loss function, is quite interesting.

Representaional Power, Layer Size and Depth

Deep feedforward networks have propertis that make it worth representing our autoencoder with a deep net (they have a sentence implying that the decoder is the final weight matrix). Training these can be hard, so sometimes pre-training is done by training shallow autoencdoers first. S

Denoising Autoencoder

It’s not clear that this should be renamed - it’s really just noise injection at the input layer. This could be thought of as a subcase, as merely a regularizied autoencoder.

The ‘score’ is a gradient field, the gradient over the log probability of the data. Score matching lets you train layer wise - somewhat similar to a boltzman machine.

I see - the idea is that the difference between the reconstruction and the original gives you a gradient, along which to move in order to get closet to the original from the reconstruction. And this direction should be the gradient of your autoencoder.

Learning Manifolds with Autoencoders

A manifold can be thought of as typified by a set of tangent planes - these planes indicate what small changes in a datapoint would keep in on the manifold.
Your autoencoder is trained to learn a representation of the data generating distribution through regularization + datapoints sampled from that distribution. It learns (especially in the denoising context) to map datapoints that are off the manifold onto the manifold. It learns to ignore factors of variation that are orthogonal to the manifold, and embrace factors of variation that move along the manifold.

There are older methods for manifold learning that involve using the nearest neighbors of a point and modeling the manifold in a locally linear neighborhood. Tons of references for this classical manifold learning space.

There’s an amazing argument against classic manifold learning methods with a specific example showing where the local structure assumption fails that seems strong but that I’m too sleepy to appreciate.

Contractive Autoencoders

Since contractive autoencoders make the reconstruction (decoder) invariante to infinitesimal perturbations of the input, its loss may be a candidate for avoiding adversarial examples.

It’s not clear from reading this exactly why this derivative penalty creates resistance to input perturbation. I feel like no one proofread this.

I see now, in penalizing the derivative it learns a representation that will generate a small gradient.

Contractive autoencoders work by penalizing the frobenius norm of the jacobian directly. This leads to updates that incentivise the gradient to be small, in addition to the reconstruction error.

This leads to a model that can be hard to train end to end - it becomes intractable for deep autoencoders. Interestingly, only applying this penalty to the decoder (the back weight matrix in a shallow autoencoder) leads to better classification accuracy (which seems sort of crazy to me, that you would only penalize half of the network - I guess that’s because I don’t buy the encoder-decoder abstraction that they’re putting over an autoencoder).

Predictive Sparse Decomposition

The part of the loss function that subtracts h from f(x) confuses me because I thought that h was f(x). And so there’s a true h, that we know somehow?

Apparently this is a combination of sparse factor models and parametric autoencoders. The term parametric feels like overkill here, but is what it is.

Applications of Autoencoders

The obvious application is dimensionality reduction. This can improve generalization accuracy, map similar datapoints to representations that are close to one another, and tends to be faster to do inference with. The reduced dimensional representation can also be much easier to interpret.

Information retrieval has an application where you have a network with sigmoids saturate, and you use the binary code representing the input as a hash function to look up similar inputs (lol). Unfortunately this is discrete, and throws away the continuity of distance… but still, hilarious. And a general technique for going from continuous to discrete representations.

Representation Learning

The way that a problem is represented can make a dramatic difference to the speed and possibility of its solution.

Representations that can be shared between tasks are one essential approach to transferring information between domains. Unsupervised learning can be done by having unlabeled data contribute to the representation. Sharing representations between tasks can lead to faster learning and regularization on both tasks.
Distributed representations and deep representations both have properties worth arguing for. That’ll happen in this chapter.

Learned representaitons in feedforward nets (and models with a linear final layer) learn a representation that is pushed to make the classes linearly separable from one another.

There are properties of the representation that we’d like to have (say, independence of different parts of the representation) which trade off against purely fitting the representation to the data.

Any learned representation can be used downstream for another task (think word vectors) or be trained simultaneously over multiple tasks or modes.

This may be a bridge to the few, single and no shot learning observed in humans and animals.

Greedy Layer-Wise Unsupervised Pretraining

This greedy representation learning through an autoencode or a RBM train a layer at a time of a deep representation to find an initialization that will allow a deep (feedforward!) net to be trained. This was the first technique to work (pre batch norm, residual connections, adam, relu, etc.) and so is responsible in part for the DL resurgence.

Thought - layerwise pretraining keeps the previous layers fixed, and so avoids the covariate shift overcome by batch normalization. Both of these solutions are rooted in the covariate shift problem - evidence for that explanation for batch normalization.

The theory used to be that networks would get stuck in local minima!! How wrong it was… our current theories (to the degree that they’re empirically unverified) are also likely wrong…

Loll, page 521 is like “it’s important to understand when unsupervised pretraining will and won’t be helpful”, and I’m super skeptical that they have an answer to that. And then on page 522 they admit “Our ability to characterize exactly what aspects of the pretrainid parameters are retained during supervised training stage is limited.” And I still haven’t been given a predictive model for when to expect greedy layer wise pretraining to work.
I remember their second sentence being way more incriminating than it actually is… oh, I got the wrong sentence. The real one says “It is not always possible to predict which tasks will benefit from unsupervised learning in this way”.

The fact that the features learned won’t be linearly separable by construction seems important, as it’s a thing that will change in the move to supervised learning.

It’s weird that they focus on a good notion of distance as what separates the quality of representations between one-hot words and images. Image data is good because every high dimensional aspect of the image (and also the ground truth data about the image) is represented in the image. It’s the data that humans see, where words… words have a mental representation more than they have a physical representation. They’re a different type of object. The one hot encoding isn’t actually the word, while the image is actually the image.

Now they give a predictive model. Unsupervised pretraining is useful when:

The amount of labeled data is small
The function to be learned is very complicated
The network being pretrained is deep

Thought - paying attention to the conditioning of the hessian during training

I like this frame - that the reason pretraining works is that it encourages the network to learn the underlying causes that generated the data.

Transfer Learning and Domains Adaptation

The goal of transfer learning is to take learned information about one distribution to improve training speed and generalization accuracy on another distribution.

In many contexts the are patterns that are a shared between classes or tasks. Images, for example, tend to share edges, curves, shapes, lighting and other properties that can improve learning for an additional task. Often this is done by taking early weights to transfer information about the input to a new domain, and replacing the weights closer to the input. In audio, where the information about the output is what matters (creating valid sentences, say, in speech recognition) the transfer is with the weights closest to the output.

Why don’t they just come out and say that all ml is transfer learning? The assumption that all values of the same class are generated from the same distribution, and transfer learning is about transfer between different data generating distributions may be at the core of it. But I see it differently - each datapoint is generated from its own distribution, which just happens to be strongly related (making for easy transfer) to other datapoints with its class. It’s a mere technicality to conflate those distributions into one and call it not transfer learning.

And this notion of concept drift, where the generating distribution for a class shifts, is a good example of recognizing ‘within-class’ differences in distribution.
I do want to see learning a classifier on multiple datapoint though of as multi-x learning. Multitask, with similar distributions. The concept of transfer needs to be more deeply embedded in the minds of ML researchers and practitioners, because it’s what’s happening.

Interesting fact - as the depth of a representation increases, its ability to transfer with less and less data improved over transfer competition tasks.

In few shot learning, you learn to separate the important factors of variation on some initial data, and then can do few shot learning because the factors that separate that class are already well modeled.

You can do transfer between modes or problems by learning to map representations of different data types (words in different languages, or images vs. words) to one another. Think zero-shot learning through cross modal transfer, or a shared embedding space with pivot words in language translation.

Semi-Supervised Disentangling of Causal Factors

I don’t feel quite comfortable with the word ‘causal’ here - I need some convincing.

Making factors in the representation independent of one another, or separating the idependent causes of the outcome in the feature space have been motivations for representation learning research.

Learning the underlying causal factors of the data will help predict the outcome when the outcome is associated with one of the causal factors that predicts the data.

I think that this frame is a deep mistake. This is unsupervised representation learning, but in the bad sense that doesn’t take the feedback from the environment about the representation into account. And it’s trying to justify the hope that the data itself is all we should use for learning our representation. Using feedback from the environment / reality that is automatically labeled makes much more sense. Optimizing directly for the optimal representation for tasks makes even more sense. But having no feedback signal at all?

Disentangling all the factors that predict the data is great and useful. But not the real thing. We need to be embedded in a time series to see why that’s true. This is also the source of my opposition to the use of the word causal - we’re fooling ourselves, using that word. There’s no arrow of time here.

They even realize that humans fail to see differences in their environment that aren’t relevant to the task they’re performing!!! They’re sitting right next to the thing!

Oh shit, your autoencoder will just ignore extremely important features if they’re not relevant enough to the reconstruction error (which doesn’t know how to prioritize features).

GANs as better at learning the salient features of an image, because it’s tuned to learn what separates generated images from ungenerated images (an importantly different criterion from MSE). In this sense, VAEs and GANs are importantly different from one another.

Some content on causal direction and its implications for representation learning in Scholkopf 2012. This may be pointing at a counterargument to using labeled data, implying that that would be a broken assumption about causality (something like ‘the causal rules would change as the label changed’). But this comes out of pretending that causal rules in the universe are the same as the way causality is being used as a term here. The conflation killed them. And now their beliefs are a mess.

Distributed Representation

This chapter (15.4) provides a wonderful explanation of why compositionality is powerful. I’d love to be able to explain this both rigorously(with the mathematics contrasting distributed and non-distributed representations) and intuitively.

Abstraction. Representation. Composition. Transfer.

What are they missing?

Structured Probabilistic Models for Deep Learning

Graphs and probabilistic models aren’t merely a frame for thinking about DL anymore, but are contributing to the models that are implemented. Framing probability distributions as interactions between random variables in a computational graph creates a nice setting for representing interactions between random variables (I assume) as edges. This also connects DL to the greater graphical modeling community. Here we’ll see how to represent a probability distribution using a graph.

The Challenge of Unstructured Modeling

One of the goals of research has been to learn over very high dimensional data with rich structure (think vision, audio, text). The DL community has been using very different inference algorithms and models than the rest of the graphical modeling community. There are many applications where DL algorithms can ignore substantial parts of the input, but this will explore situations where the input needs to be modified (and so there are many outputs), and a density is being predicted or an image is being denoised. Techniques for producing these richer outputs will be described here.

Many problems require a rich output model, or a rich understanding of the input - sampling (generating a new sample drawn from the training data’s distribution), denoising (where any element of the input could have noise that needs to be removed), interpolation, and density estimation (where the input’s probability can vary dramatically based on a single feature or set of features in the input).

Probabilistic models model only the direct (and not the indirect) interactions between variables, leading to computational efficiency.

Is this about deep nets as a graph? Or about sparser graphs?

Using Graphs to Describe Model Structure

Bayesian Networks / Belief Networks, where the graph is directed and acyclic and points to situations where one variable influences another.

I guess that this is for discrete probability distributions, which makes the table approach seem like a sensible baseline. But a lot of it feels like they’re showing that the outperform a horrendous baseline.

Representing this is O(kn) vs. O(km) where in one case, n is the number of other variables possible and in the other case it’s only those that need to be conditioned on, it makes sense to me as a representation. You can get a lot of lift out of only looking at the parts of the joint distribution that actually interact with one another, is another way to think about that.

Source: Original Google Doc