Abstract Representation Learning

Category: Machine Intelligence

Read the original document

Core Curriculum

Bengio, Courville, Vincent (2014)
Early Visual Concept Learning with Unsupervised Deep Learning (beta-VAE)
SCAN: Learning Abstract Hierarchical Compositional Visual Concepts
Understanding and Visualizing Convolutional Neural Networks
Deep Learning of Representations for Unsupervised and Transfer Learning
Neural Discrete Representation Learning
Concept Learning via Meta-Optimization with Energy Models
An Information Theoretic Analysis of Deep Latent-Variable Models
Deep Variational Information Bottleneck
Fixing a Broken ELBO
Measuring Abstract Reasoning in Neural Networks
How Neural Nets Build up Their Understanding of Images
Understanding Disentangling in beta-VAE
Elements of Information Theory
Efficient Estimation of Word Representations in Vector Space
GloVe: Global Vectors for Word Representation
SVCCA, Representational Similarity
Neural Scene Representation and Rendering
Representation Learning with Contrastive Predictive Coding
Towards Conceptual Compression
Deep Neural Networks Abstract Like Humans

Generate experiment ideas reading through this curriculum.

Fundamental Questions in Representation Learning
Representation Learning Comprehensive Research Ideas

Secondary Curriculum

Deep Variational Canonical Correlation Analysis
Generative Models of Visually Grounded Imagination
Basic Objects in Natural Categories
The Origin of Concepts
Concepts
Knowledge, Concepts and Categories
The Big Book of Concepts
Knowledge Representation
Fluid Concepts and Creative Analogies
Variational Inference for Monte Carlo Objectives (NVIL, VIMVO)
Hierarchical Deep RL: Integrating Temporal Abstraction and Intrinsic Motivation
A Proposal for Learning Compositional Problem Solvers

Latent Variable VAEs

VQ-VAE
Beta-VAE
NVIL
VIMCO

Brain Specific:

Challenges and Opportunities in Unsupervised Deep Learning

I want to create a curriculum that’s OLD. Exclusively papers from 2000 or earlier. (Hint: Old papers only cite other old papers. And check ICA.)

To Create: Fundamental Questions in Representation Learning
To Create: Fundamental Questions in X, for each axis of the research frontier.

Contrarian Truths

The supervised-unsupervised ontology is broking in a way that damages researchers’ decision making
VAEs (and reconstruction generally) implement a single heuristic for representation learning which is weak in comparison to future prediction or optimizing for future-task-relevant representations.
The more abstract structure is in the lower layers of the network, leading to conflation / misuses of ‘high’ and ‘low’ everywhere.
What if the way we thought about Abstraction was broken?
1. What if Abstraction is 5 concepts being lumped into one?
2. What if shared structure isn’t the same as unification of purpose (which is often what a class is) in functional abstraction?
3. All of the forms of abstraction need a distinct name to avoid conflation.
  1. Naming (the creation of a new abstraction) as essential
Great idea! Create a list of accepted truths in DL. Especially those believed by the high status (Hinton, Bengio, Lecun, etc.) Ask which accepted truths researchers believe are wrong and why. Invert all.
‘Outputs’ should not be at the end of the network, but at the appropriate level of abstraction for a learned representation that extends to levels both more and less abstract than the output.

Concepts Worth Implementing

Valuable properties of representations, born out of the frustration with the obsession over disentangling representations to the exclusion of other critical concepts. Many of these properties exist, to a greater or lesser extent, in human cognition.

Decomposition of representation
1. This gives you a controllable, interpretable, recombinable representation
Alignment of representation where shared structure exists
1. Want concepts with the same mechanisms / structure to update simultaneously when there’s new information that informs their working
2. Can be through compositionality
3. Trades off against decomposition?
Modifiability of complexity of the representation depending on task
1. Representation that becomes more granular upon zooming in
2. Necessary for computational efficiency
  1. Memory Constraints
  2. Compute Time
  3. Attention Constraints
3. Ideally would be on a continuum
  1. Give me the n principal components (non-linear) of the representation, while preserving clean conceptual (semantic) decomposition
Transferability
1. Ability for the representation to be repurposed for different tasks, generally through learning sufficiently high level structure that there is an appropriate level at which to do transfer between representations of problems and solutions
Appropriate tradeoff of Simplicity / Compressedness vs. Representational capacity
Sparsity / Discreteness
1. Necessary for the discovery of compute intensive structure (say, graphical / relational / network, or concept recombination) in the representation
2. Necessary for Concept Learning
Interpretability
1. Optimizability of representation for interpretability.
2. Quality translation from representation to natural language.
3. Clean isolation of parts of the representation (or a sparse approximation of the used representation) for any prediction made or action taken.
Control
1. Control through modification, freezing, or freeing of sub-parts of the representation
Discrete and Continuous Modes
1. Discreteness
  1. For Interpretability, self-examination, sparsity.
2. Continuity
  1. For representational capacity, predictive accuracy.
Fully general translation into and out of the representation
Want to be able to flexibly represent any category of object, situation, etc. in a merged representation
Reserve category errors for a particular mode of action, ‘rigor mode’
Manifold Learning - is it real? How to check this hypothesis, and leverage it if it’s valuable?

Run systematizing creativity over all of these concepts, generating experiment ideas for implementing them in deep learning representations.

Plan:

Survey all possible papers I could push hard at in Abstract Representation Learning
1. Explicate all my categories of idea as low level ideas
2. Generate new categories of idea
3. List out all of the goals for representation learning as a field and multiple pathways that would fulfill each goal
  1. Order the goals in terms of importance
4. List out the unknowns, the missing categories, the assumptions behind the goals, and the mistakes
5. List of likely to be true / likely to be false assumptions, and ways to prove or disprove each assumption

Experiment Ideas

Concept Learning
1. Ungrounded
  1. New embeddings from relational structure in knowledge bases
  2. Concept Parsing
    1. Heuristics for separating out concepts in a sentence. Or label it and learn to separate concepts, perhaps integrating part of speech tagging.
    2. With concept parser, create concept embeddings
2. Grounded
  1. Cross modal transfer for interpretable representations
3. Sentence Level Representations

Philosophy

Once you have a low dimensional discrete representation, a wide body of important algorithms are made available.

Concept recombination
Causality and its establishment
Credit assignment (to higher level objects) made efficient
Hierarchical control
1. Decision making over conceptual blocks of actions
Higher level planning

Overview of Representation Learning

Categories

Deep Convolutional Network
1. Hidden layer representations
2. Weights as a feature hierarchy
Audio
1. Music Representation
2. Speech Representation
Word Embeddings
1. Word2Vec - Hidden layer representation of each word
2. Glove
Sentence Embeddings
1. LSTM Hidden Layer

Papers

Efficient Estimation of Word Representations in Vector Space
GloVe: Global Vectors for Word Representation
SCAN: Learning Abstract Hierarchical Compositional Visual Concepts
Understanding and Visualizing Convolutional Neural Networks
Deep Neural Networks for Acoustic Modeling in Speech Recognition
Deep Learning of Representations for Unsupervised and Transfer Learning
Neural Discrete Representation Learning
Concept Learning via Meta-Optimization with Energy Models
Discovering Interpretable Representations for both Deep Generative and Discriminative Models

Surveys

Bengio, Courville, Vincent (2014)
Goyal, Ferrara (2017) [Graph Embeddings]

Thoughts

This is necessary for effective transfer learning. Transfer means discovering representations and porting their structure at the appropriate level of abstraction.
Parsing out the relevant filters or axes in a representation for transfer is important, rather than taking the entire overfit representation to a new dataset
Representations can be made interpretable through concept learning and cross-modal transfer

Representation Learning, Done Properly
I’m going to do a biblio stretch over these guys.
Papers

SCAN: Learning Abstract Hierarchical Compositional Visual Concepts
Beta-VEA: Learning Basic Visual Concepts with a Constrained Variational Framework
Early Visual Concept Learning with Unsupervised Deep Learning
An Information Theoretic Analysis of Deep Latent-Variable Models
Deep Variational Information Bottleneck
Fixing a Broken ELBO

Topics to Learn in Representation Learning / Information Theory

Evidence Lower Bound (ELBO)
Variational Autoencoder (Deep Learning Textbook 20.10.3)
Rate (Alemi)
Distortion (Alemi)
Beta-VAE
Generative Mutual Information (Broken ELBO)
Representational Mutual Information (Broken ELBO)
How to compute mutual information for discrete distributions

Offer to talk about:

Using the maximal information criterion’s binning patterns to auto-learn the kind of relationship that exists between variables, as well as a latent representation of the space of relationships between variables
The conversation over On the Information Bottleneck Theory of Deep Learning on Openreview, and whether your experience with Deep Variational Information Bottleneck left you with grounded beliefs on that set of principles / claims
Fully understanding Fixing a Broken ELBO, Fully understanding An Information-Theoretic Analysis of Deep Latent-Variable Models
Work out of Deepmind on compositional concepts, valuable properties of representations in general
Why I care about representation learning
1. Upstream of hierarchical model-based learning
2. Necessary for learning higher level cross-task regularities allowing effective domain adaptation
3. Allows latent concept compositionality for model expressiveness
Ungrounded concept learning through semantic phrase / sentence embeddings
The research projects that he’s working on and potential for collaboration

Alemi Paper Notes

Thesis of Fixing a Broken ELBO
Maximizing the ELBO is approximate maximum likelihood training which doesn’t necessarily result in a good latent representation. They demonstrate this theoretically and empirically. They derive variational upper and lower bounds on the mutual information between the input and the latent variable and use the bounds to derive a rate-distortion curve that is a tradeoff between compression and reconstruction accuracy. There are models with identical ELBO but different quantitative and qualitative characteristics.

(interesting - not between the latent variable and the output, oh I guess that the output is the input in this situation, weird that we’re using reconstruction error)

An Information-Theoretic Analysis of Deep Latent-Variable Models
Analyzing the rate-distortion tradeoff in variational autoencoders, and showing that the standard ELBO objective doesn’t allow you to select points on the rate-distortion frontier. Show how to learn generative models with different rates, achieving a similar ELBO but with very different latent variable representations.

For each paper:

What are its strengths? What is its value?
What does it say that is false?
1. Is everything in the paper true?
What are the implicit assumptions that go undefended?
What is one big idea that you can take away from the frame of the authors?

Concepts to Learn:

Lagrangian multipliers
1. Used in beta-VAE for discrete constraints on representation
KL divergence in more detail - implement it at least once, play around with it, start to feel familiar with it. Know its shape, its properties, etc.
Vector Quantisation
MCMC
Variational Inference
1. Variational Inference: A Review for Statisticians
2. High-Level Explanation of Variational Inference
Reparameterization Trick
VAEs
1. Encoder, parameterizing a posterior distribution q(z|x)
2. Decoder, p(x|z) (reconstruction) over the input data
3. Priors and posteriors are Gaussian, with diagonal covariance matrices. This allows for the Gaussian Reparameterization trick to be used. (Linear algebra hack)
4. Extensions to the priors and posteriors include:
  1. Autoregressive prior
  2. Autoregressive posterior
  3. Normalising flows (Linear algebra hack)
  4. Inverse autoregressive posteriors
PixelCNN

Paper Ideas / Notes / Summaries

Representation Learning: A Review and New Perspectives
Yoshua Bengio, Aaron Courville, Pascal Vincent

Ideas:

Take every objective in ‘What Makes a Representation Good’, add my own objectives, and for each one specify:
1. A way (or set of ways) to measure the objective
  1. Distinguish between the concept of the objective and the mathematical instantiation of the objective (unless they’re truly identical)
2. The downstream consequences of doing better or worse on the objective
3. Compare two different networks over the objective
4. The rationale (and intuition pumps) for the objective
  1. The counterarguments

Concept summary:

ML depends on data representation
One notion of quality of representation is how well it disentangles factors of variation in the data
This paper asks about and discusses the appropriate objectives for representation learning
It focuses on DL methods, which are defined as the composition of multiple non-linear transformations.
Goal is to learn more abstract, and so hopefully more useful representations.

Early Visual Concept Learning with Unsupervised Deep Learning
Also called:
beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework
Irina Higgins, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed, Alexander Lerchner

Concept summary:

Goal - learn a representation where single latent units are sensitive to changes in single generative factors, and are relatively invariant to changes in other factors.
Do this using:
1. Redundancy reduction (in the representation, I assume)
2. Statistical independence (between parts of the representation)
3. Data continuity (neuroinspired, not sure what this means yet)

SCAN: Learning Hierarchical Compositional Visual Concepts
Irina Higgins, Nicholas Sonnerat, Loic Matthey, Arka Pal, Christopher Burgess, Matko Bosnjak, Murray Shanahan, Matthew Botvinick, Demis Hassabis, Alexander Lerchner

Concept summary:

Use beta-vae to generate latents. Then arrange the learned latents into a hierarchy where composition and logic can be implemented.

Scan takes visual data and uses beta-vae to generate a latent space that is a discrete representation of the data. It then uses a decoder to generate images off of that space. Disentangling control with beta. The laten space is fashioned into a hierarchy by set relationships between the latents. There’s a learned recombination operator that can improve the ability of scan to generate new concepts by recombining old concepts.

Neural Discrete Representation Learning
Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu

Questions:

What does autoregressive mean here?
What is Vector Quantisation, why does it work, how does it help circumvent posterior collapse?

Concept Summary:

VQ-VAE outputs discrete values.
It uses ‘vector quantisation’ to avoid having the decoder ignore the latent variables output by the encoder.
The prior is autoregressive and so is learnt rather than static.

Ideas:

Remembering the thought that probability distributions should be parameterized and approximate rather than assumed to be gaussian or whatever. Look at the empirical distribution and map it with a parameterized histogram or something like that.
It’s sad that generative outputs are either continuous or discrete. Both will be missing an important category of representation.

VQ-VAE uses a simple dictionary learning algorithm (non-differentiable) called vector quantization to learn a high quality low dimensional discrete embedding of latent variables. The it uses one of pixel-CNN (images) or Wavenet (audio) to generate high quality samples off of that discrete representation. The abstractions learned by the discrete representation are consistently impressive.

Deep Variational Information Bottleneck
Alexander Alemi, Ian Fischer, Joshua Dillon, Kevin Murphy

3 concept summary:

Use mutual information with the output as the loss function
Impose an informational constraint on the representation
Approximate mutual information loss with variational inference.

Deep Learning of Representations for Unsupervised and Transfer Learning (2012)
Yoshua Bengio

Ideas:

Apply inversion to all of the ideas in each paper. When it works well, you’ve discovered something you think is true that others disagree with. And if it’s a foundational assumption, you can get started on making progress.
1. How do we know what we claim to know?
High and low in the network gets conflated with high and low in abstract space. In that space, the most general features (that are shared among many datapoints) are curves and edges, and the less general features are class-specific. Which are the more abstract? The recombination of the input features? Or the more general features? If abstraction is that which is shared among datapoints,
1. By definition, concepts (like classes) are abstract. Lines and curves are concrete.

Notes:

Machine learning is a wonderful example of abstraction, where all problems of prediction an output from an input can cleanly fall into an X, Y pair that is fed into an arbitrary algorithm.
1. It’s pretty close to ‘function’ as a great abstraction, it’s a subset of that. It’s the subset where the utility of the function is making a prediction, or something nearby that.
2. This example of abstraction is about shared structure between the problems as well as functional unity.
  1. Are there other implicit meanings in the word abstraction than shared structure and functional unity? Result of a compositional process?
Bengio used to cite ICA! Independent components analysis! And then he stopped, and ‘invented’ disentangling!!! Now he gets cited instead of them!!!! We went from non-linear ICA to disentangling…
1. I guess he does put it in the research section of the deep learning textbook…

Concept Learning via Meta-Optimization with Energy Models
Igor Mordatch

Concept Summary:

Represent concepts with an energy function that recombines concepts with an attention mask over the entities (where concepts are entities!) that exist in an event, and sum them with one another as a representation for prediction.

Fixing a Broken ELBO
Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif Saurous, Kevin Murphy

Concept Summary:

Use create a parameter to control the rate distortion tradeoff, use it to improve discrete latent variable modeling and to solve posterior collapse (where a strong decoder ignorest the latents)

Measuring abstract reasoning in neural networks
David Barrett, Felix Hill, Adam Santoro, Ari Morcos, Timothy Lillicrap

Source: Original Google Doc