Abstract Representation Learning
Category: Machine Intelligence
<!-- gdoc-inlined -->
Core Curriculum
- Bengio, Courville, Vincent (2014)
- Early Visual Concept Learning with Unsupervised Deep Learning (beta-VAE)
- SCAN: Learning Abstract Hierarchical Compositional Visual Concepts
- Understanding and Visualizing Convolutional Neural Networks
- Deep Learning of Representations for Unsupervised and Transfer Learning
- Neural Discrete Representation Learning
- Concept Learning via Meta-Optimization with Energy Models
- An Information Theoretic Analysis of Deep Latent-Variable Models
- Deep Variational Information Bottleneck
- Fixing a Broken ELBO
- Measuring Abstract Reasoning in Neural Networks
- How Neural Nets Build up Their Understanding of Images
- Understanding Disentangling in beta-VAE
- Elements of Information Theory
- Efficient Estimation of Word Representations in Vector Space
- GloVe: Global Vectors for Word Representation
- SVCCA, Representational Similarity
- Neural Scene Representation and Rendering
- Representation Learning with Contrastive Predictive Coding
- Towards Conceptual Compression
- Deep Neural Networks Abstract Like Humans
Generate experiment ideas reading through this curriculum.
Fundamental Questions in Representation Learning Representation Learning Comprehensive Research Ideas
Secondary Curriculum
- Deep Variational Canonical Correlation Analysis
- Generative Models of Visually Grounded Imagination
- Basic Objects in Natural Categories
- The Origin of Concepts
- Concepts
- Knowledge, Concepts and Categories
- The Big Book of Concepts
- Knowledge Representation
- Fluid Concepts and Creative Analogies
- Variational Inference for Monte Carlo Objectives (NVIL, VIMVO)
- Hierarchical Deep RL: Integrating Temporal Abstraction and Intrinsic Motivation
- A Proposal for Learning Compositional Problem Solvers
Latent Variable VAEs
- VQ-VAE
- Beta-VAE
- NVIL
- VIMCO
Brain Specific:
- Challenges and Opportunities in Unsupervised Deep Learning
I want to create a curriculum that’s OLD. Exclusively papers from 2000 or earlier. (Hint: Old papers only cite other old papers. And check ICA.)
To Create: Fundamental Questions in Representation Learning To Create: Fundamental Questions in X, for each axis of the research frontier.
Contrarian Truths
- The supervised-unsupervised ontology is broking in a way that damages researchers’ decision making
- VAEs (and reconstruction generally) implement a single heuristic for representation learning which is weak in comparison to future prediction or optimizing for future-task-relevant representations.
- The more abstract structure is in the lower layers of the network, leading to conflation / misuses of ‘high’ and ‘low’ everywhere.
- What if the way we thought about Abstraction was broken?
- What if Abstraction is 5 concepts being lumped into one?
- What if shared structure isn’t the same as unification of purpose (which is often what a class is) in functional abstraction?
- All of the forms of abstraction need a distinct name to avoid conflation.
- Naming (the creation of a new abstraction) as essential
- Great idea! Create a list of accepted truths in DL. Especially those believed by the high status (Hinton, Bengio, Lecun, etc.) Ask which accepted truths researchers believe are wrong and why. Invert all.
- ‘Outputs’ should not be at the end of the network, but at the appropriate level of abstraction for a learned representation that extends to levels both more and less abstract than the output.
Concepts Worth Implementing
Valuable properties of representations, born out of the frustration with the obsession over disentangling representations to the exclusion of other critical concepts. Many of these properties exist, to a greater or lesser extent, in human cognition.
- Decomposition of representation
- This gives you a controllable, interpretable, recombinable representation
- Alignment of representation where shared structure exists
- Want concepts with the same mechanisms / structure to update simultaneously when there’s new information that informs their working
- Can be through compositionality
- Trades off against decomposition?
- Modifiability of complexity of the representation depending on task
- Representation that becomes more granular upon zooming in
- Necessary for computational efficiency
- Memory Constraints
- Compute Time
- Attention Constraints
- Ideally would be on a continuum
- Give me the n principal components (non-linear) of the representation, while preserving clean conceptual (semantic) decomposition
- Transferability
- Ability for the representation to be repurposed for different tasks, generally through learning sufficiently high level structure that there is an appropriate level at which to do transfer between representations of problems and solutions
- Appropriate tradeoff of Simplicity / Compressedness vs. Representational capacity
- Sparsity / Discreteness
- Necessary for the discovery of compute intensive structure (say, graphical / relational / network, or concept recombination) in the representation
- Necessary for Concept Learning
- Interpretability
- Optimizability of representation for interpretability.
- Quality translation from representation to natural language.
- Clean isolation of parts of the representation (or a sparse approximation of the used representation) for any prediction made or action taken.
- Control
- Control through modification, freezing, or freeing of sub-parts of the representation
- Discrete and Continuous Modes
- Discreteness
- For Interpretability, self-examination, sparsity.
- Continuity
- For representational capacity, predictive accuracy.
- Discreteness
- Fully general translation into and out of the representation
- Want to be able to flexibly represent any category of object, situation, etc. in a merged representation
- Reserve category errors for a particular mode of action, ‘rigor mode’
- Manifold Learning - is it real? How to check this hypothesis, and leverage it if it’s valuable?
Run systematizing creativity over all of these concepts, generating experiment ideas for implementing them in deep learning representations.
Plan:
- Survey all possible papers I could push hard at in Abstract Representation Learning
- Explicate all my categories of idea as low level ideas
- Generate new categories of idea
- List out all of the goals for representation learning as a field and multiple pathways that would fulfill each goal
- Order the goals in terms of importance
- List out the unknowns, the missing categories, the assumptions behind the goals, and the mistakes
- List of likely to be true / likely to be false assumptions, and ways to prove or disprove each assumption
Experiment Ideas
- Concept Learning
- Ungrounded
- New embeddings from relational structure in knowledge bases
- Concept Parsing
- Heuristics for separating out concepts in a sentence. Or label it and learn to separate concepts, perhaps integrating part of speech tagging.
- With concept parser, create concept embeddings
- Grounded
- Cross modal transfer for interpretable representations
- Sentence Level Representations
- Ungrounded
Philosophy
Once you have a low dimensional discrete representation, a wide body of important algorithms are made available.
- Concept recombination
- Causality and its establishment
- Credit assignment (to higher level objects) made efficient
- Hierarchical control
- Decision making over conceptual blocks of actions
- Higher level planning
Overview of Representation Learning
Categories
- Deep Convolutional Network
- Hidden layer representations
- Weights as a feature hierarchy
- Audio
- Music Representation
- Speech Representation
- Word Embeddings
- Word2Vec - Hidden layer representation of each word
- Glove
- Sentence Embeddings
- LSTM Hidden Layer
Papers
- Efficient Estimation of Word Representations in Vector Space
- GloVe: Global Vectors for Word Representation
- SCAN: Learning Abstract Hierarchical Compositional Visual Concepts
- Understanding and Visualizing Convolutional Neural Networks
- Deep Neural Networks for Acoustic Modeling in Speech Recognition
- Deep Learning of Representations for Unsupervised and Transfer Learning
- Neural Discrete Representation Learning
- Concept Learning via Meta-Optimization with Energy Models
- Discovering Interpretable Representations for both Deep Generative and Discriminative Models
Surveys
- Bengio, Courville, Vincent (2014)
- Goyal, Ferrara (2017) [Graph Embeddings]
Thoughts
- This is necessary for effective transfer learning. Transfer means discovering representations and porting their structure at the appropriate level of abstraction.
- Parsing out the relevant filters or axes in a representation for transfer is important, rather than taking the entire overfit representation to a new dataset
- Representations can be made interpretable through concept learning and cross-modal transfer
Representation Learning, Done Properly I’m going to do a biblio stretch over these guys. Papers
- SCAN: Learning Abstract Hierarchical Compositional Visual Concepts
- Beta-VEA: Learning Basic Visual Concepts with a Constrained Variational Framework
- Early Visual Concept Learning with Unsupervised Deep Learning
- An Information Theoretic Analysis of Deep Latent-Variable Models
- Deep Variational Information Bottleneck
- Fixing a Broken ELBO
Topics to Learn in Representation Learning / Information Theory
- Evidence Lower Bound (ELBO)
- Variational Autoencoder (Deep Learning Textbook 20.10.3)
- Rate (Alemi)
- Distortion (Alemi)
- Beta-VAE
- Generative Mutual Information (Broken ELBO)
- Representational Mutual Information (Broken ELBO)
- How to compute mutual information for discrete distributions
Offer to talk about:
- Using the maximal information criterion’s binning patterns to auto-learn the kind of relationship that exists between variables, as well as a latent representation of the space of relationships between variables
- The conversation over On the Information Bottleneck Theory of Deep Learning on Openreview, and whether your experience with Deep Variational Information Bottleneck left you with grounded beliefs on that set of principles / claims
- Fully understanding Fixing a Broken ELBO, Fully understanding An Information-Theoretic Analysis of Deep Latent-Variable Models
- Work out of Deepmind on compositional concepts, valuable properties of representations in general
- Why I care about representation learning
- Upstream of hierarchical model-based learning
- Necessary for learning higher level cross-task regularities allowing effective domain adaptation
- Allows latent concept compositionality for model expressiveness
- Ungrounded concept learning through semantic phrase / sentence embeddings
- The research projects that he’s working on and potential for collaboration
Alemi Paper Notes
Thesis of Fixing a Broken ELBO Maximizing the ELBO is approximate maximum likelihood training which doesn’t necessarily result in a good latent representation. They demonstrate this theoretically and empirically. They derive variational upper and lower bounds on the mutual information between the input and the latent variable and use the bounds to derive a rate-distortion curve that is a tradeoff between compression and reconstruction accuracy. There are models with identical ELBO but different quantitative and qualitative characteristics.
(interesting - not between the latent variable and the output, oh I guess that the output is the input in this situation, weird that we’re using reconstruction error)
An Information-Theoretic Analysis of Deep Latent-Variable Models Analyzing the rate-distortion tradeoff in variational autoencoders, and showing that the standard ELBO objective doesn’t allow you to select points on the rate-distortion frontier. Show how to learn generative models with different rates, achieving a similar ELBO but with very different latent variable representations.
For each paper:
- What are its strengths? What is its value?
- What does it say that is false?
- Is everything in the paper true?
- What are the implicit assumptions that go undefended?
- What is one big idea that you can take away from the frame of the authors?
Concepts to Learn:
- Lagrangian multipliers
- Used in beta-VAE for discrete constraints on representation
- KL divergence in more detail - implement it at least once, play around with it, start to feel familiar with it. Know its shape, its properties, etc.
- Vector Quantisation
- MCMC
- Variational Inference
- Variational Inference: A Review for Statisticians
- High-Level Explanation of Variational Inference
- Reparameterization Trick
- VAEs
- Encoder, parameterizing a posterior distribution q(z|x)
- Decoder, p(x|z) (reconstruction) over the input data
- Priors and posteriors are Gaussian, with diagonal covariance matrices. This allows for the Gaussian Reparameterization trick to be used. (Linear algebra hack)
- Extensions to the priors and posteriors include:
- Autoregressive prior
- Autoregressive posterior
- Normalising flows (Linear algebra hack)
- Inverse autoregressive posteriors
- PixelCNN
Paper Ideas / Notes / Summaries
Representation Learning: A Review and New Perspectives Yoshua Bengio, Aaron Courville, Pascal Vincent
Ideas:
- Take every objective in ‘What Makes a Representation Good’, add my own objectives, and for each one specify:
- A way (or set of ways) to measure the objective
- Distinguish between the concept of the objective and the mathematical instantiation of the objective (unless they’re truly identical)
- The downstream consequences of doing better or worse on the objective
- Compare two different networks over the objective
- The rationale (and intuition pumps) for the objective
- The counterarguments
- A way (or set of ways) to measure the objective
Concept summary:
- ML depends on data representation
- One notion of quality of representation is how well it disentangles factors of variation in the data
- This paper asks about and discusses the appropriate objectives for representation learning
- It focuses on DL methods, which are defined as the composition of multiple non-linear transformations.
- Goal is to learn more abstract, and so hopefully more useful representations.
Early Visual Concept Learning with Unsupervised Deep Learning Also called: beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework Irina Higgins, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed, Alexander Lerchner
Concept summary:
- Goal - learn a representation where single latent units are sensitive to changes in single generative factors, and are relatively invariant to changes in other factors.
- Do this using:
- Redundancy reduction (in the representation, I assume)
- Statistical independence (between parts of the representation)
- Data continuity (neuroinspired, not sure what this means yet)
SCAN: Learning Hierarchical Compositional Visual Concepts Irina Higgins, Nicholas Sonnerat, Loic Matthey, Arka Pal, Christopher Burgess, Matko Bosnjak, Murray Shanahan, Matthew Botvinick, Demis Hassabis, Alexander Lerchner
Concept summary:
- Use beta-vae to generate latents. Then arrange the learned latents into a hierarchy where composition and logic can be implemented.
Scan takes visual data and uses beta-vae to generate a latent space that is a discrete representation of the data. It then uses a decoder to generate images off of that space. Disentangling control with beta. The laten space is fashioned into a hierarchy by set relationships between the latents. There’s a learned recombination operator that can improve the ability of scan to generate new concepts by recombining old concepts.
Neural Discrete Representation Learning Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu
Questions:
- What does autoregressive mean here?
- What is Vector Quantisation, why does it work, how does it help circumvent posterior collapse?
Concept Summary:
- VQ-VAE outputs discrete values.
- It uses ‘vector quantisation’ to avoid having the decoder ignore the latent variables output by the encoder.
- The prior is autoregressive and so is learnt rather than static.
Ideas:
- Remembering the thought that probability distributions should be parameterized and approximate rather than assumed to be gaussian or whatever. Look at the empirical distribution and map it with a parameterized histogram or something like that.
- It’s sad that generative outputs are either continuous or discrete. Both will be missing an important category of representation.
VQ-VAE uses a simple dictionary learning algorithm (non-differentiable) called vector quantization to learn a high quality low dimensional discrete embedding of latent variables. The it uses one of pixel-CNN (images) or Wavenet (audio) to generate high quality samples off of that discrete representation. The abstractions learned by the discrete representation are consistently impressive.
Deep Variational Information Bottleneck Alexander Alemi, Ian Fischer, Joshua Dillon, Kevin Murphy
3 concept summary:
- Use mutual information with the output as the loss function
- Impose an informational constraint on the representation
- Approximate mutual information loss with variational inference.
Deep Learning of Representations for Unsupervised and Transfer Learning (2012) Yoshua Bengio
Ideas:
- Apply inversion to all of the ideas in each paper. When it works well, you’ve discovered something you think is true that others disagree with. And if it’s a foundational assumption, you can get started on making progress.
- How do we know what we claim to know?
- High and low in the network gets conflated with high and low in abstract space. In that space, the most general features (that are shared among many datapoints) are curves and edges, and the less general features are class-specific. Which are the more abstract? The recombination of the input features? Or the more general features? If abstraction is that which is shared among datapoints,
- By definition, concepts (like classes) are abstract. Lines and curves are concrete.
Notes:
- Machine learning is a wonderful example of abstraction, where all problems of prediction an output from an input can cleanly fall into an X, Y pair that is fed into an arbitrary algorithm.
- It’s pretty close to ‘function’ as a great abstraction, it’s a subset of that. It’s the subset where the utility of the function is making a prediction, or something nearby that.
- This example of abstraction is about shared structure between the problems as well as functional unity.
- Are there other implicit meanings in the word abstraction than shared structure and functional unity? Result of a compositional process?
- Bengio used to cite ICA! Independent components analysis! And then he stopped, and ‘invented’ disentangling!!! Now he gets cited instead of them!!!! We went from non-linear ICA to disentangling…
- I guess he does put it in the research section of the deep learning textbook…
Concept Learning via Meta-Optimization with Energy Models Igor Mordatch
Concept Summary:
- Represent concepts with an energy function that recombines concepts with an attention mask over the entities (where concepts are entities!) that exist in an event, and sum them with one another as a representation for prediction.
Fixing a Broken ELBO Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif Saurous, Kevin Murphy
Concept Summary:
- Use create a parameter to control the rate distortion tradeoff, use it to improve discrete latent variable modeling and to solve posterior collapse (where a strong decoder ignorest the latents)
Measuring abstract reasoning in neural networks David Barrett, Felix Hill, Adam Santoro, Ari Morcos, Timothy Lillicrap
Source: Original Google Doc