Abstract Representation Learning

Category: Machine Intelligence

Read the original document

<!-- gdoc-inlined -->


Core Curriculum

  1. Bengio, Courville, Vincent (2014)
  2. Early Visual Concept Learning with Unsupervised Deep Learning (beta-VAE)
  3. SCAN: Learning Abstract Hierarchical Compositional Visual Concepts
  4. Understanding and Visualizing Convolutional Neural Networks
  5. Deep Learning of Representations for Unsupervised and Transfer Learning
  6. Neural Discrete Representation Learning
  7. Concept Learning via Meta-Optimization with Energy Models
  8. An Information Theoretic Analysis of Deep Latent-Variable Models
  9. Deep Variational Information Bottleneck
  10. Fixing a Broken ELBO
  11. Measuring Abstract Reasoning in Neural Networks
  12. How Neural Nets Build up Their Understanding of Images
  13. Understanding Disentangling in beta-VAE
  14. Elements of Information Theory
  15. Efficient Estimation of Word Representations in Vector Space
  16. GloVe: Global Vectors for Word Representation
  17. SVCCA, Representational Similarity
  18. Neural Scene Representation and Rendering
  19. Representation Learning with Contrastive Predictive Coding
  20. Towards Conceptual Compression
  21. Deep Neural Networks Abstract Like Humans

Generate experiment ideas reading through this curriculum.

Fundamental Questions in Representation Learning Representation Learning Comprehensive Research Ideas

Secondary Curriculum

  1. Deep Variational Canonical Correlation Analysis
  2. Generative Models of Visually Grounded Imagination
  3. Basic Objects in Natural Categories
  4. The Origin of Concepts
  5. Concepts
  6. Knowledge, Concepts and Categories
  7. The Big Book of Concepts
  8. Knowledge Representation
  9. Fluid Concepts and Creative Analogies
  10. Variational Inference for Monte Carlo Objectives (NVIL, VIMVO)
  11. Hierarchical Deep RL: Integrating Temporal Abstraction and Intrinsic Motivation
  12. A Proposal for Learning Compositional Problem Solvers

Latent Variable VAEs

  1. VQ-VAE
  2. Beta-VAE
  3. NVIL
  4. VIMCO

Brain Specific:

  • Challenges and Opportunities in Unsupervised Deep Learning

I want to create a curriculum that’s OLD. Exclusively papers from 2000 or earlier. (Hint: Old papers only cite other old papers. And check ICA.)

To Create: Fundamental Questions in Representation Learning To Create: Fundamental Questions in X, for each axis of the research frontier.

Contrarian Truths

  1. The supervised-unsupervised ontology is broking in a way that damages researchers’ decision making
  2. VAEs (and reconstruction generally) implement a single heuristic for representation learning which is weak in comparison to future prediction or optimizing for future-task-relevant representations.
  3. The more abstract structure is in the lower layers of the network, leading to conflation / misuses of ‘high’ and ‘low’ everywhere.
  4. What if the way we thought about Abstraction was broken?
    1. What if Abstraction is 5 concepts being lumped into one?
    2. What if shared structure isn’t the same as unification of purpose (which is often what a class is) in functional abstraction?
    3. All of the forms of abstraction need a distinct name to avoid conflation.
      1. Naming (the creation of a new abstraction) as essential
  5. Great idea! Create a list of accepted truths in DL. Especially those believed by the high status (Hinton, Bengio, Lecun, etc.) Ask which accepted truths researchers believe are wrong and why. Invert all.
  6. ‘Outputs’ should not be at the end of the network, but at the appropriate level of abstraction for a learned representation that extends to levels both more and less abstract than the output.

Concepts Worth Implementing

Valuable properties of representations, born out of the frustration with the obsession over disentangling representations to the exclusion of other critical concepts. Many of these properties exist, to a greater or lesser extent, in human cognition.

  1. Decomposition of representation
    1. This gives you a controllable, interpretable, recombinable representation
  2. Alignment of representation where shared structure exists
    1. Want concepts with the same mechanisms / structure to update simultaneously when there’s new information that informs their working
    2. Can be through compositionality
    3. Trades off against decomposition?
  3. Modifiability of complexity of the representation depending on task
    1. Representation that becomes more granular upon zooming in
    2. Necessary for computational efficiency
      1. Memory Constraints
      2. Compute Time
      3. Attention Constraints
    3. Ideally would be on a continuum
      1. Give me the n principal components (non-linear) of the representation, while preserving clean conceptual (semantic) decomposition
  4. Transferability
    1. Ability for the representation to be repurposed for different tasks, generally through learning sufficiently high level structure that there is an appropriate level at which to do transfer between representations of problems and solutions
  5. Appropriate tradeoff of Simplicity / Compressedness vs. Representational capacity
  6. Sparsity / Discreteness
    1. Necessary for the discovery of compute intensive structure (say, graphical / relational / network, or concept recombination) in the representation
    2. Necessary for Concept Learning
  7. Interpretability
    1. Optimizability of representation for interpretability.
    2. Quality translation from representation to natural language.
    3. Clean isolation of parts of the representation (or a sparse approximation of the used representation) for any prediction made or action taken.
  8. Control
    1. Control through modification, freezing, or freeing of sub-parts of the representation
  9. Discrete and Continuous Modes
    1. Discreteness
      1. For Interpretability, self-examination, sparsity.
    2. Continuity
      1. For representational capacity, predictive accuracy.
  10. Fully general translation into and out of the representation
  11. Want to be able to flexibly represent any category of object, situation, etc. in a merged representation
  12. Reserve category errors for a particular mode of action, ‘rigor mode’
  13. Manifold Learning - is it real? How to check this hypothesis, and leverage it if it’s valuable?

Run systematizing creativity over all of these concepts, generating experiment ideas for implementing them in deep learning representations.

Plan:

  1. Survey all possible papers I could push hard at in Abstract Representation Learning
    1. Explicate all my categories of idea as low level ideas
    2. Generate new categories of idea
    3. List out all of the goals for representation learning as a field and multiple pathways that would fulfill each goal
      1. Order the goals in terms of importance
    4. List out the unknowns, the missing categories, the assumptions behind the goals, and the mistakes
    5. List of likely to be true / likely to be false assumptions, and ways to prove or disprove each assumption

Experiment Ideas

  1. Concept Learning
    1. Ungrounded
      1. New embeddings from relational structure in knowledge bases
      2. Concept Parsing
        1. Heuristics for separating out concepts in a sentence. Or label it and learn to separate concepts, perhaps integrating part of speech tagging.
        2. With concept parser, create concept embeddings
    2. Grounded
      1. Cross modal transfer for interpretable representations
    3. Sentence Level Representations

Philosophy

Once you have a low dimensional discrete representation, a wide body of important algorithms are made available.

  1. Concept recombination
  2. Causality and its establishment
  3. Credit assignment (to higher level objects) made efficient
  4. Hierarchical control
    1. Decision making over conceptual blocks of actions
  5. Higher level planning

Overview of Representation Learning

Categories

  1. Deep Convolutional Network
    1. Hidden layer representations
    2. Weights as a feature hierarchy
  2. Audio
    1. Music Representation
    2. Speech Representation
  3. Word Embeddings
    1. Word2Vec - Hidden layer representation of each word
    2. Glove
  4. Sentence Embeddings
    1. LSTM Hidden Layer

Papers

  1. Efficient Estimation of Word Representations in Vector Space
  2. GloVe: Global Vectors for Word Representation
  3. SCAN: Learning Abstract Hierarchical Compositional Visual Concepts
  4. Understanding and Visualizing Convolutional Neural Networks
  5. Deep Neural Networks for Acoustic Modeling in Speech Recognition
  6. Deep Learning of Representations for Unsupervised and Transfer Learning
  7. Neural Discrete Representation Learning
  8. Concept Learning via Meta-Optimization with Energy Models
  9. Discovering Interpretable Representations for both Deep Generative and Discriminative Models

Surveys

  1. Bengio, Courville, Vincent (2014)
  2. Goyal, Ferrara (2017) [Graph Embeddings]

Thoughts

  • This is necessary for effective transfer learning. Transfer means discovering representations and porting their structure at the appropriate level of abstraction.
  • Parsing out the relevant filters or axes in a representation for transfer is important, rather than taking the entire overfit representation to a new dataset
  • Representations can be made interpretable through concept learning and cross-modal transfer

Representation Learning, Done Properly I’m going to do a biblio stretch over these guys. Papers

  1. SCAN: Learning Abstract Hierarchical Compositional Visual Concepts
  2. Beta-VEA: Learning Basic Visual Concepts with a Constrained Variational Framework
  3. Early Visual Concept Learning with Unsupervised Deep Learning
  4. An Information Theoretic Analysis of Deep Latent-Variable Models
  5. Deep Variational Information Bottleneck
  6. Fixing a Broken ELBO

Topics to Learn in Representation Learning / Information Theory

  1. Evidence Lower Bound (ELBO)
  2. Variational Autoencoder (Deep Learning Textbook 20.10.3)
  3. Rate (Alemi)
  4. Distortion (Alemi)
  5. Beta-VAE
  6. Generative Mutual Information (Broken ELBO)
  7. Representational Mutual Information (Broken ELBO)
  8. How to compute mutual information for discrete distributions

Offer to talk about:

  1. Using the maximal information criterion’s binning patterns to auto-learn the kind of relationship that exists between variables, as well as a latent representation of the space of relationships between variables
  2. The conversation over On the Information Bottleneck Theory of Deep Learning on Openreview, and whether your experience with Deep Variational Information Bottleneck left you with grounded beliefs on that set of principles / claims
  3. Fully understanding Fixing a Broken ELBO, Fully understanding An Information-Theoretic Analysis of Deep Latent-Variable Models
  4. Work out of Deepmind on compositional concepts, valuable properties of representations in general
  5. Why I care about representation learning
    1. Upstream of hierarchical model-based learning
    2. Necessary for learning higher level cross-task regularities allowing effective domain adaptation
    3. Allows latent concept compositionality for model expressiveness
  6. Ungrounded concept learning through semantic phrase / sentence embeddings
  7. The research projects that he’s working on and potential for collaboration

Alemi Paper Notes

Thesis of Fixing a Broken ELBO Maximizing the ELBO is approximate maximum likelihood training which doesn’t necessarily result in a good latent representation. They demonstrate this theoretically and empirically. They derive variational upper and lower bounds on the mutual information between the input and the latent variable and use the bounds to derive a rate-distortion curve that is a tradeoff between compression and reconstruction accuracy. There are models with identical ELBO but different quantitative and qualitative characteristics.

(interesting - not between the latent variable and the output, oh I guess that the output is the input in this situation, weird that we’re using reconstruction error)

An Information-Theoretic Analysis of Deep Latent-Variable Models Analyzing the rate-distortion tradeoff in variational autoencoders, and showing that the standard ELBO objective doesn’t allow you to select points on the rate-distortion frontier. Show how to learn generative models with different rates, achieving a similar ELBO but with very different latent variable representations.

For each paper:

  1. What are its strengths? What is its value?
  2. What does it say that is false?
    1. Is everything in the paper true?
  3. What are the implicit assumptions that go undefended?
  4. What is one big idea that you can take away from the frame of the authors?

Concepts to Learn:

  1. Lagrangian multipliers
    1. Used in beta-VAE for discrete constraints on representation
  2. KL divergence in more detail - implement it at least once, play around with it, start to feel familiar with it. Know its shape, its properties, etc.
  3. Vector Quantisation
  4. MCMC
  5. Variational Inference
    1. Variational Inference: A Review for Statisticians
    2. High-Level Explanation of Variational Inference
  6. Reparameterization Trick
  7. VAEs
    1. Encoder, parameterizing a posterior distribution q(z|x)
    2. Decoder, p(x|z) (reconstruction) over the input data
    3. Priors and posteriors are Gaussian, with diagonal covariance matrices. This allows for the Gaussian Reparameterization trick to be used. (Linear algebra hack)
    4. Extensions to the priors and posteriors include:
      1. Autoregressive prior
      2. Autoregressive posterior
      3. Normalising flows (Linear algebra hack)
      4. Inverse autoregressive posteriors
  8. PixelCNN

Paper Ideas / Notes / Summaries

Representation Learning: A Review and New Perspectives Yoshua Bengio, Aaron Courville, Pascal Vincent

Ideas:

  1. Take every objective in ‘What Makes a Representation Good’, add my own objectives, and for each one specify:
    1. A way (or set of ways) to measure the objective
      1. Distinguish between the concept of the objective and the mathematical instantiation of the objective (unless they’re truly identical)
    2. The downstream consequences of doing better or worse on the objective
    3. Compare two different networks over the objective
    4. The rationale (and intuition pumps) for the objective
      1. The counterarguments

Concept summary:

  1. ML depends on data representation
  2. One notion of quality of representation is how well it disentangles factors of variation in the data
  3. This paper asks about and discusses the appropriate objectives for representation learning
  4. It focuses on DL methods, which are defined as the composition of multiple non-linear transformations.
  5. Goal is to learn more abstract, and so hopefully more useful representations.

Early Visual Concept Learning with Unsupervised Deep Learning Also called: beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework Irina Higgins, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed, Alexander Lerchner

Concept summary:

  1. Goal - learn a representation where single latent units are sensitive to changes in single generative factors, and are relatively invariant to changes in other factors.
  2. Do this using:
    1. Redundancy reduction (in the representation, I assume)
    2. Statistical independence (between parts of the representation)
    3. Data continuity (neuroinspired, not sure what this means yet)

SCAN: Learning Hierarchical Compositional Visual Concepts Irina Higgins, Nicholas Sonnerat, Loic Matthey, Arka Pal, Christopher Burgess, Matko Bosnjak, Murray Shanahan, Matthew Botvinick, Demis Hassabis, Alexander Lerchner

Concept summary:

  1. Use beta-vae to generate latents. Then arrange the learned latents into a hierarchy where composition and logic can be implemented.

Scan takes visual data and uses beta-vae to generate a latent space that is a discrete representation of the data. It then uses a decoder to generate images off of that space. Disentangling control with beta. The laten space is fashioned into a hierarchy by set relationships between the latents. There’s a learned recombination operator that can improve the ability of scan to generate new concepts by recombining old concepts.

Neural Discrete Representation Learning Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu

Questions:

  1. What does autoregressive mean here?
  2. What is Vector Quantisation, why does it work, how does it help circumvent posterior collapse?

Concept Summary:

  1. VQ-VAE outputs discrete values.
  2. It uses ‘vector quantisation’ to avoid having the decoder ignore the latent variables output by the encoder.
  3. The prior is autoregressive and so is learnt rather than static.

Ideas:

  1. Remembering the thought that probability distributions should be parameterized and approximate rather than assumed to be gaussian or whatever. Look at the empirical distribution and map it with a parameterized histogram or something like that.
  2. It’s sad that generative outputs are either continuous or discrete. Both will be missing an important category of representation.

VQ-VAE uses a simple dictionary learning algorithm (non-differentiable) called vector quantization to learn a high quality low dimensional discrete embedding of latent variables. The it uses one of pixel-CNN (images) or Wavenet (audio) to generate high quality samples off of that discrete representation. The abstractions learned by the discrete representation are consistently impressive.

Deep Variational Information Bottleneck Alexander Alemi, Ian Fischer, Joshua Dillon, Kevin Murphy

3 concept summary:

  1. Use mutual information with the output as the loss function
  2. Impose an informational constraint on the representation
  3. Approximate mutual information loss with variational inference.

Deep Learning of Representations for Unsupervised and Transfer Learning (2012) Yoshua Bengio

Ideas:

  1. Apply inversion to all of the ideas in each paper. When it works well, you’ve discovered something you think is true that others disagree with. And if it’s a foundational assumption, you can get started on making progress.
    1. How do we know what we claim to know?
  2. High and low in the network gets conflated with high and low in abstract space. In that space, the most general features (that are shared among many datapoints) are curves and edges, and the less general features are class-specific. Which are the more abstract? The recombination of the input features? Or the more general features? If abstraction is that which is shared among datapoints,
    1. By definition, concepts (like classes) are abstract. Lines and curves are concrete.

Notes:

  1. Machine learning is a wonderful example of abstraction, where all problems of prediction an output from an input can cleanly fall into an X, Y pair that is fed into an arbitrary algorithm.
    1. It’s pretty close to ‘function’ as a great abstraction, it’s a subset of that. It’s the subset where the utility of the function is making a prediction, or something nearby that.
    2. This example of abstraction is about shared structure between the problems as well as functional unity.
      1. Are there other implicit meanings in the word abstraction than shared structure and functional unity? Result of a compositional process?
  2. Bengio used to cite ICA! Independent components analysis! And then he stopped, and ‘invented’ disentangling!!! Now he gets cited instead of them!!!! We went from non-linear ICA to disentangling…
    1. I guess he does put it in the research section of the deep learning textbook…

Concept Learning via Meta-Optimization with Energy Models Igor Mordatch

Concept Summary:

  1. Represent concepts with an energy function that recombines concepts with an attention mask over the entities (where concepts are entities!) that exist in an event, and sum them with one another as a representation for prediction.

Fixing a Broken ELBO Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif Saurous, Kevin Murphy

Concept Summary:

  1. Use create a parameter to control the rate distortion tradeoff, use it to improve discrete latent variable modeling and to solve posterior collapse (where a strong decoder ignorest the latents)

Measuring abstract reasoning in neural networks David Barrett, Felix Hill, Adam Santoro, Ari Morcos, Timothy Lillicrap


Source: Original Google Doc

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?