Book Manuscript

Category: Abstraction

Read the original document

<!-- gdoc-inlined -->


Global Manuscript First Draft 4 Definition 4 Scratch 9 Topics 9 Notes 10 Similarity 10 Scratch 11 Topics 15 Notes 15 Transfer 15 Scratch 20 Compression 21 Scratch 23 Topics 25 Composition & Decomposition 25 Generalization 30 Topics 33 Generating New Abstractions 34 Scratch 34 Topics 39 Conceptual Schemes 39 Topics 40 Representation Properties 40 Scratch 46 Fundamental Questions in Representation 47 Bias / Variance Tradeoff 51 Levels of Analysis 56 Scratch 59 Topics 61 Base Units / Types of Abstraction 61 Scratch 64 Counterarguments to Abstraction 65 Topics 70 Deep Truths 71 Notes & Thoughts 71 Book Organization 78 Potential chapters 78 Potential Interactions: 83 Properties of Thought & Information Processing 83 Citations 85 Reading List 85 Old Materials 87 Unstructured Content 87 Opening 87 Book Summary 88 Definitions 88 Transfer 91 Similarity 92 Compression 97 Temporal Abstraction & Causality 97 Decomposition 98 Explicating an Example 99 Application: Learning 100 Creating New Abstractions 100 Conceptual Schemes 102 Similarity / Distance 106 Value of Representations 113 Compositionality 114 Generalization 115 Alignment of Representation over Shared Structure 116 Abstraction in Reality 117 Consilience 118 Examples 118 Structure as Abstraction over Relationships 122 Clean vs. Dirty 127 Valuable Properties of Representations 129 Abstraction as Religion 130 Abstraction to Transcendence 131 Tradeoff Between Concept-relevant data and Precision 133 Ways in Which Abstraction is Done 134 Difficulties in Thinking About Abstraction 135 Terms, How People Talk about Abstraction 136 Bias Variance 140 Abstraction and Systems Thinking 142 Beautiful Example of Shared Structure in Systems 143 Search as Abstraction 144 Fundamental Questions in Representation Learning 145 Truth at Levels of Analysis, Scientific Disciplines at Levels of Analysis, Planning and Abstraction 147 Abstraction, Invariance, and the Efficiency of Learning, Formal Languages, Mutual Exclusivity 149 Language as Choice of Basis, Pattern Recognition over Compression, Much More on Decomposition 152 Sparsity 155 Classification, High Abstraction, Against Hierarchy 157 Conceptual Decomposition 161 Transfer Examples, Composition through Conceptual Decomposition 162 Alignment 164 Alignment, Similarity, Fields that Transfer 167 Data to Feature, Features to Relationship, Relationships to Structure, and Structures to Meta-Structure 171 Old Draft 172 Old Draft Table of Contents 172 What is Abstraction? 174 Introduction 174 How to Think About Abstraction - Abstraction as an Abstraction 175 Hierarchical Compression 176 Shared Structure 176 Pure Abstractions 177 Transfer, Across Definitions 177 Types of Abstraction 178 Examples 179 Base Unit 179 Consistency 179 Discreteness vs. Continuity 180 Terms, How People Talk about Abstraction 180 Power 181 Transfer and Generalization 181 What is transfer? 182 What is generalization? 182 Generalization as a Standard 183 Implicit Transfer 183 Glorious Heights of Representation® 183 Pitfalls 184 Transfer and Generalization 184 Information Loss 185 Conflation, Unlike Objects & False Transfer 185 Excessive Height 185 Abstraction as an Unseen Assumption 186 Lack of Grounding 186

First Draft Definition On definitions: What are definitions? There are so many representations of that abstract concept. The practical version of definition that I like is that each concept can be defined as a bundle of properties. And so abstraction has a bunch of properties that distinguish it:

  1. Conceptual rather than concrete
  2. General rather than specific
  3. Compressed representation
    1. Abstraction as hierarchical compression (of lower level objects)
  4. Representation that allows for predictive generalization out of range (both interpolation and extrapolation)
  5. Details have been removed (salient features have been kept)
  6. Shared structure between particular facts has been identified
    1. (Is it an abstraction / is it abstraction if we don’t label it?)
    2. Metaphor as implicit abstraction makes this distinction
  7. Higher up in a hierarchy And when a process has some of these properties we’ll call it abstracting. The hard question is which of these are necessary? Which of these are optional? Is the binary classifier for whether or not something is an abstraction whether it has a single one of these? May it need all of these? Certainly, this body of properties is a surface from which you can explore a space of definitions that are all likely useful in their own way and are close to what we mean when we say abstraction. Settling on one combination of properties with a set of rules like ‘these two are necessary, but if this third property is paired with this fifth property we can disregard the second property’ feels like a painful and terrible way to go about defining things.

What is the right way to think about things? I’d love to just introduce all of the properties and then leave the usage up to context. OR I could make a number of distinctions across the properties that tend to vary most often. And so we’d have a PCA-style set of principal component (variance maximizing) differences in usage. And each difference in usage would get a name that made the appropriate distinction (say, compressive abstraction, or labelless abstraction). And also make the distinctions between transfer and abstraction, between metaphor and abstraction.

Many properties of the defined object are actually consequences rather than causes of consequences. And then you can optimizer the object for those consequences (say, optimize the way in which you abstract for generalization ability). But it feels unusual to define the object in terms of its consequences.

Need a word for the way that there’s a bundling of related traits into a single concept. Take ‘running’ as the general version of 1. Many animals running, say. It could be seen as using legs to move quickly. It becomes a metaphor quickly - “I’m running my mind over this paper”, or “my mind is running back and forth”.

Examples of abstraction that violate each property:

  • Conceptual rather than concrete
    • In object oriented programming, the creation of an abstract class is a solid example of an abstraction that’s just as concrete as its subcomponents.
  • General rather than specific (This one is hard, may want to stick with it)
    • Can argue that mechanics is more abstract than quantum (because it abstracts away the details), but it is actually less general than quantum.
  • Compressed representation
    • Non-compressive abstraction… This is really strong. Can’t find one after ~2m.
  • Representation that allows for predictive generalization out of range (both interpolation and extrapolation)
    • Going too high usually kills predictive power. Say you think in terms of blocs of countries (which leads to assuming that the countries with the blocs and their hierarchy of suborganizations all want the same thing)
  • Details have been removed (salient features have been kept)
    • This is quite strong.
    • Multiplication is an example of the level of detail remaining the same. (acutally wait, multiplication eliminates the particular numbers involved… nvm.)
  • Shared structure between particular facts has been identified
    • You an abstract over a single datapoint, so no. But it’s likely that you won’t do it well.
  • Shared structure between particular facts has been labeled
    • Stores as abstraction over the production / supply chain violate this. But they do create an easier to use, higher level simpler interface to the backend.
      • (Is it an abstraction / is it abstraction if we don’t label it?)
    • Metaphor as implicit abstraction makes this distinction
  • Creation of a hierarchy
    • Counterexample - properties are abstractions across animals…
      • Only if animals are your datapoints. If the concept of ‘claws’ is your property (the animal has claws) then you see each instance of claw as your lower level.
      • Learning = going down this hierarchy then going up (a conjugate action)
    • This also feels strong. You can almost always create a hierarchy.

Implicit to this question is that the definition should be that which is invariant across all examples of abstraction. But this is overly focused on consistency, and not enough focused on a principle component analysis of the properties that are most key to abstraction. (On the battle between intuition and rigor in definitions).

If you care for rigor:

  • Details have been removed, keeping only the salient features
  • Do not pick conceptual rather than concrete (abstractions can be concrete too, say in programming) If you care for intuition, ordered list:
  1. General rather than specific
  2. Shared structure (labeled)
  3. Conceptual rather than concrete
  4. Compressed Representation
  5. Details have been removed
  6. Representation that allows for predictive generalization

Justification for Ordered List

First, the things that were removed / didn’t make the list: metaphors as implicit abstraction is interesting (as a broader category of transfer exists, which perhaps should be the true title of this book, since we really care that we do transfer/induction successfully and not necessarily that we use abstraction to do it [whoa. I should think about that]).

General rather than specific is the most important defining quality,

On General rather than specific

This depends on a reference point of generality. Which means that the process of abstraction moves from the more specific to the more general, rather than being objective. Or this is a distinction - you may say that a category like ‘dog’ is abstract, but compared to what? Compared to base reality. But it’s extremely grounded in comparison to animal, or being, or object (of which it’s a sub-class). And so there are two versions of abstraction at work here - the conceptual (relative to the concrete), and the relatively more general (or broader) class of objects.

Types of shared structure (and so, types of abstraction)

  • Function
    • Leading to functional abstraction
  • Presence of a property / sub-object

How General are Conclusions Here about Abstraction?

Context context context. As with all languages, there will be fluid transition between definitions.

Terms, How People Talk about Abstraction

  • High Level vs. Low Level
  • Grain (Coarse Grain vs. Fine Grain)
  • Broad vs. Specific
  • General vs. Specific
  • “Broad brush”
  • Concepts, Conceptualization (Boyd, Dad)
  • Comprehensive Whole vs. Particulars (Boyd)

High Level vs. Low Level

There’s the ubiquitous reference dependent ‘high level’ and ‘low level’ type or reference, where the speaker has in mind some reference class level (often contrasting the high using the low as a reference point, and using the high as a reference point to define the low).

This tends to lead to unnecessarily binarized thinking. Ideally the language would auto-capture that there may be a high level and a low level, but there’s also a level higher than low level but lower than high level, and lower than low level, and higher than high level, etc. The use of almost all of these terms relegates us to two levels by default. Though maybe that makes thining easier for the reader. And also, perhaps it’s not binary but points to a ‘gradient’. More true for coarseness vs. fineness than high vs. low.

Coarseness vs. Fineness (Literally true in the case of abstraction in image processing) Grain (Coarse Grain vs. Fine Grain)

I really enjoy how this version implies a continuum of abstraction - it’s clear that you can become incrementally more coarse, or incrementally more fine. And so it’s appropriate for those situations where the abstraction is continuous. It’s also flexible, and can be used for discrete situations that are more or less fine / coarse than one another. But it struggles in another context, where there’s strong discreteness. Take the version of abstraction involved in creating a function, or in creating a variable instead of using scalar values. The metaphor (coarse vs. fine) starts to break down. This is in part because coarseness vs. fineness assumes that the type of abstraction is exactly the same throughout! In the metaphor, you only get more or less resolution. You never switch to a different type of abstraction. And it’s hard to model a binary discrete situation with this, where there are exactly two levels. May be nicer if there are more levels, but we also have to keep the type of abstraction the same.

Broad vs. Specific General vs. Specific “Broad brush” “If you ask a more specific question, I can give you an answer.”, is what Scott Kominers would love to tell me. People love to use the term ‘broad brush’ to give themselves permission to not condition on subsets of populations, or to make general claims in a way that’s unrigorous. It’s both valuable and dangerous - dangerous in that they don’t expect to go into details on their claim, which makes these kinds of claim extremely hard to evaluate, verify or argue against. Or often arguing against consists of picking a counterexample, which the person admitted would exist when they said that the statement would be broad brush. It’s valuable in that these summaries across populations end up being critical for decisions that depend on the proclivities of a large number of people. And it becomes extremely difficult to reason about a space if the level or rigor required makes it hard for people to make claims that look true to them.

Concepts, Conceptualization (Boyd, Dad)

This points to the way that abstract objects often move from particular grounded objects to the immaterial concept of the object. This conceptual frame is often more general but more difficult to operate on - there are generally missing features belong to a particular instance of the concept that would allow operations to be run over that object, at a detailed level.

A conceptual understanding is an understanding of the way that ideas come out of base data, and often the way that those ideas interact with one another. The implication is that you can operate over much more data by abstracting a substantial group of data into a concept and then having that concept interact with other concepts improve at the conceptual level in a way that generalizes to every example of raw data that’s connected to the concept.

Comprehensive Whole vs. Particulars (Boyd)

There’s a wholeness of vision that is capable of considering the interactions of multiple high level objects. Those high level objects (required to see what feels like the whole) have to be constructed out of lower level components in ways that aren’t leaky or overly destructive to predictive ability.

Particulars (ex., detail orientedness) allows for the interaction with the explicit instantiation of your comprehensive whole, usually put together with concepts.

General-to-specific (Boyd)

Scratch Explanation of the Reversal

When doing transfer between the notion of abstraction between ‘reductionary’ and ‘pure’ domains like math / computer science, and ‘compositional’ and ‘dirty’ domains like concepts or deep learning, it’s important to realize that there’s a reversal in the direction that’s referred to as ‘higher level’ abstraction.

If you were implementing a DL hierarchy in CS, the most abstract layer would be the first layer, with its edges and curves (say in computer vision). That’s the layer that’s most general across all objects. It just gets more specific as the recombination occurs, in the way that less abstract classes get more specific.

If you fail to realize that this reversal exists, expectations around what can and can’t be transferred can become confused. In DL, the lower levels of the network are more likely to transfer. In CS, the more abstract classes are more likely to transfer between lower level datapoints. (though transfer has a different meaning in this context, as seen below - it’s a very restrictive ‘has these variables / properties / functions’)

Topics

  1. How General are Conclusions Here about Abstraction?
  2. Look at the way definitions decompose, and ask what knowledge does and does not generalize across them
    1. Functional[a] Abstraction (We got here by other means)
    2. Modular Abstraction
    3. Property-based abstraction
    4. Physical[b] Abstraction (composition of parts)
    5. Recursive Abstraction
    6. Temporal Abstraction (We got here by other means)

Notes

  • Here I can show the way particular definitions are attractors
  • I can discuss notions of similarity / dissimilarity between definitions
  • I can give examples of abstraction that fall under some definitions and not others
  • A demonstration of conflation where a debate is had between supporters of differing definitions can be had, motte-bailey and all

Similarity

Relationship of Abstraction to Similarity Similarity is a foundational concept that forms the basis of the ability to compress information across objects. When objects[c][d][e][f][g] are similar, whether it be in their properties, or their constituent parts, or in their function, it becomes possible to transfer information from one object to another via awareness of this shared structure. In the extreme case where objects are identical or equal, we can compressive massively - we can throw out one of the objects, and merely remember that it’s equal to the other object. On the continuum away from equal, we lose compressive power in that we have to trade the compactness of our representation of the object against the amount of new information that the object holds. As an example, it’s often much easier to store a deviation from an existing object that to construct an entirely new object. Say, this is like headphones without the wires (for bluetooth headphones), rather than inventing a new name entirely. This is a phone, but smart. That modification of an existing representation is transferable across people who have the existing representation, and is compact in the amount of new information that exists (only the deviation). In the same way that human attention is drawn merely to what changes in the environment (and in general things that are unchanging fade out of awareness), representing new information as a deviation from an old representation is a classic example of efficiently taking advantage of similarity.

Similarity Over Different Features Similarity can operate over all the features of an object. It’s often the case that features are intercorrelated, and so when enough features intercorrelate we tend to create a term or concept for that body of relationships. When objects are similar over one feature but not in others, there’s often conflation or confusion when an attempt to generalize from the workings of one object to another fails.

We can think of similarity over different features as often having very different properties - being measured in different ways, allowing or disallowing transfer in different ways, etc.

Measuring Similarity There are a body of metrics for measuring similarity - the classic example is equality. When two systems are equal, they reflect one another perfectly. There’s no information in one system that isn’t reflected in the other system. And so we can do heavy compression. But equality is binary - the objects are either equal or not. And so it’s not granular enough to capture shared structure that is incomplete. Equality strongly limits the complexity of the objects that can be compared to one another, and so looser metrics are critical to modeling real systems.

Scratch

Difficulty in describing continuous similarity, distributional similarity. Similar difficulty in describing distinctions, or breakages of similarity in distributions. General difficulty of reasoning outside of discrete space. Need for cognitive fit.

Continuum of similarity:

  1. Discrete
    1. Equality / Identicality
    2. Property overlap
    3. Edit Distance
  2. Continuous
    1. Euclidean Distance
    2. Angles
    3. Cosine Distance
    4. KL Divergence

Think about a distance metric between shapes. Humans have an intuitive sense of the similarity between shapes, as a part of a general ability to judge the similarity of any two objects in an undefined intuitive space. But writing down that metric is incredibly difficult. And so much of the mission of representation learning research is learning how to compare arbitrary objects to one another through a distance metric that’s learned from experiencing arbitrary data.

Similarity is at the core of all learning and cognition. Neuroscience side, an example is the heuristic that ‘neurons that fire together, wire together’, connecting sets of data that are similar[h] in that they occur close by in the time series that a mind experiences. It also leverages another notion of similarity - if a and b there are previous patterns that are established, those previous patterns a’ associated with the firing of a will also be associated with b. This transitivity quickly becomes a wonderfully nuanced and complex form of similarity.

In developing knowledge, there’s a value to the knowledge that will generalize. Memorization[i] is fine if you know that the data you see in future will be identical to the data that you’re seeing now. But there’s a continuum over the richness of similarity metrics, where memorization[j] can be defined as the way similarity breaks down a short distance from the datapoints that have been memorized (or an inability to return an answer when you’re a short distance away). We get beyond memorization by finding ways to map similar datapoints to one another, and through that connection drawing conclusions about the properties that those similar datapoints will share (this depends on the ways in which the objects are similar) and is stronger than memorization in that when you encounter new information that hasn’t been seen before, you can compare it to that which you’ve already seen and use that history to inform your understanding of this new experience.

Often memorization looks like a lookup, which is fragile. If you don’t find something that’s exactly equal to what you’ve seen in the past, you can’t return anything. The more effectively you can connect a new experience to a vast array of old experiences, the more data you’ll have to draw on when you make inferences about it.

One implication is that broadening the types of connection that you make between objects allows you to transfer more information and connect more types of object. Seeing more of an objects properties creates a larger surface for comparison, enriching the kinds of connection you can make between your object and others.

As you move to a more expressive and nuanced similarity metric (ex. from binaries to a continuum, say from whether a person is good or not to how good a person is) the distortions[k] that come from conflating datapoints that are relatively close but not identical with each other can disappear. But the tradeoff to eliminating that conflation is needing to store a much more complex representation of goodness for every person (and likely move from an intentional, deliberate process for thinking about it to an intuitive one).

Properties of data that make certain kinds of similarity metric more relevant:

Imagine trying to use edit distance for semantics, where you tried to map words’ conceptual similarity to the number of changes you’d need to make to the letters in one word to get to another word. The lack of overlap between a word’s spelling and its meaning makes this notion of similarity irrelevant. But instead, imagine looking at the similarity of the contexts in which words appear. Specifically, say, the number of occurrences of a small 2-5 word window of surrounding words. Suddenly, words with similar contexts and meanings can be mapped to each other - ex. Cat and dog are often used with similar surrounding words, and so will be close to one another. King and queen are also used with similar surrounding words. And when they differ, they tend to differ in a way that has to do with the meaning of the words. So distance on this similarity metric is meaningful in ways that aren’t captured by edit distance. The design of distance metrics that capture the structure that matters for a task is extremely important.

Implication: Languages need to merge semantics with A language that merged its word / letter representation with the actual meaning of the words (where, say, base sounds / letters represented the principal components of learned word vectors) wouldn’t require memorization to learn because the mapping from concepts to reality would be on a continuum and its meaning grounded in the language itself rather than being an arbitrary mapping from a word to a concept.[l][m]

Ideally you'd create a meaning-to-text mapping that was so good that what a writer or speaker of a language drew or spoke in response to their felt meaning, they could be interpreted by anybody. Pictoral representations in Chinese are a step in this direction - you can begin to infer the meaning of a character just from its appearance. The english alphabet is an arbitrary (and therefore difficult to learn and use) symbolic layer between felt meaning and representation in language

Generator: Why does language learning require so much memorization? In general, needing to memorize indicates that you’re using a degenerate distance metric and that your learning will fail to generalize.

Invariance

Invariance is a very strong form of shared structure. But to the degree that you can see similarity between experiences, you can auto-generate the world of other experiences that you haven’t had but are capable of generalizing to.

These strong examples of transfer show up across the human visual system. Scale invariance, or the ability to recognize a face, for example, at many different distances, is an example of transfer through invariance. Properties like these allow a person to see a face once, and then be able to recognize it at many angles of rotation and at many distances, in many parts of their field of vision.

Invariance as a pattern - if you can recognize the pattern under which the transformation is invariant, your learning speed increases dramatically. If you never see the pattern, you can require many orders of magnitude more experience in order to learn (because you fail to auto-generate the world of other examples that come out of perceiving the invariance). This is a very strong and wonderful example of transfer.

In machine learning, the use of a simple invariance like translation invariance in convolutional neural networks dramatically improved the performance of computer vision systems, and alongside large datasets in many cases reaches superhuman levels.

Similarity 1.0

What is this ‘shared structure’?

Underlying the concept of ‘shared’ is a notion of similarity.

Closeness to equality.

Question: Are all forms of similarity captured by 'how much you have to change the object to get equivalence?, and where equivalence is "these objects are the same"?"

Topics:

  1. Relationship of abstraction to similarity
    1. One type of abstraction identifies shared structure across objects and compresses it into a single concept or abstract object.
      1. Similarity as existing over different features
    2. Type of similarity as a function of the nature of the thing being compared
  2. Types of similarity measures
    1. Discrete:
      1. Equivalence
      2. Edit Distance
      3. Number of properties in common
      4. (Having a property in common is similarity over that property)
    2. Continuous:
      1. Cosine Distance
      2. Euclidean Distance
      3. KL Divergence / Cross Entropy
      4. Wasserstein Distance
      5. Hinge Loss
    3. Generating Similarity Metrics
      1. Concept Representations
        1. Word Embeddings, same up to angle
      2. Networks
        1. PCA over learned representation
  3. Types of Similarity
    1. Have the same function / accomplish the same task, consequently
      1. Laptop is actually a functional abstraction, more than a compressive technique over sub-parts.
        1. But it also cares about the sub-parts… the surface and ipad are functionally similar but are called ‘tablets’ instead of ‘laptops’.
    2. Use the same mechanism
    3. Have the same property
      1. Ex. shape, color, density
    4. Have a set of shared properties
  4. Question: Are all forms of similarity captured by 'how much you have to change the object to get equivalence?, and where equivalence is "these objects are the same"?"
  5. Cognitive Fit, human notions of concept similarity

Topics

  1. Learned / adapted similarity measures / Metric Learning
  2. Gardenfors’ geometry
  3. Tversky’s set theory
  4. Old geometrical approach (Carnap)

Notes

Transfer

Abstraction is one way to map shared structure from one problem / solution / dataset to another. Transfer is what allows for efficient modeling and learning, where upon encountering a situation an intelligent being or system can bring patterns seen in data and problems from the past to bear on the new situation, and so be able to act as if it’s already seen the ‘new’ situation before, informed by situations like it. Abstraction involves identifying a property or pattern that exists (usually across many datasets / problems / examples) and then naming that pattern in a way that is general across the examples (and so which you would expect to generalize to similar examples)[n]

[o] There are forms of transfer that don’t move from the specific to the general, but instead relate the specific directly to the specific - one quality example of this is metaphor. Metaphor[p] works by pointing out that one object is like another object, without necessarily spelling out exactly in which way it is similar (often, the metaphor will draw upon many intuitive and emotional properties, simultaneously). For example, take the metaphor ‘she has a heart of gold’. At one level it’s asking you to feel about her heart the way that you feel about gold. In abstract terms, you could say that the sense of awe you feel at the lustre and rarity and globally recognized value should also be felt about her heart. The way that metals can’t be changed, the strength implicit in it, the sense of being able to depend or it and (more importantly) the body of felt associations that any person moving through the world has associated with gold get transferred to this different object, the heart.

Abstraction does transfer through surprisingly similar mechanisms. Over time we build up associations between concepts and the properties and emotions that are experienced in tandem with those concepts. When we bring that concept (take an example from above: strength, or rarity, or metal) to bear in a situation, the properties associated with that concept are naturally brought to awareness when the concept is used. The difference is that in abstraction, an explicit label is given to the set of shared properties. Take ‘strength’. What do strong things have in common? There’s an element of power, robustness, capability - there’s transfer between everything that has been described in the past and every other object with that description. And now we have an abstract object, strength, which holds that bundle of properties. This natural build up of descriptive transfer allows us to use language to connect multitudes of experiences with one another without ever explicitly comparing them to each other. We merely have to experience the same abstract object in each context, and the transfer will follow naturally from using that abstract object to describe a new situation.[q] [r]

Transfer of solutions can be particularly powerful, if you realize that some abstract descriptions of problems share solutions than finding a way to represent your problem in the way that others are represented often gives you a surface for problem solving and decision making that didn’t exist before.

Transfer is at the core of induction itself. And even in deduction,[s] the rules that it makes sense to trust are trusted for their ability to yield correct predictions in an inductive sense beforehand. Much of thinking is pattern matching between situations that you’ve seen in the past and the current situation, or is intuiting about a situation which calls upon implicit knowledge that you have also gained across time or that has been built into you through evolutionary transfer, where your instincts have been finely tuned for (say) social environments and so you can pick up on leagues of implicit knowledge emotionally based on a set of ancient and natural instinctual responses. Fight, flight or freeze responses are a pattern match to situations where that response saved the lives of the being making that response in the past, and so is present as a form of transfer from situations where stimuli were similar enough to current stimuli to trigger the same response.[t]

Machine learning algorithms are all driven by transfer. The goal is to observe some set of datapoints, and on that basis find a set of patterns that will allow knowledge about new datapoints to be inferred. T[u]he process of transferring knowledge can be driven by different notions of similarity (which is the basis on which you expect the transfer to be successful or to fail). Often algorithms will weight the ability to do transfer on the basis of the degree of similarity to other datapoints, ex. kNN. That set of principles is general across all learning algorithms, and while transfer is often referred to as being between-dataset transfer in machine learning, each algorithms focus is this within-dataset transfer for generalization.

Examples of Transfer

  1. Taleb, Power Law Distributions from the Black Swan
  2. Taleb, Volatility and Growth from Antifragile
  3. Entropy (Communication Theory, Thermodynamics, Probability Distributions)
  4. Eigenvectors
  5. Differential Equations
  6. Economic Laws (esp. Game Theory)
  7. All of math? Not sure how to frame this, the concepts don’t fit nicely.
  8. Graph Analysis (Linked, Degree Distributions, etc.)

Fields by Transferability

There’s a sense that the knowledge it’s most important to gain first is the knowledge that can be re-used to successfully model other domains. By proceeding from generality to specificity, the learner will speed up dramatically.

Very rough version of fields ranked by transferability:

Reality:

  1. Mathematics
  2. Probability Theory / Statistics
  3. Philosophy (especially Logic, Epistemology)
  4. Physics
  5. Economics
  6. Computer Science
  7. Engineering
  8. Biology
  9. Chemistry

Social Reality:

  1. Psychology
  2. Folklore & Mythology
  3. History
  4. Classics
  5. Sociology
  6. Linguistics
  7. Religion
  8. Anthropology
  9. Government / Law
  10. Gender / African American Studies

A natural thought is to start learning with mathematics and statistics, immediately aquiring the parts and the modes of thought from which the other fields borrow. Much nicer to learn principles that generalize across fields (say, learning about statistical testing, which is used as the standard of evidence across umpteen sciences) and then have that body of thoughts at hand when learning the details of a specific discipline.

The challenge in learning something overly specific first is that when moving into a new space, much less of their knowledge will transfer, making the move less likely to happen in the first place and emotionally unpalatable. There’s also this challenge to contributing in ways that are valuable via transfer from a body of ideas that tend to be overfit to their domain.

What is transfer?

Great examples of this are planning, pattern recognition (transfer between datapoints)

In planning, there’s a hierarchy of tasks over which you can do transfer. Say you need to save someone just hit by a car by getting them to the hospital.

At a high level, there’s a transportation problem. It can be instantiated in a few ways - ambulance, car-transport, etc. Ambulance breaks down into calling 911, identifying your current location (orienting to the street signs), communicating with the emergency operator, etc. Those are composed of lower level actions which often disappear beneath conscious awareness, say taking out a phone, opening the interface, opening the phone app, pressing the keys, putting it to your head, etc. This process is modular and transfers across domains. Regardless of who you’re calling, when you’re calling, the motions are the same. And so you think in terms of the abstracted motion (call this person / number).

Implicit Transfer

Metaphors are examples of transfer that at first look to be free of abstraction, but as they’re drawing their power from shared structure between the domains we can use them as using abstraction implicitly.

One result is that metaphors are a source of explicit abstractions so that representations and solutions can be extended to their farthest reaches. Often the emphasis is on emotional transfer, and so the metaphor communicates more than pure information. But that transfer matters too - communication is extremely high dimensional, and the felt dimensions are more important for compelling action and building understanding than pure understanding. More on this in engineering abstractions.

Take a classic christian verse, “we are the clay, you are our potter”. There’s implicit abstraction around agency vs. the acted upon, intentionality, moldability, and craftsmanship, as well as the load of connotations that go along with the profession and the nature of the objects.

Giving examples as grounding to abstraction is common practice. Often more than one example is required to pull generalization off properly, as there will be details for particular examples that don’t necessarily generalize and transfer that can be omitted from a second or third example.

Glorious Heights of Representation

There’s power to the heights of abstraction. As the height grows, the number of objects being acted over grows. There are a huge number of nodes in a belief structure, and the impact of information grows as it becomes relevant to every node in that structure.

Say you learn something about capitalism. All of a sudden your model of every single transaction that has occurred across time, and the implications of those transactions updates. There’s power here insofar as one can take a new concept and notice the breadth of implications that come with it.

The way to generate power from lower level information is by abstracting. Say you notice that when someone takes a lower-paying job they don’t have as much time to apply and interview for higher paying jobs. This is a nice insight which may help you make that tradeoff more effectively in your life, by being aware of it. But the power comes out of abstracting this concept to opportunity cost, and realizing what the implications of opportunity cost are for the application of every finite resource (time, attention, energy, money). Even resource in that previous sentence is an abstraction over a world of objects that often need to be explicated if the impact of an update to that abstraction are going to be understood, experienced and lead to a better planning process.

Pitfalls

Transfer and Generalization

  1. Overfitting / Bias-Variance Tradeoff
    1. Grab not enough datapoints, create a complex model and assume that it will transfer cross-domain
    2. As the level of abstraction rises, there’s more data to be seen and so it’s easier to create a biased sample accidentally
    3. Large data also allows you to more accurately validate the abstraction
    4. Magic numbers… everything feels overfit
    5. Each Abstraction has to trade generality for nuance
      1. Without complexity (conditionals, context dependence, additional features / factors)
  2. Forming an abstraction in a way such that it fails to transfer
    1. Ex., Statistics / Probability Theory
    2. Wrong level of analysis to see across many tasks

Scratch

Transfer, Across Definitions Idea List

  1. Idea list this!
  2. Compression transfers
  3. Operating recursively sometimes transfers
  4.   1. Shared Structure
    
    1. Pure abstractions, CS & Mathematics
    2. Which properties transfer across these definitions through examples
    3. Map / Territory

Transfer of properties of abstraction across definitions

Definitions - is-a relationships (math, object oriented) vs. hierarchical compression (in my mind just unidirectional, but what we care about is shared properties). Abstraction necessarily virtual? How virtual, do we execute it to be consistent with the data generating process and so map cleanly to the territory, or is a map clearly apart from the territory that we don’t expect to adhere to things perfectly? Different expectations of abstraction in math/physics/cs vs the dirty version everywhere else.

  1. Shared between math / physics / cs abstraction (is-a relationship, purity, consistency with territory, always recursively executable)
  2. Hierarchical Compression

Both involve:

  1. Compression
  2. Level of analysis at which to interact
    1. Though math / physics / cs will throw category errors if you try to interact at the wrong level of analysis
  3. Heights of abstraction operate over more information
  4. Leakiness
    1. A problem in (and defined by) CS APIs. APIs as an abstraction over implementations.
    2. Though perhaps worse in hierarchical compression, where there’s so much more behind the abstractions.

Compression

  • Compression lowers working memory requirements
  • Compression lets you store more

Lower working memory[v] requirements[w] for compressed objects means that more information can be fit into working memory. The reality is that most thinking requires that you put multiple objects into slots in working memory and then watch them interact with one another, using the recombination of those objects, their properties and relationships to generate insights or plans. With a compressed object representation, something that may have taken up three, four, or more slots in working memory can pick up just one. (Functional abstraction is particularly useful in this context). That means that at high level of abstraction you can think through the interactions of tens or thousands of sub-objects and sub-sub objects while looking at only a few high level objects which you hold in working memory.

Compression also likely has deep impacts on long term memory. There’s a sense that total storage space is dramatically improved by not having to remember the details of situations, by representing the memory conceptually. There’s this sense that recall is generative and so in effect stored as a time series of neural activations. This further compresses the pure capacity necessary for memory.[x]

On the relationship between compression and generalization:

Generalization is driven by correct pattern recognition. Compression is not directly related to generalization. It's indirectly related, through pattern recognition. When you do efficient compression, often you accomplish it by discovering patterns in the data. If those patterns occur in other similar data, you can use your knowledge of those patterns to force a model to generalize by having it compress its representation and so reuse that patterns that made for efficient compression for generalization.

Hierarchical Compression Idea List

  1. Compositionality
    1. Curse of dimensionality
    2. Width-wise and depth-wise compositionality
  2. Recursivity
  3. Unifying lower level structure vs. breaking down higher level structure
  4. Read Bengio on hierarchical compression

There’s amazing power to hierarchy, in that it can introduce relatively complex models that tame an extremely high dimensional and complex world.

A classic example of hierarchical compression comes from vision, where it’s used to create state of the art computer vision systems. Images at a low level can be filtered for lines and curves - in a cat’s mind (and likely ours) the the visual system lights up in the presence of edges in a way that doesn’t happen for other extremely low level shapes. Those edges and curves can be re-composed into shapes. And those shapes are composed into parts of objects, which can be composed again into higher level objects. These all expose grains at which to interact with reality.

That’s an example of bottom up hierarchical compression, but we’re also familiar with going top down - being introduced to a high level concept (say, in the form of language) and over time needing to fill in the details.

Scratch

You don't need the data if you have a hypothesis / model that accurately generates the data. (Oh, this is connected to generative memory in human minds... still underexplored in dl)

Models are about being able to throw data away in memory, and remember the lesson / pattern that the data represented.

Having a hypothesis for the data generating model is one extremely effective way to do compression. A potentially infinite number of datapoints, replaced by a simple data-generating function.

One way to think of data generation is as generating one or more features conditional on seeing a subset of the features. And so you can know the density distribution of the data, and so compress the data by not needing to know most of the features for most datapoint (you can effectively infer all other information).

I love the frame of machine learning algorithms as algorithms that output (hopefully simple) hypotheses that are consistent with the data. It refreshingly meta, a way of thinking that doesn't take for granted the assumptions that our algorithms make. I'll try to spend more time in this frame.

Inverting Driven by Compression Progress

  1. A relatively powerful contrarian truth generator - go line by line through a research paper and invert each statement, assuming the opposite or that it is false or that it isn't the only path, and then generate 1-2 arguments in favor of the inverse or opposite or an alternative for each line. See what survives (and so is stronger than you had realized) and what is flimsy. And simultaneously generate hypotheses that can generate experiments that offend the paradigm and actually update collective beliefs.

A reaction to this abstract: “I argue that data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thus making it subjectively simpler and more beautiful. Curiosity is the desire to create or discover more non-random, nonarbitrary, regular data that is novel and surprising not in the traditional sense of Boltzmann and Shannon but in the sense that it allows for compression progress because its regularity was not yet known. This drive maximizes interestingness, the first derivative of subjective beauty or compressibility, that is, the steepness of the learning curve. It motivates exploring infants, pure mathematicians, composers, artists, dancers, comedians, yourself, and (since 1990) artificial systems.”

  1. I argue that data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thus making it subjectively simpler and more beautiful.
    1. One issue is that in understanding the data you get bored by it. But he’s pointing at the time before you’re bored, as you’re understanding it.
    2. Data is interesting when you realize you can’t compress it, or it breaks your model. It’s also interesting as you successfully model and integrate it, but the interestingness isn’t in the compression, it’s in the violation. And the most interesting data is that which is incompressible, or which has no explanation, upon which religions are built.
      1. And just like that, I discovered a contrarian truth! (Relative to schmidhuber)
  2. Curiosity is the desire to create or discover more non-random, non-arbitrary, regular data that is novel and surprising not in the traditional sense of Boltzmann and Shannon (what sense is this?) but in the sense that it allows for compression progress because its regularity was not yet known.
    1. Because you can’t tell what’s non-random in advance, it’s actually easy for humans to find patterns in the random (and be more fascinated by it than by things which truly do have regularities)
      1. But I agree that he has what the definition ‘should’ be, and this is merely a failure, a maladaptive misfiring of the curiosity system
  3. This drive maximizes interestingness, the first derivative of subjective beauty or compressibility, that is, the steepness of the learning curve.
    1. Beauty is often about complexity and awe as much as it is about simplicity and structure.

Compression as a frame for prediction in physics: “If the history of the entire universe were computable, and there is no evidence against this possibility, then its simplest explanation would be the shortest program that computes it. Unfortunately there is no general way of finding the shortest program computing any given data. Therefore physicists have traditionally proceeded incrementally, analyzing just a small aspect of the world at any given time, trying to find simple laws that allow for describing their limited observations better than the best previously known law, essentially trying to find a program that compresses the observed data better than the best previously known program. For example, Newton’s law of gravity can be formulated as a short piece of code which allows for substantially compressing many observation sequences involving falling apples and other objects. Although its predictive power is limited—for example, it does not explain quantum fluctuations of apple atoms—it still allows for greatly reducing the number of bits required to encode the data stream, by assigning short codes to events that are predictable with high probability under the assumption that the law holds. Einstein’s general relativity theory yields additional compression progress as it compactly explains many previously unexplained deviations from Newton’s predictions.”

“You can imagine the ultimate compression, where we speech to text your message, send it as a zipped string, and recreate the audio with your voice on the other side.”

Josh - values could be written as programs, ... Juergen - we have to align the rewards with the rl framework. It will compress information, everything that is regular and beautiful corresponds to compressibility. That will become boring eventually. You're saving bits as you encode compression. That goes straight to the reward maximizer.

Nando - question: why are our symbols - language and code - discrete? Juergen - at some point we invented language, which forced us to go through this channel. Our little rnn compress it all the way down so that we can describe patterns. Josh - let me disagree. language starts right before writing m. Discreetness exists in the world. At the common sense grain that we interact with the world, there are objects and discreetness. At the grain we interact with the world at, that's the representation. I think this is important, because people think they only have to deal with symbols of they're dealing with language. But objects exist, and interacting with the world demands it. Juergen - I agree, but that's just a biproduct of compression. Josh - that paper of yours was inspiring. I agree. Don't know if we get values from ot, but yes. 4 - discreetness had to do with object independence... this question comes up with nns replacing discrete structure. There's always going to be discreetness in things. Without discrete foundation, you cannot grow as much.

Topics

  1. React to Driven By Compression Progress
  2. React to Information Bottleneck

Composition & Decomposition

Conceptual Decomposition on Composition (so meta!)

What are all of the ways in which the concept is used? One tempting place to start is with the types of objects that can be composed. Neural outputs in deep learning are the obvious place to start, with linear composition of each set of neurons into a single neuron in the next layer (for feedforward and recurrent nets). After the values are summed there’s a nonlinear activation. But this is width wise composition and allows for varied combinations of the inputs to be represented at the next layer. The second type of composition in neural networks is depth-wise, where the functions that compute new layers are nested leading to many levels of transformation. That functional composition gives an ease of expressivity to the model.

Those are two types of composition in machine learning. This version is relatively clean and mathematically tractable. Another extremely clean form of composition is the composition of atoms into molecules, and molecules into macro-molecules. We can see matter as structured composition.

But there are much dirtier versions as well. Conceptual composition, where we form sentences like this one through the combination of a body of meaningful concepts, is much harder to formalize but clearly is an extremely effective method for communicating meanings.

Composition in this way is related to abstraction as hierarchical compression, where a high level meaning is the combination of more basic meanings. But it’s also tied to abstraction as shared structure, where the words being re-used are porting over different facts and connotations depending on their past usage, and so accomplish transfer. This happens at the sentence level as well, where sentences with similar grammar form the same shape of meaning as one another.

How is the way we use the concept misleading? One easy answer is the varied level of structure that can exist in a composition. One extremely common use of the concept is to say that some object is a composition of other objects, say that a pen is a composition of a cap, a shaft, an ink cartridge and a tip. These parts are indeed combined to create the pen, it’s a basic part-to-whole notion of composition. There’s no recursive composition of similar molecules at multiple levels of scale. The pattern to the combination of parts is ad-hoc and specific to the purpose of the pen, rather than being grouded in innate structure or intended to be extremely general. The parts have interfaces to one another (for example, the cap fits the shape of the top of the body of the pen), but those interfaces are also very specific to the object in question and not general across objects. There’s a shallowness to this form of composition, and so it fails to share the power that comes out of the patterned composition we find in neural networks or in the atomic basis to matter.

What other valuable conceptual schemes are we pushed off of? The obvious answer is that conflating recursive composition with shallow composition reduces the felt power of the concept. There’s this expressivity to the interchangability and depth that comes out of replaceable parts with flexible, modular interfaces. That expressivity tends not to exist in all things that are composed of sub-parts.

We’re pushed off of many of these concepts. Modularity, for starters. Where we have modular and non-modular composition. This draws our attention to the interfaces between the sub-parts in our decomposed object. It asks how general their interfaces are, and how those interfaces scale with the composition. As the flexibility of the interface grows the expressivity of the composition also grows. Specificity in the interface can allow behaviors that are unreachable otherwise, but the question is whether that specificity is hard-coded or comes out of a particular composition of sub-parts itself.

What is insufficient about the concept?

What gives the concept its value? Simplicity. The parts from which the heights are composes can be easy to understand and simple. Then you specify an interface for their interaction with one another, and an algorithm for exploring the space of compositions that satisfy some objective. Quickly you can get extremely complex behavior out of a systems whose parts are deeply well understood. Composition is a path from simplicity to complexity, without the designer having to interface with the complexity themselves.

Levels of abstraction. Composition allows for the interaction with a complex system at multiple levels of analysis with similar tooling. Take function composition in computer science as an example. A function with some high level behavior (say, sorting) can be interacted with, without having to deal with the layer below it (a particular sorting algorithm) or layers far beneath it (the bits on the computer that need to be manipulated). The function can be interacted with at its most practical level, in the form of what the function accomplishes rather than how it’s encoded or even how its particular implementation works.

So what can be said about composition itself?

Abstraction is how you construct a compositional conceptual hierarchy, and so benefits from compositionality are akin to benefits from quality abstraction. The answer generally given is informational efficiency. Recombination of abstract concepts allows you to hit so many more possible meanings than having a particular concept for each importantly different object.

The conceptual building blocks automatically allow for changes at lower levels to impact many higher level recombinations of concepts.

Compositionality makes rapid learning possible through flexible generalization, without a need to see data for every recombination of parameters in an environment.

There’s a generality to decomposition + recombination for problem solving.

Creativity through recombination of existing concepts.

Compositionality is at the core of productivity in building objects, building software, building ideas, almost all creation.

Easier generalization, as there’s much more data informing a sub-concept in a conceptual hierarchy since you can draw from every instance where the sub part exists (abstraction extends dataset size)

Existing models of new data that can be decomposed into existing parts.

Decomposed problem / goal representations

  1. Combinatorial Representational Capacity
    1. Broadens space of possible meanings through finite concepts
  2. Elegant Updates
    1. Updates to a sub-concept flexibly updates all recombinations with that concept, capturing shared structure between more specific concepts that include the sub-concept
  3. Speeds learning through efficient generalization
  4. Creativity - creative solutions often take the form of recombinations of existing concepts
  5. Decomposition + recombination for problem solving
  6. Abstraction allows for learning about sub-concepts, broadening dataset to all instances of the sub-concept
  7. Stronger Generalization - existing models of new data that can be decomposed into existing parts.

Examples:

  • Building Objects
    • Car / Bike / Train = composing wheels + frame + engine / power
  • Building Software
    • Statements, variables, conditionals compose to create functions / classes compose to flexibly and consistently solve software preoblems
  • Building ideas
    • All of language as concept recombination, each sentence merging composable meanings.

The project of efficiently representing unbelievable amounts of knowledge generally relies on finding a body of recombinable patterns which can flexibly represent an extremely wide array of objects. Atoms up through objects work this way. Human and programming languages work this way. This is the most likely generative model for our universe.

Decomposition is inverse abstraction. You can do extremely direct transfer between them.

Linearity (sum of parts) as allowing for decomposition and recombination to allow for solutions in linear problem solving that are difficult in non-linear systems.

The types of decomposition correspond to types of abstraction -

  • Functional Abstraction (We got here by other means)
  • Modular Abstraction
  • Property-based abstraction
  • Physical Abstraction (composition of parts)
  • Recursive Abstraction
  • Temporal Abstraction (We got here by other means)
  1. What kind of x’s exist?
    1. Ex, what kinds of deconstruction are there?
  2. What properties does x have?
    1. Elimination over the set of properties as fodder for new similar but different objects
    2. Take love. Say there are 7 types of love, all of which share some properties and not others. The deconstruction creates nuances, but also creates a space of possibilities where each parameter in the space is turned off or on (if it’s a binary property), or can be in one of many states (if it’s discrete non-binary) or can vary along its continuum.
  3. Modular Decomposition
    1. How does x accomplish its purpose?
      1. For each step in that causal graph, create a node
  4. Functional Deconstruction
    1. How does x accomplish its purpose?
      1. For each step in that causal graph, create a node
        1. Ask what the function of each node is (why it is there)
        2. Generate new options that accomplish the same function
  5. Physical Decomposition
  6. Recursive Decomposition
  7. Temporal Deconstruction
    1. Often a process has a dependency set that implies a temporal ordering.
    2. Recombining the orderings of a process can dramatically change it.
    3. Parallelizable processes can be even more valuable to deconstruct, because they can be scaled arbitrarily.

Challenge in decomposition:

  1. Incomplete decomposition, missing important sub-categories that makes the model feel clearly broken (even if close the the principal components for many goals are captured)
    1. OMG, we need a prediction focused PCA which balances the goals of variance maximization and maintaining predictive capacity
      1. I guess that this is what LULZ is supposed to be
    2. Ex., Intelligence -> analytical, creative and practical intelligence (Sternberg)
    3. There’s this immense harm in thinking that knowledge representations need to be literally true. Sternberg’s model may be the most useful for many tasks, efficiently making tradeoffs that are necessary tradeoffs to be made for any model. Brining a strict standard of truth to it and evaluating it on that basis fails to respect the reality that there are multiple objectives to these models.
    4. We end up in a world with mathematics and data are all that fail to die to purity, a world where making conceptual progress is impossible because all concepts can be destroyed by our standards.
      1. Yet every day, we live by these concepts, we think with these concepts. And we’re neutering the process by which we improve them Generalization

What is generalization?

The critical property of an abstraction is its ability to generalize.

Within a domain, learning often looks like generalization - in sports, you pattern match one situation to similar past situations, and react instinctively in a way that your intuition hopes will lead to success. In mathematics, solving a problem with an algebraic manipulation once lets you recognize that type of manipulation in other situations. And often so you abstract the manipulation into a rule or into an operation that you can run more flexibly.

But you’d also like to do out of domain generalization - learn something in one context and apply it to an entirely different context. Learning language in one domain (say, in a home) and then applying it at school, or with friends rather than parents, or in speeches. Take the concept of specialization and lift it from economics to understanding sexual dimorphism in evolution. Taking the concepts of a replicator and differential selection from evolution and apply it to ideas and tunes and fashions that replicate by imitation. (add something about decomposition, modularity, causality)

Generalization as a Standard

Generalization accuracy is a great standard to hold abstractions to. What we want is for our representations to aid us in problem solving, which often takes the form of prediction what the impact of our actions will be or predicting what the state of our system will be in future.

When adjudicating between representations and when constructing them, we’ll optimize for the representation that is easy to evaluate (simplicity heuristic) and that makes the most accurate predictions. And simplicity is also a part of accuracy - models that are simpler have the advantages of capturing more data (because in general they’re more abstract), being more robust to small differences between observations and so better able to capture the higher level regularity, and being straightforward to update when they’re mistaken.

Tradeoffs need to be adjudicated, and so having a downstream task to determine the appropriate lines for those tradeoffs is invaluable. Yet in practice people use other standards either as proxies for generalization or to preserve longer term value - truth is a classic example. Instead of asking if a representation is predictive, you can ask how closely it corresponds to reality via some similarity metric between your map of the territory and the territory. Often there are tradeoffs between the true model and the model that leads to the best generalization accuracy.

Even generalization accuracy should be seen as an intermediate standard or proxy, with utility as the true goal. And perhaps we should resolve the tension between predictive accuracy and utility in favor of utility. Many models or pieces of information that improve predictive accuracy do dramatic damage to agent utility (ex., noble lies - rights, objective grounding to values, sacred & religious beliefs, etc.). It’s not clear that the metric we want to evaluate these pieces of information over with is predictive accuracy, and so that evaluation needs to happen in a decision function that adjudicates between standards based on their contribution to utility.

What is Transfer?

A central part of the efficient use of models is their generality. Generality means that a model works across many tasks and domains, and a so a central component of an intelligent worldview is the ability to recognize and model shared structure. This flexibility allows us to deal with unforeseen problems efficiently, by co-opting ideas and models that are already well fit by world experience and leveraging them for the task at hand. It also lets us compress the information in our world model. But there’s also this wonderful exponential structure to composing models of systems that lets us counter the exponential complexity of the problems we face.

Transfer is identifying shared structure between objects of ideas or domains, and assuming that additional information that you have about one domain will generalize to being true of the other domain.

All machine learning is transfer. Transfer is the ‘identical’ in identical and independently distributed. At a low level, we have transfer between datapoints which leads to a standard machine learning model. It’s transfer between datasets that we usually refer to as transfer, but that’s simply a level of abstraction above the datapoint level.

Distributions can be closer to or farther from one another. Overfitting classically refers to an inability to ‘generalize’ to data drawn from the exact same distribution. We also need some conception of generality of models to distributions that are of different data but that have a similar structure

There are two major components to Transfer. The first is the recognition that transfer is possible - the recognition of shared structure between domains. The second is the application of a model whose properties were learned in one domain to another domain.

For example, say that you’re trying to determine the proper sorted order of 1-n. You’re given a few lists of numbers, and shown that after the sort they all go to 1-n. You may think of sorting, but you could also only learn that the data you’ve seen goes to the particular n-length vector you see. Trained at a low level, your model for this system will perform perfectly on the information you’ve seen thusfar. But it’s the wrong level of abstraction if you want it to generalize. When we move up, to a sorting function, we gain the ability to generalize ‘internally’, where data is generated from the same distribution as the old and we still get the sorted output correct - this won’t happen if we just memorize instances seen before, unless we look at every possible combination. But the concept of sorting itself can generalize to other problems - say, choosing the closest neighbors from a set of distances in kNN.

What gets transferred? The information gained from training on some domain / ask, which is stored in a model, gets transferred to a new domain.

Using language forces analogy / transfer over all domains - to describe a situation, we have to build a model in order to use words to describe things. And that brings in information from every other instance of the modular words used to describe the situation. Which makes it much harder to memorize the particular task at hand. Instead, we have to map it to a known structure, and in so doing transfer structure and models to it. Say you tell someone “I’m having trouble deciding what my course load should be.” The word ‘load’ brings in information about resource use and the heaviness implies there is a cost to be optimized or minimized. The word ‘decision’ implies that a set of models needs to be applied to it. You can barely describe the problem without transferring information to it.

The Recognition of Shared Structure

The recognition of shared structure is what allows transfer to happen. Shared structure can come in many forms. Often, fields are formed around the shared structure in some set of problems, and then a body of techniques are built to flexibly solve those problems.

A great example is in economics - there is shared structure in resource allocation problems. Take opportunity cost. It can be applied to any situation where there are scarce resources being allocated that are associated with actions that lead to some utility.

There’s this story where I want to finish my monthly goals, but it’s late and swing dancing starts soon. If I go to swing I can’t finish my goals - the opportunity cost to my going to swing becomes finishing these goals. At a very low level, I could take a lesson away from this which is extremely specific. Swing dancing leads to forgone goal setting. One form of learning looks like memorizing that relationship.

There’s a more abstract form of learning that comes out of abstracting, introducing time to the picture. There’s time that I won’t have to work on goals if I’m dancing. If I abstract at that level, I can take the lesson that time will make actions mutually exclusive - I can’t do one if I do the other. At that level, I can learn about other situations from this one. Any case where I spend time on one thing, it gets lost to being spent at another thing. And I can learn about the tradeoff from time in the context of swing dancing vs. goal setting and apply it to a number of other situations, thus achieving transfer.

I can amplify that transfer even more by pushing to a higher level of abstraction. Abstracting up from time to resource, I can realize that this mutual exclusivity / scarcity framework works on money and on energy (oil, gas, nuclear), works on computer memory and human memory, attention, and so on. The breadth of impact that a discovery in any of these spaces can have on the others is a function of our ability to abstract to the appropriate level to recognize the shared structure that exists around them.

Topics

  1. React to The Mathematics of Generalization
  2. React to Surfing Uncertainty and Being There

Generating New Abstractions

Scratch

Examples:

  1. Moving ideas out of and across domains
    1. The making of consilience, the reification of the unity of knowledge
  2. Invert creativity (non-novel value)
  3. Making the discrete continuous
  4. The mechanism that selects identities
    1. Both identity in this moment and the identity(ies) worth moving towards
    2. Meta-identity?
  5. Re-name the curse of dimensionality
  6. Invert Chaos
    1. Statistical Mechanics? This is how I’ve been describing it, as stat-mech vs. chaos.
    2. Need a temporal version of distributions
      1. Gaussian / Poisson Process?
  7. Need a verb for mutual information, for ‘has information about’
    1. Can say x is correlated is y, but can’t say x is informationed with y’
    2. This is a problem because you can say ‘anti-correlated’, but can’t say anti-has information about
      1. Actually nvm, this is a tragedy where correlation can say that the relationship is positive or negative because linearity implies that the relationship always is positive or negative. You can’t do that with mutual information, it captures non-linear relationships.
  8. For the way that the future is always in the present
    1. Ex., how differently you treat someone you’ll be with forever
    2. I consistently frame this in terms of the iterated vs. uniterated prisoner’s dilemma, but it needs a word
  9. Words for the different forms of meaning
    1. Meaning that comes from devotion / sacrifice to something higher than yourself
    2. Meaning that comes from connection, community, relationship
  10. The use of dramatic social upheaval for a fundamental values shift
  11. He needs an <insert word>
  12. He needs a social rite of passage
  13. He needs a social cleansing
  14. None of these are close..
  15. Making the binary probabilistic
  16. Map-territory conflation
  17. Concept-instantiation conflation
  18. There needs to be a word for the ‘bayesian problem’ where you just don’t take the prior into account when they’re doing causal inference.
  19. Ex., canonical success traits that work but that are underrepresented in the base population

These are some fascinating objects, many of them conceptual, that are a set of example of creativity in finding patterns or shared structure between our own stories or experiences that can be named and in being name be efficiently referred to and more easily built on top of (due to the freeing up of working memory).

When a new concept is created and codified as a word (say, you give a name to the moving of an idea out and across domains), suddenly the ability to think with and communicate with that concept is dramatically enhanced. In our example of moving an idea out and across domains, we may recall several examples to mind. Take Antifragile, by Nasim Nicholas Taleb. It’s built out of the inversion of another concept (fragility) but does helpful work by allowing us to gather all of the examples of a particular property of systems under the same concept, and insodoing understand it more deeply and wield it more fluently.

Many problems can be discovered completely conceptually, and (often) would only have been discovered conceptually. People ignore selection effects all over the place. Importantly, they ignore base rates. In what I call a bayesian problem (where you fail to take the prior into account) there’s a body of counterintuitive truths about the functioning of our minds that can be discovered and used. Without the conceptual representation, selection effects would have driven data to be interestingly in violation of naive expectation.

A concrete example is the effect of college on students. Demographics consistently show wonderful outcomes from students who went to ivy league schools. One reasonable counterhypothes to the value of the schols’ education system is that students are selected for quality in advance.

WHERE CONCEPTS COME FROM: A THEORY OF CONCEPT ACQUISITION William Sabo, Directed by Jesse Prinz

Summary:

Author goes after the same broken theory that Gardenfors goes after, the theory of innate concepts set up by Quine, Fodor and Chomsky to defend a pure concept based model of theory building (learning? Wonder why they call it theory building?). They want all concepts to come from existing concepts. Sabo argues that we can satisfy the explanatory goals of that theory (which he calls the Conceptual Mediation Thesis, CMT) without posing innateness.

The author’s main thesis is a distinction between indicating states and representing states. The indicating states are something like raw perception or sensing. They’re about direct correlates to the input signal. The representing states come afterwards, as the recording devices take systematically similar indicating states and creating representations of those states. And so it’s through repeated similar perceptual experience that we begin to represent, rather than merely indicate, experience.

Recent Approaches to Explaining Concept Acquisition

He covers two major approaches. The first is Ned Blocks’s argument that a “conceptual-role semantics” (how obnoxious is that?) for concepts is responsible for concept acquisition. The second is Laurence and Margolis’ frame that concept acquisition is about acquiring a sustaining mechanism.

Conceptual-role semantics - The identity of a concept is determined by its relation to other concepts. This is close to Hikari’s thesis that everything is about its connections to other things rather than its properties in and of itself.

Of course this thesis has merit to it. Concepts compose with and support one another. But some are obviously about grounded objects, like ‘Rabbit’ or ‘Chair’. The natural kinds are good examples. The need for the One-factor / Two-factor distinction is ridiculous. Everybody should be two-factor. Spending space and time on that dumb assumption is a waste of resource. The call the other side ‘referential content’ which is in fact better than natural kinds.

Ideas: 1.

Implications: Similarity comes in here as the way to group perceptions so that they can generate a representing state that corresponds to the other perceptions which were deemed similar. Association

So much of this is so silly. A sustaining mechanism? Really? The survival benefit is so far downstream that it’s such a silly way to look at concept formation. This just justifies it, but in a way that’s so remote that it’s not useful or predictive. These ‘innate concepts’ people are being silly too. The language is so deeply in the way of understanding reality. The people who advocate for natural kinds which are built on are reasonable.

The expectation of completeness is so silly. Why would you expect all concepts to come from the same mechanism? For an abstraction as huge as ‘concept’?? Same thing for ‘CMT’, the Conceptual Mediation Thesis. It’s clear that some concepts come from other concepts. But people forced them to cover all possible concepts with the same theory, and then when they bit the bullet and went for innate concepts to defend that position people moved to throw away their entire theory. Everyone demands a single feature model to make perfect predictions, refusing to believe that you could use multiple features simultaneously. Don’t they know that they’re not doing physics? That they’re studying an evolved system, which will happily use dozens and dozens of mechanisms to perform something as complex as concept acquisition?

Attention, similarity, and the identification-categorization relationship. Robert Nosofsky

Summary:

There’s a distinction made between identifying a stimuli and categorizing a stimuli. Using this distinction, the author designs a study which tests subjects on each to show that they’re related. In the test, a subject will be given a set of stimuli that could be confused for one another but which vary on dimensions that make it possible to separate out the stimuli from one another.

There’s an assumption that categorization works by examples which are stored in the mind of the person doing the categorization. A stimuli is compared to the examples and if it’s similar to them is categorized as being part of that group.

This categorization model is described as a generalization of the context theory of Medin and Shaffer. And it’s all that I really want - Using the identification-categorization relationship (which I see as corresponding to the ability to represent something being tied to your ability to categorize it)

What does it mean to identify a stimuli? In my reading, it means that you find (‘identify’) features of the stimuli. Another read says that it means that you can tell that the item is self-similar, and so you identify it as a unique item. So it’s the different between self-similarity and group similarity. Identification is about this unique object.

What does it mean to categorize a stimuli?

I realize it’s more interesting to move into Categories and Concepts here.

There are three major theories of concepts (this is up to 1981). The classical view of concepts (which goes back to Aristotle) says that concepts are groups of properties. A concept can be clearly defined as a set of properties, and objects that have those properties are categorized as belonging to that concept. The properties have to be true of all members. In the probabilistic view, instances of a concept vary in the degree that they share different properties that correspond to the concept, and so they are more or less faithful representations of the concept. The exemplar view escapes this frame entirely, where

Exemplar theory explains conceptual drift - the way a concept changes meaning as the exemplars used to refer to it change.

Ideas:

  1. Implement the Classical View
  2. Implement the probabilistic classical (prototype) view
  3. Implement a hybrid, where we use the exemplar view to create categories and then examine the shared features that are determining the categories. Those properties can become definitions for concepts on which we do logic. This would allow for the discovery of mathematics or formal systems, and the system would realize that some concepts behave very differently than others.
  4. Run with k-Medoids to be more cleanly exemplar theory.
  5. Create a new theory that’s exemplar theory without examples but instead with abstractions. Cluster centers in k-means are abstracted versions of the concept they represent.

Implications:

What I’m implementing is closer to exemplar theory than anything, but I’m running k-Means and not k-Medoids so there are no specific examples at the root of my clusters. Instead, there’s a prototypical (be careful calling it prototypical, because that implies a discrete property based representation of the concept which I am not using) version of the concept (or an abstract form). At some level though, you could see this as an implementation of prototype theory where the axes in the vector embedding are the ‘properties’ and each property is measured along a continuum. The fact that you can’t name the properties doesn’t get rid of that notion of discreteness. And discreteness is all you need to map to properties. But, the fact that you can make the embedding as large or as small as you like and it will flexibly adapt makes it feel like a fundamentally different type of object. This is similar to exemplar theory in that you use actual groups of examples to define the category. If I run with k-Medoids it feels like it’s cleanly exemplar theory, where my clustering algorithm literally stores examples of the concept I’m discovering.

Topics

  1. Present Concept Discovery paper as new paradigm in cognitive science, merging connectionism, symbolism and geometric frames into functional systems.
  2. New abstractions as discovered via similarity / shared structure

Conceptual Schemes

Abstraction is actually (partially) about founding the field of conceptual analysis

It's about making the conceptual scheme explicit, along with the questions, answers, representations and assumptions that come out of that scheme.

It's about how we fit all of our data into our existing conceptual scheme, and about how to minimize the bias induced by our scheme (and optimize the bias that does exist for predictive value or other kinds of value)

Kuhn’s insight was that this is sociological, but it’s also individual.

Example: 'Products' are an abstraction over unlike objects. The natural attempt at decomposition yields 'product categories', grating similar products and lumping them together. This aids transfer by letting you consider transfer within categories, which is between much more similar objects. Without a concept of product categories, many connections would not be made between products, because the general class of 'products' doesn't support those connections across all products. In many cases, we lack concepts like 'product category', and fail to make umpteen extraordinarily valuable connections because our conceptual scheme is insufficient.

The natural thought from cognitive economy is that conceptual schemes should efficiently structure experience so that 1. The category of an experience lets us identify it as similar to other experiences in that category and 2. The categorization of an experience lets us identify it as not being in a body of other categories.

It’s worth spelling out exactly what the resources are (prime candidates are memory and attention) and what the consequences of their scarcity is. It’s unclear how this transfers to ML.

When you have a mutually exclusive conceptual scheme, categorizing marks the object as not being in just many categories, which may contain even more information than marking it as being in a particular category. That moment when you hear a sound in the night, and upon realizing that it’s merely machinery get little relief from the fact that it’s machinery but oodles of relief from the fact that it’s not a thief, a dangerous animal or a murderer. A thank you to Elanor Rosch, Principles of Categorization (In Concepts, Chapter 8).

Topics

  1. Kuhn
  2. Piaget (A long time ago, Piaget [49] already explained the explorative learning behavior of children through his concepts of assimilation (new inputs are embedded in old schemas—this may be viewed as a type of compression) and accommodation (adapting an old schema to a new input) Representation Properties

Everything has a representation. Raw sensory content is represented to sensors as a stream of sets of pixels, or frequencies of vibration.[y] Re-representing information is powerful. Ex., re-representing information so that solutions can be transferred from one situation to another. Representing the internet as a graph (with links as edges between web pages that are graphs) rather than as an amorphous concept or as TCP-IP allowed the use of Page-Rank (a graph algorithm) to turn into Google. Other representations of the internet don’t lead to the same solutions for searching the internet. The concepts you use to describe a situation carry implicit assumptions (a frame) that can be considered your default representation. For example, when you represent numbers as arabic numerals (2, 3, 5) multiplication becomes easy (for a few reasons, ex. The number of digits of the two numbers being multiplied correspond to the number of digits in the resulting number, or ex. You can use the distributive property to break a multiplication down into a value that’s 10*x + y, say distributing multiplying by 32 into multiply by 3 and adding a 0 and multiplying by 2.) When you use roman numerals, multiplication becomes much much more difficult. This re-representation of our numbers turned a problem that used to require expert level mathematicians into a task that most 10 year olds know how to do reliably.

One easy way to see the way that a change in representation can be valuable is to see where a discrete representation of some information would be better off continuous and when a continuous representation would be better off discrete. Imagine if, instead of measuring, say, speed of a car on a continuum it was measured as a binary - fast or not fast - which would be triggered at some MPH rating. The odometor would merely tell drivers whether or not they were going ‘fast’. Immediately many more people would die due to accidents in speeding which come from an inability to make distinctions between levels of speed limit. And when it comes time to debate the speed limit, instead of seeing the hidden assumption (a binary speed representation) often the debate will center around what the number that determines the speed transitions from not fast to fast.

This is how we represent very important concepts, like truth. While in some domains a binary representation (is it true or not?) makes sense, in all too many domains our expectation of binaries on truth make sense-making nearly impossible. [z]

Think about the re-frame from binary belief (where statements are true or not true) to probabilistic belief (where statements have higher or lower probability as a function your knowledge that’s relevant to the belief). This reframe dramatically improves thinking on many philosophical issues (is it just or not just to ‘to what degree does it serve justice) and practical issues (ex. the probability of a cyber attack is low, rather than thinking that an attack will or will not exist).

Valuable Properties of Representations

Valuable properties of representations, born out of the frustration with the obsession over disentangling representations to the exclusion of other critical concepts. Many of these properties exist, to a greater or lesser extent, in human cognition.

  1. Decomposition of representation
    1. This gives you a controllable, interpretable, recombinable representation
  2. Alignment of representation where shared structure exists
    1. Want concepts with the same mechanisms / structure to update simultaneously when there’s new information that informs their working
    2. Can be through compositionality
    3. Trades off against decomposition?
  3. Modifiability of complexity of the representation depending on task
    1. Representation that becomes more granular upon zooming in
    2. Necessary for computational efficiency
      1. Memory Constraints
      2. Compute Time
      3. Attention Constraints
    3. Ideally would be on a continuum
      1. Give me the n principal components (non-linear) of the representation, while preserving clean conceptual (semantic) decomposition
  4. Transferability
    1. Ability for the representation to be repurposed for different tasks, generally through learning sufficiently high level structure that there is an appropriate level at which to do transfer between representations of problems and solutions
  5. Appropriate tradeoff of Simplicity / Compressedness vs. Representational capacity
  6. Sparsity
    1. Necessary for the discovery of compute intensive structure (say, graphical / relational / network, or concept recombination) in the representation
  7. Interpretability
    1. Optimizability of representation for interpretability.
    2. Quality translation from representation to natural language.
    3. Clean isolation of parts of the representation (or a sparse approximation of the used representation) for any prediction made or action taken.
  8. Control
    1. Control through modification, freezing, or freeing of sub-parts of the representation
  9. Discrete and Continuous Modes
    1. Discreteness
      1. For Interpretability, self-examination, sparsity.
    2. Continuity
      1. For representational capacity, predictive accuracy.
  10. Fully general translation into and out of the representation
  11. Want to be able to flexibly represent any category of object, situation, etc. in a merged representation
  12. Reserve category errors for a particular mode of action, ‘rigor mode’
  13. Canonical Representations
  14. Propose a way to map representations that are functionally identical on to one another. Use this to align representation for transfer. (Hikari) What this means is that you’d like to guarantee that making a single update will apply over the entire affected part of the representation.

For example, if you’re using the connotations of words to do the transfer concepts need to be valuable as language, it’s possible to run into russell conjugates (where two different words mean the same thing). In that case, updating one may not update the other - an ideal representation would realize that those words have similar contexts and make an update to one consistently lead to an update in all of its russell conjugates.

Often it’s efficient to update representations compositionally, where an update to a part of the representation that’s used in many downstream components can effortlessly feed into those components.

This has implications for concept creation and usage. When you introduce or create a new concept, you split the data between it and the substitute set of concepts you’ve likely been using to communicate the same idea.

Representational Alignment

  1. Take every objective in ‘What Makes a Representation Good’, add my own objectives, and for each one specify:
    1. A way (or set of ways) to measure the objective
      1. Distinguish between the concept of the objective and the mathematical instantiation of the objective (unless they’re truly identical)
    2. The downstream consequences of doing better or worse on the objective
    3. Compare two different networks over the objective
    4. The rationale (and intuition pumps) for the objective
      1. The counterarguments

Aligning representations is often the very purpose of communication. There’s been a way that you’ve been thinking about a situation, about a person, about a kind of problem, and you will feel that you’ve communicated your state to your conversation partner when your representations align. The standard is whether they can express your experience and thinking with all of the emotion and intimate detail that you can, as a confirmation of shared state and alignment of conceptual scheme and felt senses.

The transfer of knowledge from a teacher to a student often begins with the establishment of a shared and agreed upon basis of knowledge on which to build. In the technical disciplines, definitions and formal representations (math, program code, and diagrams are strong examples) facilitate the alignment of representations. It’s on top of this aligned state that learning proceeds, and if misalignment goes undiscovered there are consequences for the student’s progress - those holes lead to misunderstandings, missed inferences, and incomprehension.

When alignment is taken for granted, conversations can proceed without realization that the speaker’s mental state isn’t actually known. This leads to an illusion of transparency, where the speaker projects and anticipates that representations are already aligned, and so our intended interpretation is all that we can see our conversation partner as seeing.

There’s a desire in representations for internal alignment, where in a similar situation the similar parts of the representation are re-used rather than being connected to another part of the representation. Generally this is driven by concept re-use, where (for example) you have a concept of learning that you port whether you’re learning mathematics, to play piano or a new language. That re-use lets you take techniques like spaced repetition and deliberate practice and port them across all learning tasks, rather then leaving them to be specific to their context. Often transfer fails to happen despite this - language learners deeply appreciate the value of immersion in their target language, but immersive learning is expensive enough that it’s rarely attempted outside of language learning.

Splitting concepts means that you need to go through more experiences to acquire the same collective amount of knowledge. If you’re reading about golden retrievers and don’t realize that they’re dogs, you’ll need to learn a body of facts over again if you want to understand how golden retrievers behave.

The discovery of concepts that share some structure that wasn’t previously perceived allows for internal learning, free of new experience. It’s the taking of every experience associated with one concept and re-seeing the experiences associated with the second concept in light of the first. You realize that there’s a connection between compositionality and hierarchy trees, and are suddenly free to port notions of how sub-parts can be represented so as to be combinable, or the speed of operations across the hierarchy, or the idea to decompose and recombine when you want the properties that hierarchy promises.

As importantly, you realize that each system whose representation is compositional, whether it be an object oriented programming language or a biological taxonomy is impacted by what is learned through this alignment process.

Canonical Correlation Analysis Representation averaging without alignment Internal alignment where there’s shared structure (re-use for cleaner global updating) Consquences of the absence of alignment Measure

How do we measure alignment? A failure to do this properly leads to miscommunication, failed attempts at transfer, conflation and ambiguity. The tests look quite different across contexts.

Conversational (and human conceptual) alignment often looks like making a statement or prediction that would only be possible under the conditions of alignment (a non-trivial inference), noticing the weakness of using repetition of words as a standard of measure when it’s possible to mimic the interface of the representation without actually having the underlying representation. There’s a hope that stating a thought in one’s own words, or some other metric that forces generalization can avoid overfitting to the specific why that a concept was phrased by a first speaker.

This distinction in measure - the measurement of generalization rather than replication - is an important distinction across machine learning and thought. Can a person actually use the representation they’ve been aligned to across new data, or do they have a version of the representation that’s too closely tied to exactly what was communicated? It’s akin to learning vs. memorizing, and captures a lot of what it means to ‘understand’.

Measuring alignment in neural networks can proceed by the same mechanism. Put an identical datapoint through both networks and ask whether the same activations trigger. It’s a notion of network similarity, and can be more or less in depth, where at surface level you ask if the predictions made are the same, at a slightly deeper layer you ask if the probability distribution over possible predictions is the same, at even deeper levels you ask if the bodies of activated neurons are the same, on through the depth of the network.

There’s a notion of alignment in word vectors which allows for unsupervised machine translation, by asking a model to force words that are known to have the same meaning to use the same vectors to represent those words. All other words are learned around and with respect to those vectors, and so will also be aligned with words in the other language. Translation can proceed at a word-by-word level.

The question of prediction often can be made concrete - if a teacher wants to check that the student’s representation is correct, they assign a set of problems whose collective predictions will only be correct if the representation is correct. Depending on the task, there’s a level of depth that can be required. In some cases, the student can use the formal representation (say, a set of equations) and apply them naively to the input question to produce the proper output. This can create misalignment between the representation the student uses to solve the problem and the conceptual representation that the student really has, where they’ve outsourced their understanding to an external formal representation and are merely replicating the interface of that representation’s use.

The comparison of networks question motivates another metric for internal representational alignment - efficiency. With how few neurons can the network effectively perform its task? What fraction of its units does it actually use, and need to use? A kind of alignment can be forced by a challenging task that the network doesn’t have enough data to learn on without re-using one experience for another.

A challenge in measuring alignment is that you have to know what datapoints share structure that could benefit from re-use in the model. This can be challenging, especially if you don’t have the generator for that data. Your ability to interpret the data may be limited.

In the conversational context, it can also be a deep challenge to measure alignment. You and the person you’re talking with may use the same words to express things, but mean very different things by that word and have different connotations over its usage. Even if you realize that there’s misalignment, it can be difficult to tell where it is and how much, depending on your introspective power over your own representation and your ability to communicate with language the thoughts and experiences that underlie the creation of your representation.

The obvious rationale for optimizing for alignment is its impact on generalization and its impact on learning efficiency.

Representations that re-use the relevant components when making predictions are capable of naturally generalizing from one example to the next. People whose representations are aligned are capable of generalizing in that they’ll make the same predictions as one another when using that aligned representation, and so have a shared context on top of which they can continue to push their (now collective) knowledge forwards.

There’s a limited scope in working memory, and so to the degree that more information can be bundled together under a concept, the predictions of the representation will be more general.

Representations that are aligned allow for efficient learning, in that the same experience doesn’t have to occur twice to two concepts that could have been aligned but were not. In so many ways, people have to learn the same lessons over and over again because they fail to represent their problem in a way that naturally allows transfer from a context where the experienced the same lesson.

There’s another notion of efficiency, efficiency with cognitive resources, which comes out of alignment. There’s generally a limited number of neurons, computational units, or concepts in the representation being used for prediction, compression, etc. That limitation manifests itself in the amount of composition that can be leveraged up and across the representational hierarchy, in the amount and quality of differences that can be picked up by the model, and in its overall capacity.

As with creating a new abstract object and using it, there’s a tradeoff between conflation and shared structure when it comes to alignment. You want to align the parts of the representation for which the transfer over shared structure aligns with reality, and not align the parts for which that’s untrue. Extremely high alignment (say, the dichotomy between yin and yang / binary / dialectic / thesis-antithesis-synthesis) may actually yield more conflation and so incorrect predictions than if the representation stays at a level that can safely distinguish between the important differences between those objects.

Scratch

Experiment ideas:

  1. Distillation as alignmnet of representations,
  2. Language alignment through word vectors, can do something similar with concept learning by clustering avaliable images
  3. Internal learning, or learning by doing transfer between parts of your own representation, seeing all learned functions as hypotheses that can be applied to new datapoints.
  4. Communication in multi-agent RL could look like representational alignment
    1. Canonical correlation analysis checks which principal components are most aligned. That alignment step, applied to the feature space before model averaging.
    2. This allows for recombination in network space? Perhaps?
    3. Canonical Correlation Analysis
    4. Representation averaging without alignment
  5. Internal alignment where there’s shared structure (re-use for cleaner global updating)
    1. I bet that networks learn the same ways to recombine base elements repeatedly. This can be re-routed to use the same abstract transformations.
  6. Consquences of the absence of alignment Fundamental Questions in Representation

Why Representation Learning?

The way that a problem is represented can dramatically change our perceptions of what the problem is and how to solve it. Problems that look nearly impossible, if re-represented, suddenly become tractable.

One obvious gap is from a conceptual representation of the problem to a mathematical one. Representing a social network or communications network as a graph allows us to do operations like search or optimization over the representation. Representing a person by their connections with a vector lets us measure the similarity between ostensibly very different people quickly and efficiently. Representing a word as a vector that depends on its surrounding words does the same.

There always is a representation. And so you can improve your ability to solve a problem representation side (by reformulating the problem) or on the solution side (by reformulating your solution assuming the given representation). The reason for representation learning is that you can optimize the representation of the problem for the task at hand, with a particular parameterization of the space of possible representations of the problem.

Downstream from a quality representation is the ability to re-use parts of that representation to quickly adapt to new environments, experiences and tasks. And so learning modular representations is critical. Humans have modularity built into our grammar, and so can use language to effectively slot descriptions of actions objects into our conceptual and communicative scheme. Attempting to transfer with a representation that is at the wrong level of analysis will fail, and so the best representations allow for flexible motion between multiple levels of analysis which create a surface for transfer.

Why Learn Discrete / Sparse Representations?

Once you have a low dimensional discrete representation, a wide body of important algorithms are made available.

  1. Concept recombination
    1. With sparsity it’s possible to run inefficient algorithms, for example looking to the set of combinations of a set of concepts. Those interactions can only be looked at for a small subset of the feature space - doing it at the level of pixels and in a continuous space would be computationally intractable.
  2. Causality and its establishment
    1. If you want to do counterfactual learning, you have to find an abstractive representation of time (both the time over the action and over the outcome) and you have to find an abstractive representation of the action space.
      1. Counterfactuals are based on discrete changes in state
      2. ‘Continuous counterfactuals’ require convergence of the rollout of states from the counterfactual change
        1. By continuous counterfactuals the claim is that a small change in the input leads to a small and predictable change in the entire rollout of events following the change in the input.
      3. If rollouts diverge (and I suspect that in our world they tend to diverge quickly), continuous counterfactuals aren’t possible
    2. Creating a causal graph requires a high level representation of parts of your environment and actions in that environment.
  3. Credit assignment (to higher level objects) made efficient
  4. Hierarchical control
    1. Decision making over conceptual blocks of actions
  5. Higher level planning

Why Learn Continuous Representations?

In a continuous representation, one option you have is to have each axis represent a quality that can be had in greater or lesser proportion. And so you represent your object as a vector of many qualities (decomposition) which vary across a continuum (fluidity) and so have extraordinary flexibility.

An example of that flexibility is that if you have two object, alike in some ways but different in others, whose similarities you want to take advantage of in some contexts (say, the weight of two balls being similar, but their colors being different) you can represent that with a two dimensional vector with a continuous representation of the balls. In a discrete representation, both color and weight are represented very differently (with approximations or collapses) and can’t be merged in useful ways.

In the continuous representation you can also re-combine the representation in umpteen ways. You can you masking to average some vectors and not average others (when merging two concepts) to create novel vectors which a generative model can create a low-level representation of and which can be compared (in terms of its performance on some downstream objective) to the concepts that are already in your representation.

When you’re abstracting, you want to notice where there’s conflation and where there’s shared structure.

I’d like to note that this is an example of a dense continuous representation. There are sparse discrete and sparse continuous representations as well.

The data is also often high dimensional, with each high level concept having innumerable sub-concepts or attributes or properties that can be seen as axes in the concept vector.

Sparsity

Economy of attention The way that importance is a zero-sum game If you add weight to one feature you tacitly take it away from other features

Information as a reduction in uncertainty An objective of representation being that which maximally reduces uncertainty (about relevant predictions)

Sparsity in representations

  1. Sparsity and inefficient algorithms (ex., recombination)
  2. Compression (for speed of evaluation, related to 1)
  3. Compression (for improved pattern recognition)
  4. Compression (for improved use of scarce neural resources)
    1. High memory capacity (so cool that I rederived this myself)
  5. High representational capacity
  6. Connected to…
    1. Numenta’s Sparse Distributed Representations
    2. The State of Sparsity in DL
    3. Spiking in neurons (sparse activations are neuroinspired)
    4. Sparse Coding
      1. Sparse Coding in the Primate Cortex
  7. Sparsity as regularization improving generalization
    1. Sparsity as a notion of simplicity (and so invoking Occam)
  8. Way / ways to measure sparsity
  9. Sparsity at a high level (say, concepts) vs. at a low level (say, perception)
  10. Sparseness as discreteness out of continuity
  11. Other types of sparsity (sparsity as a concept with many instantiations)
  12. Downsides / weaknesses to sparsity
  13. High fault tolerance (as most pathways are independent) (so cool that I rederived this myself)
  14. Allows for simultenaity (so cool that I rederived this myself)

OMG, it’s possible that one huge model is possible if you use sparse networks but not if you use dense networks

Increasing the width of a sparse representation while keeping the number of sparse activations constant increases the representational power of a network without adding computational burden. It’s extremely cheap increase in potential connections.

Sparsity leads to the ability to run inefficient algorithms. In computer science, an algorithm’s computational complexity represents the amount of time it will take to run given the size of its inputs. There are many algorithms that scale poorly with their number of inputs (say, algorithms that look at all of the combinations of the input elements), but yet capture extremely valuable relationships and structure. Sparsity in the input allows for the running of these algorithms.

Speed of evaluation is often a function of the number of values that need to be considered. In a dense representation, every operation can trigger a huge number of downstream operations across the entire network. With a sparse representation, particular pathways can be triggered selectively. The total number of activations can shrink to below 1% the number of dense activations, saving on all resources involved in computing. Faster evaluations allow for more evaluations in the same time period, increasing the quality of the transformations.

Compression forces more effective pattern recognition, and we can see sparsity as a form of compression that improves pattern recognition via simplicity. A simple example is the way that a small set of primary features predicting an output is more likely to generalize to a new situation than looking at many many features which may by chance share some statistical connection to a prediction. That feature selection and refinement forces a model that looks only at the primary drivers of an outcome, which are drivers you expect are more likely to remain true at generalization.

Because there’s a limit to the capacity of any representational system (number of neurons in a brain, size of a network on a machine) there’s value in using resources efficiently. Sparsity allows for the extremely efficient use of space, and so enables more relationships to be encoded simultaneously. This likely leads to an expanded capacity for long term memory.

How do we measure sparsity? It depends on the type of sparsity and the level of analysis. One easy answer is the fraction of activations in a network representation, where the fraction of neurons that fail to activate or are zero corresponds to the sparsity level.

Sparsity is one path from continuity to discreteness, where a representation that has an arbitrary number of features (and so which is continouous in the sense of there being a large and arbitrary number of active features). There’s an approximation of continuity that comes with having an extremely large and arbitrary number of discrete values, and a sparsity prior will push that number down to a finite value that has dramatic shifts upon the addition or expulsion of one additional feature, which is yet another heuristic for the representation’s discreteness.

Aside: Types of continuity Continuity in the amount of a feature that is present (in the coefficient) Continuity in the value of the feature itself (ex., male / female is on or off, as a discrete feature) Continuity in the number of features that are being used for predictions (where as the number grows it gets closer to feeling continuous, in the same way that floating point digits are technically discrete but are approximating a continuum)

Interpretabilty of high level sparse representations (where you can say exactly which high level concepts informed your prediction or decision) is an extremely helpful property for a model to have when a human intends to intervene on its predictions or even look to understand its model. There’s an uninterpretability to the mass conflation that comes out of looking at the combinations of every feature with every other feature on a continuum, where isolating the relationships that matter is critical to finding a causal intervention strategy or to communicating a convincing explaination of the representation’s behavior.

There’s a level of fault-tolerance that comes with using a sparse representation, where if parts of the representation are damaged or disabled, since it’s only used for a tiny fraction of the computations made the damage to the system as a whole is limited. When operations are dense, with every part of the representation being used for every prediction, it’s much easier for damage or faulty computations to disable the system in its entirety.

Downsides / weaknesses to sparsity - The obvious place to begin is that sparsity can damage accuracy, especially in excessive amounts. The representation isn’t as capable of modeling complex interactions, or needs to reduce the degree of sparsity in order to get there. Which introducing another issue - choosing the optimal degree of sparsity in the representation. This is a complex choice to make, and while in the context of machine learning it can be tuned as a hyperparameter it’s not clear how to resolve it in the absence of a clear objective or in the context of non-computational representations.

Bias / Variance Tradeoff

Bias-Variance and Abstraction

[Need: The way that Abstraction / Compression increases Bias, Lowers Variance] There’s a powerful generalization from the bias variance tradeoff in modeling to the standards by which we create concepts and the quality of those concepts representation of the world and generalization through their use in future thought.

When we decide to create a new concept (one quality example is a new word), we use it to tacitly collect data that corresponds to the category that it outlines. Take the concept of dog as a label that shares information through a class of objects that are similar along many axes. It effectively bundles information about that object that will allow anything that’s recognized as being a dog as being likely to share the properties of the other dogs that have been seen in the past.

(Need an example that can demonstrate fluidity more effectively) One important downside of this binary classification (where all objects are either dogs or not dogs) is that the conflation between different dogs

Another consequence is that without the creation of a higher level class (say, animal) it isolates the data that belongs to the dog class to that class. Because categories are attractors, it’s not feasible to have a category that’s closeby without it being subsumed / replaced.

It can be useful to share learned information even more br

Bias variance is one instantiation of Occam’s razor.

Point 1: Bias Variance Tradeoff and the variance of a probability distribution Variance in the bias-variance tradeoff refers to the concept that when you’re searching over models, some models have more flexibility. When they fit a dataset, models with more flexibility tend to overfit, because they find a separating hyperplane that is overly accommodating to particular datapoint. There are many ways to tend to overfit, and variance is an abstraction over all of them.

  1. There’s valuing fit over smoothness.
  2. There’s valuing a single datapoint in a region with sparse data over the impact from other datapoint farther away that you can interpolate from or extrapolate from. (Looking too much at particular datapoints)
  3. Arbitrarily overweighting one representation of the features over valuable others, incomplete search over the set of feature representations

Related: Decomposition over Regularization https://docs.google.com/document/d/1tCoaZEzERE3XP_4SzJWJhQ17bnY7vfUGYPGnCCEO54I/edit?usp=sharing

Levels of Abstraction, Abstracting Over an Incomplete Subset https://docs.google.com/document/d/18FvL9mlKTDlxQVXju1v8vOV63U73-6f9WVGh9-d8ScE/edit?usp=sharing

Treating Variance in the bias-variance tradeoff as a concept, there are many ways we could instantiate it.

  1. The standard way, watching your model overfit. This approximation of variance is the difference between the training error and the validation error. (Bias will affect both your training and validation error equally)
  2. Bootstrap sampling variants of the dataset
    1. Split between in and out of bag examples
    2. Train on the training sample, test on the testing sample
    3. Variance is the ordinary (distribution) variance of your predictions on a given datapoint (assuming regression). You can compute the average variance across datapoints for your model’s variance.

The number of hypotheses that can be learned by a model (say, the number of features in a dataset for a decision stump, and all of their interactions for a singly branched tree) Across different representations of a hypothesis space (parameters, freedom over those parameters, number of parameters, rules, freedom of rules) these are different approximations of the variance. But a wide hypothesis space tends to cause high variance, it’s not variance itself.

Say that your model’s predictions of a datapoint are Cauchy distributed. Would you say that since its variance is undefined, it’s not subject to the bias-variance tradeoff?

Variance of a distribution

Variance is complex hypothesis classes leading to overfitting

Just because a concept is formalizable doesn’t mean that the concept is its formalization. There’s something like map-territory here. But it’s higher-map lower-map. We need a clean way to distinguish between concepts and their formalization. Would you say that ‘attention’ in deep learning is attention? Of course not. Attention is so much bigger than that.

Examples of Bias Variance, and how Abstraction increases bias while lowering variance: Myers Briggs works extremely well. 16 categories are structure that introduces bias, due to the difference between the simplification that comes from the assumptions and the variance that comes if you have constant data. You can imagine having 1600 categories and the same amount of data, but matching the underlying structure of personality much more closely. This would be a dramatic reduction of the bias at the cost of blowing up the variance (because very few people would fall into each category, and their idiosyncrasies would overly influence our sense of what each personality category was like). These categories were created by generating 4 abstract properties of people and looking at the categories that come out of treating those 4 properties as binary, one or the other switches. Extroverted or introverted. Feeling vs. Thinking. When we overcategorize (say, generating orders of magnitude more properties of personality) we thin out the data of people in each category (and so may need a system that can effectively do transfer from similar-but-not-identical categories).

If you fixate in system 2 type thinking, logical, delibarate thinking, you have high variance. You don’t have variance problems that you can point to directly, just in the way that any single predictor will be higher variance than if it’s part of an ensemble.

But you also have massive pointable-to bias problems if you’re a man with a hammer. It means that the assumptions that you make to apply your hammer are out of line with the underlying reality. Man with a hammer syndrome points to high bias resulting from misfits in model and the world.

Basically the goal is to put all of the relevant (where relevant is defined as sufficiently low bias, where the assumptions fit the situation) models into an ensemble that can be low variance as well and so generate accurate predictions.

Audience: To Andrew

Examples are hard because any complex example won’t be able to decisively validate your point, even if your point is very general. There will need to be a lot of definitions if I’m going to communicate.

Bias-Variance tradeoff as a mental model for understanding modeling failures.

Every mind is constantly engaged in prediction and inference. Prediction gives of information for decision making, which is done constantly at a subconscious level and also more intentionally and slowly at a system 2 conscious level.

Bias-variance isn’t about the world, it’s about the model used to approximate the world. It’s at the core of why that model can only approximate, and at reducing the effect from reducible error.

Take a basic linear model. Negative correlation between time spent working out and body weight.

Models are about assumptions. Bias is about how closely your assumptions fit the actual model. Variance is about how much flexibility you allow in your model. Too much and it maps too closely to the training data.

One reason (not getting into the psychological) why Occam’s razor works: Occam’s razor is about simplicity. Simplicity is about lowering variance. But there’s generally an increase in bias - there’s a tradeoff that needs to be noted when modeling complex systems. But it’s so easy to overfit and miss widely that we like simplicity.

I could provide a full treatment of statistical analysis of thinking.

Simplicity is just a subset of the applications. The tradeoff between bias and variance is nonlinear, and so models can be way off if they allow for too much flexibility - small reductions in the degrees of freedom is huge.

The work here is in finding a lot of working examples to bring bias variance to an applicable point in my mind. It will work on literally every single model I make, but the practice in finding points of flexibility and assumptions will be great thought training. This is all about driving intuition - finding examples that drive intuition.

Meta modeling - an example doesn’t work if it’s not a principle component in the model for predicting an outcome. Take Venkat Rao’s optimism-pessimism good-evil dichotomy, with examples of nazis/communists at optimistic evil, religion at optimistic good, hedonism at pessimistic evil and police officers at pessimistic good.

We can imagine a prediction for behavior as taking the culture of the society or orgnaization as a variable in the model. And that input is a model, itself. If PCA doesn’t say that it’s primary - if the thing is more driven by something else - the example feels invalid. There are 10 models for Japan’s failure, and all seem valid.

But these narratives are necessary for intuition! And so maybe non-primary examples are fine, just so that our minds can see the pattern and nod along. And start seeing it in other places. Take antifragility, example driven. If your examples are weaker, you need more of them to see the pattern clearly.

Since this is where occam’s razor is from, there’s probably equivalently powerful heuristics. Occam’s razor is likely weak in situations where it lowers variance at the price of heightening the error too much and actually harming the prediction accuracy. But you maintain interpretability.

Ensemble methods

The mental models approach is about ensemble methods for thinking. Lowering variance by combination instead of by making assumptions that increase the bias. We want flexible models that make few assumptions about the structure of the data.

Levels of Analysis

  1. Truth is only right or wrong after you choose a level of abstraction at which to determine truth.
    1. The ways in which you verify alignment of an idea with reality change depending on the level.

Front and center evidence of this claim comes from the ever raging debate over free will. In many cases, there’s a levels of analysis question - do we ask about free will at the level of beliefes, states of mind, and high level actions? Or do we ask about free will at the level of electrical signals and chemical reactions?

Often a hearty defence of free will will look something like ‘my brain state cause my future action’, with an implication that if the brain state had been different the action would have been different. And the follow is that ‘I have free will because had I willed something else (had a different high level brain state), I would have taken a different action’.

There’s a different sense of free will at a lower level of analysis. Where bodies of chemical interactions that drive hormone releases & that trigger limbic responses seem clearly defined by natural laws.

Someone who takes free will to mean ‘my current state of mind determines my actions’ and someone who takes free will to mean ‘I have control over my high and low level mental processing’ can make identical predictions about what actions a person will take and what experience they will have. They can have close to identical models of the system. But because they answer the question of how free will is defined at different levels, failing to notice this will lead to ongoing debate.

  1. Scientific disciplines as distinguished by the grade at which we interact with reality, from lower level to higher level, and the relative ease of formalizing lower levels of abstraction

One clear distinguishing factor in the scientists is the level at which they explore reality. Take physics as low level, dealing with the basics of matter and natural laws. One level higher is chemistry, exploring the composition of particular kinds of organic and inorganic matter with one another. Organic chemical interactions are composed in biology, where living systems contain networks of cells and neurons that leverage chemical reactions at scale to accomplish high level valuable behavior. Bodies of neurons firing in tandem across time create psychological phenomena, which can be studied at a behavioral level. Bodies of people interacting create even higher notions of social cohesion and tribal behavior in anthropology and sociology.

By seeking to understand systems at a new level of analysis, sciences make discoveries and run experiments that share lots of properties with one another that are not shared with lower or higher levels of analysis. The kinds of solutions that work well and the kind of questions that it make sense to ask shift with the scale at which the system is being analyzed.

The process of abstraction is what facilitates the transition between these levels of analysis. The molecule is a general composition of many atoms or other molecules, which have their instantiations. Equations that describe the bonding of these molecules in the abstract are highly predictive, and the continued composition of organic molecules turns into tissues and organs in biological systems. The choice to give labels to particular groupings of these molecules allows us to describe organs and organ systems succinctly and predictably.

One major power of abstration is to go from a particular instance of an element to that element in the abstract. The periodic table is a wonderful example of an abstraction that cuts reality at the joints. Choice of shared structure: Number of protons in the nucleus. Amazing descriptive and predictive value (See ontology is overrated).

That particular choice of shared structure creates an ordering over the elements, and there are umpteen properties that change in unison as we move across that ordering. Patterns in interaction fall out of that choice of representation.

We can imagine choosing other kinds of shared structure to represent elements, and losing out on valuable patterns that fall out of the current representation. But we also always want to ask what tradeoffs are made by embracing our current representation, and what patterns tend not to fall out of it.

Too High

For example, often there is value in a set of ideas, but that value is only apparent at a particular level of abstraction. Take regularization. The concept of regularization exists at a high level, and it captures many techniques - ensembling, weight depression (L2, L1, weight decay), dropout, pruning, etc. The concept interacts with ideas around bias and variance in a model.

But the truth is that there are fundamentally different types of regularization. There are ways that models overfit, and regularization techniques tend to work for one way but not for other ways. And so there may be a model that overfits, and the advice is to increase the regularization strength. Let’s contrast ensembling and pruning. In pruning, you take a tree fit to training data and cut off the lower level splits that it makes, which are most likely to be overfit splits. In ensembling, the hope is that there will be differences in the equilibrium found that let many overfit systems combine into a system that escapes the problem.

This distinction matters because it’s not like your model is simply ‘too high variance’ in general - it’s high variance in a particular way - perhaps it’s looking too much at particular datapoints vs. looking at a feature set in just one of many possible ways that are predictive. In case 1 you want pruning, and in case 2 you want ensembling. It’s not as simple as ‘high variance and overfitting, increase regularization’. But if regularization is the level at which you conceptualize the problem, that looks like a solution to a training error that’s lower than validation error.

It’s easy for incoherence to develop at too high a level of abstraction. It’s also easy to ask questions that don’t have objective answers. This often happens when you want to summarize around a culture, or a social system. It feels necessary for efficient communication, but leaves systematic errors untouched.

We think of the world through hierarchical concepts. These concepts can be broken down into their individual parts. These parts can be further subdivided. We can then solve, think, or recombine these parts. This also lets us take advantage of levels of analysis/abstraction.

For example, most disagreements can be solved by decomposition. Two people talk about love, but one means brotherly love and one means romance. Obviously these are not the same. Higher-level concepts shroud reality. If we decomposed into component parts, often there is no argument. Novel words are often an output.

Levels of Analysis in Machine Learning Algorithms

I’ll think of multiple levels as being centered at ‘natural’, and go to low, lower, lowest, high, higher and highest.

It’s hard because there are so many ways to move ‘up’ or ‘down’ in levels of analysis. From the details to the conceptual, is one progression. But from

One frame for idea generation is at each level, you look at how / why / what. Leading question. Great leading question - what are the principles of x?

How much time should I spend at each level of analysis?

  1. Lowest
    1. Bit representation of memory. Hardware optimization for operations.
  2. Lower
    1. How to efficiently compute the mathematical operations involved (matmuls, sqrts, convolutions)
    2. Optimal implementation of the optimizer, model, etc.
    3. Kernel level improvements to critical operations
  3. Low
    1. The mathematical operations involved in each algorithm
    2. The mechanics of the optimizer (grad descent vs lbfgs, adam vs rmsprop)
    3. The mathematics of the model - the function class it represents / is biased towards
  4. Natural
    1. The natural level of analysis to understand machine learning at… is practical. Where you download a library and apply it to a task.
    2. How the algorithm works, at the level of its subparts and how they interoperate - model, optimizer, loss function, data.
  5. High
    1. A conceptual understanding of what tools can be applied to which problems
    2. A conceptual understanding of how the algorithm works. SVMs by max margin. Linear models by separating hyperplane.
      1. This is the level at which orgs are living contrarian truths
  6. Higher
    1. Principles of ml algorithms - simplicity vs. complexity. Bias vs. Variance
      1. Too high to differentiate orgs. I think.
  7. Highest
    1. Similarity. IID as allowing transfer between similar distributions. Compression as enabling generalization.

Scratch I need to give a name to the idea that you need to look at the right level of abstraction to do transfer between problems / datasets.

His names:

  1. Causal Emergence
  2. Supervenience

This shared mutual information is just like the empowerment in rl!! I came up with this, the idea to use the mutual information between actions and outcomes and between actions and environmental states as the way to build a great representation for doing causal inference. But I never put this in abstraction... oh wait, I did! A good representation is at the right level to interact with reality! I suppose that this doesn't explicitly point at causal inference.

Right! Causal relationships as conditional. If x then y type statements.

It's like my research with Shane... there's a very powerful idea that that you should use the mutual information between causally related states to create representations But I didn't decide to use the causal intervention to measure the mutual information, I was just going to measure mutual information bare to get the problem representation on which to learn the causal relationships.

'Effective information' as a rebranding of 'strength of causal relationship' lets him import all of causal inference. I guess they did need a name for strength of causal relationship.

And so the claim from last night is that as you move up levels of analysis to the right level, you maximize causal relationship to the outcome.

I wonder how effective information is measured between high dimensional state spaces. We need a many-to-many mutual information metric.

The reality is that if you want to learn causal relationships, there's a noise issue with using the microscope that gets much better when you abstract to the macro scale. And the scale of your interventions also changes. I'm in agreement that if you see it as a grouping of micro-scales there's not actually a violation of reductionism. But the effects of the macro representation on generalization are undeniable. You overfit otherwise.

There's a definitional issue with emergence. There are umpteen situations where finding patterns at the macro-scale requires orders of magnitude less compute than finding them at the micro scale. Where the length of the description of those patterns differs intensely. This feels like emergence. (It's one intuition pump for using the word.) This is also not a violation of reductionism. It's a claim about efficiency (or debility) of computation.

'Pass to higher abstraction layers' as the way scott talks about moving through levels of analysis

The uniform assumption, the simplicity prior, oocam's razor - it brings additional information (a true assumption) into your representation of the macro state.

Scott's claim - All of the work is done by the prior, and if you put that prior on the micro states, you get the same result.

I would claim that in the real world, where you have to learn the relationships from limited data, there's a different form of gain in effective information that comes out of leveraging shared structure.

I hate that it's called 'causal emergence'. The idea that it's about the experimenter brining information to bear on the situation is exactly right. The experimenter has a weighting over hypothesis space, a standard for representations that lets them create a better-than-default representation.

Topics

  1. A very grounded example of this would be helpful. Geopolitics, marr’s levels, a software program. Base Units / Types of Abstraction

Base Unit

Base Units show up to ground abstractions in physics (often atoms), mathematics (axioms) and computer science (bits). There will be abstractions built over these base units (say, molecules, theorems & functions, or types) and the system will follow hierarchically through composition of base units. These systems have strong ties to causal structure, where outcomes are the result of sequential operations over lower level units.

There’s another representation that’s much less grounded that we can use for comparison - ensemble modeling. Rather than mapping out a space from first principles, generate a set of representations that seems to break the space up (much like we’re doing here, breaking abstractions into types to compare their properties)

Consistency

There are different types of abstraction that show up in conceptual abstraction space. Types can either be mixed with one another, or done with consistency. [Add motivation]

For example, take a big organization. We can break it into a sales division, and an engineering division, and a marketing division, etc. The type of abstraction that this is a function of what the parts of the company do for the company.

Say now that you want to measure the extraversion of these groups. This involves a different type of abstraction, down to the individual people in each group, and then a function (measure of extroversion, with an average) over the set of individual people.

We’re mixing:

  1. Function done in the company
  2. Set of People in group

And could have continued going to function at a lower level (say sales is divided into sales managers, sales call persons, lead generation people). And for some operations, like management structure, this is the appropriate way to deconstruct the space.

This works cleanly because the sets that underlie the abstractions are cleanly separated from one another.

It is oft overlooked that the abstraction process implies a choice - whether conscious or not, whether announced or not - about what the ‘base unit’, or atom, of analysis is. That is, there is no notion of meaningful abstraction, or creation of abstract objects, without some implicit choice regarding what is meaningful to consider fundamental. To use a mathematical analogy, we imply some basis with which to talk about the structures within the vector space.

The choice of base unit informs the structure of the abstraction process - e.g., the composition of the larger groups that form higher-level structures - and vice versa; whether we think of it as a choice of atomic unit or as a consideration of the abstraction structure is merely a function of available information or personal philosophy - importantly, there isn’t a functional distinction between the two views. Concretely: the interpretation of choice over the atom - let’s say, the Atomic Choice Interpretation - is to consider the question, “What is the composition of the most specific instantiation in this abstraction chain?” or, “What is the smallest meaningful unit over which we can make abstractions consistent with the chain, and, in particular, that preserve the structure established throughout the chain?” Any of these ‘structures’ can be thought of as rules that govern the movement from one level of abstraction to another. To note: In general, and especially over abstraction structures over objects in reality, the construction of abstraction moves upward; since real, observed phenomena are specific instantiations, it’s the abstract classification that does not exist, and so must be constructed. Then, the structure associated with (and that fundamentally define) some abstraction over a set of instantiations manifests as rules regarding salient, shared characteristics over all objects, and also of the axes of variation (between the objects) that are considered insubstantial and over which abstractions gloss. Returning to the consideration of the Atomic Choice Interpretation (AC), the important decision and basic axiom regards an object - the fundamental base atom. This informs the way abstractions are created over the atom - defining the base unit implies a canonical abstraction procedure: the abstract set consists of the fundamental object, and any instantiation procedure should specify an object that is of the same type as the chosen atom-- that is the atom.

Not sure if any of this actually makes sense. The idea is that if the atom is chosen, then the canonical abstraction is to ‘type of ThingThatIsTheAtom’. But except for in cases where there already is an ontology (e.g., species grouping in biological trees, hierarchical social/employee status/power trees), what does ‘type ofThing’ even mean?

To illustrate the meaningful distinction of AC, we can consider an alternative regime: instead of specifying the atomic unit, we axiomatize the rule that governs the creation of abstraction. Instead of an implication of canonical structure by atom choice, the structure is itself made explicit, and the implication made regards the canonical atom choice. It should be understood that both interpretations of choice in an abstraction model are functionally isomorphic. However, the understanding of and thinking regarding the models changes depending on the philosophical outlook (for, all told, these distinctions are only philosophical); the kind of information that is specified is different in the two cases.

This is all worth making concrete, explicit, in an example: Suppose we have a set of human beings, all of which work for some company. There are, naturally, different roles in the company, and there are different people who perform different tasks, and thus are part of communities within the company that come out of shared or similar duties. There are many different ways to create abstractions within this particular framework; that is, many different rules of subsetting exist with regard to this overall set of people.

^This assumes that the abstraction structures moving up are consistent. ^^ Need to concretely define ‘abstraction chain’.

What is significant is that this implicit choice about what is the fundamental basis of measurement informs any evaluation of how ‘good’ or ‘consistent’ an abstraction or abstraction process is. It is only after this choice of atom that we can make further choices about what ‘good’ should mean. This notion of value assignment itself varies depending on the intention behind the abstracting; in particular, it seems most general, and thus in this case meaningful, to consider ‘good’ness as a measurement of how some abstraction performs with regard to some objective. Why this is important to make distinct is that it’s not obvious that this is how we should evaluate the ‘good’ness of an abstraction; if, for instance, we took goodness to be some measure against an absolute and fixed notion - e.g., some fundamental beliefs about some capital-T Truth about reality and the way it should be represented - then the evaluation criteria would be very different. However, the author considers it most useful to allow for a more general notion of objective, that is itself chosen by the abstractor.

To make the evaluation of an abstraction more concrete, we will choose a sufficiently general but still chosen objective, and in particular one that covers most practical uses of abstraction: the ability to use the abstraction’s structure to predict the membership to the abstraction of yet unencountered instantiations, or more generally, how well the abstraction generalizes to the treatment of other examples of abstracted unit.

Scratch

I’m going to enumerate the type of abstraction done in each of the examples I have, and attempt to categorize them in some coherent way (taking the idea of abstraction and trying to go to a lower level, dividing it into particular types).

The class dog contains properties of dogs. Instances of dog have individual properties and shared properties. This is an example of going from a set of objects into a concept that unifies them. But the other way to abstract is from individuals to species - they’re unified by the ability to interbreed. And so it’s a way to create a subset.

The appropriate intermediate abstraction here may be set theory - we divide a space into subsets based on some criteria or shared property.

Orange, Apple, Banana - unified properties

What does a function do, fundamentally? It takes some behavior, and creates a shared structure where the behavior is modified on the basis of arguments. It abstracts over individual instances of the thing that the function does. Say you want to create a box, and the thing that changes is the location.

The removal of particular aspects of a problem, like generalization in mathematics.

Structure as constraints? Properties as a particular type of structure (akin to an existence constraint)

  1. Shared Properties

    1. Dog vs. instances of dog (interbreeding)
    2. Dog vs. instances of dog ([categories of property] - shared snout size property, breed property, color property)
    3. The class for Orange, vs. instance of Orange
    4. Apple, vs. instance of Apple
    5. Banana, vs. instance of Banana
    6. Animal, vs instances of animal (dog, cow, etc.) - shared properties (eats, sleeps, makes noise)
    7. Fruits vs. instances of fruit (shared ripeness duration, water content, some color)
    8. Graphic object vs. instances (lines, rectangles, circles) where they share position, orientation, etc.
  2. Shared Structure

    1. Functions, generally
  3. Compositionality

  4. The ‘type’ of abstraction done tends to differ between discrete and continuous hierarchies

    1. Continuous hierarchies are compositional, and consistently use the same type of abstraction all the way up / down the hierarchy
    2. Discrete hierarchies can see abstractions made over different categories or types.

Counterarguments to Abstraction

Against Hierarchy

  1. The mistake of using a single tag for all elements (down in a hierarchy, say the way that books are categorizes) rather than using many tags. Where your data (ex., a book) must be declared to be mainly about one thing in order to be put in a single place on the shelf, but in reality is about many things. And in the internet world, we can switch from broken and overbearing single property categorization schemes to flexible multi-property schemes.
    1. This is a more general argument against hierarchy
    2. A system of tags is… what? Feels like a hash table on top of a graph, where the nodes have relations but also have tags held in the nodes for immediate access.
      1. But what really happens is you drop the nodes and just add arbitrary categories to each datapoints.
    3. This is paradigm breaking. Instead of classification (is it in category 1 or category 2), the answer is ‘both’. Where your label vector has positive values for multiple classes… but that doesn’t even instantiate the concept properly. You have multiple values of 1, where labels don’t have to trade off against one another anymore.
      1. This leads to a new paradigm for file systems (as a graph rather than a hierarchy)
      2. General super powerful algorithm - every time you see a hierarchy, realize that it induces bias and check whether making it into a graph (or at least a DAG) can dramatically improve its value.
        1. Ex., Object Oriented Programming requires that the programmer define a static hierarchical class scheme
      3. Somehow, Clay Shirky didn’t realize this was a graph?

It’s natural that the strongest arguments against hierarchical reasoning come out of graphical reasoning. It’s a question of whether or not to lift the restrictions on relatedness, but in so doing lose the ability to generalize immediately across everything that falls under an abstract class in a hierarchy.

The claim is that if your conceptual scheme is rightfully hierarchically structured (say, every harvard athlete is a harvard student, every harvard student is a college student) then learning anything about a college student immediately generalizes to every harvard student and every harvard athlete, and everything you learn about harvard students generalizes to harvard athletes (to the degree that those categories are actually similar). So you learn that every college student has a transcript, and can presume that the athelete you’re watching on television has a transcript. Or you learn that every college student has a major, and can presume the same.)

Stronger still are biological examples - you know that every hamster is a mammal is an animal. And so in learning that every animal has a circulatory system, you can know that your hamster has a circulatory system. In learning that every mammal has live young, you know that your hamster has live young. And so that transfer is straightforward.

Once you allow for more flexible relationships (in this case lifting the exclusively ‘is a’ relationship constraint), it’s harder to do transfer with the breadth and certainty of a hierarchical scheme. Say you have a system of tags for books.

This is an example of somebody who fit the wrong data structure to his data and then, rather than getting frustrated with the biased induced by his choice, decided that the data structure itself was at fault. He noticed that books are difficult to tie into a single hierarchical conceptual scheme and went after hierarchy in general, rather than the choice.

Inheritance in object oriented programming has a similar flavor. The desire for mixins and the flexibility inherent is sensible - making valuable tools available and availing of them when it’s valuable is a natural approach to tool construction.

The value of forcing hierarchical structure to get the transfer that it promises may proove too valuable to give up. And so there’s a reaction to the overuse of a structure (or the design of an entire language around that structure) in a context where this is the reaction.

There’s a want for synthesis that can be achieved after overly powerful languages lead to the overuse of complex features and code complexity in general.

Transfer and Generalization

  1. Overfitting / Bias-Variance Tradeoff
    1. Grab not enough datapoints, create a complex model and assume that it will transfer cross-domain
    2. As the level of abstraction rises, there’s more data to be seen and so it’s easier to create a biased sample accidentally
    3. Large data also allows you to more accurately validate the abstraction
    4. Magic numbers… everything feels overfit
    5. Each Abstraction has to trade generality for nuance
      1. Without complexity (conditionals, context dependence, additional features / factors)
  2. Forming an abstraction in a way such that it fails to transfer
    1. Ex., Statistics / Probability Theory
    2. Wrong level of analysis to see across many tasks

Information Loss

  1. Information lost in classification, or as we move up levels of abstraction
    1. Ex, so much more information in a particular dog (its color, its personality, its sound) than in the category of dog in general.
  2. Information lost in binning
  3. Major pitfall is to have lost so much information that the representation isn’t predictive of important outcomes anymore
  4. Computational difficulties in maintaining information
    1. If you abstract over every property of the underlying object and hold the intersection of all of those abstractions, you may be able to keep the same amount of information. But now your model is a different type of complex.

Conflation, Unlike Objects & False Transfer

  1. Labeling a group by abstracting over objects that share one property but not others creates conflation between the unlike properties
    1. Properties shared among some but not all group members get generalized to being a property of the group
  2. When two objects share a property, a metaphor can be made between them. But the metaphor breaks down over some properties and not others. And so recognizing the limitations of transfer is a process in and of itself, that can fail and lead to erroneous generalization.
    1. Classic example is a metaphor to a positively connoted thing, to compliment or make something good. Or a metaphor to a negatively connoted thing.
    2. Russell conjugates in description (making the metaphor’s implicit abstraction concrete) are also good examples of the manipulability of transfer.
  3. Motte Bailey - moving between defensible and indefensible definitions under the same word.

Excessive Height

  1. Height obscures underlying content, making grounding more difficult
  2. Greater height exposes more datapoints, making it easier to tell a story for any thesis
    1. Height demands more rigorous statistical claims
    2. Positive example generation is the default mode in making arguments / thinking about theses, and is much easier at height
  3. ‘Category error’ problems start to get bad…
    1. What is a category error? It implies logic, or fit, or rules / constraints around objects… There’s this question
      1. Ex., adding unlike units together, adding pressure to height
      2. Not all category errors made equal. Adding a height to a volume is understandable… it can give you an approximation of another volume.
        1. At a higher level of analysis, adding ‘space’ to ‘space’ seems fine… And so abstraction can fuzz the details.
    2. Often there are multiple representations, at different levels of fuzz, and you pick the one that allows it to make sense…
    3. Sympathetic view - Structure is preserved among unlike objects
      1. Sometimes, adding height to volume gives you stronger predictive validity
    4. Unsympathetic view - The heights of abstraction obscure the category errors that would be clear at a lower level
    5. With height, more category errors occur - breakdowns in the fit between objects at lower levels of analysis.

Abstraction as an Unseen Assumption Lack of Grounding

  1. Remembering the model, forgetting the data

  2. Generalization from few (1) datapoint

  3. Lack of depth means it’s difficult to expose failed transfer

    1. Unlike Object & False Transfer
      1. Memetics
    2. Excessive Height
    3. Abstraction as an unseen assumption
      1. Reductionism
    4. Lack of grounding to abstractions
      1. Social reality
    5. Conflation

One way to see inductive / empirical / analogizing / abstractive thinking is as operating by doing transfer between similar datapoints. (As opposed to causal / deductive / rational mode, though this distinction becomes fluid.)

Abstractions alters

  1. The amount of data informing inference
  2. The closeness of the connection between the objects in question

Abstraction tries to bridge a similarity gap between datapoints, but simultaneously makes more datapoints accessible for inference.

For example, abstracting from (time, money, attention) up to (resource) gives you many more relevant datapoints, but the intersection that’s relevant shrinks. For example, time as a resource is finite and unchangeable, in a way that money and attention are not. And so in crossing the similarity gap there’s conflation that can damage your ability to think, if you assume that you can always acquire more of a resource and expect that model to cleanly generalize to time. But you do want to generalize concepts like opportunity cost to all of them.

What is the value of additional data? When building a model over small data, it’s easy to over fit to those datapoints. Generalization fails because the probability distribution that data is drawn from has a range far outside what exists in the training data. Small data makes the prior more important, and so the models fit tend to need to be much simpler (often excessively simple) to maintain generalization capacity.

What damage can be done by attempting to bridge the gap? Conflation of unlike objects can lead to illegitimate inference. It’s critical to know which structure is shared and which is not - this allows clean navigation of the tradeoff, by only doing transfer from datapoints which are similar enough to a new instance to validate the transfer. Or by smoothing over differences between datapoints, giving smaller weight to those that are less similar.

Tradeoff between working memory and conflation

Crudely saving working memory and dropping the distinctions in shared structure is often more efficient, but you eat the tradeoff whole. Doing deconstruction over an abstraction and making the relevant distinctions prior to transfer trades off against computational time and working memory, and so ideally you create abstractions that are clean enough to be efficient. Constantly going back and making distinctions is expensive, and so is the damage done to thought through conflation.

And so the abstractions that minimize conflation while compressing the representation bridge this tradeoff, improving efficiency and generalization in thought.

Topics

  1. This section could be about pitfalls / dangers, but it could also be about the strongest arguments to the core thesis presented in the book (which is much more fun).

Deep Truths

5 Known Deep Truths:

  1. Heights of Abstraction. Ability to describe vast amounts of territory & knowledge with a single word. Thinking at varying levels of analysis.
  2. Proper representation enables transfer and generalization, which is the core of all knowledge.
  3. Similarity is at the core of abstraction quality, abstraction creation, and associative learning.
  4. Overfitting & excessive height as dangers of abstraction in thought.
  5. Representation. The deep impact of a change in representation on problem solving, understanding, problem tractability.
  6. Composition is at the center of every major representation and system.

5 Contrarian Deep Truths:

  1. Manifold Conceptual Prior & Geometric Concept Discovery.
  2. Mathematization of thought.
  3. Meta-Structure and High Transfer (transfer between patterns, not data).
  4. Mathematization of form / patterns as data / structure learning / function & shape data analysis / Structuralism.
  5. Connections between objects / nodes as being more important than the nodes themselves.
  6. Abstraction is actually (partially) about founding a field of conceptual analysis, asking what the implications of your conceptual scheme are for your thought.

More Deep Truths:

  1. Composition is at the center of every major representation and system.

More Contrarian Truths:

Notes & Thoughts

  • I may want to rename this book to ‘Representation’

    • And also orient the content towards that
    • That’s the true motivating idea
    • I care about decomposition / deconstruction as well as abstraction
  • New definition - Abstraction is going to the general from the specific.

  • I need a section on how abstraction is core to intelligence, core to thinking

  • I need a section on how language and writing facilitate abstract thought

  • Another good source of examples: The creation of Theories (that unify) out of sparse facts is one of the most powerful examples of abstraction.

    • Take evolution. It’s was generalizing from the way that Darwin’s finch #1 and finch #2 populations transformed over time to the way that every single living creature has transformed across time.
    • It would have been interesting, if Darwin learned something about the way that finch populations transition across time. Some early Zoologists may have been fascinated. But it’s the motion of abstraction and generalization that makes what he did immensely powerful.
    • Now when we come to a new animal we’re loaded with generalities (like evolution) about the way that creatures like that work which allow us to much more efficiently model what is happening with ostensibly ‘new’ data.
  • Elucidate the differences between abstraction and compression

  • Find cases where the whole is more than the sum of the parts. This is an upside to abstraction (emergence! Get this from particle physics and anywhere else that it’s grounded) that I haven’t elucidated (or seen, tbh) Ideas

  • Trade-off between compositionality in a concept representation and appropriate engagement, so that well delineated concepts can be flexibly recombined to hit a large part of semantic space and so that updates to concepts that have shared structure affect all of the concepts with that structure.

  • There are some context where ‘idealized abstraction’ just looks like going to the lowest level possible and dealing with all of the information that exists (say, you could write one-off code for every object you create). There are other contexts where the higher level representation (say, groups of people) have impacts as a collective that don’t exist at a lower level, and so the only accurate model of the system needs to abstract, understand the impact of the collective on the individual, and generalize from that.

    • The mysticism around ‘Emergence’ is just a modeling error where people only abstract in one way (say, down to cells) and don’t include something important like the interaction between cells in their reductionist model of the system. It’s like creating a graph without the edges. And so when those effects have manifest consequences at a higher level, it feels like they appeared as if by magic.
  • The ‘type’ of abstraction done as a structure that is transferred between domains. Ex, the function of the lower level, the units at the lower level, the properties of objects at the lower level.

    • Perhaps, in the abstract, we can enumerate many / most of the types
  • Can think of abstraction as used to learn a representation that we can use non-compositional, non-hierarchical structures to reason about (as abstraction to the appropriate level analysis reveals the structure that allows the problem representation and solution to be compressed / maximally efficient.

  • A notion of cognitive fit is a general felt sense representation of similarity. Another form of tacit abstraction, without concretely knowing what the abstract object is.

  • Two modes - Hierarchical Compression, Shared Structure

    • Shared Structure can be out of the use of a tool or concept, the components that make up the tool or structure
  • Inverse Abstraction (Deconstruction, or Instantiation)

    • Generative - Creates new information
    • Break into component pieces, in multiple directions
      • Ex. Machine learning becomes Linear Algebra + Calculus + Probability Theory + Computer Science, which break into their own subregions
      • Ex. Scientific Field becomes Major Papers + Categories of the topic + Conferences + Major Researchers + Quality Sites
  • Biology, in creating an ontology, has tried to find unambiguous set separators with some success. Examine this abstraction in detail, it has very interesting properties.

    • Evolution divied up the space for us to come and make abstractions over.
  • Need to distinguish between composition and abstraction. Composition implies hierarchy (even 1D), but abstraction is just an instance of hierarchy - not all hierarchies are abstraction.

  • Distinction between abstractions that cover all possible subsets, vs. abstractions that are ‘dirtier’

  • Proper level of abstraction example is to think about the economy - at the entrepreneur's level, you dislike volatility, as it kills you. But at the level of the system, volatility leads to growth. [Taleb]

  • Difference in kind when your abstraction models the causal model for the data, as opposed to some derivative features.

  • (Perhaps) You can only switch between type of abstraction if you use discrete steps. If it’s like coarseness vs. fineness, or on a continuum, you don’t run into the switching between types of abstraction problem.

  • Chunking in the memory literature as a form of abstraction, where patterns found in information let a person create a group that distinguishes those bits of information.

  • Contrast between variance as information and true probability distributions.

  • You can use the probabilities that people assign to events as well as their beliefs to infer the abstractions that they created over subsets of a class.

  • The ways that people violate results from set theory in the way that they use / apply abstractions as a source of content (in the same way that behavioral econ looked for violations of probability theory)

  • Language is an example of where abstraction has occurred in going from higher to lower levels; there are common roots to words, which were later differentiated to form specific use cases.

    • Chemistry and the naming of chemical compounds (IUPAC) has a very explicit way of doing this, and is a formal system
  • Speaking of differentiation, stem cells!

  • Idea: modularity (taking advantage of combinatorial structure)

    • Computer Science (Libraries, microcomponents, functions, etc.)
    • All of language (words as modules that are combined in flexible ways)
    • Chemistry Compounds
  • Fractal / layers of abstraction. Also, conceptualization. So many different words for referring to levels of abstraction.

  • Appropriate level of abstraction for economic intervention by the government - changing the structure of economic growth (legal system, economic freedom, monetary policy) rather than direct intervention in particular economic spaces (stimulus, public works, subsidies)

  • Tradeoff between volatility and expected value in improving your representations of concepts

    • Diminishing marginal returns to search - high expected value ideas first
  • Transfer and Metalearning are downstream from richer and dissectable representations, especially concept-level representations.

    • Some words are concepts themselves, but sentence and paragraph level representations let you encode the concepts that are a recombination of words.
    • Concept-level representations are the appropriate level of abstraction to transfer across domains and between modes in a world model, including transfer to human minds (interpretability) necessary for alignment, interruptibility, etc.
  • Ontological mistakes in thinking about representation learning

  • Concept Learning

    • Ungrounded
      • New embeddings from relational structure in knowledge bases
      • Concept Parsing
        • Heuristics for separating out concepts in a sentence. Or label it and learn to separate concepts, perhaps integrating part of speech tagging.
        • With concept parser, create concept embeddings
    • Grounded
      • Cross modal transfer for interpretable representations
    • Concept Level Representations through NMT
  • Optimize representations for transferability between tasks. This is “the real key” to useful abstract knowledge.

    • It’s a representation optimized for performance across many many tasks. That’s how you know which abstractions are worthwhile, which ones are shared.
  • Remembering the thought that probability distributions should be parameterized and approximate rather than assumed to be gaussian or whatever. Look at the empirical distribution and map it with a parameterized histogram or something like that.

  • It’s sad that generative outputs are either continuous or discrete. Both will be missing an important category of representation.

  • High and low in the network gets conflated with high and low in abstract space. In that space, the most general features (that are shared among many datapoints) are curves and edges, and the less general features are class-specific. Which are the more abstract? The recombination of the input features? Or the more general features? If abstraction is that which is shared among datapoints,

    • By definition, concepts (like classes) are abstract. Lines and curves are concrete.
  • Topics in Computer Science

    • Abstraction and de-duplication
    • The way abstraction can simplifies representations (simplifies code)
    • The way abstraction can make interfaces flexible (with a single interface working well for all sub-cases, rather than many interfaces for all sub-cases)
    • Full survey of data structures as examples of abstraction and representation
    • Functional programming (monads)
    • Hiding implementation (as a downside)
    • Allowing flexible modification of implementation (without the users of the abstract interface having to modify behavior)
    • Classes as examples of abstraction
    • Functions /methods as examples of abstraction
  • See generalization as 1. Having something that generalizes and 2. Something to generalize over.

    • X generalizes to Y
  • See generalization itself as an operator that relates a thing that generalizes to a thing being generalized over.

    • How do we operate over that operator?
    • Generalization is a function. What higher level function takes that function as input, and operates over it?
    • What changes can be made to generalization itself?
    • What is generalization an example of, a single datapoint of?
  • Categorization as used to do search, or to make lookups of relevant information efficient

    • Google as dropping hierarchy / search through intentional direction of user attention to a particular categorization scheme (Yahoo) and encouraging direct lookup of sites with a term
  • Standards for whether to create a static classification scheme:

    • Stability of the entities
    • Restrictedness of the entities
    • Clarity of the edges between entities
      • Is it possible to cleave neature at the joints?
  • All data structures as spatial, encoding different notions of closeness or similarity between datapoints

    • Hierarchy as spatial, keeping related information ‘close’
    • Write down exactly what kinds of similarity are encoded in the data structure
  • Centralize concepts as manifolds in your presentation of how abstraction works / what abstraction is.

    • Dude! Philosopers of the ontology of concepts desperately need 'concepts as manifolds' - concepts as a low dimensional surface in an extremely high dimensional space of possible concepts, where instances of the concept populate the surface as subtly different variants on the same conceptual manifold. They need concepts as efficient compressions, where decisions about representation are grounded in usefulness (predictive capacity, generalization & transfer, ease of computation, etc.)
  • The lack of naturally compositional categories are the source of failures and limitations of thought. Often something doesn't fall into category 1 or 2, but in both simultaneously. Our languages' inability to naturally compose those categories (and fit that category with all of the data of the two component categories, naturally and immediately) is damning.

  • Mutually Exclusive / Collectively Exhaustive representations (vs. entertaining violations of these - often this is what is meant by ‘clean’ abstractions)

  • From Superintelligence:

    • “Subtle and potentially treacherous challenges arise even in specifying the relatively simple motivation system needed to drive an oracle. Suppose, for example, that we come up with some explication of what it means for the AI “to minimize its impact on the world, subject to achieving certain results” or “to use only designated resources in preparing the answer.” What happens if the AI, in the course of its intellectual development, undergoes the equivalent of a scientific revolution involving a change in its basic ontology?4 We might initially have explicated “impact” and “designated resources” using our own ontology (postulating the existence of various physical objects such as computers). But just as we have abandoned ontological categories that were taken for granted by scientists in previous ages (e.g. “phlogiston,” “élan vital,” and “absolute simultaneity”), so a superintelligent AI might discover that some of our current categories are predicated on fundamental misconceptions. The goal system of an AI undergoing an ontological crisis needs to be resilient enough that the “spirit” of its original goal content is carried over, charitably transposed into the new key.”
  • Mathematize the entire persepctive

    • Ex., set theory
  • Every concept is a distance metric

  • Huge, fundamentali idea: Abstraction (and concepts) are fundamentally discrete. Go to continuity in conceptual space.

    • Instead of concepts, decomposed into the bundles of properties and intuition pumps
    • Instead of categories, have a continuous similarity score across every axis
    • Instead of losing data to creating more categories, use all of the data to inform every prediction and just actually weight all the data
    • The implicit weighting in categorization is 0-1 for every datapoint
    • So broken
    • Need to embrace fluidity
    • This is what minds that are better than ours will be able to do, naturally
      • Compression is healthy for generalization
    • Learn continuous compression
    • Generalization is actually about trasnfer- not about discrete abstraction.
  • To add to definitions:

    • Hegel’s substituting a part for a whole (James Carse Ch. 14)

Related Fields / Idea Spaces to Cover

  • Object oriented programming
  • Bayesian Probability Theory
  • Set Theory
  • Language
    • Connotation
  • Analogizing (As an instance of transfer, read Douglas Hofstadter on the topic[aa])

Resources Worth Creating

  1. To Create: Fundamental Questions in Representation Learning
  2. Create a list of accepted truths in DL. Especially those believed by the high status (Hinton, Bengio, Lecun, etc.) Ask which accepted truths researchers believe are wrong and why. Invert all.
  3. I want to create a curriculum that’s OLD. Exclusively papers from 2000 or earlier. (Hint: Old papers only cite other old papers. And check ICA.)
  4. Take every objective in ‘What Makes a Representation Good’, add my own objectives, and for each one specify:
    1. A way (or set of ways) to measure the objective
      1. Distinguish between the concept of the objective and the mathematical instantiation of the objective (unless they’re truly identical)
    2. The downstream consequences of doing better or worse on the objective
    3. Compare two different networks over the objective
    4. The rationale (and intuition pumps) for the objective
      1. The counterarguments
  5. Apply inversion to all of the ideas in a frontline paper. When it works well, you’ve discovered something you think is true that others disagree with. And if it’s a foundational assumption, you can get started on making progress.
  6. How do we know what we claim to know? Ask this of a shortlist of ‘sacred beliefs’.
  7. Create a ‘sacred beliefs’ in representation learning list.
  8. Create a ‘consistently questioned beliefs’ in representation learning list.
  9. Why learn Discrete / Sparse Representations? Be able to give a fully fledged, fully throated defence and attack.
  10. Why ‘representation’ is this crazily important concept. The dramatic, windfall differences that come out of slightly different representations.
  11. Survey all possible papers I could push hard at in Abstract Representation Learning
  12. Explicate all my categories of idea as low level ideas
  13. Generate new categories of idea
  14. List out all of the goals for representation learning as a field and multiple pathways that would fulfill each goal
    1. Order the goals in terms of importance
  15. List out the unknowns, the missing categories, the assumptions behind the goals, and the mistakes
  16. List of likely to be true / likely to be false assumptions, and ways to prove or disprove each assumption
  17. Transfer between each related field and representation learning
  18. Decide what the goal is. Work backwards to research paths that accomplish the goal. Value parts of the research frontier insofar as they relate to the goal.

Book Organization

At this point I’d like to structure things more finely. Look at what’s general across what I’ve written, and get to a set of unifying high level concepts to make chapters out of. I’m going to list the painfully obvious and necessary topics (in order of necessity) first.

  1. Similarity
  2. Transfer / Generalization
  3. Compression
  4. Properties of Representations
    1. Composition
    2. Modularity
    3. Discreteness vs. Continuity
  5. Structure / Invariances / Types of Relationship / Pattern Recognition
  6. Pitfalls & Failure Modes
    1. Conflation
    2. Due to over-compression
    3. Leakiness
    4. Grounded vs. Ungrounded Potential chapters
  7. The way that abstraction in language leads to unnoticed poor models of systems (abstraction in social systems, politics)
  8. The value of abstraction in transfer, in generalizing across tasks
  9. Power of abstraction, being able to represent huge amounts of information (information theoretic perspective) with extremely succinct concepts, being able to build on those representations
  10. Working memory limitations and the use of abstraction to overcome it
  11. Leaky Abstractions, which seem to summarize information but in reality expose the underlying information in a way that forces the user of the abstraction to be extremely careful with it - danger of building on top of leaky abstractions
  12. Importance of operating at multiple levels of abstraction. Bottom up and top down, for full understanding of systems.
  13. Doing tasks / communicating / operating at the proper level of abstraction, and the results when this is done at the wrong level (ex, human language being too high level for many tasks)
    1. Example from Physics - Newton is right at one level of abstraction, and wrong at another level of abstraction
  14. Truth is only right or wrong after you choose a level of abstraction at which to determine truth.
    1. The ways in which you verify alignment of an idea with reality change depending on the level.
  15. The discreteness of abstraction
    1. The idea that a canonical solution to the level of abstraction problem is to go to the lowest possible level, and this being prohibited by the discrete nature of reality
    2. The true representation of concepts is continuous (in the brain) and then discrete again - we’re operating two levels above the true representation.
  16. Scientific disciplines as distinguished by the grade at which we interact with reality, from lower level to higher level, and the relative ease of formalizing lower levels of abstraction
  17. The planning process and the way that we treat near concrete events at a low level and far events at a high level
  18. Holistic reasoning (broad, System 1) vs. sequential (Analytic, causal, System 2) reasoning
  19. Causal models, and the way we chain through reasoning up chain of abstraction
  20. Contrast causal inference with intuitive, probabilistic inference over some high dimensional ‘feeling about’ or ‘sense of’. Transfer between these types of domains is very different, as is the way we reason about the role of abstraction in them.
  21. Abstraction and generalization - generalization as a way to evaluate abstraction quality
  22. The discrete quality of abstraction, and the problems that result
  23. At a high enough level questions start to encompass values, which are not objective, and people often don’t notice when their question is so abstract that it involves values that are arbitrary and the q has become subjective
  24. The way that information is lost up the chain of abstraction. Coarseness and fineness. Fundamental tradeoff between amount of compute necessary and the accuracy of the abstractions reasoned over.
  25. Abstraction and variance - because information is in variance, the high variance (not necessarily representative) portions of an abstraction may take over the conception of a system or a set of people. Contrast between variance as information and true probability distributions.
  26. Abstraction as compression. And compression as basically everything.
  27. When faced with a bad abstraction, people pick sides on how to think about the bad abstraction instead of going a level lower and in so doing immediately resolving all disagreement. (in the case of Islam, religion (which implies false equivalence between islam and christianity which doesn’t capture the differences in secularization, enlightenment, empire (caliphate) building and sharia)
  28. Empiricism at different levels of abstraction -
  29. Language as a proxy for thought, and the way the the link of abstraction between thought and language shapes the way we experience ideas and communication.
  30. Inferential distance, where ideas build on each other and the lack of foundation makes it impossible to move forward with a particular idea.
  31. Overview of the fundamentally related big ideas - composition, hierarchy, transfer, causality, structure, compression, math, cognitive biases, information availability, memory, attention
  32. The construction of mathematics as an explicit capture of abstraction models - also, analogies between mathematics and other abstraction processes (NB, this is fundamentally doing mathematics, but philosophy aside…)
  33. Models - Mental models, predictive models, causal models - any map of a territory - as an act of abstraction
  34. Need for Abstraction in formal systems - defining something like a utility (Econ), an environment (RL), etc.
  35. Distinction between objects and relationships - graphical / relational frame, differences in abstraction necessary to account for this
  36. Clustering (or feature group clustering) as abstraction
  37. Temporal Abstraction
  38. As necessary for causal inference
  39. Alien philosophy - Can never do similarity without the generative process
  40. Operate without operating - doing transfer / rerepresenting one as another does this
  41. Intelligence Relevant Topics
  42. Having a model
  43. Abstract Knowledge Representation
  44. Ability to Generalize
  45. Learning / Adaptive
  46. Creativity
  47. Information processing
  48. Computation
  49. Goal accomplishment (ugh)
  50. Generality (over environments, tasks, representations)
  51. Memory (Working, Episodic, Long Term)
  52. Attention (though this feels like a limitation)
  53. To turn into general versions:
  54. Sensitivity score (maml based).
    1. Re-framed for abstraction: think of representation sensitifity as a way to do transfer. By finding the representation that is most adaptable (say, you have a body of recombinable primatives that effectively capture a decomposed representation of whatever your target is) it’s easier to quickly learn to deal with something unique. There are properties of representations that make this easier or harder. In many ways, overfitting looks like representing the world in a way that makes information about particular tasks difficult to reuse for other tasks. You can see episodic memory systems as enabling very fast learning and fitting, but also as representing informaiton in a way that will be more difficult to generalize.
  55. Do close analysis of how the representation represents a single image.
    1. Credit assignment over particular filters.
    2. Look at the way the filters are recombined with one another - find simple examples of composition that models a particular part of the image.
    3. It’s useful to know which sub-parts of a representation are involved (say, causally) in making a prediction. If you can do that credit assignment, you can know which parts of the represenation aren’t worth keeping (say, if you have constraints on total memory) where ex. The avaliability heuristic pushes parts of the potential representation to spaces that are less accessible to attention the more time that has passed since those circuits have been activated.
    4. It can also be made clear where the representation can / should be made richer. Say you’re using a part of the representation that hasn’t been self reflected on deeply over and over and over again. When you decide to thicken the quality of the representation, those parts will be the low hanging fruit, the best place to start.
  56. Some filters will be composed with more or less other filters, which is a different metric than their importance. Which filters activate over the most images? At each level of the network?
    1. Can we use this metric of generality as a heuristic for transfer? Say, only filters that are sufficiently general get used in transfer?
    2. You want to see which parts of your representation get their value directly vs. getting their value through recombination with other parts of the representation. The parts that underperform in recombination may need to be made more modular or interacted with at a lower level of analysis as a standard step in improving the quality of the representation.
  57. Find ‘conflation’ in a representation. (this may be hard)
    1. Have a notion of which features should be recombined to create a higher level feature
      1. Look for overlapping activations where they should not exist (misclassified examples should be really good for this)
  58. Get to a ‘why’ for misclassified examples
    1. Look at the ways that the representation couldn’t distinguish between particular parts of an input, look at the mistakes made over 4-5 examples and diagnose them
    2. Are you allowed to publish a paper titled ‘why our networks fail?’
      1. This may be hard to get causal on, but could be extremely useful.
    3. It’s important to be able to introspect on the thinking that led to a particular decision so that the appropriate part of it can be updated (credit assignment style). In the absence of that, the entire mode of thinking or body of interconnected thoughts needs to take the update.
  59. Do VQ-VAE, but with a forward predictive model. Generative model of future, rather than present. Auto-regressive generative model.
    1. I guess this is what the forward predictive lstm over VAE state is, in a way.
  60. Take Ben Poole’s Gaussian Mixture Model VQ-VAE and apply it to something like world models (where you want this ability to go discrete or continuous)
    1. Is this idea for generative future prediction a thing in general? VAE + LSTM to do it is awesome, but is it the best in its class for that task?
  61. Check manifold learning hypothesis
    1. Dog manifold on animal manifold, for example
  62. Causal representation learning - which filtermaps have counterfactual impact on the output?
    1. Use filter-level dropout to estimate this
  63. Metric Learning is beautiful because similarity is foundational to compression, abstraction, and generalization, and the kinds of similarity that your representation exposes is a critical property of your representation. It’s an attempt to answer the question of what the most relevant axes for comparison are. The concepts that are necessary are the ones that look at similarity along one or a handful of features, not all of them, which abstractions or concepts can be seen as a multitude of learned similarity metrics across different subsets of features of the data.
  64. One definition of Ontology is asking the question, what kinds of things exist and what kind of relations can exist between those things. Which is clearly a graph representation of reality, with objects and their connections.
  65. The mistake of using a single tag for all elements (down in a hierarchy, say the way that books are categorizes) rather than using many tags. Where your data (ex., a book) must be declared to be mainly about one thing in order to be put in a single place on the shelf, but in reality is about many things. And in the internet world, we can switch from broken and overbearing single property categorization schemes to flexible multi-property schemes.
  66. This is a more general argument against hierarchy
  67. A system of tags is… what? Feels like a hash table on top of a graph, where the nodes have relations but also have tags held in the nodes for immediate access.
    1. But what really happens is you drop the nodes and just add arbitrary categories to each datapoints.
  68. This is paradigm breaking. Instead of classification (is it in category 1 or category 2), the answer is ‘both’. Where your label vector has positive values for multiple classes… but that doesn’t even instantiate the concept properly. You have multiple values of 1, where labels don’t have to trade off against one another anymore.
    1. This leads to a new paradigm for file systems (as a graph rather than a hierarchy)
    2. General super powerful algorithm - every time you see a hierarchy, realize that it induces bias and check whether making it into a graph (or at least a DAG) can dramatically improve its value.
      1. Ex., Object Oriented Programming requires that the programmer define a static hierarchical class scheme
    3. Somehow, Clay Shirky didn’t realize this was a graph?
  69. The way a grammar limits the space of possible sayable meanings
  70. The way a lack of concepts limits the space of sayable meanings
  71. Attempts at insanely high abstraction
  72. Duality (opposites) in the I Ching, Tao te Ching
  73. Hegel’s Thesis, Antithesis, Synthesis
  74. Abstraction as allowing the buildup of memetic knowledge across time (as distinct from language)
  75. Be explicit about the stared structure in abstraction across domains (computer science, representation learning, ontologizing in metaphysics, category theory, etc.) by writing down exactly what shared structure exists and how that manifests itself across each domain (in the style of Cover / Thomas on information theory, another consilience book)
  76. Interface as a major concept in modular composition.

Potential Interactions: Properties of Thought & Information Processing

Uncategorized:

  1. Transfer
  2. Creativity
  3. Feedback
    1. Positive Feedback
    2. Negative Feedback
  4. Applicability (getting to the right level of abstraction)
  5. Generalization Error
  6. Extrapolation
  7. Interpolation
  8. Observation
  9. Regularity
  10. Prediction / Inference
  11. Validation
  12. Common Sense
  13. Logic
  14. Induction
  15. Deduction
  16. Tradeoffs
  17. Arbitrage (Between two models)
  18. Exploration - Exploitation
  19. Smoothness
  20. Manifolds
  21. Sparsity

Biases

  1. Selection Bias
  2. Measurement Error
  3. Latent Variables (in a causal context)
  4. Autocorrelation
  5. Distribution Error
  6. False Assumption

Structure / Reframes

  1. Graphical / Relational Structure
  2. Hierarchical Structure
  3. Temporal Structure
    1. Positive / Negative Feedback
  4. Recursive Structure
  5. Discreteness / Continuity
  6. Causal Structure
  7. Combinatorial Structure

Tools:

  1. Distributions
    1. Gaussian Distribution
    2. Pareto / Power Law Distribution
  2. First Principles
  3. Invert - Reverses the impact of a card
  4. Counterfactual
  5. Elucidate Assumptions

Increase Bias, Lower Variance

  1. Abstraction
  2. Compression
  3. Dimensionality Reduction
  4. Regularization
  5. Linearity
  6. Discreteness
    1. Binary / Multinomial

Increase Variance, Lower Bias

  1. Continuity
  2. Nonlinearity
  3. Conditional / Distinction
  4. Nuance?

Resources

  1. Working Memory
  2. Long-Term Memory
  3. Attention
  4. Processing Speed

Attributes

  1. Creativity?
  2. Curiosity / Intrinsic reward for compression progress
  3. Education

Outcomes

  1. Interestingness
    1. Novelty
    2. Feeling of Compressing model
  2. Originality / Uniqueness
  3. Interpretability

Citations (As well as links to as many things I’ve read in writing this as possible)

  1. Concepts - Stanford Encyclopedia
  2. Abstraction - Wikipedia

Reading List

  1. Philosophy
    1. Concepts - Stanford Encyclopedia
    2. Abstract Objects - Stanford Encyclopedia
    3. Abstraction - Wikipedia
    4. Books
      1. Fluid Concepts and Creative Analogies [Hofstadter]
      2. The Philosophy of Information [Floridi]
    5. Philosophers
      1. Plato (Forms)
    6. A Human’s Guide to Words (LW Sequence)
    7. Philosophy of Mathematics
    8. Specificity as Selecting a lower level element
    9. Gears as Metaphor for the Value of Decomposition in Causality
    10. Ontology Evolution: A Process Centric Survey
    11. Models in Science
  2. Cognitive Science
    1. Knowledge, Concepts and Categories [Lamberts]
      1. They wrote me down
    2. Knowledge Representation [Markman]
    3. Cognition and Categorization [Rosch]
    4. Categories and Concepts [Smith]
    5. The Big Book of Concepts [Murphy]
    6. The Origin of Concepts [Carey]
    7. Concepts [Margolis]
    8. Working Memory / Education - Chunking, Geoffrey Miller
    9. Cognitive Development: Its Cultural and Social Foundations
  3. Machine Learning & Artificial Intelligence
    1. Bengio, Courville, Vincent (2014)
    2. Building Machines that Learn and Think Like People
    3. Generality in Artificial Intelligence
    4. SCAN: Learning Abstract Hierarchical Compositional Visual Concepts
    5. Driven by Compression Progress
    6. Universal Intelligence: A Definition of Machine Intelligence
    7. Concepts, Ontologies, and Knowledge Representation [Jakus]
    8. A Survey on Metric Learning for Feature Vectors and Structured Data (Learned Similarity)
    9. Learning a Similarity Metric
    10. A Survey of Inductive Biases for Factorial Representation Learning
  4. Mathematics
    1. Encyclopedia of Distances
    2. Applied Category Theory
    3. Category Theory
      1. Categorical Foundations of Mathematics
    4. Equality vs. Equivalence in Mathematics
    5. Barry Mazur’s When is one thing equal to another thing?
  5. Linguistics
  6. Computer Science
    1. Simple Made Easy
    2. The Complexity Trap
    3. Ontology is Overrated: Categories, Links, and Tags
    4. Can Programming Be Liberated from the Von Neumann style?
    5. From Design Patterns to Category Theory
    6. Poker Solved, in part with an explicit Abstraction Step
    7. Lambda the Ultimate
    8. C4 compositional visualization scheme for software
    9. Bjarne Stroustrup interview with lex Fridman, minute 42
  7. Conversation / Disagreement (Representation issues and false generalization)
    1. Constructive Feedback
  8. Uncategorized
    1. Judea Pearl interview with Sam Harris minutes 40-60
    2. The Art of Insight in Science and Engineering, Ch. 2
    3. the glass cage: automation and us
    4. The Information: James Gleick. The invention of categories by the Greeks.

Old Materials Unstructured Content Opening

Abstraction’s power is to allow the wielder to represent huge amounts of information in a single object.[ab] Layers of abstraction allow the entire universe to be summarized by a single node, on top of a hierarchy of lower level representations. The tool is dangerous, making ideas inaccessible and hiding distorted information. It’s foundational to the way that we think. This book is about how Abstraction operates. How it allows us to create, to analogize, and to build knowledge out of information.

Representing Information

Abstraction itself is a wonderful example of abstraction. From the child learning language to a college student studying category theory to a computer scientist designing a representation of the world, the term covers a staggeringly large span of tools and techniques across all modalities of thinking about reality.[ac]

When fields develop their own language [ad]to build ideas on top of one another, we see abstraction at work. The delineation of fields themselves are but an abstraction, one that attempts to create clarity and allow us to summarize and specialize and focus. But also one that separates ideas from one another. [ae]

Elaborate on scientific fields as abstractions that let us represent huge amounts of information under them. And transfer between them.[af] Book Summary It’s called Abstraction. It’s about the way that compression allows us to climb to the heights of abstraction, encapsulation hilarious amounts of information into a single concept. About the way that we load our language with these concepts, implicitly transferring information across umpteen experiences across our own lives and the lives of all others who have held and used the concept.[ag] Definitions

On definitions: What are definitions? The version that I like is that in practice, each concept is a bundle of properties. And so abstraction has a bunch of properties: 8. Conceptual rather than concrete 9. General rather than specific 10. Compressed representation

  1. Abstraction as hierarchical compression (of lower level objects)
  2. Representation that allows for predictive generalization out of range (both interpolation and extrapolation)
  3. Details have been removed (salient features have been kept)
  4. Shared structure between particular facts has been identified
  5. Shared structure between particular facts has been labeled
  6. (Is it an abstraction / is it abstraction if we don’t label it?)
  7. Metaphor as implicit abstraction makes this distinction
  8. Higher up in a hierarchy
  9. Hegel’s substituting a part for a whole (James Carse Ch. 14) And when a process has some of these properties we’ll call it abstracting. The hard question is which of these are necessary? Which of these are optional? Is the binary classifier for whether or not something is an abstraction whether it has a single one of these? May it need all of these? Certainly, this body of properties is a surface from which you can explore a space of definitions that are all likely useful in their own way and are close to what we mean when we say abstraction. Settling on one combination of properties with a set of rules like ‘these two are necessary, but if this third property is paired with this fifth property we can disregard the second property’ feels like a painful and terrible way to go about defining things.

What is the right way to think about things? I’d love to just introduce all of the properties and then leave the usage up to context. OR I could make a number of distinctions across the properties that tend to vary most often. And so we’d have a PCA-style set of principal component (variance maximizing) differences in usage. And each difference in usage would get a name that made the appropriate distinction (say, compressive abstraction, or labelless abstraction). And also make the distinctions between transfer and abstraction, between metaphor and abstraction.

Many properties of the defined object are actually consequences rather than causes of consequences. And then you can optimizer the object for those consequences (say, optimize the way in which you abstract for generalization ability). But it feels unusual to define the object in terms of its consequences.

Need a word for the way that there’s a bundling of related traits into a single concept. Take ‘running’ as the general version of 1. Many animals running, say. It could be seen as using legs to move quickly. It becomes a metaphor quickly - “I’m running my mind over this paper”, or “my mind is running back and forth”.

Examples of abstraction that violate each property:

  • Conceptual rather than concrete
    • In object oriented programming, the creation of an abstract class is a solid example of an abstraction that’s just as concrete as its subcomponents.
  • General rather than specific (This one is hard, may want to stick with it)
    • Can argue that mechanics is more abstract than quantum (because it abstracts away the details), but it is actually less general than quantum.
  • Compressed representation
    • Non-compressive abstraction… This is really strong. Can’t find one after ~2m.
  • Representation that allows for predictive generalization out of range (both interpolation and extrapolation)
    • Going too high usually kills predictive power. Say you think in terms of blocs of countries (which leads to assuming that the countries with the blocs and their hierarchy of suborganizations all want the same thing)
  • Details have been removed (salient features have been kept)
    • This is quite strong.
    • Multiplication is an example of the level of detail remaining the same. (acutally wait, multiplication eliminates the particular numbers involved… nvm.)
  • Shared structure between particular facts has been identified
    • You an abstract over a single datapoint, so no. But it’s likely that you won’t do it well.
  • Shared structure between particular facts has been labeled
    • Stores as abstraction over the production / supply chain violate this. But they do create an easier to use, higher level simpler interface to the backend.
    • (Is it an abstraction / is it abstraction if we don’t label it?)
    • Metaphor as implicit abstraction makes this distinction
  • Creation of a hierarchy
    • Counterexample - properties are abstractions across animals…
      • Only if animals are your datapoints. If the concept of ‘claws’ is your property (the animal has claws) then you see each instance of claw as your lower level.
    • This also feels strong. You can almost always create a hierarchy.

Implicit to this question is that the definition should be that which is invariant across all examples of abstraction. But this is overly focused on consistency, and not enough focused on a principle component analysis of the properties that are most key to abstraction. (On the battle between intuition and rigor in definitions).

If you care for rigor:

  • Details have been removed, keeping only the salient features
  • Do not pick conceptual rather than concrete (abstractions can be concrete too, say in programming) If you care for intuition, ordered list:
  1. General rather than specific
  2. Shared structure (labeled)
  3. Conceptual rather than concrete
  4. Compressed Representation
  5. Details have been removed
  6. Representation that allows for predictive generalization

Justification for Ordered List

First, the things that were removed / didn’t make the list: metaphors as implicit abstraction is interesting (as a broader category of transfer exists, which perhaps should be the true title of this book, since we really care that we do transfer/induction successfully and not necessarily that we use abstraction to do it [whoa. I should think about that]).

General rather than specific is the most important defining quality,

On General rather than specific

This depends on a reference point of generality. Which means that the process of abstraction moves from the more specific to the more general, rather than being objective. Or this is a distinction - you may say that a category like ‘dog’ is abstract, but compared to what? Compared to base reality. But it’s extremely grounded in comparison to animal, or being, or object (of which it’s a sub-class). And so there are two versions of abstraction at work here - the conceptual (relative to the concrete), and the relatively more general (or broader) class of objects.

Types of shared structure (and so, types of abstraction)

  • Function
    • Leading to functional abstraction
  • Presence of a property / sub-object

Transfer Abstraction is one way to map shared structure from one problem / solution / dataset to another. Transfer is what allows for efficient modeling and learning, where upon encountering a situation an intelligent being or system can bring patterns seen in data and problems from the past to bear on the new situation, and so be able to act as if it’s already seen the ‘new’ situation before, informed by situations like it. Abstraction involves identifying a property or pattern that exists (usually across many datasets / problems / examples) and then naming that pattern in a way that is general across the examples (and so which you would expect to generalize to similar examples)[ah]

[ai] There are forms of transfer that don’t move from the specific to the general, but instead relate the specific directly to the specific - one quality example of this is metaphor. Metaphor[aj] works by pointing out that one object is like another object, without necessarily spelling out exactly in which way it is similar (often, the metaphor will draw upon many intuitive and emotional properties, simultaneously). For example, take the metaphor ‘she has a heart of gold’. At one level it’s asking you to feel about her heart the way that you feel about gold. In abstract terms, you could say that the sense of awe you feel at the lustre and rarity and globally recognized value should also be felt about her heart. The way that metals can’t be changed, the strength implicit in it, the sense of being able to depend or it and (more importantly) the body of felt associations that any person moving through the world has associated with gold get transferred to this different object, the heart.

Abstraction does transfer through surprisingly similar mechanisms. Over time we build up associations between concepts and the properties and emotions that are experienced in tandem with those concepts. When we bring that concept (take an example from above: strength, or rarity, or metal) to bear in a situation, the properties associated with that concept are naturally brought to awareness when the concept is used. The difference is that in abstraction, an explicit label is given to the set of shared properties. Take ‘strength’. What do strong things have in common? There’s an element of power, robustness, capability - there’s transfer between everything that has been described in the past and every other object with that description. And now we have an abstract object, strength, which holds that bundle of properties. This natural build up of descriptive transfer allows us to use language to connect multitudes of experiences with one another without ever explicitly comparing them to each other. We merely have to experience the same abstract object in each context, and the transfer will follow naturally from using that abstract object to describe a new situation.[ak] [al]

Transfer of solutions can be particularly powerful, if you realize that some abstract descriptions of problems share solutions than finding a way to represent your problem in the way that others are represented often gives you a surface for problem solving and decision making that didn’t exist before.

Transfer is at the core of induction itself. And even in deduction,[am] the rules that it makes sense to trust are trusted for their ability to yield correct predictions in an inductive sense beforehand. Much of thinking is pattern matching between situations that you’ve seen in the past and the current situation, or is intuiting about a situation which calls upon implicit knowledge that you have also gained across time or that has been built into you through evolutionary transfer, where your instincts have been finely tuned for (say) social environments and so you can pick up on leagues of implicit knowledge emotionally based on a set of ancient and natural instinctual responses. Fight, flight or freeze responses are a pattern match to situations where that response saved the lives of the being making that response in the past, and so is present as a form of transfer from situations where stimuli were similar enough to current stimuli to trigger the same response.[an]

Machine learning algorithms are all driven by transfer. The goal is to observe some set of datapoints, and on that basis find a set of patterns that will allow knowledge about new datapoints to be inferred. T[ao]he process of transferring knowledge can be driven by different notions of similarity (which is the basis on which you expect the transfer to be successful or to fail). Often algorithms will weight the ability to do transfer on the basis of the degree of similarity to other datapoints, ex. kNN. That set of principles is general across all learning algorithms, and while transfer is often referred to as being between-dataset transfer in machine learning, each algorithms focus is this within-dataset transfer for generalization. Similarity

Similarity, Distance, and Memorization

This is so so long in the making.

Think about: Distance and intuitive physics

Continuum of similarity: 3. Discrete

  1. Equality / Identicality
  2. Property overlap
  3. Edit Distance
  4. Continuous
    1. Euclidean Distance
    2. Angles
    3. Cosine Distance
    4. KL Divergence

Think about a distance metric between shapes. Humans have an intuitive sense of the similarity between shapes, as a part of a general ability to judge the similarity of any two objects in an undefined intuitive space. But writing down that metric is incredibly difficult. And so much of the mission of representation learning research is learning how to compare arbitrary objects to one another through a distance metric that’s learned from experiencing arbitrary data.

Similarity is at the core of all learning and cognition. Neuroscience side, an example is the heuristic that ‘neurons that fire together, wire together’, connecting sets of data that are similar[ap] in that they occur close by in the time series that a mind experiences. It also leverages another notion of similarity - if a and b there are previous patterns that are established, those previous patterns a’ associated with the firing of a will also be associated with b. This transitivity quickly becomes a wonderfully nuanced and complex form of similarity.

In developing knowledge, there’s a value to the knowledge that will generalize. Memorization[aq] is fine if you know that the data you see in future will be identical to the data that you’re seeing now. But there’s a continuum over the richness of similarity metrics, where memorization[ar] can be defined as the way similarity breaks down a short distance from the datapoints that have been memorized (or an inability to return an answer when you’re a short distance away). We get beyond memorization by finding ways to map similar datapoints to one another, and through that connection drawing conclusions about the properties that those similar datapoints will share (this depends on the ways in which the objects are similar) and is stronger than memorization in that when you encounter new information that hasn’t been seen before, you can compare it to that which you’ve already seen and use that history to inform your understanding of this new experience.

Often memorization looks like a lookup, which is fragile. If you don’t find something that’s exactly equal to what you’ve seen in the past, you can’t return anything. The more effectively you can connect a new experience to a vast array of old experiences, the more data you’ll have to draw on when you make inferences about it.

One implication is that broadening the types of connection that you make between objects allows you to transfer more information and connect more types of object. Seeing more of an objects properties creates a larger surface for comparison, enriching the kinds of connection you can make between your object and others.

As you move to a more expressive and nuanced similarity metric (ex. from binaries to a continuum, say from whether a person is good or not to how good a person is) the distortions[as] that come from conflating datapoints that are relatively close but not identical with each other can disappear. But the tradeoff to eliminating that conflation is needing to store a much more complex representation of goodness for every person (and likely move from an intentional, deliberate process for thinking about it to an intuitive one).

Properties of data that make certain kinds of similarity metric more relevant:

Imagine trying to use edit distance for semantics, where you tried to map words’ conceptual similarity to the number of changes you’d need to make to the letters in one word to get to another word. The lack of overlap between a word’s spelling and its meaning makes this notion of similarity irrelevant. But instead, imagine looking at the similarity of the contexts in which words appear. Specifically, say, the number of occurrences of a small 2-5 word window of surrounding words. Suddenly, words with similar contexts and meanings can be mapped to each other - ex. Cat and dog are often used with similar surrounding words, and so will be close to one another. King and queen are also used with similar surrounding words. And when they differ, they tend to differ in a way that has to do with the meaning of the words. So distance on this similarity metric is meaningful in ways that aren’t captured by edit distance. The design of distance metrics that capture the structure that matters for a task is extremely important.

Implication: Languages need to merge semantics with A language that merged its word / letter representation with the actual meaning of the words (where, say, base sounds / letters represented the principal components of learned word vectors) wouldn’t require memorization to learn because the mapping from concepts to reality would be on a continuum and its meaning grounded in the language itself rather than being an arbitrary mapping from a word to a concept.

Generator: Why does language learning require so much memorization? In general, needing to memorize indicates that you’re using a degenerate distance metric and that your learning will fail to generalize.

Similarity 1.0

What is this ‘shared structure’?

Underlying the concept of ‘shared’ is a notion of similarity.

Closeness to equality.

Question: Are all forms of similarity captured by 'how much you have to change the object to get equivalence?, and where equivalence is "these objects are the same"?"

Topics: 7. Relationship of abstraction to similarity

  1. One type of abstraction identifies shared structure across objects and compresses it into a single concept or abstract object.
    1. Similarity as existing over different features
  2. Type of similarity as a function of the nature of the thing being compared
  3. Types of similarity measures
    1. Discrete:
      1. Equivalence
      2. Edit Distance
      3. Number of properties in common
      4. (Having a property in common is similarity over that property)
    2. Continuous:
      1. Cosine Distance
      2. Euclidean Distance
      3. KL Divergence / Cross Entropy
      4. Wasserstein Distance
      5. Hinge Loss
    3. Generating Similarity Metrics
      1. Concept Representations
        1. Word Embeddings, same up to angle
      2. Networks
        1. PCA over learned representation
  4. Types of Similarity
  5. Have the same function / accomplish the same task, consequently
    1. Laptop is actually a functional abstraction, more than a compressive technique over sub-parts.
      1. But it also cares about the sub-parts… the surface and ipad are functionally similar but are called ‘tablets’ instead of ‘laptops’.
  6. Use the same mechanism
  7. Have the same property
    1. Ex. shape, color, density
  8. Have a set of shared properties
  9. Question: Are all forms of similarity captured by 'how much you have to change the object to get equivalence?, and where equivalence is "these objects are the same"?"
  10. Cognitive Fit, human notions of concept similarity

Relationship of Abstraction to Similarity Similarity is a foundational concept that forms the basis of the ability to compress information across objects. When objects are similar, whether it be in their properties, or their constituent parts, or in their function, it becomes possible to transfer information from one object to another via awareness of this shared structure. In the extreme case where objects are identical or equal, we can compressive massively - we can throw out one of the objects, and merely remember that it’s equal to the other object. On the continuum away from equal, we lose compressive power in that we have to trade the compactness of our representation of the object against the amount of new information that the object holds. As an example, it’s often much easier to store a deviation from an existing object that to construct an entirely new object. Say, this is like headphones without the wires (for bluetooth headphones), rather than inventing a new name entirely. This is a phone, but smart. That modification of an existing representation is transferable across people who have the existing representation, and is compact in the amount of new information that exists (only the deviation). In the same way that human attention is drawn merely to what changes in the environment (and in general things that are unchanging fade out of awareness), representing new information as a deviation from an old representation is a classic example of efficiently taking advantage of similarity.

Similarity Over Different Features Similarity can operate over all the features of an object. It’s often the case that features are intercorrelated, and so when enough features intercorrelate we tend to create a term or concept for that body of relationships. When objects are similar over one feature but not in others, there’s often conflation or confusion when an attempt to generalize from the workings of one object to another fails.

We can think of similarity over different features as often having very different properties - being measured in different ways, allowing or disallowing transfer in different ways, etc.

Measuring Similarity There are a body of metrics for measuring similarity - the classic example is equality. When two systems are equal, they reflect one another perfectly. There’s no information in one system that isn’t reflected in the other system. And so we can do heavy compression. But equality is binary - the objects are either equal or not. And so it’s not granular enough to capture shared structure that is incomplete. Equality strongly limits the complexity of the objects that can be compared to one another, and so looser metrics are critical to modeling real systems.

Difficulty in describing continuous similarity, distributional similarity. Similar difficulty in describing distinctions, or breakages of similarity in distributions. General difficulty of reasoning outside of discrete space. Need for cognitive fit.

Types of Similarity Difference from Equivalence (Shared structure over metrics) Cognitive Fit (Similarity in human intuition)

Compression[at]

  • Compression lowers working memory requirements
  • Compression lets you store more

Lower working memory[au] requirements[av] for compressed objects means that more information can be fit into working memory. The reality is that most thinking requires that you put multiple objects into slots in working memory and then watch them interact with one another, using the recombination of those objects, their properties and relationships to generate insights or plans. With a compressed object representation, something that may have taken up three, four, or more slots in working memory can pick up just one. (Functional abstraction is particularly useful in this context). That means that at high level of abstraction you can think through the interactions of tens or thousands of sub-objects and sub-sub objects while looking at only a few high level objects which you hold in working memory.

Compression also likely has deep impacts on long term memory. There’s a sense that total storage space is dramatically improved by not having to remember the details of situations, by representing the memory conceptually. There’s this sense that recall is generative and so in effect stored as a time series of neural activations. This further compresses the pure capacity necessary for memory.[aw]

Temporal Abstraction & Causality

One of the most fundamental types of abstraction is across time. We organize time into a natural hierarchy, from sets of seconds that make up each minute, to sets of minutes that make up each hour, to sets of hours that make up each day, and on up. Thinking in these terms is liberating. Imagine needing to make a plan, and having to do it at the insanely granular level of detail of seconds or minutes rather than weeks or months at a time. The ability to operate at a higher level is what makes makes planning feasible. Thinking causally is one of the most powerful sources of our prediction capacity, and it’s enabled by this ability to abstract across time. One easy way to think about causal reasoning is to think about ‘what would have happened otherwise’, or counterfactuals. In order to simulate what would have happened with or without an event occuring, or whether a particular action was the difference between an important outcome happening or not happening, it’s necessary to both group time periods where the action could have happened together and to group time periods where the outcome is expected together. Otherwise, assigning causal impact to an event fails.

Temporal abstraction lets us compare the same action or event happening at slightly different points in time to other experiences where that action or event happened at slightly different points in time. It makes our thinking robust to small shifts in the patterns we see between actions and outcomes.

Our mind is constantly making predictions (whether or not we recognize them as predictions) at multiple time scales. A prediction like whether or not a project will go as planned that plays itself out over the course of weeks is at an entirely different time scale than whether we’re going to get crushed or not by a heavy closing door (and in the case of a positive prediction, need to dart out of the way). Abstraction is what makes feedback at a meaningful scale possible. It also allows us to make predictions arbitrarily far into the future, filtering out huge messes of detail that could feasibly crop up in the interim. Decomposition Decomposition is inverse abstraction. You can do extremely direct transfer between them.

Linearity (sum of parts) as allowing for decomposition and recombination to allow for solutions in linear problem solving that are difficult in non-linear systems.

The types of decomposition correspond to types of abstraction -

  • Functional Abstraction (We got here by other means)
  • Modular Abstraction
  • Property-based abstraction
  • Physical Abstraction (composition of parts)
  • Recursive Abstraction
  • Temporal Abstraction (We got here by other means)
  1. What kind of x’s exist?
    1. Ex, what kinds of deconstruction are there?
  2. What properties does x have?
    1. Elimination over the set of properties as fodder for new similar but different objects
    2. Take love. Say there are 7 types of love, all of which share some properties and not others. The deconstruction creates nuances, but also creates a space of possibilities where each parameter in the space is turned off or on (if it’s a binary property), or can be in one of many states (if it’s discrete non-binary) or can vary along its continuum.
  3. Modular Decomposition
  4. How does x accomplish its purpose?
    1. For each step in that causal graph, create a node
  5. Functional Deconstruction
  6. How does x accomplish its purpose?
    1. For each step in that causal graph, create a node
      1. Ask what the function of each node is (why it is there)
      2. Generate new options that accomplish the same function
  7. Physical Decomposition
  8. Recursive Decomposition
  9. Temporal Deconstruction
  10. Often a process has a dependency set that implies a temporal ordering.
  11. Recombining the orderings of a process can dramatically change it.
  12. Parallelizable processes can be even more valuable to deconstruct, because they can be scaled arbitrarily. Explicating an Example

Scientific disciplines are distinguished by the grade at which they interact with reality, from lower level to higher level, and by the relative ease of formalizing lower levels of abstraction.

Take biology (say, using the cell as an example of an object in biology) to be a particular composition of a body of chemical interactions. And notice that across almost every object in biology, sets of chemical interactions characterize the interactions and effects we seen in living creatures. In this sense, biology is more specific, and chemistry is a wider set of tools which can be combined to create objects in biology. In this context there’s functional hierarchical abstraction, in that sets of chemical reactions paired together can be represented of as a process that has some higher level function or purpose. The grouping of these interactions (by location in the living creature, or by time, or by similarity in the same kinds of bond between kinds of atom or molecule) drives a compressive process that lets us think about the abstract representation as a single entity. The upside tor this representation is that often processes will involve the entity as a whole (say, the cell dying) and without the abstract representation it would be a challenge to label all of the sub processes involved in cell-level phenomena like that without having an explicit way to refer to is. There’s this question of how many layers of representation is optimal for thinking about problems in the space. For example, a canonical representation of the biological levels of organization is (from this simplest to the most complex) is organelle, cells, tissues, organs, organ systems, organisms, populations, communities, ecosystem, up through the biosphere. But between ex. tissues and organs there is a huge gap, for which having a representation may reveal sets of phenomena that would not have been properly conceived of otherwise. Chemistry is in a similar position with respect to physics, where most chemical interactions are driven by forces (ex., the electromagnetic force between electrons and protons) identified in physics. Application: Learning The implications of abstraction for learning are multitudinous. The obvious place to start is that every concept we use and tie together and show this that we create our intellectual surface or profile.

Creating New Abstractions Examples: 15. Moving ideas out of and across domains

  1. The making of consilience, the reification of the unity of knowledge
  2. Invert creativity (non-novel value)
  3. Making the discrete continuous
  4. The mechanism that selects identities
  5. Both identity in this moment and the identity(ies) worth moving towards
  6. Meta-identity?
  7. Re-name the curse of dimensionality
  8. Invert Chaos
  9. Statistical Mechanics? This is how I’ve been describing it, as stat-mech vs. chaos.
  10. Need a temporal version of distributions
    1. Gaussian / Poisson Process?
  11. Need a verb for mutual information, for ‘has information about’
  12. Can say x is correlated is y, but can’t say x is informationed with y’
  13. This is a problem because you can say ‘anti-correlated’, but can’t say anti-has information about
    1. Actually nvm, this is a tragedy where correlation can say that the relationship is positive or negative because linearity implies that the relationship always is positive or negative. You can’t do that with mutual information, it captures non-linear relationships.
  14. For the way that the future is always in the present
  15. Ex., how differently you treat someone you’ll be with forever
  16. I consistently frame this in terms of the iterated vs. uniterated prisoner’s dilemma, but it needs a word
  17. Words for the different forms of meaning
  18. Meaning that comes from devotion / sacrifice to something higher than yourself
  19. Meaning that comes from connection, community, relationship
  20. The use of dramatic social upheaval for a fundamental values shift
  21. He needs an <insert word>
  22. He needs a social rite of passage
  23. He needs a social cleansing
  24. None of these are close..
  25. Making the binary probabilistic
  26. Map-territory conflation
  27. Concept-instantiation conflation
  28. There needs to be a word for the ‘bayesian problem’ where you just don’t take the prior into account when they’re doing causal inference.
  29. Ex., canonical success traits that work but that are underrepresented in the base population

These are some fascinating objects, many of them conceptual, that are a set of example of creativity in finding patterns or shared structure between our own stories or experiences that can be named and in being name be efficiently referred to and more easily built on top of (due to the freeing up of working memory).

When a new concept is created and codified as a word (say, you give a name to the moving of an idea out and across domains), suddenly the ability to think with and communicate with that concept is dramatically enhanced. In our example of moving an idea out and across domains, we may recall several examples to mind. Take Antifragile, by Nasim Nicholas Taleb. It’s built out of the inversion of another concept (fragility) but does helpful work by allowing us to gather all of the examples of a particular property of systems under the same concept, and insodoing understand it more deeply and wield it more fluently.

Many problems can be discovered completely conceptually, and (often) would only have been discovered conceptually. People ignore selection effects all over the place. Importantly, they ignore base rates. In what I call a bayesian problem (where you fail to take the prior into account) there’s a body of counterintuitive truths about the functioning of our minds that can be discovered and used. Without the conceptual representation, selection effects would have driven data to be interestingly in violation of naive expectation.

A concrete example is the effect of college on students. Demographics consistently show wonderful outcomes from students who went to ivy league schools. One reasonable counterhypothes to the value of the schols’ education system is that students are selected for quality in advance.

Conceptual Schemes

Abstraction is actually (partially) about founding the field of conceptual analysis

It's about making the conceptual scheme explicit, along with the questions, answers, representations and assumptions that come out of that scheme.

It's about how we fit all of our data into our existing conceptual scheme, and about how to minimize the bias induced by our scheme (and optimize the bias that does exist for predictive value or other kinds of value)

Example 1: 'Products' are an abstraction over unlike objects. The natural attempt at decomposition yields 'product categories', grating similar products and lumping them together. This aids transfer by letting you consider transfer within categories, which is between much more similar objects. Without a concept of product categories, many connections would not be made between products, because the general class of 'products' doesn't support those connections across all products. In many cases, we lack concepts like 'product category', and fail to make umpteen extraordinarily valuable connections because our conceptual scheme is insufficient.

Example 2:

Venkatesh Rao’s conceptual scheme in Weirding Diary 10 In many ways, Venkat is writing in a different language. For better or for worse, his writing is dense with neologisms, many of his own creation. From a 2400 word blog post (Weirding Diary: 10) we get this: ~20 relatively unconventional mental models / neologisms. Venkat’s own are separated from those of others. This is a list of the concepts he tends to use that others rarely use, identifying features of his unique style of thought. This representation also lets us see the similarities between the types of idea he prefers (ex., the economic thread through Hemingway bankruptcy patterns and free lunch).

Manufacturing Consent (Chomsky) Hemingway bankruptcy pattern - gradually, and then suddenly (Hemingway) Lion Values vs. Fox Values (Vincent Pareto) Hanlon’s Razor = Stupidity over Malice (Hanlon) Circulation of Elite (Vincent Pareto) Metcalfe’s Law (Robert Metcalfe) Free Lunch (Gambling / Economics) Hobbesian battlefield (Thomas Hobbes) Hydra (Greek Mythology)

Premium Mediocre (Venktesh Rao) Great Weirding (Venkatesh Rao) Glamorous Institutions (Venkatesh Rao) Vision Surplus (Venkatesh Rao) Charisma Engineering (Venkatesh Rao) Underdefended Underproductivity (Venkatesh Rao) Literary Industrial Complex (Venkatesh Rao) Counter-programming of ideological competition (Venkatesh Rao) Script Unraveling (Venkatesh Rao) Jankiness Worth Fixing (Venkatesh Rao)

This form represents a demonstration of the 'latticework of mental models' style of analysis and creativity proffered by Charlie Munger, Shane Parrish and umpteen others.

I want to express a technique for critiquing, generating and improving ontologies called conceptual decomposition.

Leading Questions for Conceptual Decomposition:

  1. What are all of the ways in which the concept is used?
  2. How is the way we use the concept misleading?
    1. What other valuable conceptual schemes are we pushed off of?
  3. What is insufficient about the concept?
  4. What gives the concept its value?
  5. What are many examples of the concept, and how to they differ from one another? What is truly invariant across them?
  6. Is the concept part of a larger conceptual scheme? What concepts does it block, or support?
  7. What is the simplest possible version of the concept? The most complex version?
  8. What are all of the definitions that exist?
  9. What is the concept often conflated with?
  10. What major assumptions does applying or using the concept make? When do these assumptions differ from reality?
  11. What are the differences between the concept in its breath and the particulars of its instantiation?

This body of leading questions should establish nuance, distinctions, and ground for an improved conceptual scheme.

Kuhn Reaction

It’s easy to get carried away with a beautiful abstract theory, so I look to appreciate the conditionals on the structure of scientific research that will let me tune predictions to an instance. That said, many of these concepts carry cleanly to science that I’ve seen.

That said, applications abound. The easy examples from machine learning come out of questioning the basic methodology, as well as questioning the things that are worth spending time thinking about and studying.

My first major critique is the (relative) unimportance of understanding when it comes to the deep learning literature. There’s an objective function that looks like performance on a dataset, where a nicer objective function would have a very clear element that was about understanding the nature of the algorithm, whether it be the structure, the types of function that are fit, the properties of important alterations to the algorithm and reasons for their success or failure, etc.

It’s almost as if the space is insufficiently experimental. It would be wonderful to have a concrete working mental model for how these functions are created that we could use to intuitively make changes.

But this is also a local success. It’s likely that there are more fundamental changes to the model (akin from going from feed-forward nets to convnets) that should be taking precedent over improvements to the existing thing. And there’s the real tradeoff between marginal improvements on a current solution and fundamental changes that are much less likely to succeed but much more important when they do succeed.

This is also why it’s important to understand machine learning generally as well as networks. There are general principles that are much better evinced by other models, and the standard for understanding a change’s impact with a mental model of the system is in a much better place.

That said, Kuhn - he has this concept of paradigm, which I’ve mapped to (in my language) local equilibria. But he may say something importantly different, which I assume away. That said, the economic model (god, this really is how-to-think like) reaped rewards in explanation which may mean that it reaps rewards in understanding reality.

The major (and obvious, so extremely important) insight is that there are assumptions that one has to make before any research can be productive. Those assumptions on the methodology and choice of direction are often more impactful on the value of the research than whether or not the research is done well within the paradigm. When those assumptions remain hidden it becomes impossible to condition on the situations that the assumptions were built around. It also creates intellectual “dead space” where valuable research could be but is not occurring, which the space will pivot to once they see weaknesses or limits to their methods. But if they cannot see those limits, it’s likely that those spaces go undiscovered. Limits require an expectation of performance.

I should think about the interrelatedness of fields as well, and watch as history cuts them up into “clean” classes and so defines away large swaths of research space.

This is worth thinking about.

Prelude

There’s this sense that the social sciences, psychology, etc. consistently debate the legitimate scientific methodology and the problems they should solve in their field. But in the natural sciences, despite similar amounts of change in those methodologies, there are no controversies over the fundamentals of the field.

Kuhn’s going to call these local equilibria “paradigms”.

Local incentives drive the problems that science tends to work on. Take automated predictive modeling and the time it’s taken to move anywhere with it despite its huge value, compared with the alacrity in deep learning.

He admits that his split into pre/post paradigm periods is too simplistic a model. “Much too schematic”.

You can think of large open source projects, which seek to parallelize a body of work that needs to be done, has similar scaling of human effort problems as fields of study, which have many researchers which are connected by the field as a whole but where there’s a struggle to make problems modular and then connect them in with the body of research.

He says, repeatedly, that despite “scientific” standards, there end up being many legitimate schools of thought around a space, informed by different arbitrary modes of thinking about how a field works.

There are all sorts of meta questions that must be answered before any productive research occurs, and often answers are assumptions that the field makes about how things should be done. And because scientific education is rigorous and rigid, these assumptions “come to exert a deep hold on the scientific mind”.

“Research is a strenuous and devoted attempt to force nature into the conceptual boxes supplied by professional education.” Unreal.

I’m really looking forward to this. Also, to thinking about what it would look like to break out of as many conceptual boxes as possible. I sense a bias-variance tradeoff here - you either pigeonhole everything into a predetermined space (too high bias) or have so much freedom that your thinking can go in too many contradictory directions (too high variance).

Growth (the rejection of the old theory and the adoption of another) only comes out of competition between segments of the scientific community. Interesting how this is positional, likely with tribes, confirmation bias, the whole human 9 yards. Wonder whether silent revolutions happen that nobody realize, and Kuhn is only counting the ones that happen loudly in the open.

There’s this paradox where you need reputation in a field to impact its fundamental assumptions, which is reputation that you can only gain by following those assumptions. There’s a big difference between leading the community in a new direction and being an outcast. Much has to do with whether you have followers.

Similarity / Distance

Similarity, Distance, and Memorization

This is so so long in the making.

Think about: Distance and intuitive physics

Continuum of similarity: 5. Discrete

  1. Equality / Identicality
  2. Property overlap
  3. Edit Distance
  4. Continuous
    1. Euclidean Distance
    2. Angles
    3. Cosine Distance
    4. KL Divergence

Think about a distance metric between shapes. Humans have an intuitive sense of the similarity between shapes, as a part of a general ability to judge the similarity of any two objects in an undefined intuitive space. But writing down that metric is incredibly difficult. And so much of the mission of representation learning research is learning how to compare arbitrary objects to one another through a distance metric that’s learned from experiencing arbitrary data.

Similarity is at the core of all learning and cognition. Neuroscience side, an example is the heuristic that ‘neurons that fire together, wire together’, connecting sets of data that are similar in that they occur close by in the time series that a mind experiences. It also leverages another notion of similarity - if a and b there are previous patterns that are established, those previous patterns a’ associated with the firing of a will also be associated with b. This transitivity quickly becomes a wonderfully nuanced and complex form of similarity.

In developing knowledge, there’s a value to the knowledge that will generalize. Memorization is fine if you know that the data you see in future will be identical to the data that you’re seeing now. But there’s a continuum over the richness of similarity metrics, where memorization can be defined as the way similarity breaks down a short distance from the datapoints that have been memorized (or an inability to return an answer when you’re a short distance away). We get beyond memorization by finding ways to map similar datapoints to one another, and through that connection drawing conclusions about the properties that those similar datapoints will share (this depends on the ways in which the objects are similar) and is stronger than memorization in that when you encounter new information that hasn’t been seen before, you can compare it to that which you’ve already seen and use that history to inform your understanding of this new experience.

Often memorization looks like a lookup, which is fragile. If you don’t find something that’s exactly equal to what you’ve seen in the past, you can’t return anything. The more effectively you can connect a new experience to a vast array of old experiences, the more data you’ll have to draw on when you make inferences about it.

One implication is that broadening the types of connection that you make between objects allows you to transfer more information and connect more types of object. Seeing more of an objects properties creates a larger surface for comparison, enriching the kinds of connection you can make between your object and others.

As you move to a more expressive and nuanced similarity metric (ex. from binaries to a continuum, say from whether a person is good or not to how good a person is) the distortions that come from conflating datapoints that are relatively close but not identical with each other can disappear. But the tradeoff to eliminating that conflation is needing to store a much more complex representation of goodness for every person (and likely move from an intentional, deliberate process for thinking about it to an intuitive one).

Properties of data that make certain kinds of similarity metric more relevant:

Imagine trying to use edit distance for semantics, where you tried to map words’ conceptual similarity to the number of changes you’d need to make to the letters in one word to get to another word. The lack of overlap between a word’s spelling and its meaning makes this notion of similarity irrelevant. But instead, imaging looking at the similarity of the contexts in which words appear. Specifically, say, the number of occurrences of a small 2-5 word window of surrounding words. Suddenly, words with similar contexts and meanings can be mapped to each other - ex. Cat and dog are often used with similar surrounding words, and so will be close to one another. King and queen are also used with similar surrounding words. And when they differ, they tend to differ in a way that has to do with the meaning of the words. So distance on this similarity metric is meaningful in ways that aren’t captured by edit distance. The design of distance metrics that capture the structure that matters for a task is extremely important.

Implication: Languages need to merge semantics with A language that merged its word / letter representation with the actual meaning of the words (where, say, base sounds / letters represented the principal components of learned word vectors) wouldn’t require memorization to learn because the mapping from concepts to reality would be on a continuum and its meaning grounded in the language itself rather than being an arbitrary mapping from a word to a concept.

Generator: Why does language learning require so much memorization? In general, needing to memorize indicates that you’re using a degenerate distance metric and that your learning will fail to generalize.

Similarity 1.0

What is this ‘shared structure’?

Underlying the concept of ‘shared’ is a notion of similarity.

Closeness to equality.

Question: Are all forms of similarity captured by 'how much you have to change the object to get equivalence?, and where equivalence is "these objects are the same"?"

Topics: 13. Relationship of abstraction to similarity

  1. One type of abstraction identifies shared structure across objects and compresses it into a single concept or abstract object.
    1. Similarity as existing over different features
  2. Type of similarity as a function of the nature of the thing being compared
  3. Types of similarity measures
  4. Discrete:
    1. Equivalence
    2. Edit Distance
    3. Number of properties in common
    4. (Having a property in common is similarity over that property)
  5. Continuous:
    1. Cosine Distance
    2. Euclidean Distance
    3. KL Divergence / Cross Entropy
    4. Wasserstein Distance
    5. Hinge Loss
  6. Generating Similarity Metrics
    1. Concept Representations
      1. Word Embeddings, same up to angle
    2. Networks
      1. PCA over learned representation
  7. Types of Similarity
  8. Have the same function / accomplish the same task, consequently
    1. Laptop is actually a functional abstraction, more than a compressive technique over sub-parts.
      1. But it also cares about the sub-parts… the surface and ipad are functionally similar but are called ‘tablets’ instead of ‘laptops’.
  9. Use the same mechanism
  10. Have the same property
    1. Ex. shape, color, density
  11. Have a set of shared properties
  12. Question: Are all forms of similarity captured by 'how much you have to change the object to get equivalence?, and where equivalence is "these objects are the same"?"
  13. Cognitive Fit, human notions of concept similarity

Relationship of Abstraction to Similarity Similarity is a foundational concept that forms the basis of the ability to compress information across objects. When objects are similar, whether it be in their properties, or their constituent parts, or in their function, it becomes possible to transfer information from one object to another via awareness of this shared structure. In the extreme case where objects are identical or equal, we can compressive massively - we can throw out one of the objects, and merely remember that it’s equal to the other object. On the continuum away from equal, we lose compressive power in that we have to trade the compactness of our representation of the object against the amount of new information that the object holds. As an example, it’s often much easier to store a deviation from an existing object that to construct an entirely new object. Say, this is like headphones without the wires (for bluetooth headphones), rather than inventing a new name entirely. This is a phone, but smart. That modification of an existing representation is transferable across people who have the existing representation, and is compact in the amount of new information that exists (only the deviation). In the same way that human attention is drawn merely to what changes in the environment (and in general things that are unchanging fade out of awareness), representing new information as a deviation from an old representation is a classic example of efficiently taking advantage of similarity.

Similarity Over Different Features Similarity can operate over all the features of an object. It’s often the case that features are intercorrelated, and so when enough features intercorrelate we tend to create a term or concept for that body of relationships. When objects are similar over one feature but not in others, there’s often conflation or confusion when an attempt to generalize from the workings of one object to another fails.

We can think of similarity over different features as often having very different properties - being measured in different ways, allowing or disallowing transfer in different ways, etc.

Measuring Similarity There are a body of metrics for measuring similarity - the classic example is equality. When two systems are equal, they reflect one another perfectly. There’s no information in one system that isn’t reflected in the other system. And so we can do heavy compression. But equality is binary - the objects are either equal or not. And so it’s not granular enough to capture shared structure that is incomplete. Equality strongly limits the complexity of the objects that can be compared to one another, and so looser metrics are critical to modeling real systems.

Difficulty in describing continuous similarity, distributional similarity. Similar difficulty in describing distinctions, or breakages of similarity in distributions. General difficulty of reasoning outside of discrete space. Need for cognitive fit.

Types of Similarity Difference from Equivalence (Shared structure over metrics) Cognitive Fit (Similarity in human intuition)

Temporal invariance is so critical to abstraction "I" am that which is invariant across time Because HOW ELSE WOULD YOU ACCOMPLISH THE TRANSFER???

Who are you?

I am he who selects value systems. But as soon as you can see that, you become he who selects he who selects value systems. And so who you are is the level at which your self awareness of the emotions that are driving your decisions bottoms out due to the constraints of your working memory, attention and knowledge.

Generalize Generalize to all labels Labels have power insofar as they can be applied consistently across time Generalization requires similarity in the training and test distributions

"I" being that which is invariant across time is strong, because you expect transfer into the future. That which has varied in the past can vary in future. And so true destruction of self looks like varying every aspect of your identity, substantially, repeatedly. Until not even the fact or degree to which you vary your identity is consistent across time. This is a nod to that which stands the test of time, and to the lindy effect.

'Invariance' is just a particular nod to similarity. 'Non-changing' implies similarity across that which is invariant. And so it's just one of umpteen kinds of similarity. And similarity is the grounding for transfer.

In many ways, though, invariance is stronger than similarity when you’re looking for a sense of certainty that the transfer you intent to pull off will actually hold. There’s this ease of transfer when the match is exact - when the match is inexact, suddenly the properties or effects that do or don’t successfully transfer are called into question. In the abstract (I can find a few examples later), there’s a body of motions involved in transfer (which should be enumerated) which include properties being shared, solutions being shared, levels of representation being shared, and relationships between objects being shared. All of these depend on the similarity between the two high level objects that contain these properties, problem and solution representations, objects and relationships in them. And so when you get invariance (which is closer to being identical than mere similarity), many of these properties transfer more cleanly than if there’s a loose analogy or a continuous (rather than a discrete) notion of similarity binding the objects together.

First, this calls for a break down of similarity. Discrete and continuous modes, and an analysis of the loseness or tightness of metaphors or connections between systems. There’s a body of conditionals, where the interactions between the type and the degree of similairty and the kinds of transfer that are acceptable or not acceptable have pattens that let you know whether or not the inference you’re attempting is legitimate.

Second, this asks for a kind of breakdown of transfer - what kinds of transfer are possible, between what kind of objects, what kinds are common, what are the properties and failure modes of those popular kinds of transfer. Examples that have been explicated that spring to mind include reference class forecasting and k-nearest-neighbors, both of which can be analyzed statistically and intuitively. In both cases there’s some notion of a set of features or properties of each datapoint that has a notion of similarity attached to each feature / property, which in aggregate (usually through a weighted aggregation) can be compared to other datapoints and used for inference insofar as their similarity score to another datapoint is high.

It’s likely that particular kinds of similarity (say, over one property but not others) leads to the ability to do particular kinds of transfer. We learn these relationships by spending time in the domains in question, and likely also learn it in the abstract (though I haven’t looked at examples closely enough to say just yet).

Spatial and temporal invariance are grounded in physics. There’s a question of whether temporal invariance is special, whether spatial invariance is special. Temporal structure has a lot to say about this kind of invariance.

In what way shoudl your transfer decay as your similarity score drops? Should you do transfer in the same way, but merely drop your confidence in the results? Obviously you have to see the featurs across which the distributional shift occured, and ask how those features causally interact with the outcomes that you’re attempting to predict.

Temporal invariance has to be measured across multiple timescales. What is invariant at a large timescale (say, years) may not be invariant over the weeks timescale or the days timescale. And it’s the choice of the scale at which to analyze relationships that lets us do long term planning or abstract over the experience of entire civilizations.

Dynamical systems where interactions across time blow up the complexity of the model needed to capture the sense that interactions can be measured at any of many possible timesteps call for temporal abstraction to fit reasonable hypotheses to the dynamics of the system. And so learning how to judge which time scale at which to do the analysis is extremely important, and something we do intuitively.

Calming down, there’s a want to cover all invariances. Temporal and spatial invariances ground our thinking in intuitive physics. Which is strong and helpful, but not nearly as strong and helpful as being able to capture invariances in informational space generally, and not just physical space. There’s this sense that there are invariances across concepts, or across systems, or that are necessary for consistency that can be leveraged to make intellectual progress. And the focus on neuroinspiration comes with a focus on physics inspired invariances. But the principle is scarily general.

Value of Representations Everything has a representation. Raw sensory content is represented to sensors as a stream of sets of pixels [should include whatever the brain uses], or frequencies of vibration. (Taste? Touch? Smell?) [Describe the way the brain re-represents reality so that creatures like us can interact with it] Re-representing information is powerful. Ex., re-representing information so that it can be transferred from one situation to another. Representing the internet as a graph (with links as edges between web pages that are graphs) rather than as an amorphous concept or as TCP-IP allowed the use of Page-Rank (a graph algorithm) to turn into Google. Other representations of the internet don’t lead to the same solutions for searching the internet. The concepts you use to describe a situation carry implicit assumptions (a frame) that can be considered your default representation. For example, when you represent numbers as arabic numerals (2, 3, 5) multiplication becomes easy (for a few reasons, ex. The number of digits of the two numbers being multiplied correspond to the number of digits in the resulting number, or ex. You can use the distributive property to break a multiplication down into a value that’s 10*x + y, say distributing multiplying by 32 into multiply by 3 and adding a 0 and multiplying by 2.) When you use roman numerals, things become much much more difficult. This re-representation of our numbers turned a problem that used to require expert level mathematicians into something that most 10 year olds know how to do reliably.

One easy way to see the way that a change in representation can be valuable is to see where a discrete representation of some information would be better off continuous and when a continuous representation would be better off discrete. Imagine if, instead of measuring, say, speed of a car on a continuum it was measured as a binary (fast or not fast) which would be triggered at some MPH rating. Immediately many more people would die due to accidents in speeding which come from an inability to make distinctions between levels of speed limit. And when it comes time to debate the speed limit, instead of seeing the hidden assumption (a binary speed representation) often the debate will center around what the number that determines the speed transitions from not fast to fast.

Think about the re-frame from binary belief (where statements are true or not true) to probabilistic belief (where statements have higher or lower probability as a function your knowledge that’s relevant to the belief). This reframe dramatically improves thinking on many philosophical issues (is it just or not just to ‘to what degree does it serve justice) and practical issues (ex. the probability of a cyber attack is low, rather than thinking that an attack will or will not exist).

Compositionality

Abstraction is how you construct a compositional conceptual hierarchy, and so benefits from compositionality are akin to benefits from quality abstraction. The answer generally given is informational efficiency. Recombination of abstract concepts allows you to hit so many more possible meanings than having a particular concept for each importantly different object.

The conceptual building blocks automatically allow for changes at lower levels to impact many higher level recombinations of concepts.

Compositionality makes rapid learning possible through flexible generalization, without a need to see data for every recombination of parameters in an environment.

There’s a generality to decomposition + recombination for problem solving.

Creativity through recombination of existing concepts.

Compositionality is at the core of productivity in building objects, building software, building ideas, almost all creation.

Easier generalization, as there’s much more data informing a sub-concept in a conceptual hierarchy since you can draw from every instance where the sub part exists (abstraction extends dataset size)

Existing models of new data that can be decomposed into existing parts.

Decomposed problem / goal representations

  1. Combinatorial Representational Capacity
    1. Broadens space of possible meanings through finite concepts
  2. Elegant Updates
    1. Updates to a sub-concept flexibly updates all recombinations with that concept, capturing shared structure between more specific concepts that include the sub-concept
  3. Speeds learning through efficient generalization
  4. Creativity - creative solutions often take the form of recombinations of existing concepts
  5. Decomposition + recombination for problem solving
  6. Abstraction allows for learning about sub-concepts, broadening dataset to all instances of the sub-concept
  7. Stronger Generalization - existing models of new data that can be decomposed into existing parts.

Examples:

  • Building Objects
    • Car / Bike / Train = composing wheels + frame + engine / power
  • Building Software
    • Statements, variables, conditionals compose to create functions / classes compose to flexibly and consistently solve software preoblems
  • Building ideas
    • All of language as concept recombination, each sentence merging composable meanings.

The project of efficiently representing unbelievable amounts of knowledge generally relies on finding a body of recombinable patterns which can flexibly represent an extremely wide array of objects. Atoms up through objects work this way. Human and programming languages work this way. This is the most likely generative model for our universe. Generalization What is generalization?

The critical property of an abstraction is its ability to generalize.

Within a domain, learning often looks like generalization - in sports, you pattern match one situation to similar past situations, and react instinctively in a way that your intuition hopes will lead to success. In mathematics, solving a problem with an algebraic manipulation once lets you recognize that type of manipulation in other situations. And often so you abstract the manipulation into a rule or into an operation that you can run more flexibly.

But you’d also like to do out of domain generalization - learn something in one context and apply it to an entirely different context. Learning language in one domain (say, in a home) and then applying it at school, or with friends rather than parents, or in speeches. Take the concept of specialization and lift it from economics to understanding sexual dimorphism in evolution. Taking the concepts of a replicator and differential selection from evolution and apply it to ideas and tunes and fashions that replicate by imitation. (add something about decomposition, modularity, causality)

Generalization as a Standard

Generalization accuracy is a great standard to hold abstractions to. What we want is for our representations to aid us in problem solving, which often takes the form of prediction what the impact of our actions will be or predicting what the state of our system will be in future.

When adjudicating between representations and when constructing them, we’ll optimize for the representation that is easy to evaluate (simplicity heuristic) and that makes the most accurate predictions. And simplicity is also a part of accuracy - models that are simpler have the advantages of capturing more data (because in general they’re more abstract), being more robust to small differences between observations and so better able to capture the higher level regularity, and being straightforward to update when they’re mistaken.

Tradeoffs need to be adjudicated, and so having a downstream task to determine the appropriate lines for those tradeoffs is invaluable. Yet in practice people use other standards either as proxies for generalization or to preserve longer term value - truth is a classic example. Instead of asking if a representation is predictive, you can ask how closely it corresponds to reality via some similarity metric between your map of the territory and the territory. Often there are tradeoffs between the true model and the model that leads to the best generalization accuracy.

Even generalization accuracy should be seen as an intermediate standard or proxy, with utility as the true goal. And perhaps we should resolve the tension between predictive accuracy and utility in favor of utility. Many models or pieces of information that improve predictive accuracy do dramatic damage to agent utility (ex., noble lies - rights, objective grounding to values, sacred & religious beliefs, etc.). It’s not clear that the metric we want to evaluate these pieces of information over with is predictive accuracy, and so that evaluation needs to happen in a decision function that adjudicates between standards based on their contribution to utility.

Alignment of Representation over Shared Structure What this means is that you’d like to guarantee that making a single update will apply over the entire affected part of the representation.

For example, if you’re using the connotations of words to do the transfer concepts need to be valuable as language, it’s possible to run into russell conjugates (where two different words mean the same thing). In that case, updating one may not update the other - an ideal representation would realize that those words have similar contexts and make an update to one consistently lead to an update in all of its russell conjugates.

Often it’s efficient to update representations compositionally, where an update to a part of the representation that’s used in many downstream components can effortlessly feed into those components.

This has implications for concept creation and usage. When you introduce or create a new concept, you split the data between it and the substitute set of concepts you’ve likely been using to communicate the same idea. Abstraction in Reality

To read: the glass cage: automation and us, on abstraction creating distance from reality Example fodder:

  • Grocery stores as an abstraction over supply chain / production of foods
  • Machines as abstractions over the few parameters that need to be manually controlled in a process
  • Computer APIs are literally referred to as abstraction layers

Almost every interface is a wonderful example of abstracting away the details of the inter

Many profitable opportunities (as well as ways to make people’s lives easier) involves abstracting away the painful details of a process that people need to go through. Food is one easy example. Instead of having people interact with the details of a complex supply chain and act as farming experts, grocery stores let people interact with a high level interface (a shelf in a store) which hides all of the details of the work required to produce and transport the product and reduces decision making to a comparison of prices. This bears similarities to abstraction layers in computer science. These eliminate needing to deal with the details of (say) the implementation of a particular sorting algorithm by having a generic sort function that’s backed by code that can change without changing the way that the user calls that code. This allows for improvement and change over time without a need for the high level behavior to change (and so improvements can be seamlessly integrated), but also exposes just the few parts of the system that really matter for it to operate properly (like which foods to pick in the grocery store example) and allows the other details to be abstracted away. Leaky abstractions in computer APIs (where you actually need to understand the underlying code implementation in order to use the interface or abstract representation properly) also show up in reality, where a store that’s serving as an interface to an implementation may not expose whether some food was factory farmed or not, or was exposed to pesticides or carcinogens or not, or was genetically modified or not. In the absence of this information, customers (or the FDA) would propose or demand that that information about the details of the production process be made available to the users of the interface. In some cases (whether a food was prepared next to peanuts that could trigger an allergy, for example) there’s a clear need for somebody to have knowledge of the details of the production process in order to make reasonable decisions.

Consilience

The project of the unification of knowledge is grounded in finding shared representations across domains. Problems can be re-represented as one another, and intermediate, generic abstract representations (ex., data structures like graphs or hierarchies) can link the kinds of solution (algorithms, re-representations) that work across systems that at surface level look distinct.

The project of the unification of knowledge:[ax]

  1. Find a sufficiently general knowledge representation
  2. Map it to a body of mathematical structures whose frame can be used to automate transfer
  3. Fill the general knowledge representation with data
  4. Learn a notion of conceptual similarity that allows the comparison of arbitrary words, phrases, sentences, paragraphs, texts
  5. Link this to the internal brain state of a person thinking about the concept (Really, everything)

Examples It’s extremely important to have a set of examples to reason from. I’m going to list a number (perhaps 3+) of examples for each component of abstraction that exists. I currently have two decompositions of the concept of abstraction: One over the domains that abstraction works in and one over the contrast between idealized abstraction and abstraction in reality. I want clean examples for all.

  1. Computer Science
    1. Object Oriented - Classes as a collection of properties.
      1. The class dog contains properties that all dogs have (snout size, color, breed), and the classes for cat and for cow contain cat and cow specific information. Objects (particular instances of animal) instantiate the class.
      2. The class Orange, or Apple, or Banana. They’ll have the type of apple, the number of spots, the density of the orange peel, etc.
      3. Some graphic objects will have curves with eccentricity (oval), or an aspect ratio (rectangle), and those properties can be contained in their classes.
    2. APIs - an abstraction over implementations.
    3. Abstract Classes - A collection of properties of classes.
      1. Animals Abstract class, where each animal eats food, sleeps, makes noise, etc. Then Dog extends animal, cow extends animal, etc.
      2. Fruits abstract class, where each fruit has some weight, some time to ripeness, some water content, some color. Then apple extends fruit, orange extends fruit, banana extends fruit
      3. Graphic Object, where each graphical object has properties in common like position, orientation, line color, fill color. Rectangle extends graphic object, line extends graphic object, circle extends graphic object.
    4. Functions within Functions
    5. Recursion
  2. Mathematics
    1. Algebra, as opposed to manipulating scalars (particular scalars) directly
      1. The variable
        1. This can be my first and foremost examplea
      2. Abstract algebra and its complexities
    2. Functions as the abstraction of an operation
    3. Rules / Operations (operations as a codification of more specific manipulations)
      1. Addition
      2. Multiplication
        1. Exponentiation
    4. Category Theory
    5. Cartesian plane / geometry, as an abstraction over numbers and geometric objects.
    6. Bayes rule as built on top of conditional probability. Conditional probability built on top of set theory (the condition is a subset of the space) and general probability.
  3. Logic
    1. Abstraction over thoughts / beliefs / hypotheses
      1. Ex., If A is true then B is true => B is false, therefor A is false
      2. Ex., If A is true, then B is true => B is true, therefor, A becomes more plausible
  4. Probability Theory
    1. As an abstraction over relationships between arbitrary variables
  5. Physics
    1. Mechanics vs. Quantum
    2. The level of specificity/precision at which measurements of systems are/can be made
      1. Height as the appropriate level for measuring the meaningful content that humans care to model [Hikari on height], as opposed to the thing that we care about when we’re measuring height (say it’s predictive of age, and better would be the lower level chemistry of the organism that led it to grow to the height it is at)
      2. Choice of certain features as being more “fundamental” than other features, and some features as being manifestations of many interactions between other, lower-level features
  6. Language
    1. The concept of abstraction
  7. Society
    1. Harvard students vs. student organization / student concentrations vs. individual students
    2. Countries vs. organizations within countries / states within the country vs. people running organizations / collective population
  8. Pattern Recognition / Signal Processing
    1. Humans perceive color as discrete, whereas computers measure pixel values that are often still discrete but at a much finer grain, and in reality there are photons bouncing off of objects that absorb particular frequencies of light.
    2. Images with hierarchical structure, where edges and curves become shapes and shapes become objects
    3. Sounds to phonemes, phonemes to words
  9. Intuition
    1. Generalizing across dangerous animals by picking out elements like strength, sharp teeth, claws
    2. Generalizing across human emotion - anger looks different across different people, but shares enough characteristics that we label it in the abstract.
  10. Biology / Chemistry (This may simply be hierarchical structure… need to think about whether an abstraction is necessarily ‘virtual’)
  11. Atom, molecule, cell, tissue, organ, organism, population, ecosystem
    1. Atoms to molecules, ‘stability’ as important at the more abstract level, to maintain atoms in a particular structure.
    2. The same foundational building blocks (atoms, molecules, cells containing genes) generalize up to diversity in animals, plants, bacteria, viruses.
  12. Species - Genus - Family - Order - Class - Phylum - Kingdom - Domain
    1. Species - genus (unified by common ancestry)
    2. Individuals to species (unified by ability to interbreed)
  13. Periodic Table of the Elements (as a classification scheme that truly cleaves nature at the joints)
    1. Choice of shared structure: Number of protons in the nucleus.
      1. Amazing descriptive and predictive value (See ontology is overrated)
  14. Turning continuum into a class
  15. Color
    1. Red, Orange, etc.
    2. People speaking other languages that cut up color space differently remember colors differently than english speakers
  16. Length
  17. Weight
  18. Money
  19. As an abstraction over value of labor / combined resource costs (I guess this is hierarchical compression)
  20. Machine learning is a wonderful example of abstraction, where all problems of prediction an output from an input can cleanly fall into an X, Y pair that is fed into an arbitrary algorithm.
  21. It’s pretty close to ‘function’ as a great abstraction, it’s a subset of that. It’s the subset where the utility of the function is making a prediction, or something nearby that.
  22. This example of abstraction is about shared structure between the problems as well as functional unity.
    1. Are there other implicit meanings in the word abstraction than shared structure and functional unity? Result of a compositional process?
  23. Graph
  24. Relational Structure
    1. Object Oriented Structure
      1. Object (Entity)
      2. X is a Y relationships (Classification, Inheritance)
      3. X has a Y relationships (Composition / Aggregation)
      4. Properties of an Object
    2. Causal Graph - X leads to Y
    3. Dependency - X depends on Y
    4. Subject - Object relationships (in sentences)
      1. Linking verbs - ‘is’, ‘has’, ‘are’, ‘being’, ‘sense’ etc. between Object and Subject
    5. Co-occurrence
      1. Ex. Words mentioned in concert with one another
    6. Link - are connected
      1. Linkage Distribution
    7. Locality
    8. Edge Density
    9. Examples of Relational Structure
      1. Categories
      2. Connections
    10. Similarities between relational and hierarchical structure
    11. Number of connected components
    12. Topology & Graphs (manifold structure)
  25. Abstraction gone wrong
  26. Weapons of mass destruction is a beautiful example of abstraction gone wrong.
    1. Conflation
    2. (See page 1 of Deadly Arsenals)
  27. Illegal Drugs
  28. All online systems of tagging, tags, tag creation, category creation
  29. Ex. User based tags of videos
  30. Libraries
  31. Categorizing books, ex. The Dewey Decimal System
  32. The Library of Congress categorization scheme
  33. File systems
  34. Directory, sub-directory, sub-sub-directory
  35. Bookmarks
  36. As user-created categorization schemes
  37. Programming Sequence of Abstraction
  38. Machine Language
  39. Assembly Language
  40. Fortran / C / C++
  41. C# / Haskell / Javascript

Structure as Abstraction over Relationships

Structural Thinking

The key to low bias models is alignment between the the model’s structure and relationships in the data - not merely the representational capacity of the model. And so rather than thinking of measures of capacity or representational breadth as sources of low bias, we should recognize that these only create bias when they misalign with the structure present in the data. And so bias is not a property of a particular model or algorithm, but comes out of an interaction between a particular model and the dataset it’s trained on. That said, increasing the representational capacity of a model will expand the space of functions that it’s capable of finding through optimization, and so will be one proxy for how close the function that that model can learn is to the real relationships that exist in the data.

In the contexts where bottom-up hierarchical (compositional) models are applied to data that has bottom-up hierarchical structure (audio, vision) we see strong generalization. This alignment makes generalization easier for the same reason it’s straightforward to get strong generalization with linear regression over linear relationships. Whether interpolating or extrapolating, if the genuine relationship was captured it will continue to hold outside the training data range. If the structure in the training data isn’t capture by the model, we require strong amounts of regularization (in the case of networks, lots of dropout and weight decay) to learn a relationships that will generalize to the validation set. (cite the comparison)

Implications for Research

  1. Intuition for Model Creation
  2. Transfer Learning
  3. Multi-Task / Multi-Dataset learning
  4. AutoML

Intuition for Model Creation In choosing which models to build, a common approach is to take structure that is present in important datasets but cannot be learned, and generate models that incorporate that kind of structure. A classic example is in learning long term dependencies in RNNs, motivating gating mechanisms and cell state for maintaining information across time. This source of inspiration differs from, say, neural inspiration.

Transfer Learning One way to dichotomize transfer across domains is separating feature-wise transfer (transferring information about relationships between specific features) from structure-wise transfer. Structure-wise transfer involves abstracting up from particular relationships to looking at the types of relationship that exist between sets of features. (Write more about structural transfer, add example (probably word embeddings))

Multi-Task / Multi-Dataset Learning Learning structure from data rather than encoding it manually will require multiple examples of the type of relationship being learned. This can be exhibited through multiple examples of that kind of relationships between features in a single dataset, but often the structure we care about only exists once inside of a dataset, and so we would need to learn over multiple datasets to learn the structure that transfers across datasets. In general this information exists in the mind of the researcher (who finds datasets that serve as examples of the needing the kind of concept or structure that they’re looking to learn), and so automating this search will require improved dataset representation.

AutoML Automating machine learning research could dramatically speed up innovation. Approaches to metalearning have been taken through 1, 2 and 3. But learning the model’s structure has much more potential for aligning the model’s structure with the data than tuning hyperparameters of existing structures. Deciding what algorithm to run on a dataset

Categories of Structure That Exist, Examples in Data, Mapping to Algorithms

Fact List

  1. Why this matters
    1. Key to a new & powerful category of transfer
    2. Key to strong generalization
    3. Key to low bias models
    4. Excellent intuition for model creation
    5. Critical for metalearning new algorithms
  2. Patterns / Regularities exist at multiple levels of abstraction. Modern machine learning algorithms capture regularities within datasets, but not regularities across datasets.
    1. Transfer at the ‘shared’ feature level, as in CNNs, is powerful if the data is homogenous. But that is a strong limit to the generality of the transfer.
  3. Strong generalization requires a model that captures the actual relationships between features, as opposed to approximating those relationships with a flexible model.
  4. Low bias models come out of alignment between the structure of the model and the structure in the data, rather than simply coming out of more flexibility.
  5. Model creation succeeds when an inductive bias that dramatically cuts down search space is shared by the data. It’s necessary to make true ‘assumptions’ in order to learn at all. But these assumptions don’t need to be made blindly - understanding structure is the way to consistently make correct assumptions.
  6. Common types of structure [Define all, give examples]
    1. Hierarchical / Compositional / Combinatorial Structure
    2. Relational / Graphical Structure
    3. Recursive Structure
    4. Temporal / Sequential Structure
    5. Clustering Structure
    6. Discreteness - quantized, distributions
    7. Continuity - distribution
    8. Smoothness
    9. Sparsity
    10. Locality
    11. Linearity / Polynomial / Exponential Structure
  7. Principles of Structure: (Operate on your operators) [Define all, give examples]
    1. Simplicity vs. complexity
    2. Bias - Variance Decomposition
    3. Abstraction - level of abstraction at which more or less structure, or different types of structure are present
    4. Framed as Compression
      1. Degree of Compression
    5. Directionality
    6. Discrete vs. Continuous
    7. Abstraction - fine vs. coarse grain structure
    8. Similarity, say, with a feature or set of features
    9. Randomness, degree to which there is structure, compressibility of data
    10. Homogeneity - degree to which the same operations can be run over objects in the structure
    11. Dimensionality - Interactions between features vs. single feature structure
  8. Across all types of structure:
    1. Description and examples of structure
    2. Interaction with principles of structure
    3. Properties belonging to that structure
    4. Transfer between differing forms of information that share that structure
    5. Different types of that structure
  9. Properties and Examples of each category of structure
    1. Hierarchical Structure
      1. Bottom Up
        1. Compositional
          1. Width-wise compositional
          2. Depth-wise compositional
        2. Combinatorial
        3. Counters curse of dimensionality with representative capacity
        4. Discrete
        5. Offers multiple levels of abstraction for interaction and prediction
        6. Types of information with Hierarchical Structure
          1. Abstraction
          2. Images
            1. Objects - Object Parts - Shapes - Lines / Curves
          3. Audio
            1. Words - Phonemes
          4. Businesses / Governments
          5. Sciences
            1. Physics
            2. Chemistry
            3. Biology
              1. Ontology of Species
              2. Organ Systems - Organs - Tissues - Cells - Nuclei + Organelles
              3. Brain
          6. Natural Language
            1. Fields - Concepts - Words (Combinatorial as well)
            2. Paragraph - Sentence - Phrase - Word - Character
          7. Time
            1. Centuries - Decades - Years - Months - Weeks - Days - Hours - Minutes - Seconds
          8. Measurement
            1. Kilometers - Meters - Centimeters - Millimeters
          9. Object Oriented Systems
            1. Classes - Objects
          10. Economy
          11. GDP - consumer spending + investment + Government Spending + Exports - Imports
      2. Top Down
        1. Recursion
          1. Homogeneous
          2. Discrete
        2. Generative
          1. Deconstruction
            1. Offers multiple levels of abstraction for interaction and prediction
    2. Relational Structure
      1. Object Oriented Structure
        1. Object (Entity)
        2. X is a Y relationships (Classification, Inheritance)
        3. X has a Y relationships (Composition / Aggregation)
        4. Properties of an Object
      2. Causal Graph - X leads to Y
      3. Dependency - X depends on Y
      4. Subject - Object relationships (in sentences)
        1. Linking verbs - ‘is’, ‘has’, ‘are’, ‘being’, ‘sense’ etc. between Object and Subject
      5. Co-occurrence
        1. Ex. Words mentioned in concert with one another
      6. Link - are connected
        1. Linkage Distribution
      7. Locality
      8. Edge Density
      9. Examples of Relational Structure
        1. Categories
        2. Connections
      10. Similarities between relational and hierarchical structure
      11. Number of connected components
      12. Topology & Graphs (manifold structure)
    3. Temporal / Sequential Structure
      1. Periodicity
        1. Hierarchical Periodicity
        2. Seasonality
      2. Burstiness
        1. Messages
        2. Words in documents
      3. Stationary vs. Non-Stationary Distributions
      4. Permanence / Option Structure
      5. Quantized
        1. Ex. hitting lights when predicting arrival time
      6. Autoregression / Autocovariance
      7. Feedback
        1. Positive Feedback
        2. Negative Feedback
        3. Length of feedback loops
      8. Synchronicity vs. Asynchronicity
        1. Discrete vs. Continuous
      9. Exponential Decay vs. Windowing
        1. Continuity vs. Discreteness
      10. Stability & Equilibrium
      11. Derivatives - change over time
      12. Asymmetry between past and future
      13. Exclusive ability to directly impact present
      14. Strong predictor of causality / anti-causality
      15. Examples - All Data is Time Series Data
    4. Clustering Structure
      1. Distance & Similarity
      2. Interaction with hierarchical structure

Clean vs. Dirty Explanation of the Reversal

When doing transfer between the notion of abstraction between ‘reductionary’ and ‘pure’ domains like math / computer science, and ‘compositional’ and ‘dirty’ domains like concepts or deep learning, it’s important to realize that there’s a reversal in the direction that’s referred to as ‘higher level’ abstraction.

If you were implementing a DL hierarchy in CS, the most abstract layer would be the first layer, with its edges and curves (say in computer vision). That’s the layer that’s most general across all objects. It just gets more specific as the recombination occurs, in the way that less abstract classes get more specific.

If you fail to realize that this reversal exists, expectations around what can and can’t be transferred can become confused. In DL, the lower levels of the network are more likely to transfer. In CS, the more abstract classes are more likely to transfer between lower level datapoints. (though transfer has a different meaning in this context, as seen below - it’s a very restrictive ‘has these variables / properties / functions’)

In Transfer and Generalization

Take OOP. Abstract classes generalize to new classes, classes generalize to new objects. The assumption is identity, there are no moving parts. When the instances (objects) vary, that variation isn’t captured at all by the abstraction. Take DL representations. Abstractions are constructed to generalize to datapoints that vary slightly from each other. Variation in the inputs are expected to properly processed by the model.

Heights of Abstraction as operating over more information

Strong distinction here! There’s inversion where in CS, the structure that’s shared across all instances is the highest level of abstraction. But in a conceptual hierarchy, height means a recombination of lower, more general building blocks.This is a dangerous fact that will lead to much confusion in transfer.I believe that James Koppel pointed this out to me a few months ago.

  1. The ‘type’ of abstraction done tends to differ between discrete and continuous hierarchies
    1. Continuous hierarchies are compositional, and consistently use the same type of abstraction all the way up / down the hierarchy
    2. Discrete hierarchies can see abstractions made over different categories or types.
  2. Dirtiness
    1. Notions of similarity can be much stricter over discrete abstractions.
      1. Equivalence
      2. Edit Distance
      3. Number of properties in common
      4. (Having a property in common is similarity over that property)
    2. Continuous notions of similarity
      1. Cosine Distance
      2. Euclidean Distance
      3. KL Divergence / Cross Entropy
      4. Wasserstein Distance
      5. Hinge Loss
      6. Mahalanobis (Distributional Distance)

Groundedness is much stronger in executable domains like CS where every abstract objects gets compiled down to bytecodes that run. There’s reality everywhere. In conceptual space, high level structure can seem to appear when it does not, or when the reverse is true. There’s dirtiness that makes powerful mistake possible, and memes take advantage of that ungroundedness to evolve in ways that allow them to propagate more easily.

What does the bias-variance tradeoff mean in CS abstraction? Mathematical Abstraction? There’s no attempt at generalization, so it seems like there’s no generalization error to learn from. It’s a different process, computationally.

The thing with perfect abstractions is that often there’s no possible change in generalization. That’s how it works with math or CS - it’s like all that exists is the training data, which is discrete and low dimensional. The value to compression in the abstraction is just taking up less space, it’s not generalization capacity.

Valuable Properties of Representations

Valuable properties of representations, born out of the frustration with the obsession over disentangling representations to the exclusion of other critical concepts. Many of these properties exist, to a greater or lesser extent, in human cognition.

  1. Decomposition of representation
  2. This gives you a controllable, interpretable, recombinable representation
  3. Alignment of representation where shared structure exists
  4. Want concepts with the same mechanisms / structure to update simultaneously when there’s new information that informs their working
  5. Can be through compositionality
  6. Trades off against decomposition?
  7. Modifiability of complexity of the representation depending on task
  8. Representation that becomes more granular upon zooming in
  9. Necessary for computational efficiency
    1. Memory Constraints
    2. Compute Time
    3. Attention Constraints
  10. Ideally would be on a continuum
    1. Give me the n principal components (non-linear) of the representation, while preserving clean conceptual (semantic) decomposition
  11. Transferability
  12. Ability for the representation to be repurposed for different tasks, generally through learning sufficiently high level structure that there is an appropriate level at which to do transfer between representations of problems and solutions
  13. Appropriate tradeoff of Simplicity / Compressedness vs. Representational capacity
  14. Sparsity
  15. Necessary for the discovery of compute intensive structure (say, graphical / relational / network, or concept recombination) in the representation
  16. Interpretability
  17. Optimizability of representation for interpretability.
  18. Quality translation from representation to natural language.
  19. Clean isolation of parts of the representation (or a sparse approximation of the used representation) for any prediction made or action taken.
  20. Control
  21. Control through modification, freezing, or freeing of sub-parts of the representation
  22. Discrete and Continuous Modes
  23. Discreteness
    1. For Interpretability, self-examination, sparsity.
  24. Continuity
    1. For representational capacity, predictive accuracy.
  25. Fully general translation into and out of the representation
  26. Want to be able to flexibly represent any category of object, situation, etc. in a merged representation
  27. Reserve category errors for a particular mode of action, ‘rigor mode’

Abstraction as Religion[ay]

“More beautiful than the vision from any mountain top… is the view of the world from the heights of abstraction.”

There’s a sense of awe that accompanies vastness. The extremes of size, from the vastness of the ocean or the vast blackness of outer space or the vastness of grand armies move us emotionally.

When you move sufficiently high up levels of abstraction, the vastness becomes beautiful and overwhelming. When you look at reality from multiple levels of abstraction simultaneously, the vastness[az][ba] of perspectives becomes beautiful and overwhelming.

[bb] Abstraction is the making of consilience, about the unification of knowledge. With abstraction is an instance of shared structure, binding together that which always was together but had not been recognized to be together[bc]. There’s this corrective motion, from the separateness of things to the connectedness of things, which is accompanied by a sense of wholeness.

Abstracting over this process[bd] rightfully generates awe. The pattern is that you watch the connections between objects appear over and over again, and you generalize to the sense that even if you’ve failed to see or appreciate it, connections exist between every object. The feeling of ubiquity and the presence of connectedness is that generalization[be]. And so you invert was was a hidden assumption - you move from assuming that connections don’t exist and that boundaries are real as a default to assuming that there is some notion of similarity that will hold between all objects as a default[bf].

It’s fairly [bg]ridiculous to claim that two objects have nothing in common (taken literally). Even the existence of boundaries is a type of connectedness. But say it was true, that two objects had nothing in common. That no pattern found in one was found in the other. Then, immediately, knowledge of one object would give you fascinating knowledge about the other objects[bh]. Every pattern that occured in one would be known to not occur in the other. And because many patterns repeat frequently across almost all objects, if the objects had any level of complexity there would be a fascinating challenge in maintaining the patternlessness between them. There is not disconnectedness (for even disconnection is a form of connection), there is randomness.

There is extraordinary power in perceiving the connectedness of things. In many contexts (category theory, neurons) the connectivity structure dominates or is all that matters. You can know an object by what it is connected to - this principle shows up in word vectors, using this strong heuristic to create rich semantic representations of words that can be compared to one another. In a neural network, the learning of the connection strength between neurons is a general path to representing knowledge.

Abstraction to Transcendence

There are a few conceptions of transcendence which are accomplishable via abstraction.

  • Gestalts (wholes that are more than sum of parts, via generalization)
  • Lower state as trivially accomplishable by a higher state
  • Moving from the concrete to the conceptual
  • Being unknowable, unmeasurable, unseeable from the lower level
  • Set theoretic - one dominates the other, in a way that leads to full containment (but also much much more)
  • Ultimate generality / universality
  • Generation of universes

For each, 1. why it relates to transcendence and 2. How abstraction accomplishes it:

Gestalts

The creation of wholes that are more than the sum of their parts can be accomplished by abstraction. Take the creation of axes from properties. You can take two or three objects, and notice their shapes. For each aspect of shape (their heights, the eccentricity of their curves, etc.) there’s a particular measurement for each object. But taken together, the objects can be seen as datapoints in a space of possible objects that is defined by the axes that measure the properties that the objects have.[bi] This space is much more general than any of the particular objects, and immediately creates a generative model for new objects: propose a new point in the space, and something new (though admittedly, it’s the restricted newness of interpolation / extrapolation rather than the creation of new axes) is brought into existence. The perception of a gestalt, that is a more general frame out of access to limited information, is power by the discovery and generalization of shared structure found between sets of facts. It’s the generalization that can follow pattern recognition.

Lower state as trivially accomplishable from a higher state

In mathematics, representations of operations in the abstract (say, addition or multiplication) are general, and much more valuable than trying to deal with each problem on an ad hoc basis. The transformation of the operations from being about particular objects into numbers in the abstract is what allows umpteen situations to be efficiently compared to one another, turning every particular instance of a problem into a trivial and solved operation. In noticing the shared structure across addition or multiplication problems, and learning to map situations to that abstract representation (and filtering out the details that are irrelevant, say of the kind of object being added or multiplied) we can do strong transfer between umpteen situations[bj].

Moving from the concrete to the conceptual / spiritual

The conceptual is tied in with the transcendent because until the transcendent level is reached, it exists as a conceptual possibility, but not concrete reality. [bk] There’s also a body of properties that concepts have that tie them to religious notions. They exist outside of material reality in that the patterns that are observed (say, the laws of physics) are invariant to any action that we know how to take. Abstract representations of numbers are deeply valuable to us in an immediate way without having a clear material instantiation. And so the sense 1. That these things are real and 2. That we are incapable of interacting with them puts abstraction to them in the regime of the transcendent.

The conceptual and the spiritual are abstract patterns, rather than concrete objects that can be physically intervened on in the way that we are conventionally accustomed to. There’s a sense that the spiritual world (which exists in the minds of believers, as does other currently understood experience in this solipsistic world) is composed of these conceptual objects.

Being unknowable, unmeasurable, unseeable from the lower level[bl]

There [bm]are operations which can be performed trivially at the abstract level (say, because efficient algorithms have been discovered that operate over a dataset) that would be unseeable from the lower level representation of the problem as seen at first glance. Re-representing the problem in the abstract makes umpteen [bn]solutions available (both from transfer and from the way that solutions interact with the new shape of your problem) that makes possible what would feel magical.

Ethics

Expanding the scope of tribal instincts has been a classic path [bo]for ethics. Familial and tribal bonds enable tight knit trust and cooperation. Abstracting out to the scale of towns, cities, nations, humanity, and eventually all living beings as a whole has been a natural path for ethicists looking for generality in the principles that govern interactions between living creatures. Tradeoff Between Concept-relevant data and Precision

One way to see inductive / empirical / analogizing / abstractive thinking is as operating by doing transfer between similar datapoints. (As opposed to causal / deductive / rational mode, though this distinction becomes fluid.)

Abstractions alters 3. The amount of data informing inference 4. The closeness of the connection between the objects in question

Abstraction tries to bridge a similarity gap between datapoints, but simultaneously makes more datapoints accessible for inference.

For example, abstracting from (time, money, attention) up to (resource) gives you many more relevant datapoints, but the intersection that’s relevant shrinks. For example, time as a resource is finite and unchangeable, in a way that money and attention are not. And so in crossing the similarity gap there’s conflation that can damage your ability to think, if you assume that you can always acquire more of a resource and expect that model to cleanly generalize to time. But you do want to generalize concepts like opportunity cost to all of them.

What is the value of additional data? When building a model over small data, it’s easy to over fit to those datapoints. Generalization fails because the probability distribution that data is drawn from has a range far outside what exists in the training data. Small data makes the prior more important, and so the models fit tend to need to be much simpler (often excessively simple) to maintain generalization capacity.

What damage can be done by attempting to bridge the gap? Conflation of unlike objects can lead to illegitimate inference. It’s critical to know which structure is shared and which is not - this allows clean navigation of the tradeoff, by only doing transfer from datapoints which are similar enough to a new instance to validate the transfer. Or by smoothing over differences between datapoints, giving smaller weight to those that are less similar.

Tradeoff between working memory and conflation

Crudely saving working memory and dropping the distinctions in shared structure is often more efficient, but you eat the tradeoff whole. Doing deconstruction over an abstraction and making the relevant distinctions prior to transfer trades off against computational time and working memory, and so ideally you create abstractions that are clean enough to be efficient. Constantly going back and making distinctions is expensive, and so is the damage done to thought through conflation.

And so the abstractions that minimize conflation while compressing the representation bridge this tradeoff, improving efficiency and generalization in thought.

Ways in Which Abstraction is Done

I’m going to enumerate the type of abstraction done in each of the examples I have, and attempt to categorize them in some coherent way (taking the idea of abstraction and trying to go to a lower level, dividing it into particular types).

The class dog contains properties of dogs. Instances of dog have individual properties and shared properties. This is an example of going from a set of objects into a concept that unifies them. But the other way to abstract is from individuals to species - they’re unified by the ability to interbreed. And so it’s a way to create a subset.

The appropriate intermediate abstraction here may be set theory - we divide a space into subsets based on some criteria or shared property.

Orange, Apple, Banana - unified properties

What does a function do, fundamentally? It takes some behavior, and creates a shared structure where the behavior is modified on the basis of arguments. It abstracts over individual instances of the thing that the function does. Say you want to create a box, and the thing that changes is the location.

The removal of particular aspects of a problem, like generalization in mathematics.

Structure as constraints? Properties as a particular type of structure (akin to an existence constraint)

  1. Shared Properties
    1. Dog vs. instances of dog (interbreeding)
    2. Dog vs. instances of dog ([categories of property] - shared snout size property, breed property, color property)
    3. The class for Orange, vs. instance of Orange
    4. Apple, vs. instance of Apple
    5. Banana, vs. instance of Banana
    6. Animal, vs instances of animal (dog, cow, etc.) - shared properties (eats, sleeps, makes noise)
    7. Fruits vs. instances of fruit (shared ripeness duration, water content, some color)
    8. Graphic object vs. instances (lines, rectangles, circles) where they share position, orientation, etc.
  2. Shared Structure
    1. Functions, generally
  3. Compositionality 1.

Difficulties in Thinking About Abstraction

  1. What you mean when you say a word is usually to activate some but not all of the words associations. And the distinctions required to disentangle those associations are innumerable.
  2. Multiple objectives to a set of thoughts, say
    1. to be communicable (that is communicate the right message),
    2. to be useful in a practical sense,
    3. to be truthful,
    4. to be efficiently represented (even though brevity requires abandoning complexity that may be necessary in some situations but not others)
    5. Etc. whose conflict involves tradeoffs where you may be willing to make a number of mistakes so that what you’re saying can be easily understood. But then you start thinking with that representation yourself.
  3. There are no words which fail to conflate unlike objects that describe what you’re trying to describe
  4. Collapsing a space into a single object
  5. Using the wrong axes to perform an evaluation (ex., how similar are these objects? Is very easy to evaluate in a way that doesn’t respect the goal of the comparison
    1. Worse, the sense that the evaluation is conditional on the purpose of the evaluation may be lost. The notion of similarity will be taken to be objective, true for all possible goals.
  6. Incomplete decomposition, missing important sub-categories that makes the model feel clearly broken (even if close the the principal components for many goals are captured)
    1. OMG, we need a prediction focused PCA which balances the goals of variance maximization and maintaining predictive capacity
      1. I guess that this is what LULZ is supposed to be
    2. Ex., Intelligence -> analytical, creative and practical intelligence (Sternberg)
    3. There’s this immense harm in thinking that knowledge representations need to be literally true. Sternberg’s model may be the most useful for many tasks, efficiently making tradeoffs that are necessary tradeoffs to be made for any model. Brining a strict standard of truth to it and evaluating it on that basis fails to respect the reality that there are multiple objectives to these models.
    4. We end up in a world with mathematics and data are all that fail to die to purity, a world where making conceptual progress is impossible because all concepts can be destroyed by our standards.
      1. Yet every day, we live by these concepts, we think with these concepts. And we’re neutering the process by which we improve them
  7. In so many cases, things aren’t right or wrong but are more right or more wrong. Spheres are reasonable approximations to the space of weights SGD can get to in n steps, even though the reality is much more complicated.

Terms, How People Talk about Abstraction

  • High Level vs. Low Level
  • Grain (Coarse Grain vs. Fine Grain)
  • Broad vs. Specific
  • General vs. Specific
  • “Broad brush”
  • Concepts, Conceptualization (Boyd, Dad)
  • Comprehensive Whole vs. Particulars (Boyd)

High Level vs. Low Level

There’s the ubiquitous reference dependent ‘high level’ and ‘low level’ type or reference, where the speaker has in mind some reference class level (often contrasting the high using the low as a reference point, and using the high as a reference point to define the low).

This tends to lead to unnecessarily binarized thinking. Ideally the language would auto-capture that there may be a high level and a low level, but there’s also a level higher than low level but lower than high level, and lower than low level, and higher than high level, etc. The use of almost all of these terms relegates us to two levels by default. Though maybe that makes thining easier for the reader. And also, perhaps it’s not binary but points to a ‘gradient’. More true for coarseness vs. fineness than high vs. low.

Coarseness vs. Fineness (Literally true in the case of abstraction in image processing) Grain (Coarse Grain vs. Fine Grain)

I really enjoy how this version implies a continuum of abstraction - it’s clear that you can become incrementally more coarse, or incrementally more fine. And so it’s appropriate for those situations where the abstraction is continuous. It’s also flexible, and can be used for discrete situations that are more or less fine / coarse than one another. But it struggles in another context, where there’s strong discreteness. Take the version of abstraction involved in creating a function, or in creating a variable instead of using scalar values. The metaphor (coarse vs. fine) starts to break down. This is in part because coarseness vs. fineness assumes that the type of abstraction is exactly the same throughout! In the metaphor, you only get more or less resolution. You never switch to a different type of abstraction. And it’s hard to model a binary discrete situation with this, where there are exactly two levels. May be nicer if there are more levels, but we also have to keep the type of abstraction the same.

Broad vs. Specific General vs. Specific “Broad brush” “If you ask a more specific question, I can give you an answer.”, is what Scott Kominers would love to tell me. People love to use the term ‘broad brush’ to give themselves permission to not condition on subsets of populations, or to make general claims in a way that’s unrigorous. It’s both valuable and dangerous - dangerous in that they don’t expect to go into details on their claim, which makes these kinds of claim extremely hard to evaluate, verify or argue against. Or often arguing against consists of picking a counterexample, which the person admitted would exist when they said that the statement would be broad brush. It’s valuable in that these summaries across populations end up being critical for decisions that depend on the proclivities of a large number of people. And it becomes extremely difficult to reason about a space if the level or rigor required makes it hard for people to make claims that look true to them.

Concepts, Conceptualization (Boyd, Dad)

This points to the way that abstract objects often move from particular grounded objects to the immaterial concept of the object. This conceptual frame is often more general but more difficult to operate on - there are generally missing features belong to a particular instance of the concept that would allow operations to be run over that object, at a detailed level.

A conceptual understanding is an understanding of the way that ideas come out of base data, and often the way that those ideas interact with one another. The implication is that you can operate over much more data by abstracting a substantial group of data into a concept and then having that concept interact with other concepts improve at the conceptual level in a way that generalizes to every example of raw data that’s connected to the concept.

Comprehensive Whole vs. Particulars (Boyd)

There’s a wholeness of vision that is capable of considering the interactions of multiple high level objects. Those high level objects (required to see what feels like the whole) have to be constructed out of lower level components in ways that aren’t leaky or overly destructive to predictive ability.

Particulars (ex., detail orientedness) allows for the interaction with the explicit instantiation of your comprehensive whole, usually put together with concepts.

General-to-specific (Boyd)

Whether the specific is across time, across objects, or other relationships types, the motion from a particular to a general representation of it (say, form a particular sponge with the concept of sponge) If you’re trying to re-generate the input we can use a VAE - elsewhiese (for sequence data) and we can do some machine learning over the video, over the content, etc.

Hierarchical Structure

  1. Hierarchical Structure
    1. Bottom Up
      1. Compositional
        1. Width-wise compositional
        2. Depth-wise compositional
      2. Combinatorial
      3. Counters curse of dimensionality with representative capacity
      4. Discrete
      5. Offers multiple levels of abstraction for interaction and prediction
      6. Types of information with Hierarchical Structure
        1. Abstraction
        2. Images
          1. Objects - Object Parts - Shapes - Lines / Curves
        3. Audio
          1. Words - Phonemes
        4. Businesses / Governments
        5. Sciences
          1. Physics
          2. Chemistry
          3. Biology
            1. Ontology of Species
            2. Organ Systems - Organs - Tissues - Cells - Nuclei + Organelles
            3. Brain
        6. Natural Language
          1. Fields - Concepts - Words (Combinatorial as well)
          2. Paragraph - Sentence - Phrase - Word - Character
        7. Time
          1. Centuries - Decades - Years - Months - Weeks - Days - Hours - Minutes - Seconds
        8. Measurement
          1. Kilometers - Meters - Centimeters - Millimeters
        9. Object Oriented Systems
          1. Classes - Objects
        10. Economy
        11. GDP - consumer spending + investment + Government Spending + Exports - Imports
    2. Top Down
      1. Recursion
        1. Homogeneous
        2. Discrete
      2. Generative
        1. Deconstruction
          1. Offers multiple levels of abstraction for interaction and prediction

Bias Variance

Bias-Variance and Abstraction

[Need: The way that Abstraction / Compression increases Bias, Lowers Variance] There’s a powerful generalization from the bias variance tradeoff in modeling to the standards by which we create concepts and the quality of those concepts representation of the world and generalization through their use in future thought.

When we decide to create a new concept (one quality example is a new word), we use it to tacitly collect data that corresponds to the category that it outlines. Take the concept of dog as a label that shares information through a class of objects that are similar along many axes. It effectively bundles information about that object that will allow anything that’s recognized as being a dog as being likely to share the properties of the other dogs that have been seen in the past.

(Need an example that can demonstrate fluidity more effectively) One important downside of this binary classification (where all objects are either dogs or not dogs) is that the conflation between different dogs

Another consequence is that without the creation of a higher level class (say, animal) it isolates the data that belongs to the dog class to that class. Because categories are attractors, it’s not feasible to have a category that’s closeby without it being subsumed / replaced.

It can be useful to share learned information even more br

Bias variance is one instantiation of Occam’s razor.

Point 1: Bias Variance Tradeoff and the variance of a probability distribution Variance in the bias-variance tradeoff refers to the concept that when you’re searching over models, some models have more flexibility. When they fit a dataset, models with more flexibility tend to overfit, because they find a separating hyperplane that is overly accommodating to particular datapoint. There are many ways to tend to overfit, and variance is an abstraction over all of them.

  1. There’s valuing fit over smoothness.
  2. There’s valuing a single datapoint in a region with sparse data over the impact from other datapoint farther away that you can interpolate from or extrapolate from. (Looking too much at particular datapoints)
  3. Arbitrarily overweighting one representation of the features over valuable others, incomplete search over the set of feature representations

Related: Decomposition over Regularization https://docs.google.com/document/d/1tCoaZEzERE3XP_4SzJWJhQ17bnY7vfUGYPGnCCEO54I/edit?usp=sharing

Levels of Abstraction, Abstracting Over an Incomplete Subset https://docs.google.com/document/d/18FvL9mlKTDlxQVXju1v8vOV63U73-6f9WVGh9-d8ScE/edit?usp=sharing

Treating Variance in the bias-variance tradeoff as a concept, there are many ways we could instantiate it.

  1. The standard way, watching your model overfit. This approximation of variance is the difference between the training error and the validation error. (Bias will affect both your training and validation error equally)
  2. Bootstrap sampling variants of the dataset
    1. Split between in and out of bag examples
    2. Train on the training sample, test on the testing sample
    3. Variance is the ordinary (distribution) variance of your predictions on a given datapoint (assuming regression). You can compute the average variance across datapoints for your model’s variance.

The number of hypotheses that can be learned by a model (say, the number of features in a dataset for a decision stump, and all of their interactions for a singly branched tree) Across different representations of a hypothesis space (parameters, freedom over those parameters, number of parameters, rules, freedom of rules) these are different approximations of the variance. But a wide hypothesis space tends to cause high variance, it’s not variance itself.

Say that your model’s predictions of a datapoint are Cauchy distributed. Would you say that since its variance is undefined, it’s not subject to the bias-variance tradeoff?

Variance of a distribution

Variance is complex hypothesis classes leading to overfitting

Just because a concept is formalizable doesn’t mean that the concept is its formalization. There’s something like map-territory here. But it’s higher-map lower-map. We need a clean way to distinguish between concepts and their formalization. Would you say that ‘attention’ in deep learning is attention? Of course not. Attention is so much bigger than that.

Examples of Bias Variance, and how Abstraction increases bias while lowering variance: Myers Briggs works extremely well. 16 categories are structure that introduces bias, due to the difference between the simplification that comes from the assumptions and the variance that comes if you have constant data. You can imagine having 1600 categories and the same amount of data, but matching the underlying structure of personality much more closely. This would be a dramatic reduction of the bias at the cost of blowing up the variance (because very few people would fall into each category, and their idiosyncrasies would overly influence our sense of what each personality category was like). These categories were created by generating 4 abstract properties of people and looking at the categories that come out of treating those 4 properties as binary, one or the other switches. Extroverted or introverted. Feeling vs. Thinking. When we overcategorize (say, generating orders of magnitude more properties of personality) we thin out the data of people in each category (and so may need a system that can effectively do transfer from similar-but-not-identical categories).

If you fixate in system 2 type thinking, logical, delibarate thinking, you have high variance. You don’t have variance problems that you can point to directly, just in the way that any single predictor will be higher variance than if it’s part of an ensemble.

But you also have massive pointable-to bias problems if you’re a man with a hammer. It means that the assumptions that you make to apply your hammer are out of line with the underlying reality. Man with a hammer syndrome points to high bias resulting from misfits in model and the world.

Basically the goal is to put all of the relevant (where relevant is defined as sufficiently low bias, where the assumptions fit the situation) models into an ensemble that can be low variance as well and so generate accurate predictions.

Abstraction and Systems Thinking

  • Proper level of abstraction example is to think about the economy - at the entrepreneur's level, you dislike volatility, as it kills you. But at the level of the system, volatility leads to growth. [Taleb]

The way that systems operate is general, in a way that abstracts away from the details of systems in particular. This makes it a beautiful example of thinking at multiple levels of analysis. One quality example is the entrepreneur (object level) vs the system (say, the body of businesses and the capitalistic system driving them). What is good for the system and what is good for the individual are often in conflict. Similarly to the way that an individual animal is unlikely to benefit from the mutations that lead to variability in its genome relative to its brethren, entrepreneurs don’t benefit from the reality that they are likely to die in the same way that the system as a whole benefits from their collective risks. This is a beautiful explication of transfer. By abstracting from the indiviual level to the system level, we can compare patterns across systems as diverse as evolutionary processes and the growth of a capitalistic economy and the personal journies we all find ourslves on.

Beautiful Example of Shared Structure in Systems

There are a number of interrelated ideas that I’d like to connect. Those are Optionality, Trial and Error, Exploration-Exploitation, Experimentation, Natural Selection, Comfort Zones, Creative Destruction, and certainly a swath of other critical ideas. These are unified by the notion of volatility, and the effect that volatility has at the level of every system that’s impacted by these concepts.

The central idea is Optionality - in many cases, you get the upside from volatility. You sample over and over from a distribution (you can imagine meeting potential friends/romantic partners, or trying out classes or a major at college, or there being many mutations of a gene in a population) and take the best, or at least the better, option.

In all situations like this (and they are everywhere), volatility is more important than the average. In Exploration-Exploitation, optionality lets you exploit by taking the best solution thusfar and using it as your model for behavior. When there’s high variance in what you’re exploring - say you’re sampling different dishes at a restaurant - if some dishes are amazing and some are awful, you get the upside - amazing dishes all the time - as soon as you start exploiting. If there isn’t much volatility - if every dish is basically the same - even if the dishes are quite good, you don’t end up as well off as when there’s variation that you can take advantage of. Personal experimentation is a form of exploration, where you try out a new behavior, style of living, or a new habit. When there’s more variation in the change in life quality, you get much more out of experimentation. Natural selection benefits from high levels of mutation in a population, because that allows for faster adaptation to an environment and faster improvements in fitness at the species level. In Trial and Error, there’s a binary outcome and you sample over and over until you get a success. Then you can use that success again and again. People who are said to be staying inside their comfort zones are suffering from the absence of optionality - by refusing to explore the space, they end up with a weak payoff. Creative destruction is critical for the growth of economic systems, and thrives off of the volatility inherent in the life and death of industry. By taking the upside to variance, capitalistic societies grow off of optionality. This foundational principle underlies almost all value creation. It calls for us to optimize systems for volatility, not average capacity. Education, business, personal lifestyle - all of these have much to gain from volatility. And so the common heuristic that volatility is bad or dangerous or scary is only true at the lower fractal level. At the system one level up, the variance is essential to growth.

Search as Abstraction

We can see search as a problem that cuts across many domains, where once we re-represent our problem as being search lends itself to immediate and powerful transfer from every other domain where the problem has been solved.

Search in artificial intelligence, social communities, and physical reality all share upmteen properties. The way that you map currently seen options, evaluate future opportuniteis for search (option value in behavioral economics, where there is an expected value to uncovering new information), the way that successful search leads to the ability to double and triple down on successful options - all of these domains share enough properties that by having a term that we apply to all of them we can start doing implicit transfer, and where if we apply ourselves we can achieve strong gains by doing explicit transfer.

The shared structure in the problems looks like a number of points (in person space, real space, model space) which have some cost of transport to other points and some revealed value of those poins as the space is explored.

Knowledge Representation, Reasoning and Question Answering

Deep learning has been successful in natural language processing in large part due to embeddings for words and for symbols. Research on creating embeddings for phrases and for facts is important for the future of knowledge representation.

It would be great to be able to represent relationships between objects. Say, represent relationships between numbers or between mammals. The relationship of less than or greater than or belonging to the same set.

We can represent this information in a structured language format that has a verb (that defines the relation) as well as a subject (the first entity) and an object (the related entity). There is a second kind of relationship, an attribute, where a subject has an attribute.

The ‘relational database’ is generally used to store these types of relationship.

Change - Dude. They use “Neural Language Models” again to represent word vectors, and much more explicitly here - “learn a vector that provides a distributed representation of each word”. This so needlessly breaks with the way that every other resource talks about this concept.

Knowledge bases are large sets of these types of relations.

It would be great if we could use relations in a knowledge base as training labels in tandem with a large corpus of text (say, wikipedia) and learn new relations automatically with high probability to fill out the knowledge base.

Knowledge bases have also been used for word-sense disambiguation.

Fundamental Questions in Representation Learning

Why Representation Learning?

The way that a problem is represented can dramatically change our perceptions of what the problem is and how to solve it. Problems that look nearly impossible, if re-represented, suddenly become tractable.

One obvious gap is from a conceptual representation of the problem to a mathematical one. Representing a social network or communications network as a graph allows us to do operations like search or optimization over the representation. Representing a person by their connections with a vector lets us measure the similarity between ostensibly very different people quickly and efficiently. Representing a word as a vector that depends on its surrounding words does the same.

There always is a representation. And so you can improve your ability to solve a problem representation side (by reformulating the problem) or on the solution side (by reformulating your solution assuming the given representation). The reason for representation learning is that you can optimize the representation of the problem for the task at hand, with a particular parameterization of the space of possible representations of the problem.

Downstream from a quality representation is the ability to re-use parts of that representation to quickly adapt to new environments, experiences and tasks. And so learning modular representations is critical. Humans have modularity built into our grammar, and so can use language to effectively slot descriptions of actions objects into our conceptual and communicative scheme. Attempting to transfer with a representation that is at the wrong level of analysis will fail, and so the best representations allow for flexible motion between multiple levels of analysis which create a surface for transfer.

Why Learn Discrete / Sparse Representations?

Once you have a low dimensional discrete representation, a wide body of important algorithms are made available.

  1. Concept recombination
    1. With sparsity it’s possible to run inefficient algorithms, for example looking to the set of combinations of a set of concepts. Those interactions can only be looked at for a small subset of the feature space - doing it at the level of pixels and in a continuous space would be computationally intractable.
  2. Causality and its establishment
    1. If you want to do counterfactual learning, you have to find an abstractive representation of time (both the time over the action and over the outcome) and you have to find an abstractive representation of the action space.
    2. Creating a causal graph requires a high level representation of parts of your environment and actions in that environment.
  3. Credit assignment (to higher level objects) made efficient
  4. Hierarchical control
    1. Decision making over conceptual blocks of actions
  5. Higher level planning

Why Learn Continuous Representations?

In a continuous representation, one option you have is to have each axis represent a quality that can be had in greater or lesser proportion. And so you represent your object as a vector of many qualities (decomposition) which vary across a continuum (fluidity) and so have extraordinary flexibility.

An example of that flexibility is that if you have two object, alike in some ways but different in others, whose similarities you want to take advantage of in some contexts (say, the weight of two balls being similar, but their colors being different) you can represent that with a two dimensional vector with a continuous representation of the balls. In a discrete representation, both color and weight are represented very differently (with approximations or collapses) and can’t be merged in useful ways.

In the continuous representation you can also re-combine the representation in umpteen ways. You can you masking to average some vectors and not average others (when merging two concepts) to create novel vectors which a generative model can create a low-level representation of and which can be compared (in terms of its performance on some downstream objective) to the concepts that are already in your representation.

When you’re abstracting, you want to notice where there’s conflation and where there’s shared structure.

I’d like to note that this is an example of a dense continuous representation. There are sparse discrete and sparse continuous representations as well.

The data is also often high dimensional, with each high level concept having innumerable sub-concepts or attributes or properties that can be seen as axes in the concept vector.

Truth at Levels of Analysis, Scientific Disciplines at Levels of Analysis, Planning and Abstraction

  1. Truth is only right or wrong after you choose a level of abstraction at which to determine truth.
    1. The ways in which you verify alignment of an idea with reality change depending on the level.

Front and center evidence of this claim comes from the ever raging debate over free will. In many cases, there’s a levels of analysis question - do we ask about free will at the level of beliefes, states of mind, and high level actions? Or do we ask about free will at the level of electrical signals and chemical reactions?

Often a hearty defence of free will will look something like ‘my brain state cause my future action’, with an implication that if the brain state had been different the action would have been different. And the follow is that ‘I have free will because had I willed something else (had a different high level brain state), I would have taken a different action’.

There’s a different sense of free will at a lower level of analysis. Where bodies of chemical interactions that drive hormone releases & that trigger limbic responses seem clearly defined by natural laws.

Someone who takes free will to mean ‘my current state of mind determines my actions’ and someone who takes free will to mean ‘I have control over my high and low level mental processing’ can make identical predictions about what actions a person will take and what experience they will have. They can have close to identical models of the system. But because they answer the question of how free will is defined at different levels, failing to notice this will lead to ongoing debate.

  1. Scientific disciplines as distinguished by the grade at which we interact with reality, from lower level to higher level, and the relative ease of formalizing lower levels of abstraction

One clear distinguishing factor in the scientists is the level at which they explore reality. Take physics as low level, dealing with the basics of matter and natural laws. One level higher is chemistry, exploring the composition of particular kinds of organic and inorganic matter with one another. Organic chemical interactions are composed in biology, where living systems contain networks of cells and neurons that leverage chemical reactions at scale to accomplish high level valuable behavior. Bodies of neurons firing in tandem across time create psychological phenomena, which can be studied at a behavioral level. Bodies of people interacting create even higher notions of social cohesion and tribal behavior in anthropology and sociology.

By seeking to understand systems at a new level of analysis, sciences make discoveries and run experiments that share lots of properties with one another that are not shared with lower or higher levels of analysis. The kinds of solutions that work well and the kind of questions that it make sense to ask shift with the scale at which the system is being analyzed.

The process of abstraction is what facilitates the transition between these levels of analysis. The molecule is a general composition of many atoms or other molecules, which have their instantiations. Equations that describe the bonding of these molecules in the abstract are highly predictive, and the continued composition of organic molecules turns into tissues and organs in biological systems. The choice to give labels to particular groupings of these molecules allows us to describe organs and organ systems succinctly and predictably.

One major power of abstration is to go from a particular instance of an element to that element in the abstract. The periodic table is a wonderful example of an abstraction that cuts reality at the joints. Choice of shared structure: Number of protons in the nucleus. Amazing descriptive and predictive value (See ontology is overrated).

That particular choice of shared structure creates an ordering over the elements, and there are umpteen properties that change in unison as we move across that ordering. Patterns in interaction fall out of that choice of representation.

We can imagine choosing other kinds of shared structure to represent elements, and losing out on valuable patterns that fall out of the current representation. But we also always want to ask what tradeoffs are made by embracing our current representation, and what patterns tend not to fall out of it.

  1. The planning process and the way that we treat near concrete events at a low level and far events at a high level

Easy everyday examples of abstraction come out of our planning processes. For example, take an action like ‘go to Antartica’. That action isn’t executable - it needs to be decomposed into sub-parts (book a flight, go from where you are to an airport, navagate the airport, fly to southern Argentina, travel to the boat, call a tour operator, acquire a permit, etc.). The sub-parts are decomposed in turn until a concrete action that can be executed falls out of the plan, bottoming out in unconscious behavior (the tapping of keyboard keys, the dialing of a cell phone). All of these representations we use to think about the problem are high level, which is necessary for us to hold the entire picture of how they get us from where we are to our destination. There’s a working memory limitation that we need to overcome in tying parts of our plan together. The holistic representation of our plan requires that we have nice high level summaries of wide bodies of actions or behaviors that cut the parts of our plan into pieces that can be understood.

Planning at a level that’s to low happens in the reinforcement learning literature, and is a source of an insane hunger for data in the systems attempting to learn from that data. The challenge is with an agent, for example, that needs to accomplish some high level goal (say, moving from a current position to a door in a level of a game) but where the agent only has access to low level actions like moving left, right, up or down. In instructing a human, any reasonable teacher would issue an instruction like ‘go to the door’, and the person would compose the set of basic motions required to fulfill the higher level motion. Quickly that would be come intuitive and the player would think about problems at this abstract level. Teaching reinforcement learning systems to do this is a critical research frontier, often referred to as hierarchical reinforcement learning.

Abstraction, Invariance, and the Efficiency of Learning, Formal Languages, Mutual Exclusivity Abstraction, Invariance, and the Efficiency of Learning

Invariance is a very strong form of shared structure. But to the degree that you can see similarity between experiences, you can auto-generate the world of other experiences that you haven’t had but are capable of generalizing to.

These strong examples of transfer show up across the human visual system. Scale invariance, or the ability to recognize a face, for example, at many different distances, is an example of transfer through invariance. Properties like these allow a person to see a face once, and then be able to recognize it at many angles of rotation and at many distances, in many parts of their field of vision.

Invariance as a pattern - if you can recognize the pattern under which the transformation is invariant, your learning speed increases dramatically. If you never see the pattern, you can require many orders of magnitude more experience in order to learn (because you fail to auto-generate the world of other examples that come out of perceiving the invariance). This is a very strong and wonderful example of transfer.

In machine learning, the use of a simple invariance like translation invariance in convolutional neural networks dramatically improved the performance of computer vision systems, and alongside large datasets in many cases reaches superhuman levels.

Formal Languages, Clean vs. Dirty Concepts

I should really frame this as a comparison between formal languages and natural languages.

Ugh, Wittgensten’s rant on how all concepts are dirty concepts in Philosophical Investigations Sections 65-78… Though some great examples of dirty concepts. ‘Games, ‘number’ (perhaps). (perhaps add the examples of love games, game theory, zero sum game) The notion that the fuziness of the boundary of a concepts shows up in Frege. (An area with vague boundaries cannot be called an area). It’s weird, he’s a mathemetician - certainly he sees the difference between discrete and continuous representations of concepts? That these don’t have to be geometric objects, or have to be solid?

Can see synesthesia as the free association of concepts with one another - the color of numbers, the shape of colors - where there are regularities in which numbers show up in a context. Many of these connections are transitive, where a number leads to a situation that leads to a color, say. Ex, beige feels blobular because the feeling you get when you see a beige wall is the same feeling you get when you see a comfy and nondescript amorphous shape. And so emotional triggers can be used as an intermediary to do transfer between seemly disconnected concepts.

Humans’ ability to do transfer between arbitrary objects has the same feel to it, where the ability to experience the warmth of a story can also move transitively from lived experience to words to that feeling of warmth. All of it is associative intuition, where levels of freedom of association are unlocked when you allow for transitive association.

Representation Learning & Language Creation

Seeing language as a general representation of information, a medium for communication of shared concepts, etc… It makes sense to decompose language and then run through all of the ways that representation learning is language creation and the corresponding benefits and downsides of that mode of language creation.

One frame is that networks create a new language each time they’re trained. Another frame is that each language is an instance of a class of languages that all share certain unifying properties, and that it’s very much worth discovering and creating new classes of language.

OMG Elenore is so beautiful - ‘Cognitive Economy’. I can do full transfer, now. Thank you.

  1. Concepts and Incentives
  2. Concepts and Opportunity Cost
  3. Concepts and Marginal Value
  4. Concepts and Diminishing Marginal Returns
  5. Concepts and Supply / Demand
  6. Concepts and Option Value
  7. Concepts and Comparative Advantage / Specialization
  8. Concepts and Equilibrium
  9. Concepts and Deadweight Loss
  10. Concepts and Elasticity
  11. Concepts and Tragedy of the Commons
  12. Concepts and Externalities
  13. Concepts and Moral Hazard / the Principal Agent Problem
  14. Concepts and Arbitrage

The natural thought from cognitive economy is that conceptual schemes should efficiently structure experience so that 1. The category of an experience lets us identify it as similar to other experiences in that category and 2. The categorization of an experience lets us identify it as not being in a body of other categories.

It’s worth spelling out exactly what the resources are (prime candidates are memory and attention) and what the consequences of their scarcity is. It’s unclear how this transfers to ML.

I never spelled this out - when you have a mutually exclusive conceptual scheme, categorizing marks the object as not being in just many categories, which may contain even more information than marking it as being in a particular category. That moment when you hear a sound in the night, and upon realizing that it’s merely machinery get little relief from the fact that it’s machinery but oodles of relief from the fact that it’s not a thief, a dangerous animal or a murderer. A thank you to Elanor Rosch, Principles of Categorization (In Concepts, Chapter 8).

Take the question of how humans decide which set of features they deem relevant to creating a new concept. There’s some exploration over the space of possible concepts, and the ones that survive tend to be those that are valuable. But there’s something about the tighness of the association of the features in question, combined with the value of attributing something learned about one datapoint in the class to every datapoint in the class that makes for much more efficient credit assignment.

The tragedy here is that category assignment is discrete rather than continuous, where ideally you’d have a smooth notion of similarity and do transfer insofar as the similarity holds. Human intuition does this relatively well. We also preserve ambiguity in language which lets us approximate continuity with our concepts. The dirtiness is what allows for a more accurate form of transfer. (Need examples of the value of ambiguity)

There’s a challenge in comparing network representations to human language, because one is continuous while the other is discrete. The dimensionality of the true representation is likely low in practice, where there’s a low dimensional meaning manifold in a high dimensional space. - that’s why humans can share state so easily in the space of felt meanings. If the dimensionality was actually extremely high, communication would be close to impossible.

Language as Choice of Basis, Pattern Recognition over Compression, Much More on Decomposition

Language as Choice of Basis

A basis is a chosen low level upon which all spoken meaning is founded. There is composition there, and so the properties of composition hold.

Abstraction, Compression and Pattern Recognition

There’s a sense that compression leads to improved generalizaiton. The mechanism by which generalization is improved is that hte compressive process forces the discovery of patterns that are present in the current example that are also present in future examples. When the compression proceeds by patterns that fail to generalize or that are irrelevant to prediction, the generalizing motion fails.

And so compression is less fundamental than pattern recognition, as pattern recognition is the lurking variable that drives compressed representations to generalize. That distinction compels us to take all of the value that we think we get from compressed schemes and reconceive of them as being value gotten from improved pattern recognition.

Due to this fact about compression it’s common to believe that the compressed representation is better because it is smaller. It’s actually about the representation that has captured more and improved patterns that generalize to future examples.

  1. Take every objective in ‘What Makes a Representation Good’, add my own objectives, and for each one specify:
    1. A way (or set of ways) to measure the objective
      1. Distinguish between the concept of the objective and the mathematical instantiation of the objective (unless they’re truly identical)
    2. The downstream consequences of doing better or worse on the objective
    3. Compare two different networks over the objective
    4. The rationale (and intuition pumps) for the objective
      1. The counterarguments

Often when you’re choosing a basis you want to measure its quality. The obvious choice is its use - how effectively can you generate the meanings you want to represent with the sub-pieces of your representation (whether they be learned neural network representations or concepts). There’s a sense of distance between true meaning and representable meaning that in language is measure by a felt sense that tells you whether or not you really the meaning that you intended to convey.

Another measure to optimize your decomposition for is succinctness. Coding theory says that you want the shortest encoding for the most common signals, and so we naturally make words like ‘I’, ‘a’, ‘of’ and ‘the’ extremely short and leave words like ‘succinctness’ to be extremely long (given that they’re likely to be used infrequently). This makes the average word length used a heuristic for the overall probability of a sentence. It also points to a desire to allow for different levels of complexity in learned representations, where the length of elements of a basis is learned along with its content.

The downstream consequences of a lack of succinctness include not being able to express messages that would take too long or take up too much space. There are many meanings that would be left on the table. There’s a forced compression of meaning (think the way that message can be compressed by using a text message, using twitter with its character limitation, or writing by hand rather than on a computer) which leads to a different character of message entirely.

The downstream consequence of a basis that has the wrong parts is that entire swaths of possible meanings become unsayable. (Write this up myself first, and then read Wittgenstein’s PI)

How are we to compare the relative decomposability of representations? One answer is to ask how many predictions can effectively be made from one and not the other, ask how accurate the predictions are in particular contexts.

There are useful properties the decompostion can be made to have - recursive decomposability is one obvious property, where sentences can be broken into words, but words can be broken into a sub-word level (ex. ‘Un-’, ‘-ing’, ‘s (for the plural)’) and so can be efficiently rerepresented or constructed on the fly. Properties like this improve the value of every single word in the language, and so to the degree that these are effectively encoded the language can be simultaneously more expressive and more efficient.

Decomposable network representations with functions that take a part of the representation and operate on it in a simple way to make it more flexible are extremely valuable. You can see the encoding of an invariance (say, translation invariance) into the representation as making the representation work for every example to which its invariant, accomplishing a similar effect to this function. An example of adding operators that add general value across the representation is the creation of attention mechanisms, which choose which part of their input to pass through nad which to choose not to attend to. These are learned, so that the parts of a representation that are valuable to a prediction get the focus of the network. But this is a single and light example of a function which takes a representation and returns an improved version of it (for the task at hand), and in general this kind of operator is rare in the representation learning research frontier.

The counterarguments to decomposability start with discretness and continuity. Decomposition is straightforward in a linear system (where the decomposed features can be trivially recombined to create the while) but is a deep challenge in a continuous setting (where the features blur into one another. This notion of ‘goo’ has the advantage of mixing more of the connected parts of the representation and allowing a set of complex connections to learn to disentangle the relevant features, leveraging that connectedness to improve accuracy.

Decomposability also creates an issue of separating concerns. What should the boundaries be, between concepts, or between features? How do we extend a concept to stretch beyond its bounds? In creating discrete representations, the boundaries that separate the parts of the representation must exist. Their existence cuts off smoothness in meaning space, as well as limiting the regime of particular concepts to a confined region (where everything is or is not an element of the category that the concept represents).

This limitation suggests that our decomposition should be over continuous, rather than discrete objects. But it’s not clear how to decompose in that context - where does one sub-part of the representation end, and another begin? Split points can be made flexible, or distributions can decay, or features can be discrete, but all of these pose a number of splits / distributions / features, ending in discreteness in some form.

Sparsity

Economy of attention The way that importance is a zero-sum game If you add weight to one feature you tacitly take it away from other features

Information as a reduction in uncertainty An objective of representation being that which maximally reduces uncertainty (about relevant predictions)

Sparsity in representations 15. Sparsity and inefficient algorithms (ex., recombination) 16. Compression (for speed of evaluation, related to 1) 17. Compression (for improved pattern recognition) 18. Compression (for improved use of scarce neural resources)

  1. High memory capacity (so cool that I rederived this myself)
  2. High representational capacity
  3. Connected to…
  4. Numenta’s Sparse Distributed Representations
  5. The State of Sparsity in DL
  6. Spiking in neurons (sparse activations are neuroinspired)
  7. Sparse Coding
    1. Sparse Coding in the Primate Cortex
  8. Sparsity as regularization improving generalization
  9. Sparsity as a notion of simplicity (and so invoking Occam)
  10. Way / ways to measure sparsity
  11. Sparsity at a high level (say, concepts) vs. at a low level (say, perception)
  12. Sparseness as discreteness out of continuity
  13. Other types of sparsity (sparsity as a concept with many instantiations)
  14. Downsides / weaknesses to sparsity
  15. High fault tolerance (as most pathways are independent) (so cool that I rederived this myself)
  16. Allows for simultenaity (so cool that I rederived this myself)

OMG, it’s possible that one huge model is possible if you use sparse networks but not if you use dense networks

Increasing the width of a sparse representation while keeping the number of sparse activations constant increases the representational power of a network without adding computational burden. It’s extremely cheap increase in potential connections.

Sparsity leads to the ability to run inefficient algorithms. In computer science, an algorithm’s computational complexity represents the amount of time it will take to run given the size of its inputs. There are many algorithms that scale poorly with their number of inputs (say, algorithms that look at all of the combinations of the input elements), but yet capture extremely valuable relationships and structure. Sparsity in the input allows for the running of these algorithms.

Speed of evaluation is often a function of the number of values that need to be considered. In a dense representation, every operation can trigger a huge number of downstream operations across the entire network. With a sparse representation, particular pathways can be triggered selectively. The total number of activations can shrink to below 1% the number of dense activations, saving on all resources involved in computing. Faster evaluations allow for more evaluations in the same time period, increasing the quality of the transformations.

Compression forces more effective pattern recognition, and we can see sparsity as a form of compression that improves pattern recognition via simplicity. A simple example is the way that a small set of primary features predicting an output is more likely to generalize to a new situation than looking at many many features which may by chance share some statistical connection to a prediction. That feature selection and refinement forces a model that looks only at the primary drivers of an outcome, which are drivers you expect are more likely to remain true at generalization.

Because there’s a limit to the capacity of any representational system (number of neurons in a brain, size of a network on a machine) there’s value in using resources efficiently. Sparsity allows for the extremely efficient use of space, and so enables more relationships to be encoded simultaneously. This likely leads to an expanded capacity for long term memory.

How do we measure sparsity? It depends on the type of sparsity and the level of analysis. One easy answer is the fraction of activations in a network representation, where the fraction of neurons that fail to activate or are zero corresponds to the sparsity level.

Sparsity is one path from continuity to discreteness, where a representation that has an arbitrary number of features (and so which is continouous in the sense of there being a large and arbitrary number of active features). There’s an approximation of continuity that comes with having an extremely large and arbitrary number of discrete values, and a sparsity prior will push that number down to a finite value that has dramatic shifts upon the addition or expulsion of one additional feature, which is yet another heuristic for the representation’s discreteness.

Aside: Types of continuity Continuity in the amount of a feature that is present (in the coefficient) Continuity in the value of the feature itself (ex., male / female is on or off, as a discrete feature) Continuity in the number of features that are being used for predictions (where as the number grows it gets closer to feeling continuous, in the same way that floating point digits are technically discrete but are approximating a continuum)

Interpretabilty of high level sparse representations (where you can say exactly which high level concepts informed your prediction or decision) is an extremely helpful property for a model to have when a human intends to intervene on its predictions or even look to understand its model. There’s an uninterpretability to the mass conflation that comes out of looking at the combinations of every feature with every other feature on a continuum, where isolating the relationships that matter is critical to finding a causal intervention strategy or to communicating a convincing explaination of the representation’s behavior.

There’s a level of fault-tolerance that comes with using a sparse representation, where if parts of the representation are damaged or disabled, since it’s only used for a tiny fraction of the computations made the damage to the system as a whole is limited. When operations are dense, with every part of the representation being used for every prediction, it’s much easier for damage or faulty computations to disable the system in its entirety.

Downsides / weaknesses to sparsity - The obvious place to begin is that sparsity can damage accuracy, especially in excessive amounts. The representation isn’t as capable of modeling complex interactions, or needs to reduce the degree of sparsity in order to get there. Which introducing another issue - choosing the optimal degree of sparsity in the representation. This is a complex choice to make, and while in the context of machine learning it can be tuned as a hyperparameter it’s not clear how to resolve it in the absence of a clear objective or in the context of non-computational representations.

Classification, High Abstraction, Against Hierarchy

Abstraction is so central to machine learning that the major task - classification - is a process of mapping a body of features to a single class label. This presumes a conceptual scheme, where the class label (whether it be cat vs. dog or positive / negative sentiment) is provided by a human who attempts to teach the machine how to map input data to a human concept.

This task shares the pitfalls and upsides of abstraction. It’s a mapping to a compressed, much lower dimensional representation than the raw features that led to the prediction. In comparison with something like density modeling, which forces a model to generate the full distribution of the feature space, it is information sparse.

The information density of a task is critical for learning effectively. One challenge in conceptual learning is that in the absence of grounding in experience, there’s merely the mimicking of the interface that humans use to communicate about reality rather than the direct experience of reality itself.

In one sense, it’s fascinating and heroic that language is capable of teaching in the absence of felt experience. This breakthrough is dramatic in comparison to having to learn experientially, not being able to map your experiences to abstractions and then generalize from those abstractions to new experiences. It’s what allows us to build up collective knowledge across time and in so doing share experiences with untold numbers of beings alive now and in the past. But it does still by default lack the richness of experience, and the density of information that’s inherent within that.

It’s not until the generality of abstraction allows for the unification of myriad experiences under a single label that the absence of density in the feedback stream is made up for.

  • Talk about how density modeling requires abstraction

There’s a notion that writing has allowed for the build up of knowledge across time. This is true and powerful. But there’s a more subtle and more general form of the building of knowledge, in the form of language.

Language holds untold amounts of implicit information. The learning of language is the learning of a frame for viewing everything that that language touches, due to the transfer across generations of the users of that language.

  • Talk about the way that language and the conceptual scheme holds generations of built up implicit knowledge

Attempts at High Abstraction

Philosophical and religious traditions are often founded on extremely abstract generalizations over many many experiences.

One clear example of this is duality in its many forms - Hegel’s Thesis and Antithesis (resolving in synthesis), Chinese philosophy’s Yin and Yang, Duality in Mathematics, Binary in Logic and Computer Science, Inversion, and on.

Duality often points at opposing forces. Action and reaction. Order and chaos. It tries to generalize across all that opposes, and describe the kinds of integration, compartmentalization, conflict or synthesis that occurs between them.

The amount of shared structure between the varied objects that these high concepts are applied to is extraordinarily ambiguous and dirty. There’s a sense of the power of the generality of the concept, generating a (often rightful!) sense of religious awe. The sheer volume of applications leads to a sense that any insight into duality itself propagates immediately across a global spectrum of experience.

Scientifically minded thinkers struggle to leverage valuable epistemological techniques (falsification, say) to the level of concepts that are intended to be this general. It’s all too easy to transfer a sense of knowledge where no relationship exists, and so these dirty abstractions are often greeted with a sense of skepticism rather than with the reverence of wisdom.

Against Hierarchy

  1. The mistake of using a single tag for all elements (down in a hierarchy, say the way that books are categorizes) rather than using many tags. Where your data (ex., a book) must be declared to be mainly about one thing in order to be put in a single place on the shelf, but in reality is about many things. And in the internet world, we can switch from broken and overbearing single property categorization schemes to flexible multi-property schemes.
    1. This is a more general argument against hierarchy
    2. A system of tags is… what? Feels like a hash table on top of a graph, where the nodes have relations but also have tags held in the nodes for immediate access.
      1. But what really happens is you drop the nodes and just add arbitrary categories to each datapoints.
    3. This is paradigm breaking. Instead of classification (is it in category 1 or category 2), the answer is ‘both’. Where your label vector has positive values for multiple classes… but that doesn’t even instantiate the concept properly. You have multiple values of 1, where labels don’t have to trade off against one another anymore.
      1. This leads to a new paradigm for file systems (as a graph rather than a hierarchy)
      2. General super powerful algorithm - every time you see a hierarchy, realize that it induces bias and check whether making it into a graph (or at least a DAG) can dramatically improve its value.
        1. Ex., Object Oriented Programming requires that the programmer define a static hierarchical class scheme
      3. Somehow, Clay Shirky didn’t realize this was a graph?

It’s natural that the strongest arguments against hierarchical reasoning come out of graphical reasoning. It’s a question of whether or not to lift the restrictions on relatedness, but in so doing lose the ability to generalize immediately across everything that falls under an abstract class in a hierarchy.

The claim is that if your conceptual scheme is rightfully hierarchically structured (say, every harvard athlete is a harvard student, every harvard student is a college student) then learning anything about a college student immediately generalizes to every harvard student and every harvard athlete, and everything you learn about harvard students generalizes to harvard athletes (to the degree that those categories are actually similar). So you learn that every college student has a transcript, and can presume that the athelete you’re watching on television has a transcript. Or you learn that every college student has a major, and can presume the same.)

Stronger still are biological examples - you know that every hamster is a mammal is an animal. And so in learning that every animal has a circulatory system, you can know that your hamster has a circulatory system. In learning that every mammal has live young, you know that your hamster has live young. And so that transfer is straightforward.

Once you allow for more flexible relationships (in this case lifting the exclusively ‘is a’ relationship constraint), it’s harder to do transfer with the breadth and certainty of a hierarchical scheme. Say you have a system of tags for books.

This is an example of somebody who fit the wrong data structure to his data and then, rather than getting frustrated with the biased induced by his choice, decided that the data structure itself was at fault. He noticed that books are difficult to tie into a single hierarchical conceptual scheme and went after hierarchy in general, rather than the choice.

Inheritance in object oriented programming has a similar flavor. The desire for mixins and the flexibility inherent is sensible - making valuable tools available and availing of them when it’s valuable is a natural approach to tool construction.

The value of forcing hierarchical structure to get the transfer that it promises may proove too valuable to give up. And so there’s a reaction to the overuse of a structure (or the design of an entire language around that structure) in a context where this is the reaction.

There’s a want for synthesis that can be achieved after overly powerful languages lead to the overuse of complex features and code complexity in general.

Conceptual Decomposition

  1. Map-territory conflation

Map-territory is a wonderful example of a concept that comes out of seeing the shared structure between models (and the way that a model can be in error itself, and is uncertain) as different from that between elements of reality (which are generated by deterministic or quantum physical processes). The base elements that drive the creation of this concept look like experiences where somebody expects that an impact on their model of reality (say, changing the way that they describe something) will directly change reality itself. In trying to bring somebody to notice their confusion, it’s very valuable to have a distinction.

Distinctions tend to generate two concepts where there used to be one. A concept that used to be unconditional becomes conditional.

Concept generation

I want to express a technique for critiquing, generating and improving ontologies called conceptual decomposition.

Leading Questions for Conceptual Decomposition:

  1. What are all of the ways in which the concept is used?
  2. How is the way we use the concept misleading?
    1. What other valuable conceptual schemes are we pushed off of?
  3. What is insufficient about the concept?
  4. What gives the concept its value?
  5. What are many examples of the concept, and how to they differ from one another? What is truly invariant across them?
  6. Is the concept part of a larger conceptual scheme? What concepts does it block, or support?
  7. What is the simplest possible version of the concept? The most complex version?
  8. What are all of the definitions that exist?
  9. What is the concept often conflated with?
  10. What major assumptions does applying or using the concept make? When do these assumptions differ from reality?
  11. What are the differences between the concept in its breath and the particulars of its instantiation?

This body of leading questions should establish nuance, distinctions, and ground for an improved conceptual scheme.

<Go one by one through these leading questions, demonstrate their use (I used one on Abstraction itself) and the properties / advantages that they hold>

Transfer Examples, Composition through Conceptual Decomposition

Examples of Transfer

  1. Taleb, Power Law Distributions from the Black Swan
  2. Taleb, Volatility and Growth from Antifragile
  3. Entropy (Communication Theory, Thermodynamics, Probability Distributions)
  4. Eigenvectors
  5. Differential Equations
  6. Economic Laws (esp. Game Theory)
  7. All of math? Not sure how to frame this, the concepts don’t fit nicely.
  8. Graph Analysis (Linked, Degree Distributions, etc.)

Leading Questions for Conceptual Decomposition:

  1. What are all of the ways in which the concept is used?
  2. How is the way we use the concept misleading?
    1. What other valuable conceptual schemes are we pushed off of?
  3. What is insufficient about the concept?
  4. What gives the concept its value?
  5. What are many examples of the concept, and how to they differ from one another? What is truly invariant across them?
  6. Is the concept part of a larger conceptual scheme? What concepts does it block, or support?
  7. What is the simplest possible version of the concept? The most complex version?
  8. What are all of the definitions that exist?
  9. What is the concept often conflated with?
  10. What major assumptions does applying or using the concept make? When do these assumptions differ from reality?
  11. What are the differences between the concept in its breath and the particulars of its instantiation? -150

Conceptual Decomposition on Composition (so meta!)

What are all of the ways in which the concept is used? One tempting place to start is with the types of objects that can be composed. Neural outputs in deep learning are the obvious place to start, with linear composition of each set of neurons into a single neuron in the next layer (for feedforward and recurrent nets). After the values are summed there’s a nonlinear activation. But this is width wise composition and allows for varied combinations of the inputs to be represented at the next layer. The second type of composition in neural networks is depth-wise, where the functions that compute new layers are nested leading to many levels of transformation. That functional composition gives an ease of expressivity to the model.

Those are two types of composition in machine learning. This version is relatively clean and mathematically tractable. Another extremely clean form of composition is the composition of atoms into molecules, and molecules into macro-molecules. We can see matter as structured composition.

But there are much dirtier versions as well. Conceptual composition, where we form sentences like this one through the combination of a body of meaningful concepts, is much harder to formalize but clearly is an extremely effective method for communicating meanings.

Composition in this way is related to abstraction as hierarchical compression, where a high level meaning is the combination of more basic meanings. But it’s also tied to abstraction as shared structure, where the words being re-used are porting over different facts and connotations depending on their past usage, and so accomplish transfer. This happens at the sentence level as well, where sentences with similar grammar form the same shape of meaning as one another.

How is the way we use the concept misleading? One easy answer is the varied level of structure that can exist in a composition. One extremely common use of the concept is to say that some object is a composition of other objects, say that a pen is a composition of a cap, a shaft, an ink cartridge and a tip. These parts are indeed combined to create the pen, it’s a basic part-to-whole notion of composition. There’s no recursive composition of similar molecules at multiple levels of scale. The pattern to the combination of parts is ad-hoc and specific to the purpose of the pen, rather than being grouded in innate structure or intended to be extremely general. The parts have interfaces to one another (for example, the cap fits the shape of the top of the body of the pen), but those interfaces are also very specific to the object in question and not general across objects. There’s a shallowness to this form of composition, and so it fails to share the power that comes out of the patterned composition we find in neural networks or in the atomic basis to matter.

What other valuable conceptual schemes are we pushed off of? The obvious answer is that conflating recursive composition with shallow composition reduces the felt power of the concept. There’s this expressivity to the interchangability and depth that comes out of replaceable parts with flexible, modular interfaces. That expressivity tends not to exist in all things that are composed of sub-parts.

We’re pushed off of many of these concepts. Modularity, for starters. Where we have modular and non-modular composition. This draws our attention to the interfaces between the sub-parts in our decomposed object. It asks how general their interfaces are, and how those interfaces scale with the composition. As the flexibility of the interface grows the expressivity of the composition also grows. Specificity in the interface can allow behaviors that are unreachable otherwise, but the question is whether that specificity is hard-coded or comes out of a particular composition of sub-parts itself.

What is insufficient about the concept?

What gives the concept its value? Simplicity. The parts from which the heights are composes can be easy to understand and simple. Then you specify an interface for their interaction with one another, and an algorithm for exploring the space of compositions that satisfy some objective. Quickly you can get extremely complex behavior out of a systems whose parts are deeply well understood. Composition is a path from simplicity to complexity, without the designer having to interface with the complexity themselves.

Levels of abstraction. Composition allows for the interaction with a complex system at multiple levels of analysis with similar tooling. Take function composition in computer science as an example. A function with some high level behavior (say, sorting) can be interacted with, without having to deal with the layer below it (a particular sorting algorithm) or layers far beneath it (the bits on the computer that need to be manipulated). The function can be interacted with at its most practical level, in the form of what the function accomplishes rather than how it’s encoded or even how its particular implementation works.

So what can be said about composition itself?

Alignment

Representational Alignment

  1. Take every objective in ‘What Makes a Representation Good’, add my own objectives, and for each one specify:
    1. A way (or set of ways) to measure the objective
      1. Distinguish between the concept of the objective and the mathematical instantiation of the objective (unless they’re truly identical)
    2. The downstream consequences of doing better or worse on the objective
    3. Compare two different networks over the objective
    4. The rationale (and intuition pumps) for the objective
      1. The counterarguments

Aligning representations is often the very purpose of communication. There’s been a way that you’ve been thinking about a situation, about a person, about a kind of problem, and you will feel that you’ve communicated your state to your conversation partner when your representations align. The standard is whether they can express your experience and thinking with all of the emotion and intimate detail that you can, as a confirmation of shared state and alignment of conceptual scheme and felt senses.

The transfer of knowledge from a teacher to a student often begins with the establishment of a shared and agreed upon basis of knowledge on which to build. In the technical disciplines, definitions and formal representations (math, program code, and diagrams are strong examples) facilitate the alignment of representations. It’s on top of this aligned state that learning proceeds, and if misalignment goes undiscovered there are consequences for the student’s progress - those holes lead to misunderstandings, missed inferences, and incomprehension.

When alignment is taken for granted, conversations can proceed without realization that the speaker’s mental state isn’t actually known. This leads to an illusion of transparency, where the speaker projects and anticipates that representations are already aligned, and so our intended interpretation is all that we can see our conversation partner as seeing.

There’s a desire in representations for internal alignment, where in a similar situation the similar parts of the representation are re-used rather than being connected to another part of the representation. Generally this is driven by concept re-use, where (for example) you have a concept of learning that you port whether you’re learning mathematics, to play piano or a new language. That re-use lets you take techniques like spaced repetition and deliberate practice and port them across all learning tasks, rather then leaving them to be specific to their context. Often transfer fails to happen despite this - language learners deeply appreciate the value of immersion in their target language, but immersive learning is expensive enough that it’s rarely attempted outside of language learning.

Splitting concepts means that you need to go through more experiences to acquire the same collective amount of knowledge. If you’re reading about golden retrievers and don’t realize that they’re dogs, you’ll need to learn a body of facts over again if you want to understand how golden retrievers behave.

The discovery of concepts that share some structure that wasn’t previously perceived allows for internal learning, free of new experience. It’s the taking of every experience associated with one concept and re-seeing the experiences associated with the second concept in light of the first. You realize that there’s a connection between compositionality and hierarchy trees, and are suddenly free to port notions of how sub-parts can be represented so as to be combinable, or the speed of operations across the hierarchy, or the idea to decompose and recombine when you want the properties that hierarchy promises.

As importantly, you realize that each system whose representation is compositional, whether it be an object oriented programming language or a biological taxonomy is impacted by what is learned through this alignment process.

Canonical Correlation Analysis Representation averaging without alignment Internal alignment where there’s shared structure (re-use for cleaner global updating) Consquences of the absence of alignment Measure

How do we measure alignment? A failure to do this properly leads to miscommunication, failed attempts at transfer, conflation and ambiguity. The tests look quite different across contexts.

Conversational (and human conceptual) alignment often looks like making a statement or prediction that would only be possible under the conditions of alignment (a non-trivial inference), noticing the weakness of using repetition of words as a standard of measure when it’s possible to mimic the interface of the representation without actually having the underlying representation. There’s a hope that stating a thought in one’s own words, or some other metric that forces generalization can avoid overfitting to the specific why that a concept was phrased by a first speaker.

This distinction in measure - the measurement of generalization rather than replication - is an important distinction across machine learning and thought. Can a person actually use the representation they’ve been aligned to across new data, or do they have a version of the representation that’s too closely tied to exactly what was communicated? It’s akin to learning vs. memorizing, and captures a lot of what it means to ‘understand’.

Measuring alignment in neural networks can proceed by the same mechanism. Put an identical datapoint through both networks and ask whether the same activations trigger. It’s a notion of network similarity, and can be more or less in depth, where at surface level you ask if the predictions made are the same, at a slightly deeper layer you ask if the probability distribution over possible predictions is the same, at even deeper levels you ask if the bodies of activated neurons are the same, on through the depth of the network.

There’s a notion of alignment in word vectors which allows for unsupervised machine translation, by asking a model to force words that are known to have the same meaning to use the same vectors to represent those words. All other words are learned around and with respect to those vectors, and so will also be aligned with words in the other language. Translation can proceed at a word-by-word level.

The question of prediction often can be made concrete - if a teacher wants to check that the student’s representation is correct, they assign a set of problems whose collective predictions will only be correct if the representation is correct. Depending on the task, there’s a level of depth that can be required. In some cases, the student can use the formal representation (say, a set of equations) and apply them naively to the input question to produce the proper output. This can create misalignment between the representation the student uses to solve the problem and the conceptual representation that the student really has, where they’ve outsourced their understanding to an external formal representation and are merely replicating the interface of that representation’s use.

Alignment, Similarity, Fields that Transfer

More Representational Alignment

Experiment ideas: 7. Distillation as alignmnet of representations, 8. Language alignment through word vectors, can do something similar with concept learning by clustering avaliable images 9. Internal learning, or learning by doing transfer between parts of your own representation, seeing all learned functions as hypotheses that can be applied to new datapoints. 10. Communication in multi-agent RL could look like representational alignment

  1. Canonical correlation analysis checks which principal components are most aligned. That alignment step, applied to the feature space before model averaging.
  2. This allows for recombination in network space? Perhaps?
  3. Canonical Correlation Analysis
  4. Representation averaging without alignment
  5. Internal alignment where there’s shared structure (re-use for cleaner global updating)
  6. I bet that networks learn the same ways to recombine base elements repeatedly. This can be re-routed to use the same abstract transformations.
  7. Consquences of the absence of alignment

The comparison of networks question motivates another metric for internal representational alignment - efficiency. With how few neurons can the network effectively perform its task? What fraction of its units does it actually use, and need to use? A kind of alignment can be forced by a challenging task that the network doesn’t have enough data to learn on without re-using one experience for another.

A challenge in measuring alignment is that you have to know what datapoints share structure that could benefit from re-use in the model. This can be challenging, especially if you don’t have the generator for that data. Your ability to interpret the data may be limited.

In the conversational context, it can also be a deep challenge to measure alignment. You and the person you’re talking with may use the same words to express things, but mean very different things by that word and have different connotations over its usage. Even if you realize that there’s misalignment, it can be difficult to tell where it is and how much, depending on your introspective power over your own representation and your ability to communicate with language the thoughts and experiences that underlie the creation of your representation.

The obvious rationale for optimizing for alignment is its impact on generalization and its impact on learning efficiency.

Representations that re-use the relevant components when making predictions are capable of naturally generalizing from one example to the next. People whose representations are aligned are capable of generalizing in that they’ll make the same predictions as one another when using that aligned representation, and so have a shared context on top of which they can continue to push their (now collective) knowledge forwards.

There’s a limited scope in working memory, and so to the degree that more information can be bundled together under a concept, the predictions of the representation will be more general.

Representations that are aligned allow for efficient learning, in that the same experience doesn’t have to occur twice to two concepts that could have been aligned but were not. In so many ways, people have to learn the same lessons over and over again because they fail to represent their problem in a way that naturally allows transfer from a context where the experienced the same lesson.

There’s another notion of efficiency, efficiency with cognitive resources, which comes out of alignment. There’s generally a limited number of neurons, computational units, or concepts in the representation being used for prediction, compression, etc. That limitation manifests itself in the amount of composition that can be leveraged up and across the representational hierarchy, in the amount and quality of differences that can be picked up by the model, and in its overall capacity.

As with creating a new abstract object and using it, there’s a tradeoff between conflation and shared structure when it comes to alignment. You want to align the parts of the representation for which the transfer over shared structure aligns with reality, and not align the parts for which that’s untrue. Extremely high alignment (say, the dichotomy between yin and yang / binary / dialectic / thesis-antithesis-synthesis) may actually yield more conflation and so incorrect predictions than if the representation stays at a level that can safely distinguish between the important differences between those objects.

Fields by Transferability

There’s a sense that the knowledge it’s most important to gain first is the knowledge that can be re-used to successfully model other domains. By proceeding from generality to specificity, the learner will speed up dramatically.

Very rough version of fields ranked by transferability:

Reality: 10. Mathematics 11. Probability Theory / Statistics 12. Philosophy (especially Logic, Epistemology) 13. Physics 14. Economics 15. Computer Science 16. Engineering 17. Biology 18. Chemistry

Social Reality: 11. Psychology 12. Folklore & Mythology 13. History 14. Classics 15. Sociology 16. Linguistics 17. Religion 18. Anthropology 19. Government / Law 20. Gender / African American Studies

A natural thought is to start learning with mathematics and statistics, immediately aquiring the parts and the modes of thought from which the other fields borrow. Much nicer to learn principles that generalize across fields (say, learning about statistical testing, which is used as the standard of evidence across umpteen sciences) and then have that body of thoughts at hand when learning the details of a specific discipline.

The challenge in learning something overly specific first is that when moving into a new space, much less of their knowledge will transfer, making the move less likely to happen in the first place and emotionally unpalatable. There’s also this challenge to contributing in ways that are valuable via transfer from a body of ideas that tend to be overfit to their domain.

Notions of Network Similarity

Candidate research ideas in networks similarity.

  1. Jensen-Shannon Divergence over set of Logits
    1. Wasserstein Distance over Logits
  2. Do alignment over convolutional kernels (akin to the alignment step in canonical correlation analysis)
    1. One algorithm for doing the alignment looks like seeing which activations are active over a sample of datapoints, and using the shared activation patterns for the alignment.
    2. After the representations are aligned, directly computing the distance between convolutional layers starts to be meaningful.
  3. Fraction of datapoints similarly classified
  4. Decision boundary metrics
  5. Architecture Similarity
    1. Comparing depth, width of hiddens, counts of layer types, orderings, etc.
  6. Ways to embed neural networks?
    1. Learn a neural network autoencoder that creates a latent space of networks. Measure similarity in the latent space.
  7. Similarity of the embedding over the Representation of the network in NAS, where the LSTM has some hidden state that generates a network which can be compared to the hidden state that generates another network
    1. Would have to find a way to, instead of going from data to representation architecture, go from architecture to representation to data (inversion)
  8. Represent the output of each neuron over the dataset as a vector, and create one vector for each neuron in the network. Compare the resulting matrices after a decomposition (SVCCA)
    1. Can use the intermediate matrices and computer aligned neurons with activation patterns rather than CCA
  9. Cosine Distance / Angles between tensors?
  10. Convergence / divergence of learning throughout training (this is an instance of a temporal / course of optimization metric, which is actually an entire family of new metrics)
  11. Recombine with the above, evaluating each notion of network similarity over the course of training

Data to Feature, Features to Relationship, Relationships to Structure, and Structures to Meta-Structure

Data to Feature, Features to Relationship, Relationships to Structure, and Structures to Meta-Structure

  • An abstract hierarchy over the structure of information

There are different forms of abstraction here, but hang with me.

One way to compress data or experience is to categorize it into features. Say you see a house. The house can be seen to have many features like the number of bedrooms, its location, square footage, the presence or absence of a pool, the noise level around it, its distance to food and to social events, and many others. A nice summary of the house looks like a listing of each feature. A particular house can be seen as being similar to another house as a function of the similarity across each feature. When you see many houses, you begin to see the things that they have in common and the things that differ across them.

Those points of attributes over which there’s similarity or variance are features, and can be seen as abstract representations of concrete instances (ex. this house has 2 rooms, vs. ‘number of rooms’ as a holder for a number which will vary across many houses). We naturally generate these abstractions as we try to distinguish between events and experiences, saying that the square footage of the house.

We can see relationships as being between features, and named relationships (like linear, exponential, or quadratic) as not just formal functions that define particular relationships, but as abstractions over those relationships, where there is a category, say linear, into which all linear relationships fall.

There are many other types of link or relationship. All categories imply an is-a relationship, where some element belongs to the category.

This compression is very hard to do in general, where there are many relationships that don’t fall into a well known function category. We’ll use correlation to ask whether there’s a positive or negative linear relationship, and perhaps use mutual information or distance correlation to ask whether a relationship exists at all. But these notions are relatively simple and crude, failing to capture the complexity and the nuances of most real relationships. It was in the face of this challenge that a field (machine learning) came to study the generation of functions that do capture those relationships.

Structure is what I’ll use to define an abstraction over sets or relationships, that describes the way that relationships relate to one another. One clean example is hierarchy - you have some elements which can relate to other elements below or above. You can call an element below a child (and an element can have many children) and an element above a parent (where each child can only have one parent). You can follow this tree or this hierarchy from any single bottom node up to the top nodes - those nodes that have no parents.

This structure is seen across umpteen fields and parts of nature, from your computer’s file system to family lineage. Here, structure defines some rules of relating, and the structure itself will take on properties.

Much like the way that correlation, mutual information, distance correlation and others fail to capture the complexity and nuances of real relationships, most kinds of structure are too difficult to write down formally. Locality describes the relative importance of relationships between features that are local rather than features that are farther away. Periodicity describes the way that sets of features will undergo changes in tandem. But many relationships are challenging to capture so simply or succinctly that they can be easily written down formally. And the formalisms for structures that we know about often change dramatically from system to system (ex. Binary trees vs. posets, though binary trees have a recursive definition using set theory ).

There’s a relationship between the basic operations that generate a structure (say recursion or composition generating a hierarchy) and the structure itself.

Old Draft Old Draft Table of Contents

  1. What is Abstraction?
    1. Introduction
    2. How to Think About Abstraction
      1. Hierarchical Compression
        1. Shared properties of compression
      2. Shared Structure
      3. Similarity
      4. Pure abstractions, CS & Mathematics
      5. Which properties transfer across these definitions through examples
      6. Map / Territory
      7. Where people have seen abstraction / how it’s talked about
    3. Types of Abstraction
      1. Examples
      2. Base unit - lowest level for composition
      3. Consistency - whether the type of object is the same down to the base unit
      4. Discreteness vs Continuous
      5. Comprehensive (covers entire space) vs. cleaner generalization
  2. Power
    1. Transfer & Generalization
      1. Metaphor as tacit abstraction
      2. Causal vs. Intuitive vs. Probabilistic vs. Structural Transfer
      3. Importance of appropriate level of abstraction
        1. Tractability with reality
    2. Representation heights as powerful
    3. Hierarchical Structure
    4. Memory, Computation
    5. Multiple levels of abstraction
      1. Moving between levels of abstraction
    6. Abstracting away reality
    7. Creativity
      1. Creativity as transfer
      2. Creativity as recombination of representations
      3. Creativity as deconstruction
  3. Pitfalls
    1. Unlike Object & False Transfer
      1. Memetics
    2. Excessive Height
    3. Abstraction as an unseen assumption
      1. Reductionism
    4. Lack of grounding to abstractions
    5. Social reality
    6. Conflation
  4. Construction & Deconstruction
    1. Processes that create abstractions
      1. Intuitive
      2. Intentional
    2. Engineering abstractions for outcomes
  5. Using Abstraction
    1. Improving abstractions
      1. Feedback
      2. Consistency
      3. Groundedness
    2. Noticing breakdowns
      1. Overly coarse
        1. Optimal Reactions
      2. Overly fine
        1. Optimal reactions
    3. Seeing the level of abstraction being used
    4. Choosing the level of abstraction to use
      1. Frame
  6. Future
    1. Machine Intelligence & Representation
    2. Future of academic fields damned by the consequences of dirty abstractions
    3. Conceptual past and future

Thoughts list

  1. Category errors as failure in abstraction type
  2. Business leaders as using high level abstractions over low level behavior
    1. Constant question is where / if their abstractions break or become ungrounded

What is Abstraction?

Introduction

Idea List

  1. Nice anecdote / grounding
  2. Motivation
    1. power of height & representation,
    2. transfer, generalization,
    3. problem solving,
    4. working memory,
    5. creativity,
    6. interestingness and intellectual beauty,
    7. deconstructing disagreement & confusion,
    8. structure of reality Lines -
  • Abstraction as synonymous with evasiveness, a lack of clarity
  • Abstraction as about untamed power

How to Think About Abstraction - Abstraction as an Abstraction

Idea List

  1. Overview Hierarchical, shared structure, purity, transfer

As a term that unifies shared structure, abstraction sits above a number of critical compressive forces that underlie thought.

Shared structure means that when looking at a set of objects, you’ll invariably find some objects that have similarities to one another. Those objects are ‘closer’ to one another, and sometimes they’re close enough that it makes sense to summarize their closeness (and reduce the number of objects that you need to keep in mind) by giving them a shared name.

Shared structure can be thought of as a simple form of hierarchical compression, in the case where there’s only one lower level layer and a single higher level concept. But that’s rarely where things end - as soon as you have abstracted, you’re looking at a new object. And yet more compressive power can be found in abstracting recursively, generating another layer and adding depth the the hierarchy. That said, there are other forms of hierarchical compression that distinguish it from shared structure. Often, unlike objects will be composed into a cohesive and coherent higher level object, and so we’ll abstract over unlike objects that work together in a way that makes it efficient to compress them, often in terms of their function. A laptop is a great example of this. It brings together wiring, a compute chip, memory, tools for interaction like your keypad and mouse, a glass screen, and more. We abstract over all of this, calling the object a laptop. This compression is important so that the level at which we interact with the object is simple and consistent with the tools being used to do the interaction - moving the object from place to place in a bag, buying or selling the object, or clarifying its presence or absence.

There’s a purity that comes out of abstracting in a way that doesn’t dirty things up. It’s hard to come by that purity in the world (say, when abstracting over things like shapes, objects, or emotions) but it happens all the time in mathematics or computer science. Mathematical functions as the abstraction of an operation, say - we take many instances of addition and abstract it into a function that operates over any data. That abstraction is pure in that it’s clear what the shared structure is between instances of addition, and it transfers cleanly (all of the properties are maintained) to a new setting. This purity comes out of the limitedness of the abstraction, as well as doing it in an abstract context (math) rather than in reality, where something like adding groups of animal together may be done heuristically and imperfectly rather than mathematically - especially for groups that haven’t developed a number system.

Across these definitions of abstraction, there are properties of abstraction hold like compression and informational power (due to height allowing one to operate over many more objects). But other properties break down - the abstraction over unlike objects that perform a unified task gives the user of the abstraction a new and improved level at which to interact with reality. But it’s not pure in the way that a type system in computer science is pure, where you abstract over the bit representation of integers, or characters, and so are confident that you can operate over every instance of that abstraction in an identical way.

Hierarchical Compression Idea List 5. Compositionality

  1. Curse of dimensionality
  2. Width-wise and depth-wise compositionality
  3. Recursivity
  4. Unifying lower level structure vs. breaking down higher level structure
  5. Read Bengio on hierarchical compression

There’s amazing power to hierarchy, in that it can introduce relatively complex models that tame an extremely high dimensional and complex world.

A classic example of hierarchical compression comes from vision, where it’s used to create state of the art computer vision systems. Images at a low level can be filtered for lines and curves - in a cat’s mind (and likely ours) the the visual system lights up in the presence of edges in a way that doesn’t happen for other extremely low level shapes. Those edges and curves can be re-composed into shapes. And those shapes are composed into parts of objects, which can be composed again into higher level objects. These all expose grains at which to interact with reality.

That’s an example of bottom up hierarchical compression, but we’re also familiar with going top down - being introduced to a high level concept (say, in the form of language) and over time needing to fill in the details.

Shared Structure Similarity

  1. Cognitive Fit
  2. Distance Metrics (As examples of different types of similarity)
    1. Edit Distance vs. Euclidian Distance vs. Geometric Distance vs. Angles vs. Shape vs. Color vs. KL Divergence, etc.

Pure Abstractions Transfer, Across Definitions Idea List 5. Idea list this! 6. Compression transfers
7. Operating recursively sometimes transfers 8. 1. Shared Structure 2. Pure abstractions, CS & Mathematics 3. Which properties transfer across these definitions through examples 4. Map / Territory

Transfer of properties of abstraction across definitions

Definitions - is-a relationships (math, object oriented) vs. hierarchical compression (in my mind just unidirectional, but what we care about is shared properties). Abstraction necessarily virtual? How virtual, do we execute it to be consistent with the data generating process and so map cleanly to the territory, or is a map clearly apart from the territory that we don’t expect to adhere to things perfectly? Different expectations of abstraction in math/physics/cs vs the dirty version everywhere else.

  1. Shared between math / physics / cs abstraction (is-a relationship, purity, consistency with territory, always recursively executable)
  2. Hierarchical Compression

Both involve: 5. Compression 6. Level of analysis at which to interact

  1. Though math / physics / cs will throw category errors if you try to interact at the wrong level of analysis
  2. Heights of abstraction operate over more information
  3. Leakiness
    1. A problem in (and defined by) CS APIs. APIs as an abstraction over implementations.
    2. Though perhaps worse in hierarchical compression, where there’s so much more behind the abstractions.

Distinctions:

  1. Transfer and generalization
    1. Math / object oriented cs (not machine learning!) / physics have very narrow and precise rules around transfer and generalization. And they throw category errors when those rules are unmet.
    2. Hierarchical compression works in looser ways in a statistical sense. You can generate a representation of a concept and use it in extremely refined or less refined ways, to better or worse results.
  2. Base units
    1. Consistently build up compositionally from base units in math / physics / cs. Not true elsewhere - can start at an intermediate or high level and go down, or start low and go up. Reductionism is much less effective elsewhere. You can’t transfer reductionism across domains, because the systems aren’t constructed (or physics) and so you don’t know what all of the ways to abstract down are.
  3. Discreteness vs. Continuity
    1. Usually discrete in language and in math / physics / cs, continuous in machine learning / in our minds. There’s a fuzziness that lets us do cross-domain ‘cognitive fit’.
  4. Dirtiness
    1. Math / physics / cs is extremely clean (even pure), elsewhere people are happy to recombine distributed representations of concepts
  5. Limited vs. Defined and Loose
    1. Extremely limited in math / physics / cs. Category errors, discretely represented.
  6. Groundedness
    1. Often ungrounded in hierarchical compression space, you don’t even really need to know the low level.
    2. Can be ungrounded in physics (say, before we knew about atoms), but very grounded in math (bottom-up construction from axioms) and in object oriented programming (regardless of the true data being modeled, the cs representation is grounded in bits)
  7. Mental modeling as abstraction often doesn’t expect to conform to the territory. Math / CS / Physics try to create abstractions that correspond very closely to the territory, and to the process that generated the data being modeled.

Types of Abstraction

Examples Base Unit

Base Units show up to ground abstractions in physics (often atoms), mathematics (axioms) and computer science (bits). There will be abstractions built over these base units (say, molecules, theorems & functions, or types) and the system will follow hierarchically through composition of base units. These systems have strong ties to causal structure, where outcomes are the result of sequential operations over lower level units.

There’s another representation that’s much less grounded that we can use for comparison - ensemble modeling. Rather than mapping out a space from first principles, generate a set of representations that seems to break the space up (much like we’re doing here, breaking abstractions into types to compare their properties)

Consistency

There are different types of abstraction that show up in conceptual abstraction space. Types can either be mixed with one another, or done with consistency. [Add motivation]

For example, take a big organization. We can break it into a sales division, and an engineering division, and a marketing division, etc. The type of abstraction that this is a function of what the parts of the company do for the company.

Say now that you want to measure the extraversion of these groups. This involves a different type of abstraction, down to the individual people in each group, and then a function (measure of extroversion, with an average) over the set of individual people.

We’re mixing: 3. Function done in the company 4. Set of People in group

And could have continued going to function at a lower level (say sales is divided into sales managers, sales call persons, lead generation people). And for some operations, like management structure, this is the appropriate way to deconstruct the space.

This works cleanly because the sets that underlie the abstractions are cleanly separated from one another.

Discreteness vs. Continuity

Terms, How People Talk about Abstraction

  • High Level vs. Low Level
  • Grain (Coarse Grain vs. Fine Grain)
  • Broad vs. Specific
  • General vs. Specific
  • “Broad brush”
  • Concepts, Conceptualization (Boyd, Dad)
  • Comprehensive Whole vs. Particulars (Boyd)

High Level vs. Low Level

There’s the ubiquitous reference dependent ‘high level’ and ‘low level’ type or reference, where the speaker has in mind some reference class level (often contrasting the high using the low as a reference point, and using the high as a reference point to define the low).

This tends to lead to unnecessarily binarized thinking. Ideally the language would auto-capture that there may be a high level and a low level, but there’s also a level higher than low level but lower than high level, and lower than low level, and higher than high level, etc. The use of almost all of these terms relegates us to two levels by default. Though maybe that makes thining easier for the reader. And also, perhaps it’s not binary but points to a ‘gradient’. More true for coarseness vs. fineness than high vs. low.

Coarseness vs. Fineness (Literally true in the case of abstraction in image processing) Grain (Coarse Grain vs. Fine Grain)

I really enjoy how this version implies a continuum of abstraction - it’s clear that you can become incrementally more coarse, or incrementally more fine. And so it’s appropriate for those situations where the abstraction is continuous. It’s also flexible, and can be used for discrete situations that are more or less fine / coarse than one another. But it struggles in another context, where there’s strong discreteness. Take the version of abstraction involved in creating a function, or in creating a variable instead of using scalar values. The metaphor (coarse vs. fine) starts to break down. This is in part because coarseness vs. fineness assumes that the type of abstraction is exaclty the same throughout! In the metaphor, you only get more or less resolution. You never switch to a different type of abstraction. And it’s hard to model a binary discrete situation with this, where there are exactly two levels. May be nicer if there are more levels, but we also have to keep the type of abstraction the same.

Broad vs. Specific General vs. Specific “Broad brush”

“If you ask a more specific question, I can give you an answer.”, is what Scott Kominers would love to tell me. People love to use the term ‘broad brush’ to give themselves permission to not condition on subsets of populations, or to make general claims in a way that’s unrigorous. It’s both valuable and dangerous - dangerous in that they don’t expect to go into details on their claim, which makes these kinds of claim extremely hard to evaluate, verify or argue against. Or often arguing against consists of picking a counterexample, which the person admitted would exist when they said that the statement would be broad brush. It’s valuable in that these summaries across populations end up being critical for decisions that depend on the proclivities of a large number of people. And it becomes extremely difficult to reason about a space if the level or rigor required makes it hard for people to make claims that look true to them.

Concepts, Conceptualization (Boyd, Dad) Comprehensive Whole vs. Particulars (Boyd) General-to-specific (Boyd)

Power

Transfer and Generalization

Idea List

  1. What transfer is, implications for generalization (non-local / out-of-domain transfer)

  2. Generalization accuracy as a standard for abstraction quality

  3. Metaphor as tacit abstraction

  4. Causal vs. Intuitive vs. Probabilistic vs. Structural Transfer

  5. Importance of appropriate level of abstraction

    1. Tractability with reality
  6. Transfer & Generalization

    1. Metaphor as tacit abstraction
    2. Causal vs. Intuitive vs. Probabilistic vs. Structural Transfer
    3. Importance of appropriate level of abstraction
      1. Tractability with reality

Transferring learned knowledge across domains requires abstraction, explicitly or implicitly. The abstractions are the shared structure within domains that allow the knowledge gained from one domain to transfer to another.

What is transfer?

Great examples of this are planning, pattern recognition (transfer between datapoints)

In planning, there’s a hierarchy of tasks over which you can do transfer. Say you need to save someone just hit by a car by getting them to the hospital.

At a high level, there’s a transportation problem. It can be instantiated in a few ways - ambulance, car-transport, etc. Ambulance breaks down into calling 911, identifying your current location (orienting to the street signs), communicating with the emergency operator, etc. Those are composed of lower level actions which often disappear beneath conscious awareness, say taking out a phone, opening the interface, opening the phone app, pressing the keys, putting it to your head, etc. This process is modular and transfers across domains. Regardless of who you’re calling, when you’re calling, the motions are the same. And so you think in terms of the abstracted motion (call this person / number).

What is generalization?

The critical property of an abstraction is its ability to generalize.

Within a domain, learning often looks like generalization - in sports, you pattern match one situation to similar past situations, and react instinctively in a way that your intuition hopes will lead to success. In mathematics, solving a problem with an algebraic manipulation once lets you recognize that type of manipulation in other situations. And often so you abstract the manipulation into a rule or into an operation that you can run more flexibly.

But you’d also like to do out of domain generalization - learn something in one context and apply it to an entirely different context. Learning language in one domain (say, in a home) and then applying it at school, or with friends rather than parents, or in speeches. Take the concept of specialization and lift it from economics to understanding sexual dimorphism in evolution. Taking the concepts of a replicator and differential selection from evolution and apply it to ideas and tunes and fashions that replicate by imitation.

Generalization as a Standard

Generalization accuracy is a great standard to hold abstractions to. What we want is for our representations to aid us in problem solving, which often takes the form of prediction what the impact of our actions will be or predicting what the state of our system will be in future.

When adjudicating between representations and when constructing them, we’ll optimize for the representation that is easy to evaluate (simplicity heuristic) and that makes the most accurate predictions. And simplicity is also a part of accuracy - models that are simpler have the advantages of capturing more data (because in general they’re more abstract), being more robust to small differences between observations and so better able to capture the higher level regularity, and being straightforward to update when they’re mistaken.

Implicit Transfer

Metaphors are examples of transfer that at first look to be free of abstraction, but as they’re drawing their power from shared structure between the domains we can use them as using abstraction implicitly.

One result is that metaphors are a source of explicit abstractions so that representations and solutions can be extended to their farthest reaches. Often the emphasis is on emotional transfer, and so the metaphor communicates more than pure information. But that transfer matters too - communication is extremely high dimensional, and the felt dimensions are more important for compelling action and building understanding than pure understanding. More on this in engineering abstractions.

Take a classic christian verse, “we are the clay, you are our potter”. There’s implicit abstraction around agency vs. the acted upon, intentionality, moldability, and craftsmanship, as well as the load of connotations that go along with the profession and the nature of the objects.

Giving examples as grounding to abstraction is common practice. Often more than one example is required to pull generalization off properly, as there will be details for particular examples that don’t necessarily generalize and transfer that can be omitted from a second or third example.

Glorious Heights of Representation

There’s power to the heights of abstraction. As the height grows, the number of objects being acted over grows. There are a huge number of nodes in a belief structure, and the impact of information grows as it becomes relevant to every node in that structure.

Say you learn something about capitalism. All of a sudden your model of every single transaction that has occurred across time, and the implications of those transactions updates. There’s power here insofar as one can take a new concept and notice the breadth of implications that come with it.

The way to generate power from lower level information is by abstracting. Say you notice that when someone takes a lower-paying job they don’t have as much time to apply and interview for higher paying jobs. This is a nice insight which may help you make that tradeoff more effectively in your life, by being aware of it. But the power comes out of abstracting this concept to opportunity cost, and realizing what the implications of opportunity cost are for the application of every finite resource (time, attention, energy, money). Even resource in that previous sentence is an abstraction over a world of objects that often need to be explicated if the impact of an update to that abstraction are going to be understood, experienced and lead to a better planning process.

Pitfalls

Transfer and Generalization

  1. Overfitting / Bias-Variance Tradeoff
    1. Grab not enough datapoints, create a complex model and assume that it will transfer cross-domain
    2. As the level of abstraction rises, there’s more data to be seen and so it’s easier to create a biased sample accidentally
    3. Large data also allows you to more accurately validate the abstraction
    4. Magic numbers… everything feels overfit
    5. Each Abstraction has to trade generality for nuance
      1. Without complexity (conditionals, context dependence, additional features / factors)
  2. Forming an abstraction in a way such that it fails to transfer
    1. Ex., Statistics / Probability Theory
    2. Wrong level of analysis to see across many tasks

Information Loss

  1. Information lost in classification, or as we move up levels of abstraction
    1. Ex, so much more information in a particular dog (its color, its personality, its sound) than in the category of dog in general.
  2. Information lost in binning
  3. Major pitfall is to have lost so much information that the representation isn’t predictive of important outcomes anymore
  4. Computational difficulties in maintaining information
    1. If you abstract over every property of the underlying object and hold the intersection of all of those abstractions, you may be able to keep the same amount of information. But now your model is a different type of complex.

Conflation, Unlike Objects & False Transfer

  1. Labeling a group by abstracting over objects that share one property but not others creates conflation between the unlike properties
    1. Properties shared among some but not all group members get generalized to being a property of the group
  2. When two objects share a property, a metaphor can be made between them. But the metaphor breaks down over some properties and not others. And so recognizing the limitations of transfer is a process in and of itself, that can fail and lead to erroneous generalization.
    1. Classic example is a metaphor to a positively connoted thing, to compliment or make something good. Or a metaphor to a negatively connoted thing.
    2. Russell conjugates in description (making the metaphor’s implicit abstraction concrete) are also good examples of the manipulability of transfer.
  3. Motte Bailey - moving between defensible and indefensible definitions under the same word.

Excessive Height

  1. Height obscures underlying content, making grounding more difficult
  2. Greater height exposes more datapoints, making it easier to tell a story for any thesis
    1. Height demands more rigorous statistical claims
    2. Positive example generation is the default mode in making arguments / thinking about theses, and is much easier at height
  3. ‘Category error’ problems start to get bad…
    1. What is a category error? It implies logic, or fit, or rules / constraints around objects… There’s this question
      1. Ex., adding unlike units together, adding pressure to height
      2. Not all category errors made equal. Adding a height to a volume is understandable… it can give you an approximation of another volume.
        1. At a higher level of analysis, adding ‘space’ to ‘space’ seems fine… And so abstraction can fuzz the details.
    2. Often there are multiple representations, at different levels of fuzz, and you pick the one that allows it to make sense…
    3. Sympathetic view - Structure is preserved among unlike objects
      1. Sometimes, adding height to volume gives you stronger predictive validity
    4. Unsympathetic view - The heights of abstraction obscure the category errors that would be clear at a lower level
    5. With height, more category errors occur - breakdowns in the fit between objects at lower levels of analysis.

Abstraction as an Unseen Assumption Lack of Grounding

  1. Remembering the model, forgetting the data

  2. Generalization from few (1) datapoint

  3. Lack of depth means it’s difficult to expose failed transfer

    1. Unlike Object & False Transfer
      1. Memetics
    2. Excessive Height
    3. Abstraction as an unseen assumption
      1. Reductionism
    4. Lack of grounding to abstractions
      1. Social reality
    5. Conflation

[a]Similar to Aristotle's Final Cause [b]Similar to Aristotle's Material Cause [c]For this section I generally find the sentences longer than they need to be. Could you try to cut every second or third sentence in half? i think that'd help a lot with readability. [d]Marked as resolved [e]Re-opened [f]Strongly agree. I expect that even more aggressive cuts would be helpful, and that this will generalize outside this section. Thank you! [g]I really appreciate how specific your suggestion was, and how it drew attention variability of sentence length rather than being comprehensive. [h]music example [i]Muscle Memory Example [j]There's no good word for non-memorized knowledge. Genrealizable knowledge? We need a term for it. [k]Create a list of the downsides of binaires in thought / weak representations. [l]What does this mean? Is this like pictoral representations in Chinese where each component means something and the word can be reconstructed from looking at the what the subcomponents mean/imply? [m]Exactly - ideally you'd create a meaning-to-text mapping that was so good that what a speaker of a language drew in response to their felt meaning could be interpreted by anybody. Pictoral representations in Chinese are a step in this direction - you may be able to infer the meaning of a character just from its appearance. The english alphabet is an arbitrary (and therefore difficult to learn and use) symbolic layer between felt meaning and representation in language. Thank you for the pointer that this needs much more explanation! [n]ADD EXAMPLE [o]Learning / Teaching example, where bundled under learning / teaching is feynman technique / spaced repetition / deliberate practice [p]Rather than using the word 'metaphor' here, I need a way to describe the difference between comparing specific experiences and using concepts to describe experiences. [q]Buildup of a collective representaition of knowledge [r]Difference between calling to mind all examples and holding knowledge in the abstract. [s]inductive version of deductive truths (ex., consistency) [t]Non-conceptual abstraction [u]Need an example of similarity between datapoints, example of features, example of a pattern that is supposed to generalize. [v]Explain working memory [w]Describe the constant decaying of memory [x]I need to think about how and in what cases this is true. [y](Taste? Touch? Smell?) [Describe the way the brain re-represents reality so that creatures like us can interact with it] [z]Examples please. [aa]nice, I've also been finding interesting parallels in hofstadter's writing; I recently recommended to a friend the GEB chapter on Minds and Thoughts (or whichever was about the ASU), and Metamagical Themas' ch 23 and 24, "On the Seeming Paradox of Mechanizing Creativity" and "Analogies and Roles in Human and Machine Learning", respectively. I'd be interested if you have others to recommend. [ab]Ex., Country [ac]All examples of learning abstract knowledge. [ad]Jargon, give an example. [ae]Recusive unraveling of equations as an example? [af]Give examples of all of the knowledge that's bundled in 'scientific 'field' - way of thinking, set of discoveries, body of equations, etc. [ag]Add Ability to understand experiences we' haven't had by leveraging abstract concepts. [ah]ADD EXAMPLE [ai]Learning / Teaching example, where bundled under learning / teaching is feynman technique / spaced repetition / deliberate practice [aj]Rather than using the word 'metaphor' here, I need a way to describe the difference between comparing specific experiences and using concepts to describe experiences. [ak]Buildup of a collective representaition of knowledge [al]Difference between calling to mind all examples and holding knowledge in the abstract. [am]inductive version of deductive truths (ex., consistency) [an]Non-conceptual abstraction [ao]Need an example of similarity between datapoints, example of features, example of a pattern that is supposed to generalize. [ap]music example [aq]Muscle Memory Example [ar]There's no good word for non-memorized knowledge. Genrealizable knowledge? We need a term for it. [as]Create a list of the downsides of binaires in thought / weak representations. [at]Used for abstraction, rather than being a form of abstraction [au]Explain working memory [av]Describe the constant decaying of memory [aw]I need to think about how and in what cases this is true. [ax]Needs to better explain the content/purpose of the list [ay]This section has great potential for us to establish some stakes. Stakes meaning: If what this section argues is true, it will dramatically impact our understanding of the world in some way. I think we could endeavor to make the stakes in this section more explicit. [az]Excessive use of 'vastness' [ba]Alt. words: immensity, boundlessness, infinitude [bb]Needs transition [bc]Many complex ideas here, could be clarified [bd]what is this process? [be]Awkward phrasing. [bf]May be worth removing both 'as a default's. [bg]Adverb is unnecessary here; perhaps ridiculous is too strong of a word [bh]object, singular, or object, plural? [bi]Needs Clarity [bj]Gross Repetition [bk]Long sentence with complex ideas [bl]Feels like a repeat of Lower state as trivially accomplishable. [bm]^ [bn]umpteen is used often, I noticed; alt. words many, myriad, scores of, numerous, etc etc etc [bo]aim?


Source: Original Google Doc

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?