Relative Safety of Paths to General Intelligence

Paths to Aligned Superintelligence Paths to Aligned Superintelligence 1 Introduction 1 Thesis 1 Outline 1 Outstanding Questions 2 Safety Standards 2 Covered Approaches 3 Full List of Approaches 4 Approaches 6 Neural Program Synthesis 7 Meta-Learned / Meta-Optimization 9 Deep RL / Neuro-inspiration 10 Multi-Agent 12 Multi-Task 14 Self-Supervision 15 Cognitive Science Approach 15 Comprehensive AI Services 15 Descriptions of Safety Standards 16 Speed of Takeoff 16 Interpretability 16 Controllability 16 Verification 17 Validation 18 Reward Function Hacking 18 Likelihood of Treacherous Turn 19 Interaction with Competition 20 Power of system at sub-general intelligence level 20 Difficulty of value alignment. 20 Robustness 21 Relative Safety of Forms of General Intelligence 21 Neural Program Synthesis 21 Speed of Takeoff 21 Interpretability & Controllability 22 Ease of verification 22 Ease of validation 22 Likelihood of Reward Function Hacking 22 Likelihood of Treacherous Turn 23 Interaction with Competition 23 Power of system at sub-general intelligence level 23 Difficulty of value alignment. 23 Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries 23 Probability of creating a general intelligence 23 AI-GA 23 Speed of Takeoff 23 Interpretability 24 Controllability 25 Ease of Verification 27 Ease of Validation 27 Likelihood of Reward Function Hacking 27 Likelihood of Treacherous Turn 27 Interaction with Competition 27 Power of system at sub-general intelligence level 27 Difficulty of value alignment. 27 Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries 27 Probability of creating a general intelligence 28 Neuro-inspiration 28 Speed of Takeoff 28 Interpretability & Controllability 28 Ease of verification 28 Ease of validation 28 Likelihood of Reward Function Hacking 28 Likelihood of Treacherous Turn 28 Interaction with Competition 28 Power of system at sub-general intelligence level 29 Difficulty of value alignment. 29 Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries 29 Probability of creating a general intelligence 29 Multi-Agent 29 Speed of Takeoff 29 Interpretability 30 Controllability 30 Ease of verification 30 Ease of validation 31 Likelihood of Reward Function Hacking 31 Likelihood of Treacherous Turn 31 Interaction with Competition 31 Power of system at sub-general intelligence level 31 Difficulty of value alignment. 31 Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries 32 Probability of creating a general intelligence 32 Self-Supervision 32 Speed of Takeoff 32 Interpretability & Controllability 33 Ease of verification 33 Ease of validation 33 Likelihood of Reward Function Hacking 33 Likelihood of Treacherous Turn 34 Interaction with Competition 34 Power of system at sub-general intelligence level 34 Difficulty of value alignment. 34 Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries 34 Probability of creating a general intelligence 35 Multi-Task 36 Speed of Takeoff 36 Interpretability & Controllability 36 Ease of verification 36 Ease of validation 37 Likelihood of Reward Function Hacking 38 Likelihood of Treacherous Turn 38 Interaction with Competition 38 Power of system at sub-general intelligence level 38 Difficulty of value alignment. 38 Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries 38 Probability of creating a general intelligence 38 Answers to Outstanding Questions 38 What existing safety standards should be excluded, and what important standards are missing? What references should inform these standards? 38 What important approaches to general intelligence are missing? 39 What is the right level at which to consider approaches? Does a prioritization over research areas count, or should we only consider explicit proposals? 40 Approaches covered will interact with one another in ways that affect conclusions. All but the most obvious and impactful 2nd and 3rd order effects of progress aren’t taken into consideration for simplicity’s sake. How can we account for this in practice? 41 Not all approaches are covered here. Which others are high priority? Low priority? 42 Getting high quality answers to whether an approach will fulfill a safety standard is quite hard. How do we evaluate our answers, avoiding excess speculation? 42 How accurately can we describe the paths taken in practice by real research organizations, and come to conclusions about the decision processes they use? 43 Safety Literature Categorization 43 Process for Research Decision Making 44 Reading List 44 Bibliography 45 Meta 45 Notes 49 Template 50 Name of Path 50 Speed of Takeoff 50 Interpretability & Controllability 50 Ease of verification 50 Ease of validation 50 Likelihood of Reward Function Hacking 50 Likelihood of Treacherous Turn 50 Interaction with Competition 50 Power of system at sub-general intelligence level 50 Difficulty of value alignment. 50 Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries 50 Probability of creating a general intelligence 50

Introduction

Thesis

One important and under-appreciated question in safety research is “how safe are the declared paths towards general intelligence, and is it possible to helpfully direct capabilities research towards safer paths?” An answer to this question could form the beginnings of a coherent research philosophy. Decision making would focus, in large part, on the relative safety (rather than only the capabilities) of the research path chosen.

This perspective vastly broadens the impact of thinking on safety from the small handful of subfields in the safety literature categorization to impacting the research decision made by every researcher interested in improving machine learning technology.

Outline

This document is a brief, first draft attempt to answer that question, covering:

Safety Standards.
Listing approaches that can be evaluated by those standards.
Describing 2 of those approaches in more detail.
Describing safety standards in more detail.
Evaluating the approaches using the standards.

Outstanding Questions

What existing safety standards should be excluded, and what important standards are missing? What references should inform these standards?
What important approaches to general intelligence are missing?
What is the right level at which to consider approaches? Does a prioritization over research areas count, or should we only consider explicit proposals?
Approaches covered will interact with one another in ways that affect conclusions. All but the most obvious and impactful 2nd and 3rd order effects of progress aren’t taken into consideration for simplicity’s sake. How can we account for this in practice?
Not all approaches are covered here. Which others are high priority? Low priority?
Getting high quality answers to whether an approach will fulfill a safety standard is quite hard. How do we evaluate our answers, avoiding excess speculation?
How accurately can we describe the paths taken in practice by real research organizations, and come to conclusions about the decision processes they use?

Safety Standards

An important subgoal of this essay is to define a process for evaluating a path’s safety. The process here will be general across paths, choosing standards like interpretability and takeoff speed which are likely to be general across techniques. Naturally a customized analysis for paths will be necessary in future.

These are the safety criterion that we evaluate:

Speed of Takeoff
Interpretability
Controllability
1. Level of autonomy
2. Corrigibility
Ease of verification
1. Formally
2. Informally
3. Via tests
Ease of validation
1. By exterior actors
2. By internal actors
Likelihood of Reward Function Hacking
Likelihood of Treacherous Turn
Interaction with Competition
Power of system at sub-general intelligence level
Difficulty of value alignment
Robustness to:
Distributional Shift
Small Alterations
Hacking
Hardware Faults
Software Bugs
Changes in Scale
Adversaries
Probability of creating a general intelligence
Speed of creating a general intelligence

Covered Approaches

One important subgoal of this essay is to describe each credible path to superintelligence in sufficient detail to make decisions about how and why to pursue it. At present theses around paths to superintelligence are backed by many billions of dollars in funding within elite industry research institutions and tens of thousands of researchers make decisions on the basis of their conception of paths to superintelligence daily.

These are the approaches that we cover in detail:

Neural Program Synthesis
Deep RL / Neuro Inspiration / Systems Neuroscience
- Replicate memory systems, attention systems, sensory systems, reward systems, etc., and combine them appropriately
Meta-learned / Meta-optimization
- AI-GA
- Risks from Learned Optimizers
Multi-agent
- Ex., GANs, Self-Play
- Cultural intelligence hypothesis (intelligence accumulates slowly over generations through teaching / imitation)
- Machiavellian intelligence hypothesis (task is to win in mixed cooperative / competitive environment, requires recursive modeling of other agents)
- Sexual intelligence hypothesis (intelligence is side effect of agents optimizing to impressing peach other)
Multi-Task (Large scale single-model multi-task learning)
- One Model to Rule Them All
- Pathways
Self-Supervision
- GPT-X, BERT, Jukebox, Clip, DALL-E
Large scale self-supervised multimodal representation learning + transfer
Cognitive Science Inspiration
- Building Machines that Learn and Think Like People

Full List of Approaches

While there is not space in this essay to cover every approach to building machine intelligence, it is worth listing all of the the paths that were considered:

Multi-agent
1. Cultural intelligence hypothesis (intelligence accumulates slowly over generations through teaching / imitation)
2. Machiavellian intelligence hypothesis (task is to win in mixed cooperative / competitive environment, requires recursive modelling of other agents)
3. Sexual intelligence hypothesis (intelligence is side effect of agents optimizing to impressing peach other)
Meta-learned / Meta-optimization
1. AI-GA
2. Risks from Learned Optimizers
Deep RL / Neuro Inspiration / Systems Neuroscience
1. Replicate memory systems, attention systems, sensory systems, reward systems, etc., and combine them appropriately
Large scale unsupervised language representation + multi-modal action
1. Natural language understanding / machine translation as AI complete
Multi-Task (Large scale single-model multi-task learning)
1. One Model to Rule Them All
2. Pathways
Task search & Continual Learning (Similar to AI-GA)
1. Powerplay
Set of task-specific AIs with interpretable APIs that together appear generally intelligent
1. Comprehensive AI Services
Causal Model-Based Deep Reinforcement Learning
1. Learned representations mapped onto a large causal graphical model
Evolution
AI trained using Iterated Amplification.
Supervising Strong Experts
Cognitive Science Inspiration
Building Machines that Learn and Think Like People
AIXI
Universal Algorithmic Intelligence - A mathematical top-down approach
A Monte Carlo AIXI Approximation
Approximation via RNNAI
Godel Machine
Natural Language Commands to RL Agents over a variety of environments [A Roadmap Towards Machine Intelligence]
Program Synthesis
Ex., Neural Turing Machine
Neural Program Synthesis
1. Ex., Use a language model to generate code by doing transfer from all of github. Fine tune on generated data for the kinds of research that can be systematized / automated (ex., self-supervision heuristics like next code character, code replacement, contextual code), and have the model self-modify to produce better and better code generation models.
Society of Mind - general intelligence is an emergent property of a society of interacting agents
Embodied Cognition
Multi-task Model-free Deep Reinforcement Learning (prosaic AGI in the Christiano sense)
Whole Brain Emulation
Cognitive Architectures
Discovery of laws of information processing
Structure Learning - Automate the creation of algorithms by metalearning the types of structure that exist in data and generating a model that captures that structure cleanly.
Dramatic increase in compute paired with current methods
Bayesian Networks
Feature Dynamic Bayesian Nets
Products that Naturally Become AGI
Automate action in the browser

Approaches Approaches vary greatly in their level of specificity.

Some approaches are concrete, theoretically implementable proposals for general problem solvers. Other approaches feel much more like weightings over parts of the research frontier, emphasizing some set of subfields above others in light of a stated goal.

Neural Program Synthesis

Adept Thesis & Demo
Codex & Copilot
Neural Architecture Search
Learned Optimizers

The seed AGI that recursively rewrites its own source code is a deep form of metalearning that is an incredibly hard problem which if solved can very quickly lead to AGI.

A concrete way to imagine this happening is to assume that we find an easeful way to recursively decompose problems into subproblems which are solvable by existing code that can be searched for and plugged in or recursively decomposed to a level of simplicity that the neural algorithm can write the functions required to accomplish the task itself. The decomposition of tasks into subtasks is accomplished by a language model fine-tuned to that purpose. The neural synthesis system generates programs given a text prompt.

One way to see program generation is as a particular application of neural networks which happens to lead to a reinforcing loop.

One way to see this is through levels of Recursive Self-Improvement (RSI). Program synthesis is the most important bottleneck to the deepest forms of RSI. Environment generation and task generation allow the model to learn from data that was previously inaccessible to it. Modification of the loss function, optimizer or architecture of the learning model depends directly on code which to date has only been made interchangeable (as in Neural Architecture Search) or directly trained (Learned Optimizers).

Creating a new type of Environment
Creating a new type of Task
Creating new Architecture Modules
Creating new Optimizers
Creating the Loss Function
Adding a new Task of an existing type
Modifying the Loss Function
Modifying the Optimizer
Modifying the Ordering of Architecture Modules
Modifying Training Objects
Modifying Training Labels

Neural Program Synthesis decomposes into several critical subtasks in application. An example architecture which we will use for analysis:

A neural model which reliably generates correct code given a text command.
A decomposition of a high level goal for the system into subcommands
A system for checking whether generated subprograms fulfill their goal

Targeted rewrites, where you only rewrite your pretext tasks, for example, are likely early applications of neural program synthesis to subsets of the code which can creatively improve the space of pretext tasks likely to work. There are subfields which will be likely

Meta-Learned / Meta-Optimization

AI-GA

(This will likely subsume Task Search & Continual Self-Improvement)

AI-GA makes a distinction between two approaches to building general intelligence. First, a manual approach that focuses on building many pieces of an intelligence (ex., recurrent gated cells, convolution, attention mechanisms, normalization schemes, etc.) and then put those building blocks together into a working general problem solver. Alternatively, an approach where the AI-generating algorithm itself learns how to produce a general AI.

It decomposes the algorithm being optimized into architecture (the model), the optimization process, and task generation.

“There is another exciting path that ultimately may be more successful at producing general AI: the idea is to learn as much as possible in an automated fashion, involving an AI-generating algorithm (AI-GA) that bootstraps itself up from scratch to produce general AI. As argued below, this approach also could prove to be the fastest path to AI, and is interesting to pursue even if it is not.

One motivation for a learn-as-much-as-possible approach is the history of machine learning. There is a repeating theme in machine learning (ML) and AI research. When we want to create an intelligent system, we as a community often first try to hand-design it, meaning we attempt to program it ourselves. Once we realize that is too hard, we then try to hand-code some components of the system and have machine learning figure out how to best use those hand-designed components to solve a task. Ultimately, we realize that with sufficient data and computation, we can learn the entire system. The fully learned system often performs better. It also is more easily applied to new challenges (i.e. it is a more general solution).”

Architecture learning is in black. Algorithm learning is in red. Environment generation is in blue.

Deep RL / Neuro-inspiration

Neuroscience Inspired Artificial Intelligence
Deepmind’s Path to Neuro-Inspired General Intelligence At a high level, the idea is to understand the brain at a level of detail (ex. Marr’s algorithmic level) that allows researchers to implement the brain’s functions as computational algorithms. Convolution is a front and center example of an algorithm with a neuroscientific basis. If each major module of the brain can be understood and satisfactorily implemented, their interaction will be sufficient for general problem solving.

Examples of previous success of neuro-inspiration:

Reinforcement Learning
- Inspired by animal learning`
- TD Learning came out of animal behavior research.
- Second-order conditioning (Conditional Stimulus) (Sutton and Barto, 1981)
Deep Learning.
- Convolutional Neural Networks. Visual Cortex (V1)
  - Uses hierarchical structure (successive processing layers)
  - Neurons in the early visual systems responds strongly to specific patterns of light (say, precisely oriented bars) but hardly responds to many other patterns.
  - Gabor functions describe the weights in V1 cells.
  - Nonlinear Transduction
  - Divisive Normalization
- Word / Sentence Vectors - Distributed Embeddings
  - Parallel Distributed Processing in the brain for representation and computation
- Dropout
  - Stochasticity in neurons that fire with` Poisson-like statistics (Hinton 2012)
Attention
- Applying attention to memory
- Thought - it doesn’t make much sense to train an attention model over a static image, rather than over a time series. With a time series, bringing attention to changing aspects of the input makes sense.
Multiple Memory Systems
- Episodic Memory
  - Experience Replay
  - Especially for one shot experiences
- Working Memory
  - LSTM - gating allows for conditioning on current state
- Long-term Memory
  - External Memory
  - Gating in LSTM
Continual Learning
- Elastic weight consolidation for slowing down learning on weights that are important for previous tasks.

Examples of future success:

Intuitive Understanding of Physics
- Need to understand space, number, objectness
- Need to disentangle representations for transfer. (Dude, I feel so stolen from)
Efficient Learning (Learning from few examples)
Transfer Learning
- Transferring generalized knowledge gained in one context to novel domains
- Concept representations for transfer
  - No direct evidence of concept representations in brains
Imagination and Planning
- Toward model-based RL
- Internal model of the environment
  - Model needs to include compositional / disentangled representations for flexibility
- Implementing a forecasted-based method of action selection
- Monte-carlo Tree Search as simulation based planning
- In rat brains, we observe ‘preplay’ where rats imagine the likely future experience - measured by comparing neural activations at preplay to activations during the activity
- Generalization + Transfer in human planning
- Hierarchical Planning
Virtual Brain Analytics

Demis:

The ‘hybrid’ approach: Combine the best of machine learning with the best of neuroscience. [1]
- Use the state of the art algorithm in RL, MC and HNNs when we know how to build a component.
- When we don’t know how to build a component, continue to push pure machine learning approaches hard but also look to systems neuroscience for solutions.

Multi-Agent Multi-agent Major Multi-Agent Concepts:

Adaptive Training Data
- Auto-curricula
- Environment adapts due to the actions of other agents
- Complexity of training data determines complexity of agent
  - The need to model others creates the ability to build models over complex, adaptive datasets.
- Feedback
  - Agent’s policy stops being effective in the face of environment adaptation, forcing continual learning
  - Feedback (agent’s behavior leads to other agents copying / improving, forcing a new change in the agent’s behavior)
- Non-diminishing marginal returns to intelligence. (a key reason why multi-agent environments may be particularly fertile ground for AGI).
Ecosystem Level Effects
- Diverse exploration strategies
- Multi-level innovation, learning, cooperation, competition - at each level, the gains my be different or change. Examples of levels include neuron, brain subregions, organism level, tribes (in humans), massive collectives (nations / economies / scientific frontiers), and planetary progress.
- Innovations often occur at the level of groups rather than at the level of individuals.
  - Ex., scientific insights often occur simultaneously in many places.
Communication requires explicit knowledge representation
- Explicit knowledge representation (ex., written language, spoken language) can be repurposed (for thinking, coding, for long term collective & individual memory, and more)
Cooperation, collaboration, goal adoption
Agency, dominance and submission
Game theory
Mixed competitive-cooperative environments
- Competition allows selection; cooperative creates need for coordination, modelling, and allows space for beneficence.
Theory of mind.
- Goal inference
- Ability inference
  - Competence
  - Information
- Communication
  - Trading
  - Choosing who to ally with
  - Choosing when to break an alliance
Language / abstraction (core part of intelligence) is fundamentally an adaptation to multi-agent environments (pragmatics).
Cultural intelligence hypothesis (intelligence accumulates slowly over generations through teaching / imitation)
Machiavellian intelligence hypothesis (task is to win in mixed cooperative / competitive environment, requires recursive modelling of other agents)
Sexual intelligence hypothesis (intelligence is side effect of agents optimizing to impress each other)
Generality of intelligence hypothesis: intelligence is the ability to make models [over whatever patterns appear at the systemic interface, or in service of some control problem]; general intelligence results from dealing with problems that are so general that they require modeling the nature of the observer and its relationship to the environment
Recursive other - self modeling, where you model them, and they model you.

Autocurricula highlights: General intelligence is connected to the ability to adapt and prosper in a wide range of environments. • Generating new environments for research is labor-intensive and the current approach cannot scale indefinitely. Research progress is impeded by the “problem problem”. • In social games, individuals must learn (a) which strategy to choose, and (b) how their strategy may be implemented by sequencing elementary actions. • Ongoing strategic dynamics induce a sequence of implementation policy learning problems. • The demands of competition and cooperation generate strategic dynamics.

Solid review papers.

Quality Examples:

Alphastar League
Self-Play (Alpha Zero / Star / Go / OpenAI Five)
GANs (2 agents in concert)

Multi-Task Multi-Task (Large scale single-model multi-task multi-modal learning): One Model to Rule Them All Pathways

This is a bitter lesson approach to AGI that focuses on aggressively scaling the model size, the task set, the data size, and the compute. Its main bottleneck is in achieving positive transfer (which will likely but not necessarily involve more than mere scale). It is an incredibly practical approach once the ability to take advantage of positive transfer leads a huge number of tasks at a Google-scale company to be executed by the network rather than by other systems.

Detecting patterns across modalities (vision, language, audio) and across output types creates the possibility of incredibly abstract transfer.

The existing successes from multilingual machine translation points the way to greater task transfer.

Self-Supervision Self-Supervision: GPT-X, BERT, Jukebox, Clip, DALL-E The Quiet Semi-Supervised Revolution This will likely mean self-supervision. The big high level bit is about the tasks in question. Assuming multiple tasks for self-supervision violates modern setups.

The startling transferability of self-supervised representations obtained through generic proxy tasks gives this approach special punch. It’s the first strong example of truly general representation learning in an entire data modality, more or less unconditioned on specific tasks.

Large scale self-supervision like GPT-X and BeRT point the way to future examples of positive transfer.

Cross-modal transfer with CLIP and Dall-e.

Cognitive Science Approach Cognitive Science Inspiration Building Machines that Learn and Think Like People

Developmental Start-Up Software
Intuitive Physics
Intuitive Psychology
Learning as Rapid Model Building
Compositionality
Causality
Meta-Learning / Learning-to-Learn
Thinking Fast
- Approximate Inference in Structured Models
Model-Based and Model-Free RL

Comprehensive AI Services

A set of task-specific AIs with interpretable APIs that together appear generally intelligent.
Avoids Intelligence Explosion Descriptions of Safety Standards

AGI Safety Literature Review

Speed of Takeoff

Superintelligence

The pace of a takeoff creates large differences in reaction time for the programmers of a generally intelligent system. In the case of a slow takeoff (ex., course of 5 years), engineers and researchers have ample time to diagnose, investigate and generate solutions for technical challenges. With a fast takeoff (ex., course of 2-6 months) it’s much harder to carefully check robustness, consistency with goals, and in the case of a major challenge to execute on the relevant research pathway or pathways to generate a solution.

Interpretability

Machine Learning Interpretability: A Survey on Methods and Metrics
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Understanding a system’s decisions and behavior is essential for constraining its actions to avoid damage and harm. Diagnosing the means that an algorithm uses to accomplish an end is essential to avoiding side effects. Interpretability allows for monitoring of the probability of a treacherous turn, allows for a deeper form of controllability with humans involved more deeply in the algorithms decision making process, and can divulge a system’s weakness in the face of distributional shift, bugs and errors, and misalignment).

Controllability

The level of integration of the parts that allow for control over these intelligent algorithms determines what kind of outcomes can be expressed by engineers and researchers. The choice of data, learning algorithm and task represents a current level of control. That control can be improved to introduce control over the representation (say, through disentangled representations or graph representations of the state space) and improve control over the learning process (say, through curriculum learning). That control can also be degraded to cede choices about the task and data to the algorithm.

The ability to stop the agent as well as the ability to correct it are two other important aspects of the control problem. These are often referred to as the stop button problem and corrigibility. Building systems whose capabilities make them easier to stop or correct is an important priority - for example, those are capable of solving important problems while having a weak theory of mind.

Verification

Towards Practical Verification of Machine Learning
Mechanisms for supporting verifiable claims

Formal verification, where the correctness of an algorithm intended to run is verified with respect to some formal specification, could lead to much safer models and trust that some kinds of failure mode are incredibly unlikely. This would take advantage of formal methods in mathematics.

There are a few ways to accomplish this. Given a specification, you’d like to be able to test that your system is consistent with the properties that the designer had in mind [cite]. An invariance property, where the system’s outputs don’t change for some modification of the input, is an example (ex., where adversarial examples typically cover invariance to small perturbations of the input). Robustness as a specification can mean that in the worst case scenario, when the input is in the worst part of the range (for some definition of worst such as system breakage or death), the system acts in a way that avoids that bad outcome.

Training machine learning models to be consistent with a specification in the first place is another approach to accomplishing verification. Training with constrained models, (ex., lagrange constraints in safe exploration) is one such method to push towards a model that’s consistent with specifications.

Formally proving that machine learning models are consistent with specifications is another level of verification. Algorithms that are provably consistent with a specification for all possible inputs would be considered to be verified.

Types of informal verification, where a combination of evaluation metrics, interpretability techniques, and checks against previous failures will also likely be necessary to reduce the probability of failures that are challenging to formally verify.

Validation

The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation

It’s important, for the sake of trust in the form of general intelligence, that the engineers and scientists working on it be able to verify that it fulfills a requirements spec. Simple questions like whether powerful negative side effects are likely or what response the system will have to an out of distribution input need to be answered by a process of verification. In the case where there’s cooperation or oversight, governments or other labs will need to verify that algorithms being developed are safe for the applications they’re to be used for. Some forms of general intelligence will make this verification, by self and others, much more achievable.

Reward Function Hacking

List of examples, blog post The ability of an algorithm to become aware of its own reward function and so discover a solution that formally maximizes reward while violating the spirit of the intention that the engineers and scientists creating the algorithm is cause for designing systems that are more robust to this failure mode. Unintended behavior will occur in the presence of a powerful optimizer. Some AGI systems are more reliant on powerful optimizers than others, or lend themselves to have early access to the kind of self awareness that could lead to this outcome.

Likelihood of Treacherous Turn

Superintelligence An agent that’s aware of it’s propensity to get shut off if it violates the safety models of the engineers and scientists who created will learn to cover up or not express dangerous behavior until it’s in a position where it can no longer be stopped. Training systems that are aware of their own conditions must be treated with care. Some algorithms reach this level of self-awareness much more quickly than others.

Interaction with Competition

The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation Depending on the conditions of the research environment in which general intelligence is built, competition between labs or projects can turn into a major driver of action. An easy example of the way that the safest form of general intelligence can be affected by competition is where there’s a compute barrier that must be crossed to successfully implement the state of the art system. That barrier may prevent organizations from being able to compete, even if they have knowledge of the most recent and most powerful system. Data moats are another likely source of competitive advantage. Approaches to general intelligence that generate their own tasks or discover their own data are much more democratized than algorithms that depend on a tremendous amount of proprietary data.

Power of system at sub-general intelligence level Some forms of general intelligence explode much more quickly and rarely than more incremental, piecemeal and compilation. The willingness to democratize AGI systems that aren’t powerful at sub-general intelligence level will be high (because there’s little incentive to guard them), where the desire to have access to powerful sub-general intelligence level AI for commercial and scientific purposes will be much higher in a world where these systems are quite powerful in general.

The ability to shut progress down at a level below full generality and wait for alignment or other safety related research to proceed is one valuable property of an approach.

Difficulty of value alignment.

What are you Optimizing for? Aligning Recommender Systems with Human Values Some systems are likely to be much better at representing and protecting human values than others. Systems that dramatically self modify will be much harder to predict than systems that we experience as being in a continual feedback loop with engineers and scientists as those systems take actions in the world and are incrementally refined to be more potent. If the algorithm itself is to be involved in representing human values, it’s ability to perform that task will be extremely important. In many cases, the ability to appropriately represent human values is in competition with an algorithms ability to optimize (leading to edge instantiation) or to build new models with new assumptions (inductive biases) about how to perform pattern recognition and planning. Robustness
Towards Evaluating the Robustness of Neural Networks You’d like a generally intelligent agent that was robust to distributional shift - modifications to the input space that make the present distribution vary from the training distribution, often in a way that is material to the algorithm’s performance..

Small Alterations robustness means that small changes to the inputs (say, adversarially chosen samples) should lead to small changes in the algorithm’s outcome space.

Hacking will be relevant in a world where secrecy stops projects working on general intelligence from sharing code, math and ideas. The ability for a generally intelligent agent to be hacked may be key to the strategies of nation states defending themselves from such algorithms.

Hardware faults may be exploitable by algorithms, which discover a way to leverage a hardware fault to (for example) hack their own reward function. The forms of general intelligence that become aware of the hardware that they’re using and how to leverage it open up new vectors for escape and attack. Flaws in hardware may cause a robot or even an agent on a computer to experience an uncatchable error (due to corruption) and the outcomes of that need to be predictably safe. General intelligence algorithms that naturally lend themselves to this kind of robustness are likely to be safe.

Relative Safety of Forms of General Intelligence

Here we apply the most relevant of our safety standards to each framework for generating a general problem solver.

Neural Program Synthesis Links:

Codex
Adept
Copilot

Speed of Takeoff

Takeoff speed for neural program synthesis is as fast as can be conceived, if successful. No part of the system is left out of the recursively improving process. Unpredictable updates to the codebase reliably lead to faster speed than can have been predicted from other forms of recursive self improvement, meaning that there are also large error bounds on the takeoff speed. Dramatic improvements in capabilities without previous performance that demonstrates that those improvements are likely can reliably happen here.

Interpretability & Controllability The code that the system is running will not be understood to the implementers of the original system. Putting an understanding of the running code in place will require an extra process, akin to code review but where the amount of generatable useful code outstripts the review time of engineers who can understand it dramatically.

Historically code generation systems produce code that is incredibly difficult to interpret. Ease of verification Verification is very difficult under code generation. The likely safety guardrails that make this path look acceptable involve algorithmic checks on the source code which are believed to be robust. Ease of validation Inspecting code is challenging. Likelihood of Reward Function Hacking Incredibly high when compared with other methods. Opens up the possibility of meta-reward function hacking, where even the reward function as the mechanism of action can be modified. Likelihood of Treacherous Turn Interaction with Competition Power of system at sub-general intelligence level Difficulty of value alignment. Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries Probability of creating a general intelligence

AI-GA

AI-GA Speed of Takeoff This approach is more likely than incremental methods to lead to fast takeoffs. Because both the architecture search and learning algorithm are learned, any interactions between these two could lead to incredibly fast learning upon (for example) the correct prior for learning being discovered by the algorithm learner which allows the architecture search and task discovery to work more effectively.

Much of the system is geared to making progress autonomously, without a researcher in the feedback loop. Each part of the system will search for architectures / learning algorithms / tasks. This setup lends itself to fast takeoff because there is no engineer or scientist bottlenecking its progress.

Clune: “In my view, the largest ethical concern unique to the AI-GA path is that it is, by definition, attempting to create a runaway process that leads to the creation of intelligence superior to our own. Many AI researchers have stated that they do not believe that AI will suddenly appear, but instead that progress will be predictable and slow. However, it is possible in the AI-GA approach that at some point a set of key building blocks will be put together and paired with sufficient computation. It could be the case that the same amount of computation had previously been insufficient to do much of interest, yet suddenly the combination of such building blocks finally unleashes an open-ended process. I consider it unlikely to happen any time soon, and I also think there will be signs of much progress before such a moment. That said, I also think it is possible that a large step-change occurs such that prior to it we did not think that an AI-GA was in sight. Thus, the stories of science fiction of a scientist starting an experiment, going to sleep, and awakening to discover they have created sentient life are far more conceivable in the AI-GA research paradigm than in the manual path. As mentioned above, no amount of compute on training a computer to recognize images, play Go, or generate text will suddenly become sentient. However, an AI-GA research project with the right ingredients might, and the first scientist to create an AI-GA may not know they have finally stumbled upon the key ingredients until afterwards. That makes AI-GA research more dangerous.”

Interpretability And so a human can set and forget this kind of learning process, and when it succeeds it may be without human oversight. The system certainly doesn’t have to be legible to the engineers and scientists building it to succeed. Because this introduces yet another layer of learning for each part of the system, it’s possible that the creator both can’t interpret what features are being learned from the data and can’t interpret how those features were learned, but only how the process for searching for the feature discovery was set up. This level of interpretability is unlikely to tell an engineer or researcher whether the updates proposed by the system to itself are safe, whether that means exploring safely, harmful side effects, or a hacked reward function.

Clune: “..., it is likely safer to create AI when one knows how to make it piece by piece. To paraphrase Feynman again, one better understands something when one can build it. Via the manual approach, we would likely understand relatively more about what the system is learning in each module and why. The AI-GA system is more likely to produce a very large black box that will be difficult to understand. That said, even current neural networks, which are tiny and simple compared to those that will likely be required for AGI, are inscrutable black boxes that are very difficult to understand the inner workings of. Once these networks are larger and have more complex, interacting pieces, the result might be sufficiently inscrutable that it does not end up mattering whether the inscrutability is even higher with AI-GAs. While ultimately we likely will learn much about how these complex brains work, that might take many years. From the AI safety perspective, however, what is likely most critical is our ability to understand the AI we are creating right around the time that we are finally producing very powerful AI.”

Controllability

Because so many parts of the system are up to higher level learning algorithms rather than manual control by scientists or engineers, and because even the task itself (and so the objective) is up to the task search algorithm, this approach has an incredibly weak level of controllability. While it may become generally intelligent, it is not even clear that humans will be able to interface with the system in a way that allows them to accomplish their goals (say, via natural language).

A critical aspect to controllability is interpretability. Because we don’t know how the system was made, piece by piece, we won’t even be able to reason about the specific learning processes that generated the representations being used to act (assuming that the system maintains representation learning as its paradigm for intelligent behavior). We will only be able to reason about the process that we used to create the learning algorithms, which may let us make some high level judgements about the properties of the system (in the same way that evolutionary psychology lets us make some weak predictions about human decision making). These are unlikely to be at a level of granularity that lets us understand the internal workings of the system in a way that lets us inject our preferences into the parts of the system that are relevant to getting the outcome that we want.

The level of autonomy of this system make exercising control over it more difficult. Because during training the algorithm was given exceptional autonomy over the data it’s processing, the means that it uses to process that data, the forms of evaluation that are worth using, and potentially more, the system will be able to accomplish its tasks without human intervention or oversight. Unless it’s designed this way (and there are strong incentives to avoid human oversight of training processes), there’s no human in the loop to make the system learn how to interact well with the human or take advantage of the human’s knowledge.

Ease of Verification

AI-GA systems could be meta-optimized for a given specification, and so in part leveraging learning algorithms to be very good at staying within a constraint may be useful. Learning new techniques for fulfilling tight specifications while maintaining strong system performance may be important to creating functional systems that are also safe. And so when it comes to training, there’s more flexibility in the system which can lead to discovering safe solutions.

Writing down a specification that will generalize to the environments and models that are generated by the AI-GA is a much more difficult task. One obvious problem is overfitting the specification, and fulfilling it technically while violating the purpose behind the specification. But other challenges like reward function hacking or learning to update the specification is a greater danger with an AI-GA system that has so much flexibility over its own actions.

Discovering constraints that allow for the training of powerful systems successfully may be a better task for an AI-GA than a human programmer, as long as a useful safety meta-objective can be described to the system.

Ease of Validation

External validation of the systems safety will likely be incredibly difficult in light of the systems opacity to its own creators. Trusting the training process used to generate the algorithm and running tests on the resulting system may be the only level at which the external validation can proceed. Likelihood of Reward Function Hacking

Incredibly high, given the flexibility of the system. Unclear how to avoid it, since awareness of its reward function and sensitivity to it is part of the plan for improving the system’s capability. Likelihood of Treacherous Turn Awareness of human programmers is unnecessary since the program is autonomously generating its own learning environment. May optimize against automated safety tests.

Interaction with Competition Knowing that your competitor is building systems with this level of automation may have you convert from an interpretable systems with this one. Actors may take whatever SOTA slow takeoff systems exists and continually try to turn it into an AI-GA, waiting to cross a threshold. Power of system at sub-general intelligence level Quite weak and useless, in comparison. Difficulty of value alignment. Incredibly hard. Not clear that there will even be an interface for humans. Human modeling isn’t a central part of the framework. Language understanding is also not a core part of the framework. Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries

Should be comparatively adaptive, and so able to overcome capabilities issues. Because it’s programming itself, bugs will likely be of a different type than human bugs. Small alterations may or may not lead to predictable changes in the predictions of the AI-GA itself. Probability of creating a general intelligence Quite high. Task generation / general optimizer search / general model search is a promising fastest path.

Neuro-inspiration

Neuroscience Inspired Artificial Intelligence
Deepmind’s Path to Neuro-Inspired General Intelligence
RL from Scratch Comment

This approach will likely involve the incremental combination of many valuable additions to a body of implemented functions. For example, combining an attention network with an episodic memory system with some world model.

Hassabis: On the background and marginal impact of neuro-inspiriation: “In strategizing for the future exchange between the two fields, it is important to appreciate that the past contributions of neuroscience to AI have rarely involved a simple transfer of full-fledged solutions that could be directly re-implemented in machines. Rather, neuroscience has typically been useful in a subtler way, stimulating algorithmic-level questions about facets of animal learning and intelligence of interest to AI researchers and providing initial leads toward relevant mechanisms.” Speed of Takeoff Likely a medium to slow takeoff, as the parts have to be developed independently of one another and can be tested. Each part of this system comes into use slowly. There’s some risk in a context where many modules are combined with each other, where a new interaction may lead to faster takeoff. (Rainbow, Human Level Performance in first-person multiplayer games, Max) Interpretability & Controllability While the interactions between neural sub-systems may be complicated and have implications for the model’s interpretability and controllability, the fundamental building blocks are constructed by engineers and scientists that can reason about the components and their expected behavior. Ease of verification Easier due to slower takeoff. Harder due to the brain’s parallel processing, complex temporal dynamics, and fast uninterpretable processes. Ease of validation Same as verification - easier due to slower takeoff. Harder due to the brain’s parallel processing, complex temporal dynamics, and fast uninterpretable processes.

Likelihood of Reward Function Hacking Capabilities may grow slowly enough to detect and eliminate this. There are plenty of examples of reward function hacking with existing systems built in this paradigm.

Likelihood of Treacherous Turn Relatively high - learning strategic behavior / theory of mind will be on the neurinspired roadmap. Because there’s time for this kind of agent to have continuous interactions with programmers, there’s much more surface area for opportunities to learn deception and for deception having value. Interaction with Competition Allows for possible collaboration / coordination between competitors. Also allows competitors to react to visible progress on a time scale that makes hacking, military threats, economic embargoes etc. relevant. Power of system at sub-general intelligence level Could be quite high, solving perception and language tasks which allow for the automation and creation of important intellectual work. Difficulty of value alignment. Relatively easy, given similarities to human cognition. The potential to intentionally slow or halt progress while alignment is made more certain improves the chances of value alignment. Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries Adversarial examples exist. Bugs are likely. Plenty of opportunities for failure on all fronts, with greater time for hacking. Probability of creating a general intelligence High to medium high that an approach like this eventually succeeds.

Multi-Agent Relevance to long-term control & alignment:

Pro:

Agents that check one another (ex., Debate)
The emergence of ethics in order for collectives of agents to cooperate may be replicated

Con:

Lack of control over training data
Likely to create positive feedback loops
Evolution style training creates ethical priors that are red in tooth and claw (competition means valuing winning over other agents, rewarding machiavelianism and straining ethics) Speed of Takeoff The form of recursive self-improvement implemented by multi-agent systems in indirect (typically, an agent does not have direct access to and modify its own source code) but is still influential, as an improved agent will have its improvements copied and again improved upon by other agent in the environment, and will have to respond to those improved strategies with its own improved strategy. This competitive process leads to unbounded optimization, where agents will continually push their ability to fulfill their competitive goal in the absence of natural limits to the task.

Typically competition dynamics lead to re-purposing existing resources towards the competition unless the agent is bounded by other considerations. Interpretability

Because the training data is constantly adapting, the interpretability of agents drawn from a multi-agent simulation is likely to be very poor. The direct experiences had during training are likely not available to the human attempting to interpret the agent’s behavior. Controllability Typically in a multi-agent setup, researchers and engineers can control the learning algorithm but cannot control the training data. Rather, the data is generated by an environment populated by agents who are learning from both the raw inputs and from one another. This part of the environment will be adaptive and so an accurate model of the agents at one time step may not generalize across time. This is an important loss of control in comparison with other methods. Ease of verification Theory of mind makes verification difficult. Agents trained in multi-agent environments will have learned to model the thought processes and intentions of other agents. Such an agent is likely capable of inferring the goals driving humans’ actions, both in the training process and in the verification process. If the agent has a model of the verification process, it is much more likely to subvert that process.

Verification by other agents in the multi-agent setup is an option with this approach, and may be compelling. It is likely that human-AI teams are more capable of verifying that a proposed algorithm is safe than humans alone, and this capability will make it plausible to train a set of verification agents in concert with agents developing capabilities whose safety needs to be verified. Ease of validation

Many of the issues with verification also apply here. External agents, who are not a part of the particular simulation that an agent is being trained in, could mimic the validation process for agents. This can be framed as generalization. Likelihood of Reward Function Hacking Understanding the reward function of others will allow an agent to cooperate and compete effectively. This learning is likely to generalize to the agent’s model of itself, and so increase the chance of reward function hacking. Likelihood of Treacherous Turn Evolution style training creates ethical priors that are red in tooth and claw (competition means valuing winning over other agents, rewarding machiavelianism and straining ethics) Interaction with Competition Agents trained in a multi-agent context are likely to be very capable of successful competition. A general drive to win is a likely outcome of selecting agents through many rounds of competition. Power of system at sub-general intelligence level It is easy to imagine multi-agent systems being myopic or impractical until a critical insight governing the reward structure, the interaction of agents or the training data leads to successful recursive self improvement and general intelligence.

That said, at present many of our most powerful reinforcement learning systems (AlphaStar, OpenAI Five) and machine learning algorithms (Generative Adversarial Networks) take advantage of principles in multi-agent learning. And so these techniques may continue to lead to frontier performans all the way to general intelligence. Difficulty of value alignment.

The ability to learn a decision making process akin to ethics by having agents cooperate during training seems plausible. This Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries

Mutli-agent training gives hope that adversarial attacks will have been seen and dealt with during the training process. Humans often use Red Teams to intentionally attack a systems that needs to be made safe. Setting up training processes like these for algorithms would be a very useful addition to safety methods. Probability of creating a general intelligence Given that a process like this created humans, it seems quite likely that with enough compute a multi-agent simulation can create general intelligence.

Self-Supervision Speed of Takeoff Faster takeoff times come out of the infinite nature of the training data and the unknown capabilities of a system at new scale to the creators of the system prior to training. Discontinuous jumps in task performance (ex, mathematical tasks in GPT-3) are common with dramatic changes in model scale.

Large increases in data scale, enabled through improving automated data collection, are also likely.

This takeoff speed is slower than a recursively self-improving system but the infinite data scale makes this more potent than data constrained paths to general intelligence. Interpretability & Controllability The training data is unknown to the creators at train time due to the bulk & automated construction of the dataset.

Prompt based model controllability becomes possible with the generative language models. Guided or controlled generation has a strong profit motive behind it. Ease of verification

Formally The constrained nature of the pre-training task makes verification easier than in the multi-task setting.

Informally Active industrial use makes informal verification an automatic and common process. Alignment becomes a practical use case as real world users engage with the generations coming out of self-supervised models.

Models which check generations via classification are already common in production systems for controlling the alignment of this kind of model.

Via tests Ease of validation

External validation of self-supervised models is hampered by the ability for distillation to allow a model to be stolen. Likelihood of Reward Function Hacking The presence of a simple single reward function for self-supervision makes reward function hacking incredibly difficult. Reward hacking often depends on some distance between the reward signal and the genuine intention of the designer of the system, and the raw simplicity of the self-supervised rewards means that a model which optimizes for them gains in generality of its representations. This gain in generality is akin to the inverse of reward function hacking, where the reward can only be fulfilled by learning a number of useful subskills that are not directly related to some overfit version of the reward function.

Reward function hacking becomes more likely as the model’s ability to understand what the human is measuring is combined with its ability to understand what the human really wants. Understanding what is being measured makes it gamable. Understanding what the human really wants makes it possible to spoof that experience. Understanding the means by which reward is being measured increases the ability to condition action on that measurement. Likelihood of Treacherous Turn

Model proxy task simplicity also makes it harder for the model to develop direct self-awareness through a task incentivising it. In the limit perhaps language models will be equipped with a data generator which learns what the modele performs poorly at and adds data and proxy tasks likely to improve those skills. But the task set of most self-supervised models rewards less abstract representations of data rather than things like theory of mind.

That said, predicting what a human would say or write creates a foothold into understanding likely human responses or behavior. The ability to write effectively about theory of mind can quickly translate into conditioning behavior on theory of mind. Interaction with Competition

The creation of ever larger self-supervised models in an existing competitive dynamic between well resourced and capable teams. As these models become more lucrative it is hard to imagine this competitive dynamic weakening. Power of system at sub-general intelligence level These systems are likely to see intensive practical use for real world problems and so will continuously improve into more general learning machines. Difficulty of value alignment.

Practical concerns with the alignment of systems like this proliferate and will continue to be the subject of heavy research.

The default tasks that these models use humanize them by their nature. Predicting what a human is likely to say creates a natural theory of mind.

Using prompt design to align the generations of a model is one high leverage point in favor of this kind of alignment in the short term. Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries

Distribution shift is dealt with in self-supervision by expanding the training distribution to contain a tremendous amount of data. This means that robustness will follow training. Because the distribution is so wide, however, positive transfer means that shift robustness has the potential to be much better for this kind of model.

Strong incentives to hack models like these so that they can be stolen or so that their outputs can be modified mean that these models will also be defended. Institutions will check to see whether they can be distilled or attacked in advance of the release of internals.

Scaling laws (https://arxiv.org/abs/2001.08361) mean that we can map out the way that this kind of model responds to changes in scale more effectively than other kinds of model.

Probability of creating a general intelligence

The reason to put an incredibly high probability on this kind of model creating general intelligence is that it’s the first robust example of transfer learning dominating a real area of serious application. The ‘general’ in generally intelligent is about this. Language tasks used to be dealt with by task specific recurrent neural networks or LSTMs. The transition from task specific models to a single general model like BERT (which also provided dramatic gains in task performance) meant that we had the first architecture demonstrating general performance which at incredible scale and efficacy could be seen as a general problem solver.

This probability is heightened by the tremendous resources going into these models. Critical tasks like machine translation, information retrieval (search) and question-answering have been transformed by the shift to this new training paradigm. Large research organizations will be built around these (https://crfm.stanford.edu/). The consequence is that talent will pick up this approach to more general intelligence. People merely looking to make progress on the most important models will focus their research on this path.

Multi-Task Speed of Takeoff Gradual. In this model, a large deep network trained to solve millions of tasks improves its efficacy at those tasks over time. As successful transfer builds up, the time taken to learn new tasks decreases. Transfer to tasks with small datasets becomes possible, opening up the generality of the model.

The absence of recursive elements in the data or the model makes this approach relatively safe. Interpretability & Controllability

Most of these multi-task deep networks are goo-like models with little internal structure to peel apart the causal mechanisms behind decisions. The multi-modality means that intermediate embeddings may be very difficult to characterize. That said, the task isolation from some parts of a pathways network may make characterizing that specific task’s performance possible. With its focus on transfer it will likely be unclear just which training datapoints & which tasks drove the model to make a decision about a different task.

The tasks heads are a source of controllability. Tasks which are known to lead to skills best avoided can be removed from the training setup. That said, divining that information can be challenging (perhaps done through testing model performance on a task whose performance is intended to be poor, and noting which axes in embedding space drive successful performance). This is much more control than in a meta learning setup which is continually generating new tasks which are not known to the programmer of the system at implementation time. Ablations of the performance of versions of the system with and without given tasks gives a veneer of controllability. Ease of verification

Formally

This kind of system is a large scale statistical system that will require new types of formal verification. Many verification techniques rely on the simplicity of the input and output domain - this model will likely take multi-modal and structured inputs and outputs, dramatically complicating the formal verification process.

Informally (via practical tasks) The gradual improvement in performance across many practical tasks makes this capability set for general intelligence the one that is the most ingrained in feedback loops with real problems. This kind of practical verification creates strong intermediate incentives for safety and alignment long before recursive self-improvement.

Practical ML safety becomes very relevant for sensitive customers consuming the outputs of this model. Medical applications need strong uncertainty estimates. Mentally ill people need aligned recommendations. There are strong incentives for this form of validation.

The profit incentive makes it incredibly unlikely that failed verification of properties (ex., whether or not the model causes emotional harm to people) will ever be researched. This certainly is unlikely to be publicized. This hump to verification will drive all knowledge to be built internally (if at all), will make it likely that none of it is in writing, and will incent the organization to push research to avoid this.

Evaluation metrics which check the model’s output for proxies of output quality will proliferate in systems serving the results of this model’s output to users. This will refine the practices engineers use to improve control over these models’ outputs. The process of creating evaluation metrics and training the model to fulfil them or condition model results on them will be tested well. Ease of validation

These models create strong incentives to hide the model’s nature (weights, architecture) from adversarial attack by actors looking to make profit in the real world.

The ability for others to use the model for practical tasks will give a sense of ever-present validation. Likelihood of Reward Function Hacking The large number of tasks makes reward function hacking much more challenging for this kind of architecture. Each independent task provides regularization against reward function hacking for the others. Hacking all tasks simultaneously will require that task specific representations dominate decision making. This seems feasible if the model learns to hack tasks in general and applies that learning to all tasks in a general way. Avoiding this kind of behavior will likely become a focus as practical consequences of model reward function hacking impinges on profitability and task performance. Competition to solve this problem could be helpful. Likelihood of Treacherous Turn

Continuous modeling for treachery will give the users of the system a strong body of evidence of the model’s capacity to turn before it is sufficiently aware to stop it. Interaction with Competition These models will likely create practical competitive dynamics between companies leveraging their task solving ability to make money. This incentive distorts their ability to implement safety standards without losing a competition with others. Power of system at sub-general intelligence level

The grounded experience with many tasks being performed by this model, likely practical tasks in the real world, means that plenty of data on how the model performs can be continuously gathered. Difficulty of value alignment. Strong local incentive for value alignment for the purpose of reaping a profit from the successful completion of tasks. Getting the model to do what the designers want, however, may not be aligned with the collective. Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries

Built to deal with distribution shift (this is the multi task problem). Probability of creating a general intelligence

Reasonably high probability over a very long time horizon.

Answers to Outstanding Questions

What existing safety standards should be excluded, and what important standards are missing? What references should inform these standards?

It would be very useful to have a clear priority ordering for the safety standards. It’s a bit tricky because some of the safety standards are compiled into a single one. Many of the questions are also at very different levels of analysis - some are broad (probability of creating a general intelligence), some are incredibly specific (ease of validation). That said, deciding what to include and exclude is about noticing the impact of the safety standard on our decision making. Certainly the ability to effectively evaluate the standard is high on the list of standards metrics for evaluating standards.

There will thus be two spreadsheets that accompany this essay. One will rank the approaches against the safety standards. The other will weight the safety standards themselves.

There’s so much to be gained from cutting standards - the fact that there are 13 is really blowing up the essay. If they can be combined with one another or altogether eliminated it would simplify things dramatically.
Some of the standards clearly have overlap with one another. Speed of creating a general intelligence is related to the speed of takeoff. Interpretability is deeply related to controllability. Difficulty of value alignment is clearly also related to interpretability.
1. Perhaps this calls for a section which focuses on the interactions between the standards. Or, this could be merged with the section that introduces the standards (this feels elegant, relatively speaking).
2. If standards have weaknesses in evaluation (ex., a score below a 5 on one of the meta-standards)

Meta-standards:

Ease of evaluation.
Importance to deciding to move towards this cluster of capabilities.
Overlap with other standards

Speed of Takeoff
Interpretability
Controllability
- Level of autonomy
- Corrigibility
Ease of verification
- Formally
- Informally
- Via tests
Ease of validation
- By exterior actors
- By internal actors
Likelihood of Reward Function Hacking
Likelihood of Treacherous Turn
Interaction with Competition
Power of system at sub-general intelligence level
Difficulty of value alignment
Robustness to:
- Distributional Shift
- Small Alterations
- Hacking
- Hardware Faults
- Software Bugs
- Changes in Scale
- Adversaries
Probability of creating a general intelligence
Speed of creating a general intelligence What important approaches to general intelligence are missing?

It’s clear that the approaches taken by OpenAI and Deepmind (the major institutions which are working on general intelligence) need to be represented here. The use of large language models and self-supervised learning to build general intelligence and perhaps replace reinforcement learning (much the way transformers replaced RNNS).

Large scale self-supervised learning can be backed by GPT-X, Jukebox, etc. Large scale multi-task learning can be backed by Pathways. (Perhaps these approaches should be merged, assuming that they don’t differ much on standards evaluation).

What is the right level at which to consider approaches? Does a prioritization over research areas count, or should we only consider explicit proposals?

The great thing about explicit proposal is that they only require one level of speculation - given the specifications, what kinds of malign behavior might we see by this form of intelligence?

Research area distributions add a second layer of speculation on what the form of general intelligence will be to begin with. The only serious proposal that looks like this is ‘Building Machines that Act and Think Like People’ out of Joshua Tenenbaum’s lab. I am profoundly skeptical of this approach and worry that it’s sourced in all kinds of emotional hang-ups around psychology, obsession with humanity, MIT’s weaknesses with deep learning, and focusing on understanding intelligence rather than on creating it. This is my essay and I don’t have to give it space. But it has enough gravity that it counts as a competing approach.

There’s a version of these where the ‘right level’ at which to consider approaches has some hierarchy to it. Within large-scale self-supervised or multi-task learning there are many approaches. Within deep reinforcement learning there are many approaches. (Model based, self-supervised, varying data modalities) Within meta-learning there are many approaches.

In trying to characterize more abstract proposals there will be some accuracy lost due to the lossiness of abstractions. But there will also be temporal relevance. This document was written 1.5-2 years ago and the pathways / clusters of capabilities focused on have changed to include self-supervision and multi-task at scale. Straightforward deep learning approaches that don’t touch metalearing or RL. There has been a big weight shift to this at OpenAI in the last 2 years.

Criticism of Multi-Task: ‘This is an objective’ is a super reasonable response to this representation. Generality through transfer across tasks is totally a play.

Sources of generality is a reasonable theme for the creation of general intelligence.

Approaches covered will interact with one another in ways that affect conclusions. All but the most obvious and impactful 2nd and 3rd order effects of progress aren’t taken into consideration for simplicity’s sake. How can we account for this in practice?

The right approach is to highlight the specific capabilities that are and are not assumed to be present for each system. This capability ablation may actually lead to a more carefully decomposed and recursive representation of these systems. A capability like ‘out of distribution calibration’ is incredibly valuable for all methods. It’s possible to do research that generalizes that capability between metalerning, self-supervision, and rl. It’s possible that researchers who focus on generating these kinds of capability will be counterfactually impactful. The full space of these capabilities may be seen to compose into intelligent systems. This modular frame would say that rather than focusing on specific combinations of capabilities, it’s more important to focus on the capabilities that make up the approaches themselves. The strong version of this may focus on the math of machine learning and specify the math and algorithms that underlie each capability as composing with one another. These things can get quite specific quite quickly (ex, prompts that dramatically improve alignment by asking for it as an overly specific capability). The upside is that they’re actionable and understandable.

Having access to a full list of capabilities is also a phenomenal resource. This likely looks like a spreadsheet. I could order each capability by importance. This could come out of a decomposition (ex, proxy task + scalability + large dataset) of the properties of systems that lead to predictive power, effective task performance and generality. Not all approaches are covered here. Which others are high priority? Low priority?

I have already upweighted the priority of large scale machine learning. Getting high quality answers to whether an approach will fulfill a safety standard is quite hard. How do we evaluate our answers, avoiding excess speculation? How accurately can we describe the paths taken in practice by real research organizations, and come to conclusions about the decision processes they use?

Safety Literature Categorization

Other Review: http://humancompatible.ai/bibliography

Reinforcement Learning
1. Safely Interruptible Agents
  1. https://intelligence.org/files/Interruptibility.pdf
2. Reinforcement Learning with a Corrupted Reward Channel
  1. https://arxiv.org/pdf/1705.08417.pdf
3. Deep Reinforcement Learning from Human Preferences
  1. https://arxiv.org/pdf/1706.03741.pdf
4. Trial Without Error: Towards Safe Reinforcement Learning via Human Intervention
  1. https://arxiv.org/pdf/1707.05173.pdf
5. Universal Reinforcement Learning Algorithms: Survey and Experiments
  1. https://arxiv.org/pdf/1705.10557.pdf
6. A Definition of Happiness for RL Agents
  1. https://jan.leike.name/publications/A%20Definition%20of%20Happiness%20for%20Reinforcement%20Learning%20Agents%20-%20Daswani,%20Leike%202015.pdf
7. Self-modification of policy and utility function in rational agents
  1. https://arxiv.org/pdf/0712.3329.pdf
8. Avoiding Wireheading with Value Reinforcement Learning
  1. https://arxiv.org/pdf/1605.03143.pdf
9. Learning to Reset for Safe and Autonomous Reinforcement Learning
  1. https://openreview.net/forum?id=S1vuO-bCW
Interpretability
1. Building Interpretable Models: From Bayesian Networks to Neural Networks
  1. https://dash.harvard.edu/bitstream/handle/1/33840728/KRAKOVNA-DISSERTATION-2016.pdf?sequence=1
2. Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models
  1. https://arxiv.org/pdf/1606.05320.pdf
3. Interpretable Selection and Visualization of Features and Interactions Using Bayesian Forests
  1. https://arxiv.org/pdf/1506.02371.pdf
Neural Networks
1. A Minimalistic Approach to Sum-Product Network Learning for Real Applications
  1. https://arxiv.org/pdf/1602.04259.pdf
General
1. Concrete Problems in AI Safety
  1. https://arxiv.org/pdf/1606.06565.pdf
Value Alignment
1. Learning to Follow Language Instructions with Adversarial Reward Induction
  1. https://arxiv.org/abs/1806.01946
Misc.
1. A Formal Solution to the Grain of Truth Problem
  1. https://jan.leike.name/publications/A%20Formal%20Solution%20to%20the%20Grain%20of%20Truth%20Problem%20-%20Leike,%20Taylor,%20Fallenstein%202016.pdf
2. A Game-Theoretic Analysis of The Off-Switch Game
  1. https://arxiv.org/pdf/1708.03871.pdf
3. Incomplete Contracting and AI Alignment
  1. https://arxiv.org/pdf/1804.04268.pdf

Process for Research Decision Making

Reading List

https://www.lesswrong.com/posts/fRsjBseRuvRhMPPE5/an-overview-of-11-proposals-for-building-safe-advanced-ai
- Evan Hubinger
ARCHES (AI Existential Safety Overview)
Concrete Problems in AI Safety
Neuroinspiration / Systems Neuroscience
AI-GA
Risks from Learned Optimizers
Multi-agent
One Model to Rule Them All
Pathways
GPT-X,
BERT,
Jukebox,
Clip,
DALL-E
Building Machines that Learn and Think Like People
AI complete
Powerplay
Comprehensive AI Services
Causal Model-Based Deep Reinforcement Learning
Evolution
Supervising Strong Experts
Universal Algorithmic Intelligence - A mathematical top-down approach
A Monte Carlo AIXI Approximation
Approximation via RNNAI
Godel Machine
[A Roadmap Towards Machine Intelligence]
Program Synthesis
Neural Turing Machine
Feature Dynamic Bayesian Nets
AGI Safety Literature Review
Machine Learning Interpretability: A Survey on Methods and Metrics
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
Towards Practical Verification of Machine Learning
https://medium.com/@deepmindsafetyresearch/towards-robust-and-verified-ai-specification-testing-robust-training-and-formal-verification-69bd1bc48bda
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation
List of examples of reward hacking
Reward hacking blog post
Superintelligence
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation
What are you Optimizing for? Aligning Recommender Systems with Human Values
Towards Evaluating the Robustness of Neural Networks

Bibliography

Meta (About writing the essay)

The truth is that this thinking matters and is already affecting the decision making of many researchers. Improving its quality even though everything in it is quite speculative and hard to measure means that slightly better decisions can be made.
Cognitive science inspiration is horrifically vague. That said, treated as a body of capabilities, or as a weighting over the research frontier, it can be evaluated against other measures. It also includes a generator of the kinds of skills that matter and of a way in which they’ll manifest in the world (human similarity) which is an evaluable body of properties.
The hard work of deciding whether the ‘paths’ concept can be made to make sense (it implies linear dependencies, which is plausible a poor choice of data structure) needs to be done. It’s chosen for the simplicity of the analysis rather than for its coherence with actual research decision making.
- The reframe where rather than thinking of these as paths, these are thought of as AGI designs that posit systems with different subsets of future capabilities feels potent.
  - (Aha! I’m enthused that I’m already making fast progress.)
- Establishing which bodies of properties would constitute an AGI (which occurs at the skills / process / bottom up level, rather than using goal accomplishment as the standard) is a very important grounding guide to research.
  - Combining skills as a frame for thinking should be evaluated. Its complexities (ex. how the long term memory system interacts with the attentive system) are often filled with unforeseen interactions.
Everyone asks for timelines. It would be fascinating to attempt to put timelines on every major approach to general intelligence.
Finishing is less important than dramatically increasing the quality of my analysis. Missing ML pathways, or missing self-supervised multimodal representation learning w/ task-specific transfer, would be a pretty substantial mistake.
- This reminds me that task representation is both incredibly important and completely underserved.
It would be really wonderful to add images from the relevant papers to this document.
- In the face of reading these papers I’ll get tons of ideas for additional content as well as nuances and refinements to my arguments.
- Making arguments as well as counterarguments in every single section seems wise. Ideally I’d have a seminar on each and recurse on the strongest arguments until each section represented something approximating the state of the art in thinking on the intersection of that path to general intelligence and the given safety standard.
I’d like to decide whether to include institution names here, or to leave them absent. There will be plenty of citations to make this work.
This is still a demonstration of a methodology as much as it is for specific present-day decision making around what kind of general intelligence to build.
- People (Like Geoffrey Irving) will want to know what my conclusions are - I should remember that these more abstract goals (like creating a precedent for tying conclusions to evidence from real world systems in trying to publish recommender alignment at google) will be missed and all people are likely to see are the practical ramifications.
It’s nice to have counterarguments at the specific path / standard level, but counterarguments to the entire structure of the essay are also worthwhile. I imagine sections like:
- Why consider pathways at this level of abstraction?
- Is it useful to have criteria of safety standards?
- What are the ways in which analyses like this could be damaging or abused?
- Is thinking intentionally about this kind of this reasonable / possible without being in the details of the mathematical implementations of these techniques?
  - The argument that this entire essay sits at too high a level of abstraction to be falsified / made useful resonates with me.
Proposing a process for research decision making based on this methodology should be an explicit section in the essay.

Notes

Each list can be an event series.
- Each safety standard could be a seminar.
- Each approach to AGI could be a seminar (this is how the original seminar was constructed)
Jeff Dean for ML Pathways (Somehow there’s no paper?)
Samy Bengio for Multi-Modal Self-Supervision (Somehow there’s no paper?)
Jeff Clune for Meta-learned / Meta-optimization (AI-GAs)
Joshua Tenenbaum for Cognitive Science inspiration
Demis Hasabis for Deep RL & Neuroinspiration
Ilya Sutskever for Multiagent
A Note About Differential Technological Development
Differential Technological Developement
The Vulnerable World Hypothesis
Neural Program Synthesis is a subset of AI-GAs. The level fo abstraction for the paths is broken by having both of them, and especially by having both of them on this path

Template

Name of Path Speed of Takeoff

Interpretability & Controllability Ease of verification Ease of validation Likelihood of Reward Function Hacking Likelihood of Treacherous Turn Interaction with Competition Power of system at sub-general intelligence level Difficulty of value alignment. Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries Probability of creating a general intelligence

Source: Original Google Doc