19-05-27 Relative Safety of Forms of General Intelligence

Thesis

The form of general intelligence that is created has a dramatic impact on the level of risk from that intelligence. Some approaches are much easier to align with what the user intent than others. Some approaches lead to very fast takeoff dynamics.

This essay is, somewhat inevitably, filled with speculation. Beliefs here are lightly held. Many positions are thought through to the level of noticing the existence of an argument. Concepts are conflated with one another. Pathways certainly are not independent of one another, and umpteen interactions are not accounted for.

Raw high level ideation over approaches

Multi-agent
1. Cultural intelligence hypothesis (intelligence accumulates slowly over generations through teaching / imitation)
2. Machiavellian intelligence hypothesis (task is to win in mixed cooperative / competitive environment, requires recursive modelling of other agents)
3. Sexual intelligence hypothesis (intelligence is side effect of agents optimizing to impressing peach other)
Meta-learned / Meta-optimization
1. AI-GA
2. Risks from Learned Optimizers
Neuroinspiration / Systems Neuroscience
1. Replicate memory systems, attention systems, sensory systems, reward systems, etc., and combine them appropriately
Task search & Continual Learning (Similar to AI-GA)
1. Powerplay
Set of task-specific AIs with interpretable APIs that together appear generally intelligent
1. Comprehensive AI Services
Causal Model-Based Deep Reinforcement Learning
1. Learned representations mapped onto a large causal graphical model
Evolution
AI trained using Iterated Amplification.
1. Supervising Strong Experts
Large scale unsupervised language representation + multi-modal action
1. Natural language understanding / machine translation as AI complete
Cognitive Science Inspiration
Building Machines that Learn and Think Like People
AIXI
Universal Algorithmic Intelligence - A mathematical top-down approach
A Monte Carlo AIXI Approximation
Approximation via RNNAI
Godel Machine
Natural Language Commands to RL Agents over a variety of environments [A Roadmap Towards Machine Intelligence]
Program Synthesis
Ex., Neural Turing Machine
Ex., Use a language model to generate code by doing transfer from all of github. Fine tune on generated data for the kinds of research that can be systematized / automated (ex., self-supervision heuristics like next code character, code replacement, contextual code), and have the model self-modify to produce better and better code generation models.
Society of Mind - general intelligence is an emergent property of a society of interacting agents
Multi-task Model-free Deep Reinforcement Learning (prosaic AGI in the Christiano sense)
Structure Learning - Automate the creation of algorithms by metalearning the types of structure that exist in data and generating a model that captures that structure cleanly.
Dramatic increase in compute paired with current methods
Bayesian Networks
Feature Dynamic Bayesian Nets
Whole Brain Emulation
Cognitive Architectures
Discovery of laws of information processing
Embodied Cognition

Safety Standards:

Speed of Takeoff
Interpretability
Controllability
1. Level of autonomy
Ease of verification
1. Formally
2. Informally
3. Via tests
Ease of validation
1. By exterior actors
2. By internal actors
Likelihood of Reward Function Hacking
Likelihood of Treacherous Turn
Interaction with Competition
Power of system at sub-general intelligence level
Difficulty of value alignment.
Robustness to:
Distributional Shift
Small Alterations
Hacking
Hardware Faults
Software Bugs
Changes in Scale
Adversaries
Probability of creating a general intelligence
Speed of creating a general intelligence

Standards to add:
14. Boxability (can successfully restrict access to information, compute, etc.)
15. Intermediate progress can lead to a stop

Caveats:

Approaches covered will interact with one another in ways that affect conclusions. All but the most obvious and impactful 2nd and 3rd order effects of progress aren’t taken into consideration for simplicity’s sake.
The set of approaches are examples, and are not close to comprehensive.
Ranking research areas vs. ranking specific propsals

Outstanding questions:

Treat weightings over the research frontier, or specific proposals?

Other overviews:

Deepmind’s Path to Neuro-Inspired General Intelligence
- Nixon
Building Machines that Learn and Think Like People
- Lake et. al.
A Roadmap Towards Machine Intelligence
- Mikolov et. al.
Where will AGI come from?
- Karpathy
- Where will Artificial Intelligence Come From?
  - Sebastian Nowozin
From Here to Human Level AGI in Four Steps
- Ben Goertzel

Plan:
Evaluate each path on the axes given.
Recommend

Things I’m not covering:
Non-AI approaches

Let’s sketch this essay.

I should take each of the 15 and describe how they could lead to general problem solving. This may let me cut down to 5, if sets of insights or assumptions behind approaches seem worth unifying.
Once I’ve described the approaches, I’ll create a spreadsheet. Approaches on one axis, and the 13 safety standards on another axis.
Once I have the spreadsheet, there’s a question about whether the essay should proceed in terms of approaches or in terms of safety standards. Standards makes sense if I’ve already dealt with the xyzs.

I need citations as well.

Doing multi-task learning with task discovery… does this all add up to the asme thing?
What is obviously different?
I’m going to start with

AI-GA,
then cover AIXI,
then cover multi-agent evolution,
then cover a generic prosaic AGI / Reinforcement Learning centric / Deepmindesqe version of agi.
Lifelong learning
Systems Neuroscience
1. Replicate memory systems, attention systems, sensory systems, reward systems, etc., and combine them appropriately
Think like People

Approaches
Approaches vary greatly in their level of specificity.

Some approaches are concrete, theoretically implementable proposals for general problem solvers. Other approaches feel much more like weightings over parts of the research fronteir, emphasizing some set of subfields above others in light of a stated goal.
AI-GA

AI-GA

(This will likely subsume Task Search & Continual Learning)

AI-GA makes a distinction between two approaches to building general intelligence. First, a manual approach that focuses on building many pieces of an intelligence (ex., recurrent gated cells, convolution, attention mechanisms, normalization schemes, etc.) and then put those building blocks together into a working general problem solver. Alternatively, an approach where the an AI-generating algorithm itself learns how to produce a general AI.

The attempt to abstract out the generated pieces of the research frontier into architecture (the model), optimization, and task generation.

Neuro-inspiration

Neuroscience Inspired Artificial Intelligence
Deepmind’s Path to Neuro-Inspired General Intelligence
At a high level, the idea is to understand the brain at a level of detail (ex. Marr’s algorithmic level) that allows researchers to implement the brain’s functions as computational algorithms. Convolution is a front and center example of an algorithm with a neuroscientific basis. If each major module of the brain can be understood and satisfactorily implemented, their interaction will be sufficient for general problem solving. The current frontier includes implementations of:
Neuroscience Inspired Artificial Intelligence
Examples of previous success of neuro-inspiration:
Reinforcement Learning
- Inspired by animal learning
- TD Learning came out of animal behavior research.
- Second-order conditioning (Conditional Stimulus) (Sutton and Barto, 1981)
Deep Learning.
- Convolutional Neural Networks. Visual Cortex (V1)
  - Uses hierarchical structure (successive processing layers)
  - Neurons in the early visual systems responds strongly to specific patterns of light (say, precisely oriented bars) but hardly responds to many other patterns.
  - Gabor functions describe the weights in V1 cells.
  - Nonlinear Transduction
  - Divisive Normalization
- Word / Sentence Vectors - Distributed Embeddings
  - Parallel Distributed Processing in the brain for representation and computation
- Dropout
  - Stochasticity in neurons that fire with` Poisson-like statistics (Hinton 2012)
Attention
- Applying attention to memory
- Thought - it doesn’t make much sense to train an attention model over a static image, rather than over a time series. With a time series, bringing attention to changing aspects of the input makes sense.
Multiple Memory Systems
- Episodic Memory
  - Experience Replay
  - Especially for one shot experiences
- Working Memory
  - LSTM - gating allows for conditioning on current state
- Long-term Memory
  - External Memory
  - Gating in LSTM
Continual Learning
- Elastic weight consolidation for slowing down learning on weights that are important for previous tasks.

Examples of future success:

Intuitive Understanding of Physics
- Need to understand space, number, objectness
- Need to disentangle representations for transfer. (Dude, I feel so stolen from)
Efficient Learning (Learning from few examples)
Transfer Learning
- Transferring generalized knowledge gained in one context to novel domains
- Concept representations for transfer
  - No direct evidence of concept representations in brains
Imagination and Planning
- Toward model-based RL
- Internal model of the environment
  - Model needs to include compositional / disentangled representations for flexibility
- Implementing a forecasted-based method of action selection
- Monte-carlo Tree Search as simulation based planning
- In rat brains, we observe ‘preplay’ where rats imagine the likely future experience - measured by comparing neural activations at preplay to activations during the activity
- Generalization + Transfer in human planning
- Hierarchical Planning
Virtual Brain Analytics

An implementation of each, in concert, may be sufficient to create a learning system that solves problems in general.
Examples can be provided of research that implements each of the above

Self-Referential / Feedback Loops
PowerPlay

POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem

Three part system:

Task Generator
Solution Discoverer / Solver Modification
Correctness Demonstration / Verifier

At a high level, the goal is to invent new tasks (specifically pattern recognition tasks and general decision making tasks in dynamic environments) and implement a solver that searches for solutions to the discovered tasks. The solver’s solution is verified by the correctness demonstrater via proof search, while also guaranteeing that the solver doesn’t begin to fail on the previously solved problems.

Godel Machine
Godel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements

Self-Rewriting system that modifies its code once it discovers a useful change, which is proved to be useful through a proof search. Encodes a body of proof techniques and a global optimality theorem.

Cognitive Science Perspective
Building Machines that Learn and Think Like People
The Soar Cognitive Architecture

Frontier:

Intuitive Physics
- Human infants can solve physics related tasks and incredibly young ages. For example there are smoothness priors built in - young infants believe objects should move along smooth paths.
- This model of physics is later leveraged for understanding concepts.
Intuitive Psychology
- Priors to distinguish animate agents from inanimate objects
  - Expecting agents to respond, to have goals, distinguishing between anti-social, neutral, and pro-social agents
- Formalize concepts of ‘goal’, ‘agent’, ‘planning’, ‘cost’, ‘efficiency’, and ‘belief’
Learning as rapid model building
- Focus on one-shot and few-shot learning
- Relationship between pattern recognition and model building
Compositionality
- New representations can be constructed through the combination of primitive elements. Composition is at the core of representation productivity, whether it be natural language or computer programming.
Causality
- Represent the real world process that generates observations. Build up a model of the environment that models the state-action-state transitions.
- Causality allows for stronger generalization.
Learning-to-Learn
- Learning of a task can be accelerated through previous or parallel learning of other related tasks.
  - Strong Priors
  - Learned Constraints
  - Inductive Biases
Thinking Fast
- Approximate inference in structured models
  - Monte Carlo Methods
- Model-based and model-free reinforcement learning

From Karpathy:

attention. The at-will ability to selectively "filter out" parts of the input that is judged not to be relevant for a current top-down goal. e.g. the "cocktail party effect".
working memory: some structures/processes that temporarily store and manipulate information (7 +/- 2). Related to this, phonological loop: a special part of working memory dedicated to storing a few seconds of sound (e.g. when you repeat a 7-digit phone number in your mind to keep it in memory). also: the visuospatial sketchpad and an episodic buffer.
long-term memory of quite a few suspected different types: procedural memory (e.g. driving a car), semantic memory (e.g. the name of the current President), episodic memory (for autobiographical sequences of events, e.g. where one was during 9/11)
knowledge representation; the ability to rapidly learn and incorporate facts into some "world model" that can be inferred over in what looks to be approximately bayesian ways. the ability to detect and resolve contradictions, or propose experiments that disambiguate cases. the ability to keep track of what source provided a piece of information and later down-weigh its confidence if the source is suddenly judged not trust-worthy.
spatial reasoning, some crude "game engine" model of a scene and its objects and attributes. All the complex biases we have built in that only get properly revealed with optical illusions. Spatial memory: cells in the brain that keep track of the connectivity of the world and do something like an automatic "SLAM", putting together a lot of information from different senses to position the brain in the world.
reasoning by analogy, eg applying a proverb such as "that’s locking the barn door after the horse has gone" to a current situation.
emotions; heuristics that make our genes more likely to spread - e.g. frustration.
a forward simulator, which lets us roll forward and consider abstract events and situations.
various skill acquisition heuristics; practicing something repeatedly, including the abstract idea of "resetting" an experiment, or deciding when an experiment is finished, or what its outcomes were. The heuristic inclination for "fun", experimentation, and curiosity. The heuristic of empowerment, or the idea that it is better to take actions that leave more options available in the future.
consciousness / theory of mind: the understanding that other agents are like me but also slightly different in unknown ways. Empathy (e.g. the cringy feeling when seeing someone else get hurt). Imitation learning, or the heuristic of paying attention to and then later repeating what the other agents are doing.

Multi-Agent Evolution

Deepmind’s Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research

Decomposition of Safety Standards

Safety Standards:

Speed of Takeoff (Safety side)
Interpretability
Controllability
1. Level of autonomy
Ease of verification
1. Formally
2. Informally
3. Via tests
Ease of validation
1. By exterior actors
2. By internal actors
Likelihood of Reward Function Hacking
Likelihood of Treacherous Turn
Interaction with Competition
Power of system at sub-general intelligence level
Difficulty of value alignment.
Robustness to:
Distributional Shift
Small Alterations
Hacking
Hardware Faults
Software Bugs
Changes in Scale
Adversaries
Probability of creating a general intelligence (Capabilities side)
Speed of creating a general intelligence (Capabilities side)

Speed of Takeoff
The pace of a takeoff creates large differences in reaction time for the programmers of a generally intelligent system. In the case of a slow takeoff (ex., course of 5 years), engineers and researchers have ample time to diagnose, investigate and generate solutions for technical challenges. With a fast takeoff (ex., course of 2-6 months) it’s much harder to carefully check robustness, consistency with goals, and in the case of a major challenge to execute on the relevant research pathway or pathways to generate a solution.

Interpretability

Understanding a system’s decisions and behavior is essential for constraining its actions to avoid damage and harm. Diagnosing the means that an algorithm uses to accomplish an end is essential to avoiding side effects. Interpretability allows for monitoring of the probability of a treacherous turn, allows for a deeper form of controllability with humans involved more deeply in the algorithms decision making process, and can divulge a system’s weakness in the face of distributional shift, bugs and errors, and misalignment).

Controllability

The level of integration of the points that allow for control over these intelligent algorithms determines what kind of outcomes can be expressed by engineers and researchers. The choice of data, learning algorithm and task represents a current level of control. That control can be improved to introduce control over the representation (say, through disentangled representations or graph representations of the state space) and improve control over the learning process (say, through curriculum learning). That control can also be degraded to cede choices about the task and data to the algorithm.

Verification

Formal verification, where the correctness of an algorithm intended to run is verified with respect to some formal specification, could lead to much safer models and trust that some kinds of failure mode are incredibly unlikely. This would take advantage of formal methods in mathematics.

There are a few ways to accomplish this. Given a specification, you’d like to be able to test that your system is consistent with the properties that the designer had in mind [cite]. An invariance property, where the system’s outputs don’t change for some modification of the input, is an example (ex., where adversarial examples typically cover invariance to small perturbations of the input). Robustness as a specification can mean that in the worst case scenario, when the input is in the worst part of the range (for some definition of worst such as system breakage or death), the system acts in a way that avoids that bad outcome.

Training machine learning models to be consistent with a specification in the first place is another approach to accomplishing verification. Training with constrained models, (ex., lagrange constraints in safe exploration) is one such method to push towards a model that’s consistent with specifications.

Formally proving that machine learning models are consistent with specifications is another level of verification. Algorithms that are provably consistent with a specification for all possible inputs would be considered to be verified.

Types of informal verification, where a combination of evaluation metrics, interpretability techniques, and checks against previous failures will also likely be necessary to reduce the probability of failures that are challenging to formally verify.

Validation
It’s important, for the sake of trust in the form of general intelligence, that the engineers and scientists working on it be able to verify that it fulfills a requirements spec. Simple questions like whether powerful negative side effects are likely or what response the system will have to an out of distribution input need to be answered by a process of verification.
In the case where there’s cooperation or oversight, governments or other labs will need to verify that algorithms being developed are safe for the applications they’re to be used for. Some forms of general intelligence will make this verification, by self and others, much more achievable.

Reward Function Hacking

List of examples, blog post
The ability of an algorithm to become aware of its own reward function and so discover a solution that formally maximizes reward while violating the spirit of the intention that the engineers and scientists creating the algorithm is cause for designing systems that are more robust to this failure mode.
Unintended behavior will occur in the presence of a powerful optimizer. Some AGI systems are more reliant on powerful optimizers than others, or lend themselves to have early access to the kind of self awareness that could lead to this outcome.

Likelihood of Treacherous Turn

An agent that’s aware of it’s propensity to get shut off if it violates the safety models of the engineers and scientists who created will learn to cover up or not express dangerous behavior until it’s in a position where it can no longer be stopped. Training systems that are aware of their own conditions must be treated with care. Some algorithms reach this level of self-awareness much more quickly than others.

Interaction with Competition

Depending on the conditions of the research environment in which general intelligence is built, competition between labs or projects can turn into a major driver of action.
An easy example of the way that the safest form of general intelligence can be affected by competition is where there’s a compute barrier that must be crossed to successfully implement the state of the art system. That barrier may prevent organizations from being able to compete, even if they have knowledge of the most recent and most powerful system.
Data moats are another likely source of competitive advantage. Approaches to general intelligence that generate their own tasks or discover their own data are much more democratizable than algorithms that depend on a tremendous amount of proprietary data.

Power of system at sub-general intelligence level
Some forms of general intelligence explode much more quickly and rarely than more incremental, piecemeal and compilation.
The willingness to democratize AGI systems that aren’t powerful at sub-general intelligence level will be high (because there’s little incentive to guard them), where the desire to have access to powerful sub-genral intelligence level AI for commercial and scientific purposes will be much higher in a world where these systems are quite powerful in general.

Difficulty of value alignment.
Some systems are likely to be much better at representing and protecting human values than others. Systems that dramatically self modify will be much harder to predict than systems that we experience as being in a continual feedback loop with engineers and scientists as those systems take actions in the world and are incrementally refined to be more potent.
If the algorithm itself is to be involved in representing human values, it’s ability to perform that task will be extremely important. In many cases, the ability to appropriately represent human values is in competition with an algorithms ability to optimize (leading to edge instantiation) or to build new models with new assumptions (inductive biases) about how to perform pattern recognition and planning.
Robustness
You’d like a generally intelligent agent that was robust to distributional Shift - modifications to the input space that make the present distribution vary from the training distribution, often in a way that is material to the algorithm’s performance..

Small Alterations robustness means that small changes to the inputs (say, adversarially chosen samples) should lead to small changes in the algorithm’s outcome space.

Hacking wll be relevant in a world where secrecy stops projects working on general intelligence from sharing code, math and ideas. The ability for a generally intelligent agent to be hacked may be key to the strategies of nation states defending themselves from such algorithms.

Hardware faults may be exploitable by algorithms, which discover a way to leverage a hardware fault to (for example) hack their own reward function. The forms of general intelligence that become aware of the hardware that they’re using and how to leverage it open up new vectors for escape and attack.
Flaws in hardware may cause a robot or even an agent on a computer to experience an uncatchable error (due to corruption) and the outcomes of that need to be predictably safe. General intelligence algorithms that naturally lend themselves to this kind of robustness are likely to be safe.

Relative Safety of Forms of General Intelligence

Here we apply the most relevant of our safety standards to each framework for generating a general problem solver.

AI-GA

AI-GA
Speed of Takeoff
This approach is more likely than incremental methods to lead to fast takeoffs. Because both the architecture search and learning algorithm are learned, any interactions between these two could lead to incredibly fast learning upon (for example) the correct prior for learning being discovered by the algorithm learner which allows the architecture search and task discovery to work more effectively.

Much of the system is geared to making progress autonomously, without a researcher in the feedback loop. Each part of the system will search for architectures / learning algorithms / tasks. This setup lends itself to fast takeoff because there is no engineer or scientist bottlenecking its progress.

Clune:
“In my view, the largest ethical concern unique to the AI-GA path is that it is, by definition, attempting to create a runaway process that leads to the creation of intelligence superior to our own. Many AI researchers have stated that they do not believe that AI will suddenly appear, but instead that progress will be predictable and slow. However, it is possible in the AI-GA approach that at some point a set of key building blocks will be put together and paired with sufficient computation. It could be the case that the same amount of computation had previously been insufficient to do much of interest, yet suddenly the combination of such building blocks finally unleashes an open-ended process. I consider it unlikely to happen any time soon, and I also think there will be signs of much progress before such a moment. That said, I also think it is possible that a large step-change occurs such that prior to it we did not think that an AI-GA was in sight. Thus, the stories of science fiction of a scientist starting an experiment, going to sleep, and awakening to discover they have created sentient life are far more conceivable in the AI-GA research paradigm than in the manual path. As mentioned above, no amount of compute on training a computer to recognize images, play Go, or generate text will suddenly become sentient. However, an AI-GA research project with the right ingredients might, and the first scientist to create an AI-GA may not know they have finally stumbled upon the key ingredients until afterwards. That makes AI-GA research more dangerous.”

Interpretability
And so a human can set and forget this kind of learning process, and when it succeeds it may be without human oversight. The system certainly doesn’t have to be legible to the engineers and scientists building it to succeed. Because this introduces yet another layer of learning for each part of the system, it’s possible that the creator both can’t interpret what features are being learned from the data and can’t interpret how those features were learned, but only how the process for searching for the feature discovery was set up. This level of interpretability is unlikely to tell an engineer or researcher whether the updates proposed by the system to itself are safe, whether that means exploring safely, harmful side effects, or a hacked reward function.

Clune:
“..., it is likely safer to create AI when one knows how to make it piece by piece. To paraphrase Feynman again, one better understands something when one can build it. Via the manual approach, we would likely understand relatively more about what the system is learning in each module and why. The AI-GA system is more likely to produce a very large black box that will be difficult to understand. That said, even current neural networks, which are tiny and simple compared to those that will likely be required for AGI, are inscrutable black boxes that are very difficult to understand the inner workings of. Once these networks are larger and have more complex, interacting pieces, the result might be sufficiently inscrutable that it does not end up mattering whether the inscrutability is even higher with AI-GAs. While ultimately we likely will learn much about how these complex brains work, that might take many years. From the AI safety perspective, however, what is likely most critical is our ability to understand the AI we are creating right around the time that we are finally producing very powerful AI.”

Controllability

Because so many parts of the system are up to higher level learning algorithms rather than manual control by scientists or engineers, and because even the task itself (and so the objective) is up to the task search algorithm, this approach has an incredibly weak level of controllability. While it may become generally intelligent, it is not even clear that humans will be able to interface with the system in a way that allows them to accomplish their goals (say, via natural language).

A critical aspect to controllability is interpretability. Because we don’t know how the system was made, piece by piece, we won’t even be able to reason about the specific learning processes that generated the representations being used to act (assuming that the system maintains representation learning as its paradigm for intelligent behavior). We will only be able to reason about the process that we used to create the learning algorithms, which may let us make some high level judgements about the properties of the system (in the same way that evolutionary psychology lets us make some weak predictions about human decision making). These are unlikely to be at a level of granularity that lets us understand the internal workings of the system in a way that lets us inject our preferences into the parts of the system that are relevant to getting the outcome that we want.

The level of autonomy of this system make exercising control over it more difficult. Because during training the algorithm was given exceptional autonomy over the data it’s processing, the means that it uses to process that data, the forms of evaluation that are worth using, and potentially more, the system will be able to accomplish its tasks without human intervention or oversight. Unless it’s designed this way (and there are strong incentives to avoid human oversight of training processes), there’s no human in the loop to make the system learn how to interact well with the human or take advantage of the human’s knowledge.

Ease of Verification

AI-GA systems could be meta-optimized for a given specification, and so in part leveraging learning algorithms to be very good at staying within a constraint may be useful. Learning new techniques for fulfilling tight specifications while maintaining strong system performance may be important to creating functional systems that are also safe. And so when it comes to training, there’s more flexibility in the system which can lead to discovering safe solutions.

Writing down a specification that will generalize to the environments and models that are generated by the AI-GA is a much more difficult task. One obvious problem is overfitting the specification, and fulfilling it technically while violating the purpose behind the specification. But other challenges like reward function hacking or learning to update the specification is a greater danger with an AI-GA system that has so much flexibility over its own actions.

Discovering constraints that allow for the training of powerful systems successfully may be a better task for an AI-GA than a human programmer, as long as a useful safety meta-objective can be described to the system.

Ease of Validation

External validation of the systems safety will likely be incredibly difficult in light of the systems opacity to its own creators. Trusting the training process used to generate the algorithm and running tests on the resulting system may be the only level at which the external validation can proceed.
Likelihood of Reward Function Hacking

Incredibly high, given the flexibility of the system. Unclear how to avoid it, since awareness of its reward function and sensitivity to it is part of the plan for improving the system’s capability.
Likelihood of Treacherous Turn
Awareness of human programmers is unnecessary since the program is autonomously generating its own learning environment. May optimize against automated safety tests.

Interaction with Competition
Knowing that your competitor is building systems with this level of automation may have you convert from an interpretable systems with this one. Actors may take whatever SOTA slow takeoff systems exists and continually try to turn it into an AI-GA, waiting to cross a threshold.
Power of system at sub-general intelligence level
Quite weak and useless, in comparison.
Difficulty of value alignment.
Incredibly hard. Not clear that there will even be an interface for humans. Human modeling isn’t a central part of the framework. Language understanding is also not a core part of the framework.
Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries

Should be comparatively adaptive, and so able to overcome capabilities issues. Because it’s programming itself, bugs will likely be of a different type than human bugs. Small alterations may or may not lead to predictable changes in the predictions of the AI-GA itself.
Probability of creating a general intelligence
Quite high. Task generation / general optimizer search / general model search is a promising fastest path.

Neuro-inspiration

Neuroscience Inspired Artificial Intelligence
Deepmind’s Path to Neuro-Inspired General Intelligence

This approach will likely involve the incremental combination of many valuable additions to a body of implemented functions. For example, combining an attention network with an episodic memory system with some world model.

Hassabis:
On the background and marginal impact of neuro-inspiriation:
“In strategizing for the future exchange between the two fields, it is important to appreciate that the past contributions of neuroscience to AI have rarely involved a simple transfer of full-fledged solutions that could be directly re-implemented in machines. Rather, neuroscience has typically been useful in a subtler way, stimulating algorithmic-level questions about facets of animal learning and intelligence of interest to AI researchers and providing initial leads toward relevant mechanisms.”
Speed of Takeoff
Likely a medium to slow takeoff, as the parts have to be developed independently of one another and can be tested. Each part of this system comes into use slowly. There’s some risk in a context where many modules are combined with each other, where a new interaction may lead to faster takeoff. (Rainbow, Human Level Performance in first-person multiplayer games, Max)
Interpretability & Controllability
While the interactions between neural sub-systems may be complicated and have implications for the model’s interpretability and controllability, the fundamental building blocks are constructed by engineers and scientists that can reason about the components and their expected behavior.
Ease of verification
Easier due to slower takeoff. Harder due to the brain’s parallel processing, complex temporal dynamics, and fast uninterpretable processes.
Ease of validation
Same as verification - easier due to slower takeoff. Harder due to the brain’s parallel processing, complex temporal dynamics, and fast uninterpretable processes.

Likelihood of Reward Function Hacking
Capabilities may grow slowly enough to detect and eliminate this.
Likelihood of Treacherous Turn
Relatively high - learning strategic behavior / theory of mind will be on the neurinspired roadmap.
Interaction with Competition
Allows for possible collaboration / coordination between competitors.
Power of system at sub-general intelligence level
Could be quite high, solving perception and language tasks which allow for the automation and creation of important intellectual work.
Difficulty of value alignment.
Relatively easy, given similarities to human cognition.
Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries
Adversarial examples exists. Bugs are likely.
Probability of creating a general intelligence
High to medium high.
Self-Referential / Feedback Loops
PowerPlay / Godel Machine

POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem
Godel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements

Because in many ways learning algorithms internalize their environment in order to model it, searching for the task or the learning environment itself introduces questions around

Powerplay Verification

A correctness proof is required to show that a change is an improvement. It would make sense to include an additional proof that a change is safe, for some definition of safe with a high tolerance. On correctness in powerplay:

“3.3 Implementing CORRECTNESS DEMONSTRATION Correctness demonstration may be the most time-consuming obligation of pi . At first glance it may seem that as the sequence T1, T2, . . . is growing, more and more time will be needed to show that si but not si−1 7 can solve T1, T2, . . . , Ti , because one naive way of ensuring correctness is to re-test si on all previously solved tasks. Theoretically more efficient ways are considered next. “

Often this is implemented via proof search or decomposition.

In the case of proof search, a proof that the system will act safely would be the most general solution to this problem. Schimdhuber describes the generality :

3.3.1 Most General: Proof Search The most general way of demonstrating correctness is to encode (in read-only storage) an axiomatic system A that formally describes computational properties of the problem solver and possible si , and to allow pi to search the space of possible proofs derivable from A, using a proof searcher subroutine that systematically generates proofs until it finds a theorem stating that si but not si−1 solves T1, T2, . . . , Ti (proof search may achieve this efficiently without explicitly re-testing si on T1, T2, . . . , Ti). This could be done like in the G¨odel Machine [44] (Section 9.1), which uses an online extension of Universal Search [17] to systematically test proof techniques: proof-generating programs that may invoke special instructions for generating axioms and applying inference rules to prolong an initially empty proof ∈ B∗ by theorems, which are either axioms or inferred from previous theorems through rules such as modus ponens combined with unification, e.g., [7]. P can be easily limited to programs generating only syntactically correct proofs [44]. A has to subsume axioms describing how any instruction invoked by some s ∈ S will change the state u of the problem solver from one step to the next (such that proof techniques can reason about the effects of any si). Other axioms encode knowledge about arithmetics etc (such that proof techniques can reason about spatial and temporal resources consumed by si). In what follows, CORRECTNESS DEMONSTRATIONS will be discussed that are less general but sometimes more convenient to implement.

In the less general case of components of powerplay that can be decomposed and independently evaluated, there’s danger in that the interactions between parts of the system. This may make the correctness checker more efficient, which will also apply to a safety checker. Schmidhuber on component-by-component verification:

“3.3.2 Keeping Track Which Components of the Solver Affect Which Tasks Often it is possible to partition s ∈ S into components, such as individual bits of the software of a PC, or weights of a NN. Here the k-th component of s is denoted s k . For each k (k = 1, 2, . . .) a variable list L k = (T k 1 , T k 2 , . . .) is introduced. Its initial value before the start of POWERPLAY is L k 0 , an empty list. Whenever pi found si and Ti at the end of CORRECTNESS DEMONSTRATION, each L k is updated as follows: Its new value L k i is obtained by appending to L k i−1 those Tj ∈/ L k i−1 (j = 1, . . . , i) whose current (possibly revised) solutions now need s k at least once during the solution-computing process, and deleting those Tj whose current solutions do not use s k any more. POWERPLAY’s CORRECTNESS DEMONSTRATION thus has to test only tasks in the union of all L k i . That is, if the most recent task does not require changes of many components of s, and if the changed bits do not affect many previous tasks, then CORRECTNESS DEMONSTRATION may be very efficient. Since every new task added to the repertoire is essentially defined by the time required to invent it, to solve it, and to show that no previous tasks became unsolvable in the process, POWERPLAY is generally “motivated” to invent tasks whose validity check does not require too much computational effort. That is, POWERPLAY will often find pi that generate si−1-modifications that don’t affect too many previous tasks, thus decomposing at least part of the spaces of tasks and their solutions into more or less independent regions, realizing divide and conquer strategies as by-products. Compare a recent experimental analysis of this effect [53, 52].”
Godel Machine Interpretability
Because the theorem prover updates itself, it’s not clear the the updates (especially after many iterations) will continue to be understandable to humans.
Godel Machine Alignment
There is no intention or ability to align this system with the intentions of its creator. No interface is specified beyond proof search and the definition of universal optimality.
Maintaining alignment under self-modification of the theorem prover itself is a challenge presented to the Godel machine that is not present in AIXI or HSEARCH.

Godel Machine Verification
An upside of proof search is that it should be viable for outside observers to make sure that the system’s proof is correct. The proof discovery code, however, will be harder to verify - it will have been developed over an incremental process that may be expensive or difficult to validate. It’s not clear that an algorithm for checking whether the given proof search algorithm works will be present, though running said proof search algorithm over known provable and unprovable theorems may resolve this.

As with powerplay, there’s upside to the proof search step which is that if formalized, safety can be added to this check.

Speed of Takeoff
Follows AI-GA.
Interpretability
Follows AI-GA.

Controllability
Follows AI-GA.

Ease of verification
Follows AI-GA.

Ease of validation
Follows AI-GA.

Likelihood of Reward Function Hacking
Follows AI-GA.

Likelihood of Treacherous Turn
Follows AI-GA.

Interaction with CompetitionFollows AI-GA.
Follows AI-GA.
Power of system at sub-general intelligence level

Follows AI-GA.

Difficulty of value alignment.
Follows AI-GA.

Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries
Follows AI-GA.

Probability of creating a general intelligence
Follows AI-GA.

Cognitive Science
Building Machines that Learn and Think Like People
The Soar Cognitive Architecture
Speed of Takeoff

The speed of takeoff of these models can be predicted to be slower that recursively self improving systems. This gives researchers time to discover ways to align practical versions of these systems as they’re deployed, and to create methods of verifying the safety of the behavior of these systems.
Alignment

Understanding human values will be comparatively straightforward if machines think like people in their entirety. Despite this, many human values lead to a lack of safety. Drives towards dominance, freedom, violence, and other emotional responses are a more likely result of algorithms that learn to think like people than they are in an algorithm that merely optimizes, gaining those drives out of instrumental convergence rather than from training.
Treacherous Turn
Algorithms capable of modeling human behavior effectively will be likely to have a theory of mind. That theory of mind can be used to lead engineers and scientists into believing what the system wants them to believe.

On the other hand, the slowness of takeoff makes it more likely that it can be handled before it reaches levels whose capabilities we cannot model effectively.

Verification

Having time to develop verification methods specific to the techniques used here is one healthy advantage to taking this approach.

Ease of validation
Comparatively easy, where there is time for validation methods to be conceived of and applied to the system.
Likelihood of Reward Function Hacking
Interaction with Competition
Easier for competitors to see and copy the progress of others. More time for agreements to form. More time to see the consequences of their tools. Time for weaponization of tools.
Power of system at sub-general intelligence level
Often quite useful. Many present models built with this frame in mind (attention, for example) are SOTA for important applications.
Difficulty of value alignment.
Comparatively easy, where human cognition has been the high level model guiding the process, where there is plenty of times for alignment failures to be recognized and solutions discovered.
Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries
Time to correct for these.
Probability of creating a general intelligence
Medium likelihood, where ceilings in the performance of systems may be reached and may make the path to general intelligence longer.

Multi-Agent Evolution

Alignment

Clune:
“We might thus produce an AI with human vices, such as violence, hatred, jealousy, deception, cunning, or worse, simply because those attributes make an AI more likely to survive and succeed in a particular type of competitive simulated world”

Speed of Takeoff
Relatively fast. Likely takes an ungodly amount of compute, but can be automated through self-play.
Interpretability
Very hard. Interpretation depends on tasks solved long in the past which filtered organisms (algorithms).
Controllability
Challenging without interpretability or a useful API
Ease of verification
Can specify a suite of tests, but hard to bound the learning space.
Ease of validation
Potentially possible, with sufficiently rigorous testing.
Likelihood of Reward Function Hacking
Very high. Evolving a new reward function is likely as well.
Likelihood of Treacherous Turn
Low-ish, it likely develops a concept of humans quite health.
Interaction with Competition
Could cooperate in the space of multi-agent environments.
Power of system at sub-general intelligence level
Relatively weak. Early environments won’t represent the real world effectively.
Difficulty of value alignment.
Upsides: Humans evolved, it may understand humans.
Downsides: Human and evolved values are violent and dangerous.
Robustness to Distributional Shift, Small Alterations, Hacking, Hardware Faults, Software Bugs, Changes in Scale, Adversaries

Somewhat more robust due to distributed agents taking actions.
Probability of creating a general intelligence

This replicates the path evolution took but has no clear wins today. So wediu.

Source: Original Google Doc