AI Alignment Writing Day Roundup #2

Here are some of the posts from the AI Alignment Forum writing day. Due to the participants writing 34 posts in less than 24 hours (!), I’m re-airing them to let people have a proper chance to read (and comment) on them, in roughly chronological order.

1) Computational Model: Causal Diagrams with Symmetry by Johns Wentworth

This post is about representing logic, mathematics, and functions with causal models.

For our purposes, the central idea of embedded agency is to take these black-box systems which we call “agents”, and break open the black boxes to see what’s going on inside.

Causal DAGs with symmetry are how we do this for Turing-computable functions in general. They show the actual cause-and-effect process which computes the result; conceptually they represent the computation rather than a black-box function.

I’m new to a lot of this, but to me this seemed like a weird and surprising way to think about math (e.g. the notion that the input $n$ is something that causes the next input $n - 1$ ). Seems like a very interesting set of ideas to explore.

2) Towards a mechanistic understanding of corrigibility by Evan Hubinger

This post builds off of Paul Christiano’s post on Worst Case Guarantees. That post claims:

Even if we are very careful about how we deploy ML, we may reach the point where a small number of correlated failures could quickly become catastrophic… I think the long-term safety of ML systems requires being able to rule out this kind of behavior, which I’ll call unacceptable, even for inputs which are extremely rare on the input distribution.

Paul then proposes a procedure built around adversarial search, where one part of the system searches for inputs that produce unacceptable outputs in the trained agent, and talks more about how one might build such a system.

Evan’s post tries to make progress on finding a good notion of acceptable behaviour from an ML system. Paul’s post offers two conditions about the ease of choosing an acceptable action (in particular, that it should not stop the agent achieving a high average reward and that is shouldn’t make hard problems much harder), but Evan’s conditions are about the ease of training an acceptable model. His two conditions are:

It must be not that hard for an amplified overseer to verify that a model is acceptable.
It must be not that hard to find such an acceptable model during training.

If you want to be able to do some form of informed oversight to produce an acceptable model, however, these are some of the most important conditions to pay attention to. Thus, I generally think about choosing an acceptability condition as trying to answer the question: what is the easiest-to-train-and-verify property such that all models that satisfy that property (and achieve high average reward) are safe?

The post then explores two possible approaches, act-based corrigibility and indifference corrigibility.

3) Logical Optimizers by Donald Hobson

This approach offers a solution to a simpler version of the FAI problem:

Suppose I was handed a hypercomputer and allowed to run code on it without worrying about mindcrime, then the hypercomputer is removed, allowing me to keep 1Gb of data from the computations. Then I am handed a magic human utility function, as code on a memory stick. [The approach below] would allow me to use the situation to make a FAI.

4) Deconfuse Yourself about Agency by VojtaKovarik

This post offers some cute formalisations, for example, generalising the notion of anthropomorphism to $A (Θ)$ -morphization, about morphing/modelling any system by using an alternative architecture $A (Θ)$ .

This is an attempt to remove the need to explicitly use the term ‘agency’ in conversation, out of a sense that the use of the word is lacking in substance. I’m not sure I agree with this, I think people are using it to talk about a substantive thing they don’t know how to formalise yet. Nonetheless I liked all the various technical ideas offered.

My favourite part personally was the opening list of concrete architectures organised by how ‘agenty’ they feel, which I will quote in full:

Architectures I would intuitively call “agenty”:
1. Monte Carlo tree search algorithm, parametrized by the number of rollouts made each move and utility function (or heuristic) used to evaluate positions.
2. (semi-vague) “Classical AI-agent” with several interconnected modules (utility function and world model, actions, planning algorithm, and observations used for learning and updating the world model).
3. (vague) Human parametrized by their goals, knowledge, and skills (and, of course, many other details).
Architectures I would intuitively call “non-agenty”:
1. A hard-coded sequence of actions.
2. Look-up table.
3. Random generator (outputting x∼π on every input, for some probability distribution π).
Multi-agent architectures:
1. Ant colony.
2. Company (consisting of individual employees, operating within an economy).
3. Comprehensive AI services.

5) Thoughts from a Two-Boxer by jaek

I really liked this post, even though the author ends by saying the post might not have much of a purpose any more.

Having written that last paragraph I suddenly understand why decision theory in the AI community is the way it is. I guess I wasn’t properly engaging with the premises of the thought experiment.

The post was (according to me) someone thinking for themselves about decision theory and putting in the effort to clearly explain their thoughts as they went along.

My understanding of the main disagreement between academia’s CDT/EDT and the AI Alignment’s UDT/FDT alternatives is the same as Paul Christiano’s understanding, which is that they are motivated by asking slightly different questions (the former being more human-focused and the latter being motivated by how to enter code into an AI). This post shows someone thinking through that and coming to that same realisation for themselves. I expect to link to it in the future as an example of this.