Aligning a toy model of optimization

Suppose I have a magic box $O p t$ that takes as input a program $U : {0, 1}^{n} \to R$ , and produces $O p t (U) = {a r g m a x}_{x} U (x)$ , with only $n$ times the cost of a single evaluation of $U$ . Could we use this box to build an aligned AI, or would broad access to such a box result in doom?

This capability is vaguely similar to modern ML, especially if we use $O p t$ to search over programs. But I think we can learn something from studying simpler models.

An unaligned benchmark

(Related.)

I can use $O p t$ to define a simple unaligned AI (details omitted):

Collect data from a whole bunch of sensors, including a “reward channel.”
Use $O p t$ to find a program $M$ that makes good predictions about that data.
Use $O p t$ to find a policy $π$ that achieves a high reward when interacting with $M$ .

This isn’t a great design, but it works as a benchmark. Can we build an aligned AI that is equally competent?

(I haven’t described how $O p t$ works for stochastic programs. The most natural definition is a bit complicated, but the details don’t seem to matter much. You can just imagine that it returns a random $x$ that is within one standard deviation of the optimal expected value.)

Competing with the benchmark

(Related.)

If I run this system with a long time horizon and a hard-to-influence reward channel, then it may competently acquire influence in order to achieve a high reward.

We’d like to use $O p t$ to build an AI that acquires influence just as effectively, but will use that influence to give us security and resources to reflect and grow wiser, and remain responsive to our instructions.

We’d like the aligned AI to be almost as efficient. Ideally the proportional overhead would converge to 0 as we consider more complex models. At worst the overhead should be a constant factor.

Possible approach

(Related.)

My hope is to use $O p t$ to learn a policy $π^{+}$ which can answer questions in a way that reflects “everything $π$ knows.” This requires:

Setting up an objective that incentivizes $π^{+}$ to give good answers to questions.
Arguing that there exists a suitable policy $π^{+}$ that is only slightly more complicated than $π$ .

If we have such a $π^{+}$ , then we can use it to directly answer questions like “What’s the best thing to do in this situation?” The hope is:

Its answers can leverage everything $π$ knows, and in particular all of $π$ ’s knowledge about how to acquire influence. So using $π^{+}$ in this way is competitive with using $π$ directly.
It knows enough about human preferences to be corrigible.

“Everything $π$ knows” is slippery; I mean something like “what a sufficiently-idealized Bayesian would believe after updating on the fact that $π$ achieves a high reward.” Constructing an objective which incentivizes these answers probably requires understanding the nature of that update.

Thoughts on feasibility

In the context of ML, I usually imagine training $π^{+}$ via iterated amplification. Unfortunately, iterated amplification doesn’t correspond to optimizing a single objective— $U$ it requires either training a sequence of agents or exploiting properties of local search (using the previous iterate to provide oversight for the next). If we just have $O p t$ , it’s not clear if we can efficiently do anything like iterated amplification or debate.

If aligning $O p t$ is impossible, I think that’s slightly bad news for aligning ML. That said, it’s reasonably likely that local search will be easier to align, so the next step would be constructing a simple model of local search.

There are also some ways in which the optimizer case seems easier:

It’s a simpler model and so more amenable to analysis. The Bayesian update from ” $π$ gets a high reward” is more straightforward when $π$ is actually optimized.
We don’t have to worry about optimization difficulty.
Given a policy $π$ we can directly search for an input on which it behaves a certain way.

It’s OK if it’s impossible

When working on alignment I aim to either find a scalable alignment strategy or a clear argument for why scalable alignment is impossible. I’m excited about considering easy-to-analyze versions of the alignment problem even if they are impossible:

It gives us practice making impossibility arguments, and developing relevant intuitions and techniques.
It clarifies the difficulty of the alignment problem—if we know why we can’t handle simple cases like $O p t$ , then we learn something about what the solution must look like in more complicated cases.
It gives us a sense of what impossibility results might look like, if we were able to prove them in more realistic cases. Would they actually be strong enough to guide action, or convince anyone skeptical?

Expensive optimization

I described $O p t$ as requiring $n$ times more compute than $U$ . If we implemented it naively it would instead cost $2^{n}$ times more than $U$ .

We can use this more expense $O p t$ in our unaligned benchmark, which produces an AI that we can actually run (but it would be terrible, since it does a brute force search over programs). It should be easier to compete with this really slow AI. But it’s still not trivial and I think it’s worth working on. If we can’t compete with this benchmark, I’d feel relatively pessimistic about aligning ML.