Aligning a toy model of optimization
Suppose I have a magic box that takes as input a program , and produces , with only times the cost of a single evaluation of . Could we use this box to build an aligned AI, or would broad access to such a box result in doom?
This capability is vaguely similar to modern ML, especially if we use to search over programs. But I think we can learn something from studying simpler models.
An unaligned benchmark
I can use to define a simple unaligned AI (details omitted):
Collect data from a whole bunch of sensors, including a “reward channel.”
Use to find a program that makes good predictions about that data.
Use to find a policy that achieves a high reward when interacting with .
This isn’t a great design, but it works as a benchmark. Can we build an aligned AI that is equally competent?
(I haven’t described how works for stochastic programs. The most natural definition is a bit complicated, but the details don’t seem to matter much. You can just imagine that it returns a random that is within one standard deviation of the optimal expected value.)
Competing with the benchmark
If I run this system with a long time horizon and a hard-to-influence reward channel, then it may competently acquire influence in order to achieve a high reward.
We’d like to use to build an AI that acquires influence just as effectively, but will use that influence to give us security and resources to reflect and grow wiser, and remain responsive to our instructions.
We’d like the aligned AI to be almost as efficient. Ideally the proportional overhead would converge to 0 as we consider more complex models. At worst the overhead should be a constant factor.
My hope is to use to learn a policy which can answer questions in a way that reflects “everything knows.” This requires:
Setting up an objective that incentivizes to give good answers to questions.
Arguing that there exists a suitable policy that is only slightly more complicated than .
If we have such a , then we can use it to directly answer questions like “What’s the best thing to do in this situation?” The hope is:
Its answers can leverage everything knows, and in particular all of ’s knowledge about how to acquire influence. So using in this way is competitive with using directly.
It knows enough about human preferences to be corrigible.
“Everything knows” is slippery; I mean something like “what a sufficiently-idealized Bayesian would believe after updating on the fact that achieves a high reward.” Constructing an objective which incentivizes these answers probably requires understanding the nature of that update.
Thoughts on feasibility
In the context of ML, I usually imagine training via iterated amplification. Unfortunately, iterated amplification doesn’t correspond to optimizing a single objective—it requires either training a sequence of agents or exploiting properties of local search (using the previous iterate to provide oversight for the next). If we just have , it’s not clear if we can efficiently do anything like iterated amplification or debate.
If aligning is impossible, I think that’s slightly bad news for aligning ML. That said, it’s reasonably likely that local search will be easier to align, so the next step would be constructing a simple model of local search.
There are also some ways in which the optimizer case seems easier:
It’s a simpler model and so more amenable to analysis. The Bayesian update from ” gets a high reward” is more straightforward when is actually optimized.
We don’t have to worry about optimization difficulty.
Given a policy we can directly search for an input on which it behaves a certain way.
It’s OK if it’s impossible
When working on alignment I aim to either find a scalable alignment strategy or a clear argument for why scalable alignment is impossible. I’m excited about considering easy-to-analyze versions of the alignment problem even if they are impossible:
It gives us practice making impossibility arguments, and developing relevant intuitions and techniques.
It clarifies the difficulty of the alignment problem—if we know why we can’t handle simple cases like , then we learn something about what the solution must look like in more complicated cases.
It gives us a sense of what impossibility results might look like, if we were able to prove them in more realistic cases. Would they actually be strong enough to guide action, or convince anyone skeptical?
I described as requiring times more compute than . If we implemented it naively it would instead cost times more than .
We can use this more expense in our unaligned benchmark, which produces an AI that we can actually run (but it would be terrible, since it does a brute force search over programs). It should be easier to compete with this really slow AI. But it’s still not trivial and I think it’s worth working on. If we can’t compete with this benchmark, I’d feel relatively pessimistic about aligning ML.