Suppose I have a magic box Opt that takes as input a program U:{0,1}n→R, and produces Opt(U)=argmaxxU(x), with only n times the cost of a single evaluation of U. Could we use this box to build an aligned AI, or would broad access to such a box result in doom?
This capability is vaguely similar to modern ML, especially if we use Opt to search over programs. But I think we can learn something from studying simpler models.
I can use Opt to define a simple unaligned AI (details omitted):
Collect data from a whole bunch of sensors, including a “reward channel.”
Use Opt to find a program M that makes good predictions about that data.
Use Opt to find a policy π that achieves a high reward when interacting with M.
This isn’t a great design, but it works as a benchmark. Can we build an aligned AI that is equally competent?
(I haven’t described how Opt works for stochastic programs. The most natural definition is a bit complicated, but the details don’t seem to matter much. You can just imagine that it returns a random x that is within one standard deviation of the optimal expected value.)
If I run this system with a long time horizon and a hard-to-influence reward channel, then it may competently acquire influence in order to achieve a high reward.
We’d like to use Opt to build an AI that acquires influence just as effectively, but will use that influence to give us security and resources to reflect and grow wiser, and remain responsive to our instructions.
We’d like the aligned AI to be almost as efficient. Ideally the proportional overhead would converge to 0 as we consider more complex models. At worst the overhead should be a constant factor.
My hope is to use Opt to learn a policy π+ which can answer questions in a way that reflects “everything π knows.” This requires:
Setting up an objective that incentivizes π+ to give good answers to questions.
Arguing that there exists a suitable policy π+ that is only slightly more complicated than π.
If we have such a π+, then we can use it to directly answer questions like “What’s the best thing to do in this situation?” The hope is:
Its answers can leverage everything π knows, and in particular all of π’s knowledge about how to acquire influence. So using π+ in this way is competitive with using π directly.
It knows enough about human preferences to be corrigible.
“Everything π knows” is slippery; I mean something like “what a sufficiently-idealized Bayesian would believe after updating on the fact that π achieves a high reward.” Constructing an objective which incentivizes these answers probably requires understanding the nature of that update.
Thoughts on feasibility
In the context of ML, I usually imagine training π+ via iterated amplification. Unfortunately, iterated amplification doesn’t correspond to optimizing a single objective—Uit requires either training a sequence of agents or exploiting properties of local search (using the previous iterate to provide oversight for the next). If we just have Opt, it’s not clear if we can efficiently do anything like iterated amplification or debate.
If aligning Opt is impossible, I think that’s slightly bad news for aligning ML. That said, it’s reasonably likely that local search will be easier to align, so the next step would be constructing a simple model of local search.
There are also some ways in which the optimizer case seems easier:
It’s a simpler model and so more amenable to analysis. The Bayesian update from ”π gets a high reward” is more straightforward when π is actually optimized.
We don’t have to worry about optimization difficulty.
Given a policy π we can directly search for an input on which it behaves a certain way.
It’s OK if it’s impossible
When working on alignment I aim to either find a scalable alignment strategy or a clear argument for why scalable alignment is impossible. I’m excited about considering easy-to-analyze versions of the alignment problem even if they are impossible:
It gives us practice making impossibility arguments, and developing relevant intuitions and techniques.
It clarifies the difficulty of the alignment problem—if we know why we can’t handle simple cases like Opt, then we learn something about what the solution must look like in more complicated cases.
It gives us a sense of what impossibility results might look like, if we were able to prove them in more realistic cases. Would they actually be strong enough to guide action, or convince anyone skeptical?
Expensive optimization
I described Opt as requiring n times more compute than U. If we implemented it naively it would instead cost 2n times more than U.
We can use this more expense Opt in our unaligned benchmark, which produces an AI that we can actually run (but it would be terrible, since it does a brute force search over programs). It should be easier to compete with this really slow AI. But it’s still not trivial and I think it’s worth working on. If we can’t compete with this benchmark, I’d feel relatively pessimistic about aligning ML.
Aligning a toy model of optimization
Suppose I have a magic box Opt that takes as input a program U:{0,1}n→R, and produces Opt(U)=argmaxxU(x), with only n times the cost of a single evaluation of U. Could we use this box to build an aligned AI, or would broad access to such a box result in doom?
This capability is vaguely similar to modern ML, especially if we use Opt to search over programs. But I think we can learn something from studying simpler models.
An unaligned benchmark
(Related.)
I can use Opt to define a simple unaligned AI (details omitted):
Collect data from a whole bunch of sensors, including a “reward channel.”
Use Opt to find a program M that makes good predictions about that data.
Use Opt to find a policy π that achieves a high reward when interacting with M.
This isn’t a great design, but it works as a benchmark. Can we build an aligned AI that is equally competent?
(I haven’t described how Opt works for stochastic programs. The most natural definition is a bit complicated, but the details don’t seem to matter much. You can just imagine that it returns a random x that is within one standard deviation of the optimal expected value.)
Competing with the benchmark
(Related.)
If I run this system with a long time horizon and a hard-to-influence reward channel, then it may competently acquire influence in order to achieve a high reward.
We’d like to use Opt to build an AI that acquires influence just as effectively, but will use that influence to give us security and resources to reflect and grow wiser, and remain responsive to our instructions.
We’d like the aligned AI to be almost as efficient. Ideally the proportional overhead would converge to 0 as we consider more complex models. At worst the overhead should be a constant factor.
Possible approach
(Related.)
My hope is to use Opt to learn a policy π+ which can answer questions in a way that reflects “everything π knows.” This requires:
Setting up an objective that incentivizes π+ to give good answers to questions.
Arguing that there exists a suitable policy π+ that is only slightly more complicated than π.
If we have such a π+, then we can use it to directly answer questions like “What’s the best thing to do in this situation?” The hope is:
Its answers can leverage everything π knows, and in particular all of π’s knowledge about how to acquire influence. So using π+ in this way is competitive with using π directly.
It knows enough about human preferences to be corrigible.
“Everything π knows” is slippery; I mean something like “what a sufficiently-idealized Bayesian would believe after updating on the fact that π achieves a high reward.” Constructing an objective which incentivizes these answers probably requires understanding the nature of that update.
Thoughts on feasibility
In the context of ML, I usually imagine training π+ via iterated amplification. Unfortunately, iterated amplification doesn’t correspond to optimizing a single objective—Uit requires either training a sequence of agents or exploiting properties of local search (using the previous iterate to provide oversight for the next). If we just have Opt, it’s not clear if we can efficiently do anything like iterated amplification or debate.
If aligning Opt is impossible, I think that’s slightly bad news for aligning ML. That said, it’s reasonably likely that local search will be easier to align, so the next step would be constructing a simple model of local search.
There are also some ways in which the optimizer case seems easier:
It’s a simpler model and so more amenable to analysis. The Bayesian update from ”π gets a high reward” is more straightforward when π is actually optimized.
We don’t have to worry about optimization difficulty.
Given a policy π we can directly search for an input on which it behaves a certain way.
It’s OK if it’s impossible
When working on alignment I aim to either find a scalable alignment strategy or a clear argument for why scalable alignment is impossible. I’m excited about considering easy-to-analyze versions of the alignment problem even if they are impossible:
It gives us practice making impossibility arguments, and developing relevant intuitions and techniques.
It clarifies the difficulty of the alignment problem—if we know why we can’t handle simple cases like Opt, then we learn something about what the solution must look like in more complicated cases.
It gives us a sense of what impossibility results might look like, if we were able to prove them in more realistic cases. Would they actually be strong enough to guide action, or convince anyone skeptical?
Expensive optimization
I described Opt as requiring n times more compute than U. If we implemented it naively it would instead cost 2n times more than U.
We can use this more expense Opt in our unaligned benchmark, which produces an AI that we can actually run (but it would be terrible, since it does a brute force search over programs). It should be easier to compete with this really slow AI. But it’s still not trivial and I think it’s worth working on. If we can’t compete with this benchmark, I’d feel relatively pessimistic about aligning ML.