Rohin Shah comments on Aligning a toy model of optimization

Rohin Shah 21 Sep 2019 4:01 UTC
LW: 6 AF: 3
AF
Planned summary:
Current ML capabilities are centered around **local search**: we get a gradient (or an approximation to one, as with evolutionary algorithms), and take a step in that direction to find a new model. Iterated amplification takes advantage of this fact: rather than a sequence of gradient steps on a fixed reward, we can do a sequence of amplification steps and distillation gradient steps.
However, we can consider an even simpler model of ML capabilities: function maximization. Given a function from n-bit strings to real numbers, we model ML as allowing us to find the input n-bit string with the maximum output value, in only $O (n)$ time (rather than the $O (2^{n})$ time that brute force search would take). If this were all we knew about ML capabilities, could we still design an aligned, competitive version of it? While this is not the actual problem we face, due to its simplicity it is more amenable to theoretical analysis, and so is worth thinking about.
We could make an unaligned AI that maximizes some explicit reward using only 2 calls to Opt: first, use Opt to find a good world model M that can predict the dynamics and reward, and then use Opt to find a policy that does well when interacting with M. This is unaligned for all the usual reasons: most obviously, it will try to seize control of the reward channel.
An aligned version does need to use Opt, since that’s the only way of turning a naively-exponential search into a linear one; without using Opt the resulting system won’t be competitive. We can’t just generalize iterated amplification to this case, since iterated amplification relies on a _sequence_ of applications of ML capabilities: this would lead to an aligned AI that uses Opt many times, which will not be competitive since the unaligned AI only requires 2 calls to Opt.
One possible approach is to design an AI with good incentives (in the same way that iterated amplification aims to approximate HCH) that “knows everything that the unaligned AI knows”. However, it would also be useful to produce a proof of impossibility: this would tell us something about what a solution must look like in more complex settings.
Planned opinion:
Amusingly, I liked this post primarily because comparing this setting to the typical setting for iterated amplification was useful for seeing the design choices and intuitions that motivated iterated amplification.