How To Go From Interpretability To Alignment: Just Retarget The Search
When people talk about prosaic alignment proposals, there’s a common pattern: they’ll be outlining some overcomplicated scheme, and then they’ll say “oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are”, and then they’ll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.) And then usually there’s a whole discussion about the specific problems with the overcomplicated scheme.
In this post I want to argue from a different direction: if we had great interpretability tools, we could just use those to align an AI directly, and skip the overcomplicated schemes. I’ll call the strategy “Just Retarget the Search”.
We’ll need to make two assumptions:
Some version of the natural abstraction hypothesis holds, and the AI ends up with an internal concept for human values, or corrigibility, or what the user intends, or human mimicry, or some other outer alignment target.
The standard mesa-optimization argument from Risks From Learned Optimization holds, and the system ends up developing a general-purpose (i.e. retargetable) internal search process.
Given these two assumptions, here’s how to use interpretability tools to align the AI:
Identify the AI’s internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
Identify the retargetable internal search process.
Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target.
Just retarget the search. Bada-bing, bada-boom.
Of course as written, “Just Retarget the Search” has some issues; we haven’t added any of the bells and whistles to it yet. Probably the “identify the internal representation of the alignment target” step is less like searching through a bunch of internal concepts, and more like writing our intended target in the AI’s internal concept-language. Probably we’ll need to do the retargeting regularly on-the-fly as the system is training, even when the search is only partly-formed, so we don’t end up with a misaligned AI before we get around to retargeting. Probably we’ll need a bunch of empirical work to figure out which possible alignment targets are and are not easily expressible in the AI’s internal language (e.g. I’d guess “user intention” or “human mimicry” are more likely than “human values”). But those details seem relatively straightforward.
A bigger issue is that “Just Retarget the Search” just… doesn’t seem robust enough that we’d want to try it on a superintelligence. We still need to somehow pick the right target (i.e. handle outer alignment), and ideally it’s a target which fails gracefully (i.e. some amount of basin-of-corrigibility). If we fuck up and aim a superintelligence at not-quite-the-right-target, game over. Insofar as “Just Retarget the Search” is a substitute for overcomplicated prosaic alignment schemes, that’s probably fine; most of those schemes are targeting only-moderately-intelligent systems anyway IIUC. On the other hand, we probably want our AI competent enough to handle ontology shifts well, otherwise our target may fall apart.
Then, of course, there’s the assumptions (natural abstractions and retargetable search), either of which could fail. That said, if one or both of the assumptions fail, then (a) that probably messes up a bunch of the overcomplicated prosaic alignment schemes too (e.g. failure of the natural abstraction hypothesis can easily sink interpretability altogether), and (b) that might mean that the system just isn’t that dangerous in the first place (e.g. if it turns out that retargetable internal search is indeed necessary for dangerous intelligence).
First big upside of Just Retargeting the Search: it completely and totally eliminates the inner alignment problem. We just directly set the internal optimization target.
Second big upside of Just Retargeting the Search: it’s conceptually simple. The problems and failure modes are mostly pretty obvious. There is no recursion, no complicated diagram of boxes and arrows. We’re not playing two Mysterious Black Boxes against each other.
But the main reason to think about this approach, IMO, is that it’s a true reduction of the problem. Prosaic alignment proposals have a tendency to play a shell game with the Hard Part of the problem, move it around and hide it in different black boxes but never actually eliminate it. “Just Retarget the Search” directly eliminates the inner alignment problem. No shell game, no moving the Hard Part around. It still leaves the outer alignment problem unsolved, it still needs assumptions about natural abstractions and retargetable search, but it completely removes one Hard Part and reduces the problem to something simpler.
As such, I think “Just Retarget the Search” is a good baseline. It’s a starting point for thinking about the parts of the problem it doesn’t solve (e.g. outer alignment), or the ways it might fail (retargetable search, natural abstractions), without having to worry about inner alignment.