Seth Herd comments on 3b. Formal (Faux) Corrigibility

Seth Herd 18 Jun 2024 22:23 UTC
LW: 5 AF: 4
0
AF
This seems productive.
I don’t understand your proposal if it doesn’t boil down to “do what the principal wants” or “do what the principal says” (correctly interpreted and/or carefully verified). This makes me worried that what you have in mind is not that simple and coherent and therefore relatively easy to define or train into an AGI.
This (maybe misunderstanding) of your corrigibility=figure out what I want is why I currently prefer the instruction-following route to corrigibility. I don’t want the AGI to guess at what I want any more than necessary. This has downsides, too; back to those at the end.
I do think what your model of me says, but I think it’s only narrowly true and probably not very useful that
It’s fine if the AGI does what I want and not what I say, as long as it’s correct about what I want.
I think this is true for exactly the right definition of “what I want”, but conveying that to an AGI is nontrivial, and re-introduces the difficulty of value learning. That’s mixed with the danger that it’s incorrect about what I want. That is, it could be right about what I want in one sense, but not the sense I wanted to convey to it (E.G., it decides I’d really rather be put into an experience machine where I’m the celebrated hero of the world, rather than make the real world good for everyone like I’d hoped to get).
Maybe I’ve misunderstood your thesis, but I did read it pretty carefully, so there might be something to learn from how I’ve misunderstood. All of your examples I remember correspond to “doing what the principal wants” by a pretty common interpetation of that phrase.
Instruction-following puts a lot of the difficulty back on the human(s) in charge. This is potentially very bad, but I think humans will probably choose this route anyway. You’ve pointed out some ways that following instructions could be a danger (although I think your genie examples aren’t the most relevant for a modest takeoff speed). But I think unless something changes, humans are likely to prefer keeping the power and the responsibility to trying to put more of the project into the AGIs alignment. That’s another reason I’m spending my time thinking through this route to corrigibility instead of the one you propose.
Although again, I might be missing something about your scheme.
I just went back and reread 2. Corrigibility Intuition (after writing the above, which I won’t try to revise). Everything there still looks like a flavor of “do what I want”. My model of Max says “corrigibility is more like ‘do your best to be correctable’”. It seems like correctable means correctable toward what the principal wants. So I wonder if your formulation reduces to “do what I want, with an emphasis on following instructions and being aware that you might be wrong about what I want”. That sounds very much like the Do What I Mean And Check formulation of my instruction-following approach to corrigibility.
Thanks for engaging. I think this is productive.
Just to pop back to the top level briefly, I’m focusing on instruction-following because I think it will work well and be the more likely pick for a nascent language-model agent AGI, from below human level to somewhat above it. If RL is heavily involved in creating that agent, that might shift the balance and make your form of corrigibility more attractive (and still vastly more attractive than attempting value alignment in any broader way). I think working through both of these is worthwhile, because those are the two most likely forms of first AGI, and the two most likely actual alignment targets.
I definitely haven’t wrapped my head around all of the pitfalls with either method, but I continue to think that this type of alignment target makes good outcomes much more likely, at least as far as we’ve gotten with the analysis so far.
I think this type of alignment target is also important because the strongest and most used arguments for alignment difficulty don’t apply to them. So when we’re debating slowing down AGI, proponents of going forward will be talking about these approaches. If the alignment community hasn’t thought through them carefully, there will be no valid counterargument. I’d still prefer that we slow AGI even though I think these methods give us a decent chance of succeeding at technical alignment. So that’s one more reason I find this topic worthwhile.
This has gotten pretty discursive, so don’t worry about responding to all of it.