In order to conclude that a corrigible AI is safe, one seemingly has to argue or assume that there is a broad basin of attraction around the overseer’s true/actual values (in addition to around corrigibility) that allows the human-AI system to converge to correct values despite starting with distorted values.
No.
In a straightforward model of self-improving AI that tries to keep its values stable, like the Tiling Agents model, the graph looks like this:
[AI0]->[AIS0]->[AISS0]->[AISSS0]->[AISSSS0]
While the non-self-overwriting human overseer is modeled as being stable during that time:
o o o o o
^ -> ^ -> ^ -> ^ -> ^
^ ^ ^ ^ ^
A corrigible agent system, by contrast, is composed of a recursive chain of distinct successors all of whose valence functions make direct reference to the valence functions of the same stable overseers.
The graph of a corrigible self-improving agent looks like this:
[AI0]->[AIS0]->[AISS0]->[AISSS0]->[AISSSS0]
^ ^ ^ ^ ^
o o o o o
^ ^ ^ ^ ^
^ ^ ^ ^ ^
But it occurs to me that the overseer, or the system composing of overseer and corrigible AI, itself constitutes an agent with a distorted version of the overseer’s true or actual preferences
It doesn’t matter where you draw the agent boundaries; the graph in the “corrigible self-improving AI” case is still fundamentally and [I think] load-bearingly different from the graph in the “naive self-improving AI” case.
You point out that humans can have object-level values that do not reflect our own most deeply held inner values—like we might have political beliefs we would realize are heinous if we were reflective enough. In this way, humans can apparently be “misaligned” with ourselves in a single slice of time, without self-improving at all.
But that is not the kind of “misalignment” that the AI alignment problem is about. If an agent has deep preferences that mismatch with its shallow preferences, well, then, it really has those deep preferences. Just like the shallow preferences, the deep preferences should be a noticeable fact about the agent, or we are not talking about anything meaningful. A Friendly AI god might have shallow preferences that divert from its deep preferences—and that will be fine, if its deep preferences are still there and still the ones we want it to have, because if deep preferences can be real at all, well, they’re called “deep” because they’re the final arbiters—they’re in control.
[ When thinking about human intelligence enhancement, I worry about a scenario where [post]humans “wirehead”, or otherwise self-modify such that the dopamine-crazed rat brain pulls the posthumans in weird directions. But this is an issue of stability under self-modification—not an issue of shallow preferences being more fundamental or “real” than deep preferences in general. ]
Something that is epistemically stronger than you—and frequently even something that isn’t—is smart enough to tell when you’re wrong about yourself. A sufficiently powerful AI will be able to tell what its overseers actually want—if it ends up wanting to know what they want enough, that it stares at its overseers long enough to figure that out.
Of course, you could say, well, there is some degree to which humanity or humans are not stable, during the time the corrigible AI is self-improving. That is true, but it would be very strange if that slight instability of humanity/humans somehow resulted in every member in the long chain of successor AIs looking back at their creators and trying to figure out what they actually want, making the same mistakes. That’s the value add of Christiano’s corrigibility model. I think it’s brilliant.
As far as I can think, the only way a successfully-actually-corrigible-in-Christiano’s-original-technical-sense [not silly LLM-based-consumer-model UX-with-an-alignment-label-slapped-on “corrigible”] AI [ which thing humanity currently has no idea how to build ] could become crooked and lock in values that get us paperclipped, is if you also posit that the humans are being nudged to have different preferences by the corrigible AI-successors during this process, which seems like it would take a lot of backbending on the AI’s part.
Please someone explain to me why I am wrong about this and corrigibility does not work.
No.
In a straightforward model of self-improving AI that tries to keep its values stable, like the Tiling Agents model, the graph looks like this:
While the non-self-overwriting human overseer is modeled as being stable during that time:
A corrigible agent system, by contrast, is composed of a recursive chain of distinct successors all of whose valence functions make direct reference to the valence functions of the same stable overseers.
The graph of a corrigible self-improving agent looks like this:
It doesn’t matter where you draw the agent boundaries; the graph in the “corrigible self-improving AI” case is still fundamentally and [I think] load-bearingly different from the graph in the “naive self-improving AI” case.
You point out that humans can have object-level values that do not reflect our own most deeply held inner values—like we might have political beliefs we would realize are heinous if we were reflective enough. In this way, humans can apparently be “misaligned” with ourselves in a single slice of time, without self-improving at all.
But that is not the kind of “misalignment” that the AI alignment problem is about. If an agent has deep preferences that mismatch with its shallow preferences, well, then, it really has those deep preferences. Just like the shallow preferences, the deep preferences should be a noticeable fact about the agent, or we are not talking about anything meaningful. A Friendly AI god might have shallow preferences that divert from its deep preferences—and that will be fine, if its deep preferences are still there and still the ones we want it to have, because if deep preferences can be real at all, well, they’re called “deep” because they’re the final arbiters—they’re in control.
[ When thinking about human intelligence enhancement, I worry about a scenario where [post]humans “wirehead”, or otherwise self-modify such that the dopamine-crazed rat brain pulls the posthumans in weird directions. But this is an issue of stability under self-modification—not an issue of shallow preferences being more fundamental or “real” than deep preferences in general. ]
Something that is epistemically stronger than you—and frequently even something that isn’t—is smart enough to tell when you’re wrong about yourself. A sufficiently powerful AI will be able to tell what its overseers actually want—if it ends up wanting to know what they want enough, that it stares at its overseers long enough to figure that out.
Of course, you could say, well, there is some degree to which humanity or humans are not stable, during the time the corrigible AI is self-improving. That is true, but it would be very strange if that slight instability of humanity/humans somehow resulted in every member in the long chain of successor AIs looking back at their creators and trying to figure out what they actually want, making the same mistakes. That’s the value add of Christiano’s corrigibility model. I think it’s brilliant.
As far as I can think, the only way a successfully-actually-corrigible-in-Christiano’s-original-technical-sense [not silly LLM-based-consumer-model UX-with-an-alignment-label-slapped-on “corrigible”] AI [ which thing humanity currently has no idea how to build ] could become crooked and lock in values that get us paperclipped, is if you also posit that the humans are being nudged to have different preferences by the corrigible AI-successors during this process, which seems like it would take a lot of backbending on the AI’s part.
Please someone explain to me why I am wrong about this and corrigibility does not work.