One precarious way of looking at corrigibility (in the hard problem sense) is that it internalizes alignment techniques in an agent. Instead of thinking of actions directly, a corrigible agent essentially considers what a new separate proxy agent it’s designing would do. If it has an idea of what kind of proxy agent would be taking the current action in an aligned way, the original corrigible agent then takes the action that the aligned proxy agent would take. For example, instead of considering proxy utility its own, in this frame a corrigible agent considers what would happen with a proxy agent that has that proxy utility, how it should function to avoid the goodharting/misalignment trouble.
The tricky part of this is respecting minimality. The proxy agent itself should be more like a pivotal aligned agent, built around the kind of thing the current action or plan is, rather than around the overall goals of the original agent. This way, passing to the proxy agent de-escalates the scope of optimization/cognition. More alarmingly, the original agent that’s corrigible in this sense now seemingly reasons about alignment, which requires all sorts of dangerous cognition. So one of the things a proxy agent should do less of is less thinking about alignment, less ambitious corrigibility.
Anything that makes a proxy agent safer (in the sense of doing less dangerous cognition) should be attempted for the original corrigible agent as well. So the most corrigible agent in this sequence of three is human programmers, who perform dangerous alignment cognition to construct the original corrigible agent, which perhaps performs some alignment techniques when coming up with proxy agents for its actions, but doesn’t itself invent those techniques. And the proxy agents are less corrigible still in this sense, some of them might be playing a maximization game that works directly (like chess or theorem proving), prepared for them by the original corrigible agent.
One precarious way of looking at corrigibility (in the hard problem sense) is that it internalizes alignment techniques in an agent. Instead of thinking of actions directly, a corrigible agent essentially considers what a new separate proxy agent it’s designing would do. If it has an idea of what kind of proxy agent would be taking the current action in an aligned way, the original corrigible agent then takes the action that the aligned proxy agent would take. For example, instead of considering proxy utility its own, in this frame a corrigible agent considers what would happen with a proxy agent that has that proxy utility, how it should function to avoid the goodharting/misalignment trouble.
The tricky part of this is respecting minimality. The proxy agent itself should be more like a pivotal aligned agent, built around the kind of thing the current action or plan is, rather than around the overall goals of the original agent. This way, passing to the proxy agent de-escalates the scope of optimization/cognition. More alarmingly, the original agent that’s corrigible in this sense now seemingly reasons about alignment, which requires all sorts of dangerous cognition. So one of the things a proxy agent should do less of is less thinking about alignment, less ambitious corrigibility.
Anything that makes a proxy agent safer (in the sense of doing less dangerous cognition) should be attempted for the original corrigible agent as well. So the most corrigible agent in this sequence of three is human programmers, who perform dangerous alignment cognition to construct the original corrigible agent, which perhaps performs some alignment techniques when coming up with proxy agents for its actions, but doesn’t itself invent those techniques. And the proxy agents are less corrigible still in this sense, some of them might be playing a maximization game that works directly (like chess or theorem proving), prepared for them by the original corrigible agent.