Note that the instrumental goal is importantly distinct from the subagent which pursues that instrumental goal. I think a big part of the insight in this post is to say “corrigibility is a property of instrumental goals, separate from the subagents which pursue those goals”; we can study the goals (i.e. problem factorization) rather than the subagents in order to understand corrigibility.
Note that the instrumental goal is importantly distinct from the subagent which pursues that instrumental goal. I think a big part of the insight in this post is to say “corrigibility is a property of instrumental goals, separate from the subagents which pursue those goals”; we can study the goals (i.e. problem factorization) rather than the subagents in order to understand corrigibility.