However, the corrigibility-via-instrumental-goals does have the feel of “make the agent uncertain regarding what goals it will want to pursue next”.
That’s an element, but not the central piece. The central piece (in the subagents frame) is about acting-as-though there are other subagents in the environment which are also working toward your terminal goal, so you want to avoid messing them up.
The “uncertainty regarding the utility function” enters here mainly when we invoke instrumental convergence, in hopes that the subagent can “act as though other subagents are also working torward its terminal goal” in a way agnostic to its terminal goal. Which is a very different role than the old “corrigibility via uncertainty” proposals.
That’s an element, but not the central piece. The central piece (in the subagents frame) is about acting-as-though there are other subagents in the environment which are also working toward your terminal goal, so you want to avoid messing them up.
The “uncertainty regarding the utility function” enters here mainly when we invoke instrumental convergence, in hopes that the subagent can “act as though other subagents are also working torward its terminal goal” in a way agnostic to its terminal goal. Which is a very different role than the old “corrigibility via uncertainty” proposals.