(I wrote this in reply to a draft; apologies if the post has been load-bearingly updated since then.)
Consider a future AGI. The best argument I currently know of for why “general intelligence” is a thing at all (as opposed to all intelligence just boiling down to a bag of use-case-specific heuristics) is that search/optimization/world-modeling/planning/etc are naturally recursive; problems factor into subproblems, goals factor into subgoals. So, a natural general-purpose architecture for intelligence involves a “general intelligence module” which can take in a (sub)problem, and recursively pass (sub)subproblems to other instances of the general intelligence module.
Let’s assume that general intelligence either does look vaguely like that, or at least can look vaguely like that. (This would be a potentially useful assumption to disagree with!)
Assuming that structure, consider how the different instances of the general intelligence module relate to each other. One module might be able to more efficiently achieve its own subgoal A by hacking/manipulating/overwriting another module’s subgoal B; then there would be two modules working toward subgoal A. But a “general intelligence” made of such modules would not be very generally intelligent; its modules would overwrite each others’ subgoals all the time, ruining the system’s ability to actually search/optimize/model/plan/etc on complicated problems.
So in order for a general intelligence built out of recursive subproblem-solver instances to work… those subproblem-solver instances need to have some kind of corrigibility constraint to their interactions. They need to somehow not try to hack each other, manipulate each other, etc.
I don’t know how a powerful general intelligence would handle that. I’m not sure I know how humans handle it, within our own minds, though maybe it’s an extension of the ideas in your post. But I do expect that strong AGI is possible and will solve this problem somehow, and that solution will involve some kind of corrigibility/nonmanipulation between the AGI’s components. More speculatively, I would guess that there is a natural convergent way to handle this problem, possibly even one which humans accidentally use already (though we clearly do not have a proper verbal or mathematical understanding of it).
The above is a frustratingly non-constructive argument; it points toward a convergent notion of corrigibility without really saying concrete things about what that notion might be. But it’s the best argument I currently know of that there’s a True Name for corrigibility/nonmanipulation/etc.
The sweater example is close but doesn’t quite hit the nail on the head. It’s not that the (dry sweater) planner would delete the (look nice) goal; that wouldn’t help dry the sweater! Rather, the (dry sweater) planner would try to commandeer more mental resources, i.e. more planner-modules, steer more attention to drying the sweater. That additional attention potentially helps dry the sweater. But as an accidental side-effect, attention would be steered away from other goals, like e.g. (look nice). In short, the (dry sweater) planner-module can better dry the sweater by redirecting the (look nice) module to focus on drying the sweater instead.
… and that totally does happen in humans! Humans often get caught up in a specific subgoal, lose track of the broader goal which generated that subgoal in the first place, and end up optimizing for the subgoal in a way which doesn’t help the original goal. It’s the phenomenon of lost purposes, at an individual level.
(Likewise with the painting example: when looking at little patches of a large complex painting, people will totally lose track of context and overlook inconsistencies.)
It really doesn’t seem like humans “keep their eye on the ball” all the time, even in the large majority of day-to-day cases where cognition basically works.