I worry that in the context of corrigibility it’s misleading to talk about alignment, and especially about utility functions. If alignment characterizes goals, it presumes a goal-directed agent, but a corrigible AI is probably not goal-directed, in the sense that its decisions are not chosen according to their expected value for a persistent goal. So a corrigible AI won’t be aligned (neither will it be misaligned). Conversely, an agent aligned in this sense can’t be visibly corrigible, as its decisions are determined by its goals, not orders and wishes of operators. (Corrigible AIs are interesting because they might be easier to build than aligned agents, and are useful as tools to defend against misaligned agents and to build aligned agents.)
In the process of gradually changing from a corrigible AI into an aligned agent, an AI becomes less corrigible in the sense that corrigibility ceases to help in describing its behavior, it stops manifesting. At the same time, goal-directedness starts to dominate the description of its behavior as the AI learns well enough what its goal should be. If during the process of learning its values it’s more corrigible than goal-directed, there shouldn’t be any surprises like sudden disassembly of its operators on molecular level.
I worry that in the context of corrigibility it’s misleading to talk about alignment, and especially about utility functions. If alignment characterizes goals, it presumes a goal-directed agent, but a corrigible AI is probably not goal-directed, in the sense that its decisions are not chosen according to their expected value for a persistent goal. So a corrigible AI won’t be aligned (neither will it be misaligned). Conversely, an agent aligned in this sense can’t be visibly corrigible, as its decisions are determined by its goals, not orders and wishes of operators. (Corrigible AIs are interesting because they might be easier to build than aligned agents, and are useful as tools to defend against misaligned agents and to build aligned agents.)
In the process of gradually changing from a corrigible AI into an aligned agent, an AI becomes less corrigible in the sense that corrigibility ceases to help in describing its behavior, it stops manifesting. At the same time, goal-directedness starts to dominate the description of its behavior as the AI learns well enough what its goal should be. If during the process of learning its values it’s more corrigible than goal-directed, there shouldn’t be any surprises like sudden disassembly of its operators on molecular level.