My high-level model for how the first AGI systems will be built is a black box search, possible plus transparency tools. (This might already be wrong since you’ve said in a previous post that you think black box searches are unlikely.) In that context, Corrigibility seems to be a property of the system itself, not of the objective. In other words, it’s about Inner Alignment not Outer Alignment.
In this context, it seems extremely difficult to build Corrigibility into the system. However, if I’ve understood the last section of this post correctly, the point of Corrigibility is not that we do something to insert this property into an otherwise non-corrigible training process or system, but that certain approaches lead to corrigible agents by default, and this is an argument for why we might expect the resulting systems to be aligned.
I presently don’t understand why stock IDA (i.e., training system Ak to imitate the performance of system [Haccess⟶Ak−1]) would lead to corrigible systems. If the reason is in the details, this may not mean much since I don’t know the details. If the reason is that the generic idea leads to act based agents since Distillation is implemented by something like initiation learning, this seems unconvincing because it’s about the training signal and not the model’s true objective. If the reason is that the overseer is so powerful that it can also solve inner alignment concerns, then I get idea much better and am uncertain what to think. (Although, that would seem to imply that transparency tools are an extremely important component of making this work.)
I generally think of stock IDA as applied to question-answering only, leading to an Oracle. (This might also be wrong). In that case, I think Corrigibility basically means that the Oracle doesn’t try to influence the user’s preferences through its answers? The objection that my brain spits out here is that being myopic sounds like a property that is also sufficient and easier to verify.
My high-level model for how the first AGI systems will be built is a black box search, possible plus transparency tools. (This might already be wrong since you’ve said in a previous post that you think black box searches are unlikely.) In that context, Corrigibility seems to be a property of the system itself, not of the objective. In other words, it’s about Inner Alignment not Outer Alignment.
In this context, it seems extremely difficult to build Corrigibility into the system. However, if I’ve understood the last section of this post correctly, the point of Corrigibility is not that we do something to insert this property into an otherwise non-corrigible training process or system, but that certain approaches lead to corrigible agents by default, and this is an argument for why we might expect the resulting systems to be aligned.
I presently don’t understand why stock IDA (i.e., training system Ak to imitate the performance of system [Haccess⟶Ak−1]) would lead to corrigible systems. If the reason is in the details, this may not mean much since I don’t know the details. If the reason is that the generic idea leads to act based agents since Distillation is implemented by something like initiation learning, this seems unconvincing because it’s about the training signal and not the model’s true objective. If the reason is that the overseer is so powerful that it can also solve inner alignment concerns, then I get idea much better and am uncertain what to think. (Although, that would seem to imply that transparency tools are an extremely important component of making this work.)
I generally think of stock IDA as applied to question-answering only, leading to an Oracle. (This might also be wrong). In that case, I think Corrigibility basically means that the Oracle doesn’t try to influence the user’s preferences through its answers? The objection that my brain spits out here is that being myopic sounds like a property that is also sufficient and easier to verify.