Aligned or Benign Conjecture: Let A be a machine learning agent you are training with an aligned loss function. If A is in a situation that is too far out of distribution for it to be aligned, it won’t act intelligently either.
I definitely don’t believe this!
I believe that any functionalcognitive machinery must be doing its thing on the training distribution, and in some sense it’s just doing the same thing at deployment time. This is important for having hope for interpretability to catch out-of-distribution failures.
(For example, I think there is very little hope of interpretability detecting the presence of arbitrary backdoors in a model, before having seen any examples of the backdoor trigger, which is what it would look like to try to detect OOD failures from machinery that is literally never doing anything the training distribution).
But that doesn’t even mean that the cognitive machinery is effectively aiming at the same goal OOD, much less that it is aiming at achieving any goal related to the loss function, and even less that it is aligned just because the loss function reliably ranks policies based on the empirical quality of their behavior.
I definitely don’t believe this!
I believe that any functional cognitive machinery must be doing its thing on the training distribution, and in some sense it’s just doing the same thing at deployment time. This is important for having hope for interpretability to catch out-of-distribution failures.
(For example, I think there is very little hope of interpretability detecting the presence of arbitrary backdoors in a model, before having seen any examples of the backdoor trigger, which is what it would look like to try to detect OOD failures from machinery that is literally never doing anything the training distribution).
But that doesn’t even mean that the cognitive machinery is effectively aiming at the same goal OOD, much less that it is aiming at achieving any goal related to the loss function, and even less that it is aligned just because the loss function reliably ranks policies based on the empirical quality of their behavior.