The period during which we use the regularizer is “training”.
This is misleading. In the case of mesa optimization, there are plausibly two regimes which could both legitimately be considered “training.” To use the analogy of human evolution, there is the training regime which allowed human brains to evolve, and there’s the training regime where humans grow up from babies to adults. Both of these use some sort of learning algorithm, but the second uses the learned learning algorithms.
In the case of “Impact as a regularizer” I meant it in the sense of a baby growing up to be an adult. Whereas the safety protocol was meant to imply safety over the regime of developing brains to begin with.
Also to note: I think I could have been substantially more clear in that post specifically. I wrote it during a brainstorming session where I was trying to think of ways an impact measure could help alignment. I will consider adding a note on the post saying “I don’t currently think about impact measures this way anymore.”
[EDIT: See my other comment which explains my reply much better] You’re right to say that they are very similar. The only real difference is in the conceptual framing. In the safety protocol case I imagined creating a testing environment, which may include potentially misaligned mesa optimizers. The case of regularizer is one where we have given it autonomy and it is no longer in a regime for us to perform tests on.
In that case, why isn’t this equivalent to impact as a safety protocol? The period during which we use the regularizer is “training”.
(Perhaps the impact as a safety protocol point was meant to be about mesa optimization specifically?)
This is misleading. In the case of mesa optimization, there are plausibly two regimes which could both legitimately be considered “training.” To use the analogy of human evolution, there is the training regime which allowed human brains to evolve, and there’s the training regime where humans grow up from babies to adults. Both of these use some sort of learning algorithm, but the second uses the learned learning algorithms.
In the case of “Impact as a regularizer” I meant it in the sense of a baby growing up to be an adult. Whereas the safety protocol was meant to imply safety over the regime of developing brains to begin with.
Also to note: I think I could have been substantially more clear in that post specifically. I wrote it during a brainstorming session where I was trying to think of ways an impact measure could help alignment. I will consider adding a note on the post saying “I don’t currently think about impact measures this way anymore.”
[EDIT: See my other comment which explains my reply much better] You’re right to say that they are very similar. The only real difference is in the conceptual framing. In the safety protocol case I imagined creating a testing environment, which may include potentially misaligned mesa optimizers. The case of regularizer is one where we have given it autonomy and it is no longer in a regime for us to perform tests on.