I don’t really get how the regularizer is supposed to be different from these two cases—perhaps the idea is that it is a regularizer specifically on the distribution over utility functions that the AI is optimizing?
To clarify, the difference was that in the case of the regularizer, the impact penalty would gradually be discarded as the AI learned more. In the case of an influence limiter, the AI would always be restricted to low impact actions, until it is retired.
The period during which we use the regularizer is “training”.
This is misleading. In the case of mesa optimization, there are plausibly two regimes which could both legitimately be considered “training.” To use the analogy of human evolution, there is the training regime which allowed human brains to evolve, and there’s the training regime where humans grow up from babies to adults. Both of these use some sort of learning algorithm, but the second uses the learned learning algorithms.
In the case of “Impact as a regularizer” I meant it in the sense of a baby growing up to be an adult. Whereas the safety protocol was meant to imply safety over the regime of developing brains to begin with.
Also to note: I think I could have been substantially more clear in that post specifically. I wrote it during a brainstorming session where I was trying to think of ways an impact measure could help alignment. I will consider adding a note on the post saying “I don’t currently think about impact measures this way anymore.”
[EDIT: See my other comment which explains my reply much better] You’re right to say that they are very similar. The only real difference is in the conceptual framing. In the safety protocol case I imagined creating a testing environment, which may include potentially misaligned mesa optimizers. The case of regularizer is one where we have given it autonomy and it is no longer in a regime for us to perform tests on.
To clarify, the difference was that in the case of the regularizer, the impact penalty would gradually be discarded as the AI learned more. In the case of an influence limiter, the AI would always be restricted to low impact actions, until it is retired.
In that case, why isn’t this equivalent to impact as a safety protocol? The period during which we use the regularizer is “training”.
(Perhaps the impact as a safety protocol point was meant to be about mesa optimization specifically?)
This is misleading. In the case of mesa optimization, there are plausibly two regimes which could both legitimately be considered “training.” To use the analogy of human evolution, there is the training regime which allowed human brains to evolve, and there’s the training regime where humans grow up from babies to adults. Both of these use some sort of learning algorithm, but the second uses the learned learning algorithms.
In the case of “Impact as a regularizer” I meant it in the sense of a baby growing up to be an adult. Whereas the safety protocol was meant to imply safety over the regime of developing brains to begin with.
Also to note: I think I could have been substantially more clear in that post specifically. I wrote it during a brainstorming session where I was trying to think of ways an impact measure could help alignment. I will consider adding a note on the post saying “I don’t currently think about impact measures this way anymore.”
[EDIT: See my other comment which explains my reply much better] You’re right to say that they are very similar. The only real difference is in the conceptual framing. In the safety protocol case I imagined creating a testing environment, which may include potentially misaligned mesa optimizers. The case of regularizer is one where we have given it autonomy and it is no longer in a regime for us to perform tests on.