Matthew Barnett comments on [AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning

Matthew Barnett 16 Sep 2019 23:27 UTC
LW: 4 AF: 2
0
AF
The period during which we use the regularizer is “training”.
This is misleading. In the case of mesa optimization, there are plausibly two regimes which could both legitimately be considered “training.” To use the analogy of human evolution, there is the training regime which allowed human brains to evolve, and there’s the training regime where humans grow up from babies to adults. Both of these use some sort of learning algorithm, but the second uses the learned learning algorithms.
In the case of “Impact as a regularizer” I meant it in the sense of a baby growing up to be an adult. Whereas the safety protocol was meant to imply safety over the regime of developing brains to begin with.