Matthew Barnett comments on [AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning

Matthew Barnett 16 Sep 2019 17:18 UTC
LW: 5 AF: 3
0
AF
I don’t really get how the regularizer is supposed to be different from these two cases—perhaps the idea is that it is a regularizer specifically on the distribution over utility functions that the AI is optimizing?
To clarify, the difference was that in the case of the regularizer, the impact penalty would gradually be discarded as the AI learned more. In the case of an influence limiter, the AI would always be restricted to low impact actions, until it is retired.
- Rohin Shah 16 Sep 2019 22:58 UTC
  LW: 2 AF: 1
  0
  AF Parent
  In that case, why isn’t this equivalent to impact as a safety protocol? The period during which we use the regularizer is “training”.
  (Perhaps the impact as a safety protocol point was meant to be about mesa optimization specifically?)
  - Matthew Barnett 16 Sep 2019 23:27 UTC
    LW: 4 AF: 2
    0
    AF Parent
    The period during which we use the regularizer is “training”.
    This is misleading. In the case of mesa optimization, there are plausibly two regimes which could both legitimately be considered “training.” To use the analogy of human evolution, there is the training regime which allowed human brains to evolve, and there’s the training regime where humans grow up from babies to adults. Both of these use some sort of learning algorithm, but the second uses the learned learning algorithms.
    In the case of “Impact as a regularizer” I meant it in the sense of a baby growing up to be an adult. Whereas the safety protocol was meant to imply safety over the regime of developing brains to begin with.
  - Matthew Barnett 16 Sep 2019 23:19 UTC
    LW: 1 AF: 1
    0
    AF Parent
    Also to note: I think I could have been substantially more clear in that post specifically. I wrote it during a brainstorming session where I was trying to think of ways an impact measure could help alignment. I will consider adding a note on the post saying “I don’t currently think about impact measures this way anymore.”
  - Matthew Barnett 16 Sep 2019 23:15 UTC
    LW: 1 AF: 1
    0
    AF Parent
    [EDIT: See my other comment which explains my reply much better] You’re right to say that they are very similar. The only real difference is in the conceptual framing. In the safety protocol case I imagined creating a testing environment, which may include potentially misaligned mesa optimizers. The case of regularizer is one where we have given it autonomy and it is no longer in a regime for us to perform tests on.