Chris_Leong comments on Inner Alignment via Superpowers

Chris_Leong 31 Aug 2022 10:44 UTC
5 points
0
Do you think this will generalise to models of reinforcement learning that aren’t model-based^[1]? One challenge here is that it may learn to use certain superpowers or not to use them rather than learning to seek or avoid a particular state of the world. Do you think this is likely to be an issue?
1. ^
  I should mention that I don’t know what PPO is, so I don’t know if it is model-based or not
- Jeremy Gillen 1 Sep 2022 18:50 UTC
  1 point
  0
  Parent
  I think it might help a bit with non-model-based approaches, because it will be a bit like training on adversarial environments. But with model-based RL, this technique should reduce the distribution shift much more, because the distribution shift happening with “being powerful in sim” → “being powerful in reality”, should be much larger than “being powerful in my world model” → “being powerful in reality (as observed using a slightly better version of my world model)”.
  PPO isn’t model based, so we will test it out with that.
  One challenge here is that it may learn to use certain superpowers or not to use them rather than learning to seek or avoid a particular state of the world. Do you think this is likely to be an issue?
  Yeah this is a good point, it’s not clear whether it will generalise to “use superpowers when available and seek aligned states, but act unaligned otherwise”, or “use superpowers when available and seek aligned states, and act aligned otherwise”. My hope is that the second one is more likely, because it should be a simpler function to represent. This does assume that the model has a simplicity bias, but I think that’s reasonable.
  - Chris_Leong 2 Sep 2022 8:02 UTC
    2 points
    0
    Parent
    I guess I’m worried that learning to always use or always not use a super power might often be simpler.
    - Jeremy Gillen 2 Sep 2022 8:15 UTC
      1 point
      0
      Parent
      Yeah might depend on the details, but it shouldn’t learn to always use or not use superpowers, because it should still be trained on some normal non-superpower rollouts. So the strategy it learns should always involve using superpowers when available but not using them when not, otherwise it’d get high training loss.