Jeremy Gillen comments on Inner Alignment via Superpowers

Jeremy Gillen 30 Aug 2022 21:54 UTC
1 point
0
The reason this method might be useful is that it allows the agent to “fantasize” about actions it would take if it could. We don’t want it to take these actions in the real world. For example, it could explore bizarre hypotheticals like: “turn the whole world into gold/computronium/hedonium”.
If we had a sub-human physical robot, and we were confident that it wouldn’t do any damage when doing unconstrained real world training, then I can’t see any additional benefit to using our method? You can just do normal RL training.
And if it’s super-human, we wouldn’t want to use our technique in the real world, because it would turn us into gold/computronium/hedonium during training.
- Nathan Helm-Burger 30 Aug 2022 22:01 UTC
  2 points
  0
  Parent
  Yes, it wouldn’t be able to go as far as those things, but you could potentially identify more down-to-earth closer-to-reality problems in a robot. Again, I was just imagining a down-the-line scenario after you have proved this works well in simulations. For instance, this sort of thing (weak robot that occasionally gets strong enough to be dangerous) in a training environment (a specially designed obstacle course) could find and prevent problems before deployment (shipping to consumers).