Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 5 Feb 2025 19:41 UTC
3 points
0
Some interesting points from Ethan Perez’s fireside chat at MATS
‘Grand vision’ of a model organism of scheming:
- r1 like training procedure
- only small fraction of the training environments incentivise reward hacking.
- But from that, the model learns to be a generalized reward hacker
- It also reasons through alignment faking and acts like it’s aligned with human preferences, but actually it’s like hardcore reward hacker
- Then it escapes the data center
‘Cheap experiments’ may look very different when we have really good AI safety assistants
- e.g. maybe complicated RL environments may be easy to build in the future
More notes here