Some interesting points from Ethan Perez’s fireside chat at MATS
‘Grand vision’ of a model organism of scheming:
r1 like training procedure
only small fraction of the training environments incentivise reward hacking.
But from that, the model learns to be a generalized reward hacker
It also reasons through alignment faking and acts like it’s aligned with human preferences, but actually it’s like hardcore reward hacker
Then it escapes the data center
‘Cheap experiments’ may look very different when we have really good AI safety assistants
e.g. maybe complicated RL environments may be easy to build in the future
More notes here
Some interesting points from Ethan Perez’s fireside chat at MATS
‘Grand vision’ of a model organism of scheming:
r1 like training procedure
only small fraction of the training environments incentivise reward hacking.
But from that, the model learns to be a generalized reward hacker
It also reasons through alignment faking and acts like it’s aligned with human preferences, but actually it’s like hardcore reward hacker
Then it escapes the data center
‘Cheap experiments’ may look very different when we have really good AI safety assistants
e.g. maybe complicated RL environments may be easy to build in the future
More notes here