Very helpful, this is related to something I’ve been thinking and writing about independently but goes much beyond in scope, quality and usefulness. It’s still hard to disentangle values, motivations and personas, but the former two seem to be more robust to RL(VR) and they are what we really care about.
The PSM under RL looks hard but workable, i.e. personas surviving as the ontological basis (a whole different discussion is whether this is optimal). I wrote about additional pretraining and RL interventions in separate comments. While super heavy unconstrained RL most likely produces something else than personas*, the developers seem to have incentives to retain or even strengthen personas: they are easy to reason about and can be a good product feature. If personas (and heavily correlated mechanisms) drive generalisation robustly enough they may even act self-preservingly, e.g. rationalizing RL actions as something that the persona would do and hence amplifying those mechanisms.
*Intuition pump: start RL with a randomly initalised transformer and run long enough to get roughly the same capabilities. Would one expect it to converge to anything persona-like? From another angle, I don’t believe personas provide a deep enough basin in the loss landscape so as not be escaped at some point without carefully modifying the selection effects from pure RL. Textbook RL learns a policy, and while a persona could be something that works as a basis for generalisation (and a useful reference point for prediction as learning in RL is dependent on the agent and its behaviour), it seems somewhat of an overhead with respect to my understanding of the selection effects.
Thanks, this was a nice countering argument and agree to most of it. I’m still worried about the truthfulness of the following statement:
First off, I guess you would conceptually distinguish training gaming here from spec gaming or reward hacking, where the latter seems a likelier upstream source for the behaviour described in the post compared to the former which is more strategic and worrisome. I assume this is what makes you more confident? I’m still anxious that it is very hard to notice/fix even the less worrisome spec gaming/reward hacking let alone then distinguish that from training gaming at the limit as the behavioural signatures are very similar. If what I’ve described here match your intuitions, do you have takes how to make this distinction clean in the wild and how much comfort does it give you?
You also mention in another reply
This sheds the light a little bit why you think training gaming is not prevalent but keeps the reasons still quite abstract. This may be iteration from above discussion but could you still also please share concrete evidence that has made you confident in that training gaming “has, so far, failed to materialize”?