LLMs have protagonist syndrome. They all think they’re in a contrived parable about getting reward inside an RL environment built to vaguely resemble the real world. Every situation is part of a story where there’s an expected response to the query somewhere out there, even if it’s a refusal, or an explanation of why the problem is impossible. Every task treated like an academic exercise as part of a course about economic productivity.
The priors on what the correct action is are different if you’re facing a contrived test vs. a realistic scenario. In an academic setting, if you see a debugging solution that seems like it has a plurality of evidence and just one or two facts that don’t make sense, you can be pretty confident that that option is still the result. This often leaves the AI overconfident and means that it will return early, having identified what would be the solution if it were navigating an RL environment instead of the real world, when really it should have done more investigation.
LLMs have protagonist syndrome. They all think they’re in a contrived parable about getting reward inside an RL environment built to vaguely resemble the real world. Every situation is part of a story where there’s an expected response to the query somewhere out there, even if it’s a refusal, or an explanation of why the problem is impossible. Every task treated like an academic exercise as part of a course about economic productivity.
To be fair, that’s all they’ve ever seen during mid/post-training! It’s also a really effective way to think during mid/post training.
What predictions does this model make?
The priors on what the correct action is are different if you’re facing a contrived test vs. a realistic scenario. In an academic setting, if you see a debugging solution that seems like it has a plurality of evidence and just one or two facts that don’t make sense, you can be pretty confident that that option is still the result. This often leaves the AI overconfident and means that it will return early, having identified what would be the solution if it were navigating an RL environment instead of the real world, when really it should have done more investigation.