The use that wasn’t obvious from the ELK framing might be fixing issues with RL environments, grader prompts, canonical solutions, etc. that ultimately enable reward hacking and thus motivate dishonest behavior. Confessions can serve as bug reports about the datasets, not centrally about the AI. They likely fail to catch a lot of issues with the AI, but substantially improving the datasets might fix some of the things they failed to catch about the AI.
The use that wasn’t obvious from the ELK framing might be fixing issues with RL environments, grader prompts, canonical solutions, etc. that ultimately enable reward hacking and thus motivate dishonest behavior. Confessions can serve as bug reports about the datasets, not centrally about the AI. They likely fail to catch a lot of issues with the AI, but substantially improving the datasets might fix some of the things they failed to catch about the AI.