Thanks for sharing the quote! It’s cool they clarified this.
Rather, its behavior on our evaluations is evidence that it learned to “cheat” in training to receive higher reward
I do think this claim is not fully substantiated by their results, though. That is one plausible cause for why the model cheats on METR’s task, but not the only possible explanation (I listed some more in the post, e.g. misgeneralization of persistence entrained by well-specified RL)
Ah, thanks! Added a link to your shortform