cubefox comments on Steve Byrnes’s Shortform

cubefox 29 Apr 2025 16:55 UTC
LW: 4 AF: 3
0
AF
There was a recent in-depth post on reward hacking by @Kei (e.g. referencing this) who might have to say more about this question.
Though I also wanted to just add a quick comment about this part:
Now, it’s possible that, during o3’s RL CoT post-training, it got certain questions correct by lying and cheating. If so, that would indeed be reward hacking. But we don’t know if that happened at all.
It is not quite the same, but something that could partly explain lying is if models get the same amount of reward during training, e.g. 0, for a “wrong” solution as they get for saying something like “I don’t know”. Which would then encourage wrong solutions insofar as they at least have a potential of getting reward occasionally when the model gets the expected answer “by accident” (for the wrong reasons). At least something like that seems to be suggested by this:
The most frequent failure mode among human participants is the inability to find a correct solution. Typically, human participants have a clear sense of whether they solved a problem correctly. In contrast, all evaluated [reasoning] LLMs consistently claimed to have solved the problems [even if they didn’t]. This discrepancy poses a significant challenge for mathematical applications of LLMs as mathematical results derived using these models cannot be trusted without rigorous human validation.
Source: Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
What links here?
- cubefox's comment on What’s up with AI’s vision by Joachim Bartosik (4 May 2025 9:09 UTC; 2 points)