Cole Wyeth comments on o3 Is a Lying Liar

Cole Wyeth 7 May 2025 7:35 UTC
2 points
0
I think it’s hard to get a useful model for reasons related to the blatant reward hacking—the difficulty of RL on long horizon tasks without a well-defined reward signal.