Buck comments on The Problem

Buck 15 Aug 2025 19:17 UTC
8 points
−1
Incidentally, I think reward hacking has gone down as a result of people improving techniques, despite capabilities increasing. I believe this because of anecdotal reports and also graphs like the one from the Anthropic model card for Opus and Sonnet 4:
- habryka 4 May 2026 23:23 UTC
  9 points
  0
  Parent
  My guess is you changed your mind on this? It seems pretty clear to me that GPT 5.5 and Opus 4.7 are much more reward-hacky than their predecessors, just substantially better at it (or like, they are much stronger apparent-success seekers, which IMO clearly was the reward they were trained on).
  - Eliezer Yudkowsky 5 May 2026 1:11 UTC
    8 points
    −2
    Parent
    Noting again for the record that I would not be surprised if at some future stage, the model figures out what humans want to hear and see, errors and all, and then there is an apparent sudden amazing success with alignment.
  - Buck 9 Jun 2026 5:20 UTC
    4 points
    0
    Parent
    This isn’t actually clear to me; is this just based on your impression? I’m not skeptical, I just don’t have an opinion.