habryka comments on The Problem

habryka 4 May 2026 23:23 UTC
9 points
0
My guess is you changed your mind on this? It seems pretty clear to me that GPT 5.5 and Opus 4.7 are much more reward-hacky than their predecessors, just substantially better at it (or like, they are much stronger apparent-success seekers, which IMO clearly was the reward they were trained on).
- Eliezer Yudkowsky 5 May 2026 1:11 UTC
  9 points
  0
  Parent
  Noting again for the record that I would not be surprised if at some future stage, the model figures out what humans want to hear and see, errors and all, and then there is an apparent sudden amazing success with alignment.