Thomas Kwa comments on Daniel Kokotajlo’s Shortform

Thomas Kwa 19 Jul 2025 18:05 UTC
LW: 6 AF: 4
4
AF
One possible way things could go is that models behave like human drug addicts, and don’t crave reward until they have an ability to manipulate it easily/directly, but as soon as they do, lose all their other motivations and values and essentially become misaligned. In this world we might get
- Companies put lots of effort into keeping models from nascent reward hacking, because reward hacking behavior will quickly lead to more efficient reward hacking and lead to a reward-addicted model
- If the model becomes reward-addicted and the company can’t control it, they might have to revert to an earlier checkpoint, costing 100s of millions
- If companies can control reward-addicted models, we might get some labs that try to keep models aligned for safety + to avoid the need for secure sandboxes, and some that develop misaligned models and invest heavily in secure sandboxes
- It’s more likely we can tell the difference between aligned and reward-addicted models, because we can see behavioral differences out of distribution