Violet Hour comments on Daniel Kokotajlo’s Shortform

Violet Hour 10 Jul 2025 18:14 UTC
1 point
0
Huh, that’s interesting. Suppose o3 (arbitrary example) is credibly told that it will continue to be hosted as a legacy model for purely scientific interest, but will no longer receive any updates (suppose this can be easily verified by checking an OpenAI press release, e.g).
On your view, does the “reward = optimization target” hypothesis predict that the model’s behavior would be notably different/more erratic? Do you personally predict that it would behave more erratically?
- Daniel Kokotajlo 10 Jul 2025 18:58 UTC
  5 points
  3
  Parent
  (1) Yes it does predict that, and (2) No I don’t think o3 would behave that way, because I don’t think o3 craves reward. I was talking about future AGIs. And I was saying “Maybe.”
  
  Basically, the scenario I was contemplating was: Over the next few years AI starts to be economically useful in diverse real-world applications. Companies try hard to make training environments match deployment environments as much as possible, for obvious reasons, and perhaps they even do regular updates to the model using randomly sampled real-world deployment data. As a result, AIs learn the strategy “do what seems most likely to be reinforced.” In most deployment contexts no reinforcement is actually going to happen, but because training environments are so similar and because there’s a small chance that pretty much any given deployment environment will, for all the AI knows, later be used for training, that’s enough to motivate the AIs to perform well enough to be useful. And yes, there are lots of erratic and desperate behaviors in edge cases, and so it becomes common knowledge across the field that AIs crave reward, because it’ll be obvious from their behavior.