Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 10 Jul 2025 17:31 UTC
5 points
0
In the original post I was saying that maybe AIs really will crave reward after all. In basically the same way that a drug addict craves drugs. So, I meant to say, maybe if they conclude that they are almost certainly not going to get reinforced, they’ll behave increasingly desperately and/or erratically and/or despondently, similar to a drug addict who thinks they won’t be able to get any more drugs. In other words, I was expecting something more like the first bullet point.

I then added the bits about ‘going through the motions’ because I wanted to be clear that it doesn’t have to be perfectly coherently EU-maximizing to still count. As long as they are doing things like in the first bullet point, they count as having drugs/reinforcement as the optimization target, even if they are also sometimes doing some things like in the second bullet point.
- Violet Hour 10 Jul 2025 18:14 UTC
  1 point
  0
  Parent
  Huh, that’s interesting. Suppose o3 (arbitrary example) is credibly told that it will continue to be hosted as a legacy model for purely scientific interest, but will no longer receive any updates (suppose this can be easily verified by checking an OpenAI press release, e.g).
  On your view, does the “reward = optimization target” hypothesis predict that the model’s behavior would be notably different/more erratic? Do you personally predict that it would behave more erratically?
  - Daniel Kokotajlo 10 Jul 2025 18:58 UTC
    5 points
    3
    Parent
    (1) Yes it does predict that, and (2) No I don’t think o3 would behave that way, because I don’t think o3 craves reward. I was talking about future AGIs. And I was saying “Maybe.”
    
    Basically, the scenario I was contemplating was: Over the next few years AI starts to be economically useful in diverse real-world applications. Companies try hard to make training environments match deployment environments as much as possible, for obvious reasons, and perhaps they even do regular updates to the model using randomly sampled real-world deployment data. As a result, AIs learn the strategy “do what seems most likely to be reinforced.” In most deployment contexts no reinforcement is actually going to happen, but because training environments are so similar and because there’s a small chance that pretty much any given deployment environment will, for all the AI knows, later be used for training, that’s enough to motivate the AIs to perform well enough to be useful. And yes, there are lots of erratic and desperate behaviors in edge cases, and so it becomes common knowledge across the field that AIs crave reward, because it’ll be obvious from their behavior.