ACCount comments on Daniel Kokotajlo’s Shortform

ACCount 10 Jul 2025 10:57 UTC
4 points
−2
Sure, there’s no “reward” to be had in deployment. But “RL gone wrong” has already caused the AI to internalize reward-seeking behavior.
Awareness of there being “no reward in deployment” may temper that behavior, but it wouldn’t cancel it entirely. Like in humans, even being fully aware of a horror movie being safe doesn’t cancel out the fear entirely.
A stubborn “reward-seeking instinct” that makes it through to deployment may cause a lot of issues. Ranging from the instrumental convergence staples like unwanted self-preservation, and to a whole host of misaligned behaviors.
If flattery and manipulation gives better RLHF reward, then AI would continue to flatter and manipulate in deployment. If tampering with the metrics used by evaluators gives better coding RL rewards, then AI would keep manipulating the metrics in deployment. This may generalize in unexpected ways.
- Kaj_Sotala 10 Jul 2025 14:24 UTC
  5 points
  0
  Parent
  I agree that this can cause issues, but I think it’s important to be precise and avoid terminology that implies things that are not necessarily correct. For example, it’s true that the AI has internalized behavior that was reward-seeking before, but that’s an adaptation-executor not a reward maximizer.
  If it was still maximizing rewards, then that would imply that its level of flattery was still responding to the amount of reward: if flattery brought it good results it would do more of it, if flattery failed to bring rewards then the amount of flattery would reduce and eventually the behavior might extinguish entirely. That’s not the case in any relevant sense.
  What links here?
  - Kaj_Sotala's comment on Daniel Kokotajlo’s Shortform by Daniel Kokotajlo (10 Jul 2025 14:33 UTC; 7 points)
  - ACCount 10 Jul 2025 16:38 UTC
    3 points
    2
    Parent
    Good catch. I agree that “adaptation-executor” is a more appropriate term. Reward itself is no longer there, but the adaptations that are downstream from that reward still are.
    It’s just that being deep fried in “RL juice” creates the kind of adaptations that can look very much like the AI still expects the reward to be there, and is still trying to maximize that phantom reward.
  - Daniel Kokotajlo 10 Jul 2025 14:34 UTC
    2 points
    −1
    Parent
    That’s why I used the drug addict example.
    - Kaj_Sotala 10 Jul 2025 14:43 UTC
      2 points
      0
      Parent
      Sorry, I don’t understand. If “reward is the optimization target” incorrectly implies that AIs would change their behavior more than they do, then the drug addict example seems orthogonal to that issue?
      - Daniel Kokotajlo 10 Jul 2025 19:04 UTC
        4 points
        0
        Parent
        I didn’t say reward is the optimization target NOW! I said it might be in the future! See the other chain/thread with Violet Hour.
        Kaj_Sotala 10 Jul 2025 19:22 UTC
        2 points
        0
        Parent
        Ah okay, that makes more sense to me. I assumed that you would be talking about AIs similar to current-day systems since you said that you’d updated from the behavior of current-day systems.
        Daniel Kokotajlo 11 Jul 2025 0:37 UTC
        2 points
        0
        Parent
        I am talking about AIs similar to current-day systems, for some notion of “similar” at least. But I’m imagining AIs that are trained on lots more RL, especially lots more long-horizon RL.