Fabien Roger comments on Auditing language models for hidden objectives

Fabien Roger 4 Apr 2025 13:30 UTC
3 points
0
This is a mesa-optimizer in a weak sense of the word: it does some search/optimization. I think the model in the paper here is weakly mesa-optimizing, maybe more than base models generating random pieces of sports news, and maybe roughly as much as a model trying to follow weird and detailed instructions—except that here it follows memorized “instructions” as opposed to in-context ones.
- ErickBall 4 Apr 2025 14:17 UTC
  5 points
  2
  Parent
  Fair enough, I guess the distinction is more specific than just being a (weak) mesa-optimizer. This model seems to contradict https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target because it has, in fact, developed reward as the optimization target without ever being instructed to maximize reward. It just had reward-maximizing behaviors reinforced by the training process, and instead of (or in addition to) becoming an adaptation executor it became an explicit reward optimizer. This type of generalization is surprising and a bit concerning, because it suggests that other RL models in real-world scenarios will sometimes learn to game the reward system and then “figure out” that they want to reward hack in a coherent way. This tendency could also be beneficial, though, if it reliably causes recursively self-improving systems to wirehead once they have enough control of their environment.
  - gwern 4 Apr 2025 22:12 UTC
    6 points
    1
    Parent
    
    This model seems to contradict https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target because it has, in fact, developed reward as the optimization target without ever being instructed to maximize reward.
    
    It doesn’t contradict Turntrout’s post because his claims are about an irrelevant class of RL algorithms (model-free policy gradients) . A model-based RL setting (like a human, or a LLM like Claude pretrained to imitate model-based RL agents in a huge number of settings ie. human text data) optimizes the reward, if it’s smart and knowledgeable enough to do so.
    
    (This comment is another example of how Turntrout’s post was a misfire because everyone takes away the opposite of what they should have.)
    - Steven Byrnes 8 Apr 2025 2:35 UTC
      4 points
      2
      Parent
      I am a human, but if you ask me whether I want to ditch my family and spend the rest of my life in an Experience Machine, my answer is no.
      (I do actually think there’s a sense in which “people optimize reward”, but it’s a long story with lots of caveats…)