gwern comments on Auditing language models for hidden objectives

gwern 4 Apr 2025 22:12 UTC
6 points
1

This model seems to contradict https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target because it has, in fact, developed reward as the optimization target without ever being instructed to maximize reward.

It doesn’t contradict Turntrout’s post because his claims are about an irrelevant class of RL algorithms (model-free policy gradients) . A model-based RL setting (like a human, or a LLM like Claude pretrained to imitate model-based RL agents in a huge number of settings ie. human text data) optimizes the reward, if it’s smart and knowledgeable enough to do so.

(This comment is another example of how Turntrout’s post was a misfire because everyone takes away the opposite of what they should have.)
- Steven Byrnes 8 Apr 2025 2:35 UTC
  4 points
  2
  Parent
  I am a human, but if you ask me whether I want to ditch my family and spend the rest of my life in an Experience Machine, my answer is no.
  (I do actually think there’s a sense in which “people optimize reward”, but it’s a long story with lots of caveats…)