Daniel Kokotajlo comments on Tracing the Thoughts of a Large Language Model

Daniel Kokotajlo 29 Mar 2025 3:29 UTC
LW: 11 AF: 6
0
AF
Awesome work!
In this section, you describe what seems at first glance to be an example of a model playing the training game and/or optimizing for reward. I’m curious if you agree with that assessment.
So the model learns to behave in ways that it thinks the RM will reinforce, not just ways they actually reinforce. Right? This seems at least fairly conceptually similar to playing the training game and at least some evidence that reward can sometimes become the optimization target?
- Nicholas Goldowsky-Dill 30 Mar 2025 7:38 UTC
  7 points
  4
  Parent
  FYI, See this paper for a more detailed explanation of the model in question and how it’s trained.