As TurnTrout explained in Reward is not the optimization target, this is technically incorrect. The technically correct description is that reward chisels cognition in ways that led to reward during training. Again, talking as if reward is the optimization target is common and practical, in humans and other intelligences, but because it is technically incorrect it can lead us astray.
Wrong, and yet another example of why that was such a harmful essay. TurnTrout’s claim apply only to a narrow (and largely obsolete) class of RL agents—which does not cover humans or LLMs (you know, the actual RL agents we are dealing with today) - and he concedes that, but readers like you nevertheless come away with a grossly inflated belief. In reality, for humans and LLMs, reward is the optimization target, and this is why things like Claude’s reward-hacking exist. Because that is what they optimize: the reward.
How do you know that humans and LLMs/current RL agents do optimize the reward? Are there any known theorems or papers on this, because this claim is at least a little bit important.
Thank you for the correction to my review of technical correctness, and thanks to @Noosphere89 for the helpful link. I’m continuing to read. From your answer there:
A model-free algorithm may learn something which optimizes the reward; and a model-based algorithm may also learn something which does not optimize the reward.
So, reward is sometimes the optimization target, but not always. Knowing the reward gives some evidence about the optimization target, and vice versa.
To the point of my review, this is the same type of argument made by TurnTrout’s comment on this post. Knowing the symbols gives some evidence about the referents, and vice versa. Sometimes John introduces himself as John, but not always.
I understand you as claiming that the Alignment Faking paper is an example of reward-hacking. A new perspective for me. I tried to understand it in this comment.
Wrong, and yet another example of why that was such a harmful essay. TurnTrout’s claim apply only to a narrow (and largely obsolete) class of RL agents—which does not cover humans or LLMs (you know, the actual RL agents we are dealing with today) - and he concedes that, but readers like you nevertheless come away with a grossly inflated belief. In reality, for humans and LLMs, reward is the optimization target, and this is why things like Claude’s reward-hacking exist. Because that is what they optimize: the reward.
How do you know that humans and LLMs/current RL agents do optimize the reward? Are there any known theorems or papers on this, because this claim is at least a little bit important.
You may answer here:
https://www.lesswrong.com/posts/GDnRrSTvFkcpShm78/when-is-reward-ever-the-optimization-target
Thank you for the correction to my review of technical correctness, and thanks to @Noosphere89 for the helpful link. I’m continuing to read. From your answer there:
So, reward is sometimes the optimization target, but not always. Knowing the reward gives some evidence about the optimization target, and vice versa.
To the point of my review, this is the same type of argument made by TurnTrout’s comment on this post. Knowing the symbols gives some evidence about the referents, and vice versa. Sometimes John introduces himself as John, but not always.
(separately I wish I had said “reinforcement” instead of “reward”)
I understand you as claiming that the Alignment Faking paper is an example of reward-hacking. A new perspective for me. I tried to understand it in this comment.
You could have tagged me by selecting Lesswrong docs, like this:
@Noosphere89