Thank you for the correction to my review of technical correctness, and thanks to @Noosphere89 for the helpful link. I’m continuing to read. From your answer there:
A model-free algorithm may learn something which optimizes the reward; and a model-based algorithm may also learn something which does not optimize the reward.
So, reward is sometimes the optimization target, but not always. Knowing the reward gives some evidence about the optimization target, and vice versa.
To the point of my review, this is the same type of argument made by TurnTrout’s comment on this post. Knowing the symbols gives some evidence about the referents, and vice versa. Sometimes John introduces himself as John, but not always.
I understand you as claiming that the Alignment Faking paper is an example of reward-hacking. A new perspective for me. I tried to understand it in this comment.
Thank you for the correction to my review of technical correctness, and thanks to @Noosphere89 for the helpful link. I’m continuing to read. From your answer there:
So, reward is sometimes the optimization target, but not always. Knowing the reward gives some evidence about the optimization target, and vice versa.
To the point of my review, this is the same type of argument made by TurnTrout’s comment on this post. Knowing the symbols gives some evidence about the referents, and vice versa. Sometimes John introduces himself as John, but not always.
(separately I wish I had said “reinforcement” instead of “reward”)
I understand you as claiming that the Alignment Faking paper is an example of reward-hacking. A new perspective for me. I tried to understand it in this comment.
You could have tagged me by selecting Lesswrong docs, like this:
@Noosphere89