Mhh… I don’t think (1) vs (2 & 3) is something I was aiming to distinguish. I would say behaviorally 3 is perfectly consistent with a reward-seeker.
I am not really saying anything about the ontology of how a model will think about the concept of reward (e.g. “What does the grader want?”, “What will update my weights?”, “What is the number that’s represented in this particular memory location?”). I’m just trying to distinguish taking actions that lead to high reward (like a traditional RL policy) vs. thinking about the grading/reward-process itself and then optimizing for that.
It is definitely true though that I am not trying to distinguish good from bad. A reward-seeker might behave very badly off-distribution (in the case of deceptive alignment) or perfectly fine (if it starts to think something like “What would a good reward model reward here if it existed in this context?”).
Mhh… I don’t think (1) vs (2 & 3) is something I was aiming to distinguish.
I would say behaviorally 3 is perfectly consistent with a reward-seeker.
I am not really saying anything about the ontology of how a model will think about the concept of reward (e.g. “What does the grader want?”, “What will update my weights?”, “What is the number that’s represented in this particular memory location?”). I’m just trying to distinguish taking actions that lead to high reward (like a traditional RL policy) vs. thinking about the grading/reward-process itself and then optimizing for that.
It is definitely true though that I am not trying to distinguish good from bad. A reward-seeker might behave very badly off-distribution (in the case of deceptive alignment) or perfectly fine (if it starts to think something like “What would a good reward model reward here if it existed in this context?”).