In general, selection for reward produces equally strong selection for reward’s necessary and sufficient conditions. In general, it seems like there should be a lot of those. Therefore, since selection is not only for reward but for anything which goes along with reward (e.g. reaching the goal), then selection won’t advantage reward optimizers over agents which reach goals quickly / pick up lots of trash / [do the objective].
Low confidence disagree here. if the AI has a very good model of how to achieve goal/reward X (which LLMs generally do), then the ‘reward optimizer’ policy elicits the set of necessary actions (like pick up lots of trash) that leads to this reward. in this sense, I think the ‘think about what actions achieve goal and do them’ behavior will achieve better rewards and therefore be more heavily selected for. I think the above also fits in the framing of the recent behavioral selection model proposed by Alex Mallen (https://www.lesswrong.com/posts/FeaJcWkC6fuRAMsfp/the-behavioral-selection-model-for-predicting-ai-motivations-1), similar to the ‘motivation’ cognitive pattern.
Why will the AI display this kind of explicit reward modelling in the first place? 1. we kind of tell the LLM what the goal is in certain RL tasks. 2. the most coherent persona/solution is one that explicit models rewards/thinks about goals, whether from assistant persona training or writing about AI.
Therefore I think we should reconsider implication #1? if the above is correct, AI can and will optimize for goals/rewards, just not in the intrinsic sense. this can be seen as a ‘cognitive groove’ that gets chiseled in to the AI, but is problematic in the same ways as the reward optimization premise.
Low confidence disagree here. if the AI has a very good model of how to achieve goal/reward X (which LLMs generally do), then the ‘reward optimizer’ policy elicits the set of necessary actions (like pick up lots of trash) that leads to this reward. in this sense, I think the ‘think about what actions achieve goal and do them’ behavior will achieve better rewards and therefore be more heavily selected for. I think the above also fits in the framing of the recent behavioral selection model proposed by Alex Mallen (https://www.lesswrong.com/posts/FeaJcWkC6fuRAMsfp/the-behavioral-selection-model-for-predicting-ai-motivations-1), similar to the ‘motivation’ cognitive pattern.
Why will the AI display this kind of explicit reward modelling in the first place? 1. we kind of tell the LLM what the goal is in certain RL tasks. 2. the most coherent persona/solution is one that explicit models rewards/thinks about goals, whether from assistant persona training or writing about AI.
Therefore I think we should reconsider implication #1? if the above is correct, AI can and will optimize for goals/rewards, just not in the intrinsic sense. this can be seen as a ‘cognitive groove’ that gets chiseled in to the AI, but is problematic in the same ways as the reward optimization premise.