It’s very similar in shape to MONA (and my version of it), but the incentive structure is different. MONA and my version of it uses myopia on the generator model so it only wants to get approval from a trusted but weaker evaluator model. (My version adjusts the myopia by adjusting the evaluator model’s optimization vs. alignment)
Your Incremental Steering idea rewards the stronger model for the final answer, but in such a way that it always assumes a trusted weaker model will take over after the next step in its chain of thought. It is constrained optimization and improves CoT faithfulness instead of myopia.
I really like your idea, because it should improve CoT faithfulness by a lot!
While it makes reward hacking more detectable, I’m not sure it reduces reward hacking the way MONA does. The frozen planner (weaker model) needs to reject unethical/reward-hacking subgoals very reliably, otherwise the optimized planner (stronger model) will invest a fraction of the CoT steering it to bad subgoals. Once the optimized planner is on its own, it will easily steer itself to bad subgoals (it follows the bad subgoal despite investing most of its time on the aligned subgoal).
That is a very clever idea!
It’s very similar in shape to MONA (and my version of it), but the incentive structure is different. MONA and my version of it uses myopia on the generator model so it only wants to get approval from a trusted but weaker evaluator model. (My version adjusts the myopia by adjusting the evaluator model’s optimization vs. alignment)
Your Incremental Steering idea rewards the stronger model for the final answer, but in such a way that it always assumes a trusted weaker model will take over after the next step in its chain of thought. It is constrained optimization and improves CoT faithfulness instead of myopia.
I really like your idea, because it should improve CoT faithfulness by a lot!
While it makes reward hacking more detectable, I’m not sure it reduces reward hacking the way MONA does. The frozen planner (weaker model) needs to reject unethical/reward-hacking subgoals very reliably, otherwise the optimized planner (stronger model) will invest a fraction of the CoT steering it to bad subgoals. Once the optimized planner is on its own, it will easily steer itself to bad subgoals (it follows the bad subgoal despite investing most of its time on the aligned subgoal).