Caleb Biddulph comments on MONA: Managed Myopia with Approval Feedback

Caleb Biddulph 23 Jan 2025 20:11 UTC
7 points
0
Adding onto Rohin’s comment, assuming you’re interested in using this technique on human-level or slightly superhuman systems:
After o1 came out, I did my own thinking about how MONA could apply to CoT-tuned models operating over many steps. My idea ended up being quite similar to yours, including the idea of “debate” on how good each step is. Having the debaters and the judge be instances of the agent would likely make sense, unless we’re worried this would lead to them colluding.
Adding more details to this picture:
- We should keep a human in the loop as we train the judge and debater models. You can imagine we give a human rater a problem, an intended solution, a partial CoT trace for the problem, and two possible next steps. The human has to choose which step is better.
- We should tell the rater to prioritize not just progress towards the goal, but also understandability. If the agent uses one step to make an inferential leap that the human doesn’t understand (even if correct), this should be considered worse than taking many steps that each make sense and which reach the same conclusion.
- The human’s task becomes more difficult the longer the CoT traces are, and obviously intractable at the level of 10T tokens per trace. This is the major reason we need the debate system.
- The debater AIs will go back and forth about which step is the best until the human is convinced one way or the other. Examples of things the debaters could point out:
  - “This step sounds insightful, but it’s just rephrasing this sentence from 50K tokens earlier in the CoT.”
  - “This step seems irrelevant. Maybe the agent is using steganography?”
  - “Although this step is incorrect, it gets the agent closer to the final answer, because…”
- We train the judge to imitate the human, and train the debater to convince the judge that their assigned CoT step is the best.
- Finally, the approval model for MONA is trained on a score derived from the judge model.
- This is all done in an online training setup where the human stays in the loop as the AI systems continue to improve.
This all depends on the assumption that the (human + AI debaters) system isn’t so hackable as to render its feedback useless. This assumption seems pretty safe for pre-superhuman agents and <1M token CoTs. Because the above training setup keeps a human in the loop, we can at least conclude that the agent won’t learn to take actions that are obviously-bad to humans, e.g. outputting gibberish tokens like “describing.\ + similarlyNow write oppositeley.]”
Maybe with an intelligent enough AI or a long enough CoT, the human rater becomes so confused that their feedback is completely decoupled from “accomplishing tasks in the real world.” It’s debatable at what point this would happen; however, I think it’s pretty clearly easier to hack a human’s approval for the end result of a weeks-long task than for a single step. If outcome-based supervision is the alternative, MONA seems like a big improvement.