Problem: Won’t this training run basically hack the judges, and produce actions that look good to the judges but aren’t actually useful for accomplishing tasks in the real world?
It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you’re imagining.
But it sounds like you’re talking about a pretty weak (subhuman) AI system. So I’m not quite sure why you’d expect it to “hack” the judge.
Certainly sometimes the overseer will approve of some action that isn’t actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the “Menu” link because they aren’t paying attention or whatever. But then after that the agent’s next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the “Book” link; the overseer shouldn’t make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don’t expect it to be useless for accomplishing tasks.
(Though really in the computer use setting I expect I’d recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees si+1), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.)
If I understand correctly (and I don’t think I do, which is why I’m asking) MONA in this setting would look something like:
Your description of the setup sounds reasonable, though given the weak levels of capability you’re imagining I don’t think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer.
Also as mentioned above I’d probably recommend the overseer gets access to si+1 but even if that weren’t the case I’d still think it should be feasible to build a non-useless agent. (Though I’m not taking a stance on how it would compare to one trained with outcome RL.)
EDIT: I’m not sure how big each action you are considering is. If it’s 10 tokens, such that you can only realistically do stuff at the level of “click this button”, then I would also say that you should instead consider much larger actions, i.e. in each step the AI can take perhaps 100 such actions.
Thanks this is helpful. Is MONA basically “Let’s ONLY use process-based feedback, no outcome-based feedback?”
Another objection: If this works for capabilities, why haven’t the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
Is MONA basically “Let’s ONLY use process-based feedback, no outcome-based feedback?”
And also “don’t propagate rewards backwards in time”, which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)
EDIT: And tbc, “don’t propagate rewards backwards in time” is the primary focus in this paper—in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper).
Another objection: If this works for capabilities, why haven’t the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
… As a person who works at a corporation, it’s a bit tricky to speculate on this publicly, and I’m not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals.
Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn’t much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don’t want to use MONA. (See also the second limitation in the post.)
It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you’re imagining.
But it sounds like you’re talking about a pretty weak (subhuman) AI system. So I’m not quite sure why you’d expect it to “hack” the judge.
Certainly sometimes the overseer will approve of some action that isn’t actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the “Menu” link because they aren’t paying attention or whatever. But then after that the agent’s next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the “Book” link; the overseer shouldn’t make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don’t expect it to be useless for accomplishing tasks.
(Though really in the computer use setting I expect I’d recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees si+1), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.)
Your description of the setup sounds reasonable, though given the weak levels of capability you’re imagining I don’t think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer.
Also as mentioned above I’d probably recommend the overseer gets access to si+1 but even if that weren’t the case I’d still think it should be feasible to build a non-useless agent. (Though I’m not taking a stance on how it would compare to one trained with outcome RL.)
EDIT: I’m not sure how big each action you are considering is. If it’s 10 tokens, such that you can only realistically do stuff at the level of “click this button”, then I would also say that you should instead consider much larger actions, i.e. in each step the AI can take perhaps 100 such actions.
Thanks this is helpful. Is MONA basically “Let’s ONLY use process-based feedback, no outcome-based feedback?”
Another objection: If this works for capabilities, why haven’t the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
And also “don’t propagate rewards backwards in time”, which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)
EDIT: And tbc, “don’t propagate rewards backwards in time” is the primary focus in this paper—in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper).
… As a person who works at a corporation, it’s a bit tricky to speculate on this publicly, and I’m not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals.
Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn’t much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don’t want to use MONA. (See also the second limitation in the post.)
For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?
The interesting application MONA seems to be when the myopic RL agent is able to produce better suggestions than the overseer
Edit: I elaborated
Plus maybe let the overseer observe the result and say “oops” and roll back that action, if we can implement a rollback in this context
If it were as simple as “just ask an LLM to choose actions” someone would have deployed this product a while ago.
But in any case I agree this isn’t the most interesting case for MONA, I talked about it because that’s what Daniel asked about.