Daniel Kokotajlo comments on MONA: Managed Myopia with Approval Feedback

Daniel Kokotajlo 23 Jan 2025 22:56 UTC
LW: 3 AF: 2
0
AF
Thanks this is helpful. Is MONA basically “Let’s ONLY use process-based feedback, no outcome-based feedback?”
Another objection: If this works for capabilities, why haven’t the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
- Rohin Shah 23 Jan 2025 23:58 UTC
  LW: 4 AF: 4
  0
  AF Parent
  Is MONA basically “Let’s ONLY use process-based feedback, no outcome-based feedback?”
  And also “don’t propagate rewards backwards in time”, which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)
  EDIT: And tbc, “don’t propagate rewards backwards in time” is the primary focus in this paper—in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper).
  Another objection: If this works for capabilities, why haven’t the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
  … As a person who works at a corporation, it’s a bit tricky to speculate on this publicly, and I’m not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals.
  Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn’t much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don’t want to use MONA. (See also the second limitation in the post.)