If the situations where you imagine MONA helping are situations where you can’t see the long-term optimization problems, I think you need a relatively strong second bullet point
That doesn’t seem right. It can simultaneously be the case that you can’t tell that there are problems stemming from long-term optimization problems when you don’t use MONA, and also if you actually use MONA, then it will measurably improve quality.
For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which we’d penalize if we knew about it, but we don’t realize that’s happening). Later when things are put into production errors happen, but it’s chalked up to “well it’s hard to anticipate everything”.
Instead you use MONA, and it doesn’t learn to do this. You compare the resulting coding agent to the original agent, and notice that the MONA agent’s lines of code are much more rarely implicated in future bugs, and conclude they are higher quality.
To rephrase what I think you are saying are situations where work on MONA is very helpful:
By default people get bitten by long-term optimization. They notice issues in prod because it’s hard to catch everything. They patch individual failures when they come up, but don’t notice that if they did more work on MONA, they would stop the underlying driver of many issues (including future issues that could result in catastrophes). They don’t try MONA-like techniques because it’s not very salient when you are trying to fix individual failures and does not pass cost-benefit to fix individual failures.
If you do work on MONA in realistic-ish settings, you may be able to demonstrate that you can avoid many failures observed in prod without ad-hoc patches and that the alignment tax is not too large. This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly and because people don’t by default think of the individual failures you’ve seen in prod as symptoms of their long-term optimization, but your empirical work pushes them over the line and they end up trying to adopt MONA to avoid future failures in prod (and maybe reduce catastrophic risk—though given competitive pressures, that might not be the main factor driving decisions and so you don’t have to make an ironclad case for MONA reducing catastrophic risk).
Is that right?
I think this is at least plausible. I think this will become much more likely once we actually start observing long-term optimization failures in prod. Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
Yes, that’s broadly accurate, though one clarification:
This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly
That’s a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.
I think this will become much more likely once we actually start observing long-term optimization failures in prod.
Agreed, we’re not advocating for using MONA now (and say so in the paper).
Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
Idk, to be effective I think this would need to be a pretty drastic increase in transparency, which seems incompatible with many security or non-proliferation intuitions, as well as business competitiveness concerns. (Unless you are thinking of lots of transparency to a very small set of people.)
That doesn’t seem right. It can simultaneously be the case that you can’t tell that there are problems stemming from long-term optimization problems when you don’t use MONA, and also if you actually use MONA, then it will measurably improve quality.
For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which we’d penalize if we knew about it, but we don’t realize that’s happening). Later when things are put into production errors happen, but it’s chalked up to “well it’s hard to anticipate everything”.
Instead you use MONA, and it doesn’t learn to do this. You compare the resulting coding agent to the original agent, and notice that the MONA agent’s lines of code are much more rarely implicated in future bugs, and conclude they are higher quality.
To rephrase what I think you are saying are situations where work on MONA is very helpful:
By default people get bitten by long-term optimization. They notice issues in prod because it’s hard to catch everything. They patch individual failures when they come up, but don’t notice that if they did more work on MONA, they would stop the underlying driver of many issues (including future issues that could result in catastrophes). They don’t try MONA-like techniques because it’s not very salient when you are trying to fix individual failures and does not pass cost-benefit to fix individual failures.
If you do work on MONA in realistic-ish settings, you may be able to demonstrate that you can avoid many failures observed in prod without ad-hoc patches and that the alignment tax is not too large. This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly and because people don’t by default think of the individual failures you’ve seen in prod as symptoms of their long-term optimization, but your empirical work pushes them over the line and they end up trying to adopt MONA to avoid future failures in prod (and maybe reduce catastrophic risk—though given competitive pressures, that might not be the main factor driving decisions and so you don’t have to make an ironclad case for MONA reducing catastrophic risk).
Is that right?
I think this is at least plausible. I think this will become much more likely once we actually start observing long-term optimization failures in prod. Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
Yes, that’s broadly accurate, though one clarification:
That’s a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.
Agreed, we’re not advocating for using MONA now (and say so in the paper).
Idk, to be effective I think this would need to be a pretty drastic increase in transparency, which seems incompatible with many security or non-proliferation intuitions, as well as business competitiveness concerns. (Unless you are thinking of lots of transparency to a very small set of people.)