Rohin Shah comments on Research directions Open Phil wants to fund in technical AI safety

Rohin Shah 10 Feb 2025 5:22 UTC
LW: 2 AF: 2
0
AF
I meant “it’s obvious you should use MONA if you are seeing problems with long-term optimization”, which I believe is Fabien’s position (otherwise it would be “hard to find”).
Your reaction seems more like “it’s obvious MONA would prevent multi-step reward hacks”; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).