Fabien Roger comments on Untrusted monitoring insights from watching ChatGPT play coordination games

Fabien Roger 9 Apr 2025 21:17 UTC
3 points
0
I think inhuman but not superhuman is fine as long as it holds that succeeding at defending against human attack is strong evidence of success against inhuman ones. If the defenses are not “overfit” to human attacks, I think it’s quite doable to get a sense of how good defenses are against inhuman attacks. For paraphrasing in particular, I’d be very surprised if the best human attacks (found in many days of research) were defeated by paraphrasing but human-level-ish LLMs (given only a few minutes of human-equivalent opaque thinking time) could figure out strategies which were not defeated by paraphrasing.
(And there is also the strategy of using RL to elicit inhuman attacks which doesn’t have the weakness you describe, but which has other issues.)