We believe that, even without further breakthroughs, this work can almost entirely mitigate the risk that we unwittingly put misaligned circa-human-expert-level agents in a position where they can cause severe harm.
Sam, I’m confused where this degree of confidence is coming from? I found this post helpful for understanding Anthropic’s strategy, but there wasn’t much argument given about why one should expect the strategy to work, much less to “almost entirely” mitigate the risk!
To me, this seems wildly overconfident given the quality of the available evidence—which, as Aysja notes, involves auditing techniques like e.g. simply asking the models themselves to rate their evil-ness on a scale from 1-10… I can kind of understand evidence like this informing your background intuitions and choice of research bets and so forth, but why think it justifies this much confidence you’ll catch/fix misalignment?
I interpreted Habryka’s comment as making two points, one of which strikes me as true and important (that it seems hard/unlikely for this approach to allow for pivoting adequately, should that be needed), and the other of which was a misunderstanding (that they don’t literally say they hope to pivot if needed).