We believe that, even without further breakthroughs, this work can almost entirely mitigate the risk that we unwittingly put misaligned circa-human-expert-level agents in a position where they can cause severe harm.
Sam, I’m confused where this degree of confidence is coming from? I found this post helpful for understanding Anthropic’s strategy, but there wasn’t much argument given about why one should expect the strategy to work, much less to “almost entirely” mitigate the risk!
To me, this seems wildly overconfident given the quality of the available evidence—which, as Aysja notes, involves auditing techniques like e.g. simply asking the models themselves to rate their evil-ness on a scale from 1-10… I can kind of understand evidence like this informing your background intuitions and choice of research bets and so forth, but why think it justifies this much confidence you’ll catch/fix misalignment?
Sam, I’m confused where this degree of confidence is coming from? I found this post helpful for understanding Anthropic’s strategy, but there wasn’t much argument given about why one should expect the strategy to work, much less to “almost entirely” mitigate the risk!
To me, this seems wildly overconfident given the quality of the available evidence—which, as Aysja notes, involves auditing techniques like e.g. simply asking the models themselves to rate their evil-ness on a scale from 1-10… I can kind of understand evidence like this informing your background intuitions and choice of research bets and so forth, but why think it justifies this much confidence you’ll catch/fix misalignment?