Oh yeah that is a good point to bring up. I agree that the empirics of how good few shot catastrophe prevention is will affect both the usefulness of post-prevented-catastrophe models and change how good of a strategy rare failures is. It’s also the case that the rare failures comes from great pre-deployment (or in-deployment) evaluations and training procedures, but in some sense this is also a version of few-shot catastrophe prevention with different affordances for the safety team IRL.
The diffuse stakes control doesn’t suffer from this and just directly engages with the dynamic of training the model on failures and getting useful work out of a misaligned model
Thanks for engaging and disagreeing
I wrote this article in particular with the AI models from the AI 2027 TTX that “outperform humans at every remote work task and accelerate AI R&D by 100x” in mind. It depends on how spiky the capabilities are, but it feels reasonable to assume that these AIs are at least as good as the best humans at decision-making and persuasion, if not better. (If you felt like neither of these would be true for this model, then I’m fine with also applying these to a more powerful AI which hit this criteria, but I would be surprised if these capabilities came late enough to be irrelevant to shaping the dynamics of the intelligence explosion and beyond. Though given your last paragraph, that might be cruxy.)
But I’m not even sure the ‘persuasion’ capability needs to be that good. A lot the persuasion ultimately routed through i) the AI gaining lots of trust from legibly succeeding and being useful to decision-makers in a broad set of domains and ii) the AIs having lots of influence and control over people’s cognition (such as information control).
I think under the picture you described in the second paragraph it would be less surprising if AI didn’t have a large impact, but I think it’s likely the AI are more like a strong combination of primary executive assistant and extremely competent, trust advisor. I think they will have more reach and capabilities than human technical advisors, and end up being competitive or better than them at decision-making, and ultimately end up supporting/automating many parts of the decision maker’s cognition. One such control I imagine the AI has that is important is that they will be a middleman in a lot of the information that you see, such as being able to summarize documents or conversations or particular. (I expect for the incentives to do this will keep increasing.)
My guess is that the reason why misaligned human advisors have not be deadly to democracy has been because it has been difficult for them to influence many other decision makers and because the human decision maker still did nearly all of the important thinking related to ‘how true is what they are saying’ / ‘what are their incentives’ / engaging with people who disagreed / etc (and there were better incentives to do so). I think this changes whenever you introduce strong AIs, and additionally, I think these AIs will be better at playing to the many things I listed in the section “Humans believe what they are incentivized to, and the incentives will be to believe the AIs.”
Yup, though I don’t have strong beliefs on if they will succeed at this enough to sufficiently reduce the damage this could cause, especially in worst-case alignment regimes.
I don’t have particular guesses, but there are probably many decisions that are made that can influence how AI turns out that are not individually catastrophic but catastrophic together, such as the example I gave of distrusting an individual AI advisor. I also wouldn’t rule out particular small sets of very important decisions that are made that would make it much harder for AI to go well; the example I gave in the article of mkaing the human fire/distrust specific advisors feels salient as an example.