I largely agree with the substance of this comment. Lots of risk comes from AIs who, to varying extents, didn’t think of themselves as deceptively aligned through most of training, but then ultimately decide to take substantial material action intended to gain long-term power over the developers (I call these “behavioral schemers”). This might happen via reflection and memetic spread throughout the deployment or because of more subtle effects of the distribution shift to situations where there’s an opportunity to grab power.
And I agree that people are often sloppy in their thinking about exactly how these AIs will be motivated (e.g., often too quickly concluding that they’ll be trying to guard the same goal across contexts).
(Though, in case this was in question, I think this doesn’t undermine the premise of AI control research, which is essentially making a worst-case assumption about the AI’s motivations, so it’s robust to other kinds of dangerously-motivated AIs.)
A basic reason why [behavioral schemers that aren’t training-time schemers] might seem rarer is that the AI must concentrate attacks towards good opportunities just as much as any other behavioral schemer, but the AI isn’t vigilantly looking out for such opportunities to the same degree. Why would an AI that isn’t a training-time schemer have evaded auditing in search of these failure modes that the AI eventually exhibits?
One plausible answer is that auditing for these failure modes is hard—the AI is easily able to distinguish real opportunities to attack from fake ones even without vigilantly looking. In the case of auditing for risk from memetic diseases, it might be especially expensive to synthesize diverse and convincing long-term memory banks with which to evaluate the AI’s propensity to suffer memetic diseases.
Another plausible answer is that the AI is vigilantly trying to evade oversight and discern good opportunities to attack, but [only during auditing for propensity to grab power, and not during most of training]. In practice, most of our evidence about the alignment of the AI may come from the AI’s behavior during auditing, not training. Many auditing inputs might look quite different from training, presenting realistic situations in which the AI might want to take an egregiously misaligned action. So, for example, an AI that terminally seeks reward-on-the-episode in training might tend to turn into a vigilant schemer in contexts that look like they could plausibly be a good opportunity to attack or an auditing input.
On net, behavioral scheming without training-time scheming seems comparably plausible to behavioral scheming with training-time scheming.
I largely agree with the substance of this comment. Lots of risk comes from AIs who, to varying extents, didn’t think of themselves as deceptively aligned through most of training, but then ultimately decide to take substantial material action intended to gain long-term power over the developers (I call these “behavioral schemers”). This might happen via reflection and memetic spread throughout the deployment or because of more subtle effects of the distribution shift to situations where there’s an opportunity to grab power.
And I agree that people are often sloppy in their thinking about exactly how these AIs will be motivated (e.g., often too quickly concluding that they’ll be trying to guard the same goal across contexts).
(Though, in case this was in question, I think this doesn’t undermine the premise of AI control research, which is essentially making a worst-case assumption about the AI’s motivations, so it’s robust to other kinds of dangerously-motivated AIs.)
Here’s some relevant discussion of “Behavioral schemers that weren’t training-time schemers”: