Alex Mallen comments on Richard Ngo’s Shortform

Alex Mallen 6 Feb 2026 19:35 UTC
LW: 2 AF: 1
0
AF
Here’s some relevant discussion of “Behavioral schemers that weren’t training-time schemers”:
A basic reason why [behavioral schemers that aren’t training-time schemers] might seem rarer is that the AI must concentrate attacks towards good opportunities just as much as any other behavioral schemer, but the AI isn’t vigilantly looking out for such opportunities to the same degree. Why would an AI that isn’t a training-time schemer have evaded auditing in search of these failure modes that the AI eventually exhibits?
One plausible answer is that auditing for these failure modes is hard—the AI is easily able to distinguish real opportunities to attack from fake ones even without vigilantly looking. In the case of auditing for risk from memetic diseases, it might be especially expensive to synthesize diverse and convincing long-term memory banks with which to evaluate the AI’s propensity to suffer memetic diseases.
Another plausible answer is that the AI is vigilantly trying to evade oversight and discern good opportunities to attack, but [only during auditing for propensity to grab power, and not during most of training]. In practice, most of our evidence about the alignment of the AI may come from the AI’s behavior during auditing, not training. Many auditing inputs might look quite different from training, presenting realistic situations in which the AI might want to take an egregiously misaligned action. So, for example, an AI that terminally seeks reward-on-the-episode in training might tend to turn into a vigilant schemer in contexts that look like they could plausibly be a good opportunity to attack or an auditing input.
On net, behavioral scheming without training-time scheming seems comparably plausible to behavioral scheming with training-time scheming.