Mikhail Samin comments on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Mikhail Samin 28 Mar 2025 14:35 UTC
2 points
−2
Sleeper agents, alignment faking, and other research were all motivated by the two cases outlined here: a scientific case and a global coordination case.
These studies have results that are much likelier in the “alignment is hard” worlds.
Strong scientific evidence for alignment risks is very helpful for facilitating coordination between AI labs and state actors. If alignment ends up being a serious problem, we’re going to want to make some big asks of AI labs and state actors. We might want to ask AI labs or state actors to pause scaling (e.g., to have more time for safety research). We might want to ask them to use certain training methods (e.g., process-based training) in lieu of other, more capable methods (e.g., outcome-based training). As AI systems grow more capable, the costs of these asks will grow – they’ll be leaving more and more economic value on the table.
Is Anthropic currently planning to ask AI labs or state actors to pause scaling?