A software intelligence explosion might be asymmetrically good for safety if you think safety research is absolute > relative:
Ajeya Cotra and others have argued that ML research being automated first could mean that safety work gets a boost during a software intelligence explosion. The typical worry is that this doesn’t help if capabilities race ahead. But I think the picture is more nuanced depending on how you model safety progress — and the distinction matters not just for intelligence explosion dynamics but for how you allocate resources between safety approaches today.
There are roughly two ways to think about safety difficulty:
Relative: safety is hard insofar as capabilities keep pulling the goalposts. More capable systems mean harder alignment problems, more deceptive alignment risk, etc.
Absolute: safety is about hitting some fixed technical bar — interpretability reaching a certain threshold, formal verification of certain properties, etc. — largely independent of how fast capabilities move.
If you lean toward the absolute view, an intelligence explosion looks surprisingly good for safety. You’re not just getting more capable AI systems — you’re getting dramatically better alignment researchers, better interpretability tools, faster progress on whatever technical problems currently bottleneck safety. The “bar” doesn’t move much, but your ability to clear it does.
The obvious counterexample is control, where the relative view pretty clearly dominates. More capable systems have more situational awareness, are better at strategic deception, better at identifying and exploiting oversight gaps. Capability gains directly translate into harder control problems, roughly one-for-one or worse.
Mech interp might be partially in the relative camp too, since understanding what a model is doing gets harder as the model gets better at obfuscating or reasoning in ways that outpace our interpretive tools. This feels less clearly true than for control — it seems to require more situational awareness and steps of reasoning on the model’s part — but it’s an open empirical question worth investigating.
One implication for resource allocation: if you think a given safety approach is more relative, its value degrades as capabilities advance, even if it looks competitive today. Suppose mech interp has a 1% chance of success and control has a 2% chance over the next x years. Control looks better naively — but if the capability-to-safety ratio gets significantly worse over a large portion of that window, and control’s difficulty tracks capabilities more closely, mech interp could still be the better investment. A similar (and somewhat unrelated) dynamic applies when comparing general vs. narrow safety research: even if a more general approach looks worse by current benchmarks, it may dominate if you expect the landscape of future models to shift in ways that erode the value of narrow solutions.
(Open to hearing why this is totally wrong, has been stated many times before, or not useful, ofc)
A software intelligence explosion might be asymmetrically good for safety if you think safety research is absolute > relative:
Ajeya Cotra and others have argued that ML research being automated first could mean that safety work gets a boost during a software intelligence explosion. The typical worry is that this doesn’t help if capabilities race ahead. But I think the picture is more nuanced depending on how you model safety progress — and the distinction matters not just for intelligence explosion dynamics but for how you allocate resources between safety approaches today.
There are roughly two ways to think about safety difficulty:
Relative: safety is hard insofar as capabilities keep pulling the goalposts. More capable systems mean harder alignment problems, more deceptive alignment risk, etc.
Absolute: safety is about hitting some fixed technical bar — interpretability reaching a certain threshold, formal verification of certain properties, etc. — largely independent of how fast capabilities move.
If you lean toward the absolute view, an intelligence explosion looks surprisingly good for safety. You’re not just getting more capable AI systems — you’re getting dramatically better alignment researchers, better interpretability tools, faster progress on whatever technical problems currently bottleneck safety. The “bar” doesn’t move much, but your ability to clear it does.
The obvious counterexample is control, where the relative view pretty clearly dominates. More capable systems have more situational awareness, are better at strategic deception, better at identifying and exploiting oversight gaps. Capability gains directly translate into harder control problems, roughly one-for-one or worse.
Mech interp might be partially in the relative camp too, since understanding what a model is doing gets harder as the model gets better at obfuscating or reasoning in ways that outpace our interpretive tools. This feels less clearly true than for control — it seems to require more situational awareness and steps of reasoning on the model’s part — but it’s an open empirical question worth investigating.
One implication for resource allocation: if you think a given safety approach is more relative, its value degrades as capabilities advance, even if it looks competitive today. Suppose mech interp has a 1% chance of success and control has a 2% chance over the next x years. Control looks better naively — but if the capability-to-safety ratio gets significantly worse over a large portion of that window, and control’s difficulty tracks capabilities more closely, mech interp could still be the better investment. A similar (and somewhat unrelated) dynamic applies when comparing general vs. narrow safety research: even if a more general approach looks worse by current benchmarks, it may dominate if you expect the landscape of future models to shift in ways that erode the value of narrow solutions.
(Open to hearing why this is totally wrong, has been stated many times before, or not useful, ofc)