AI companies don’t actually have specific plans, they mostly just hope that they’ll be able to iterate. (See Sam Bowman’s bumper post for an articulation of a plan like this.) I think this is a reasonable approach in principle: this is how progress happens in a lot of fields. For example, the AI companies don’t have plans for all kinds of problems that will arise with their capabilities research in the next few years, they just hope to figure it out as they get there. But the lack of specific proposals makes it harder to demonstrate particular flaws.
A lot of my concerns about alignment proposals are that when AIs are sufficiently smart, the plan won’t work anymore. But in many cases, the plan does actually work fine right now at ensuring particular alignment properties. (Most obviously, right now, AIs are so bad at reasoning about training processes that scheming isn’t that much of an active concern.) So you can’t directly demonstrate that current plans will fail later without making analogies to future systems; and observers (reasonably enough) are less persuaded by evidence that requires you to assess the analogousness of a setup.
I’d say that work like our Alignment Faking in Large Language Models paper (and the model organisms/alignment stress-testing field more generally) is pretty similar to this (including the “present this clearly to policymakers” part).
A few issues:
AI companies don’t actually have specific plans, they mostly just hope that they’ll be able to iterate. (See Sam Bowman’s bumper post for an articulation of a plan like this.) I think this is a reasonable approach in principle: this is how progress happens in a lot of fields. For example, the AI companies don’t have plans for all kinds of problems that will arise with their capabilities research in the next few years, they just hope to figure it out as they get there. But the lack of specific proposals makes it harder to demonstrate particular flaws.
A lot of my concerns about alignment proposals are that when AIs are sufficiently smart, the plan won’t work anymore. But in many cases, the plan does actually work fine right now at ensuring particular alignment properties. (Most obviously, right now, AIs are so bad at reasoning about training processes that scheming isn’t that much of an active concern.) So you can’t directly demonstrate that current plans will fail later without making analogies to future systems; and observers (reasonably enough) are less persuaded by evidence that requires you to assess the analogousness of a setup.