True, but I’m happy to let anyone answering it include defining what they mean.
FWIW, my definition was “getting to the point where existential and suffering risks from misalignment are, at least, significantly reduced, and we are sufficiently confident that AI is sufficiently well aligned that we can reasonable trust it to supply most of the effort in further alignment work. I’m also implicitly assuming that what we’re aligning is LLMs, or at least something fairly similar to them for alignment purposes — partly because, as I discuss in my answer, if it’s something else aligning LLMs may still be useful (e.g. if they get used by the other AI as a tool call to solve alignment-related problems), and also that that happening probably delays ASI enough to give us some extra time.
FWIW, my definition was “getting to the point where existential and suffering risks from misalignment are, at least, significantly reduced, and we are sufficiently confident that AI is sufficiently well aligned that we can reasonable trust it to supply most of the effort in further alignment work
For many non-doomers , including those working in AI, safety is a series incremental steps, gradual improvements that hopefully keep pace with improvements in capabilities. Their view is analogous to car or plane safety , where the process has no end point, and perfection, zero casualties, isn’t really expected. On this view, safety is not a point , but a line that needs to keep ahead of the ever rising capabilities line. It would be odd to ask when car safety will be solved.
Why alignment at all?
There are a number of routes to AI safety.
Control means that it doesn’t matter what the AI wants, if it wants anything, because we can make it do what we want.
Corrigibility means alignment that can be changed once an AI is up and running. Control could be considered extreme corrigibility.
Non agency. Alignment and Control are both responses to agency. A third approach is non-agentic “tool AI” which responds to a specific instruction or request. Current (2025) AI’s are fairly tool like.
AI Control is fine below and perhaps even up to AGI. I think that approach genuinely does suffere from a Sharp Left turn once the AI’s capabilities significantly exceed ours: that seems to me like an approach where your control startegies really do need to be as smart as the thing you’re trying to control. In very simple situations, you can use cryptographically strong techniques, but in realistic AI Control tasks, the attack surface is so large and so complex that something that can understand it better than you can has a huge tactical advantage.
I see Corrigibility as very different from Control. Building a very corrigible AI is likely a feasible technical approach to AI Alignment. My issues with it are primarily:
a) corrigibility has a bigger problem with extrapolation out-of-distribution and thus Goodharting than a more value-learning based approach. This is not necessarily an insoluble problem, if the AI can distinguish what’s out-of-distribution and act suitable cautiously: Seth Herd’s “Do What I Mean (and Check)” Corrigible alignment is basically this.
b) it is very, very easy for multiple groups of humans each with access to corrigible ASI to get into a war or other form of conflict using ASI-powered weapons/technologies. It is also very easy for a small powerful group to use very Corrigible AI to greatly concentrate power. Both of these are separate sources of X-Risk/Suffering-Risk that simple misalignment, but also very serious risks. Dario Amodei ’s writing, and indeed Claude’s Constitution make it clear that Anthropic take this risk as seriously as they do misalignment X-risk, and I completely agree with them. I think people on LessWrong and in the Alignment Community generally need to consider this problem more than they often seem to. ASI generated technology is going to be very powerful, and is thus going to need to be used very wisely, even when it has appeared rapidly. Highly Corrigible AI is much less likely to push back on the imprudent ideas of whoever is operating/controlling it than Value Learning AI.
Tool AI isn’t the direction that the market and demand is currently moving, and has exactly the same potential for empowering existing human conflicts and enhancing concentration of power as Corrigible AI, if not even more so.
So I see Corrigible AI and Tool AI as probably technically feasible, but as causing massive inherent sociotechnical risks. What we need is AI that is wiser and more ethical than humans, but actually aligned to what a very wide human would agree is in the general interests of all of humanity.
So I agree that what you describe are approaches often outline to AI Alignment: I just disagree with calling that AI Safety: I see creating highly Corrigible AI as solving the technical AI Alignment problem at the cost of producing a different major new of X-Risk/S-Risk from AI, so not solving AI Safety.
I think that approach genuinely does suffere from a Sharp Left turn once the AI’s capabilities significantly exceed ours: that seems to me like an approach where your control startegies really do need to be as smart as the thing you’re trying to control
There’s a very basic difference between the people who believe in SLT’S, rapid RSI , etc and those who don’t, and it affects their unspoken assumptions and semantics. The thing where it affects their senstuvs is a problem.
corrigibility has a bigger problem with extrapolation out-of-distribution and thus Goodharting than a more value-learning based approach
I don’t se why.
b) it is very, very easy for multiple groups of humans each with access to corrigible ASI to get into a war or other form of conflict using ASI-powered weapons/technologies
Agreed. I didn’t say so explicitly, but I was mainly concerned with Everybody Dies scenarios. I think a multipolar scenario where ASI are controllable and controlled by powerful interests is highly likely, but not completely fatal.
Tool AI isn’t the direction that the market and demand is currently moving, and has exactly the same potential for empowering existing human conflicts and enhancing concentration of power as Corrigible AI, if not even more so.
Ok, but that’s a different complaint to “it’s not even possible”. Also , the market is for agents that work for you, but do your own things. That’s a point against the standard Doom argument with a sovereign AI killing everyone for it’s own reasons.
So I see Corrigible AI and Tool AI as probably technically feasible, but as causing massive inherent sociotechnical risks. What we need is AI that is wiser and more ethical than humans, but actually aligned to what a very wide human would agree is in the general interests of all of humanity.
And a.way of forcing people to use it. If merely controllable/corrigible AI is available, powerful interests are going to prefer it.
So I agree that what you describe are approaches often outline to AI Alignment: I just disagree with calling that AI Safety: I see creating highly Corrigible AI as solving the technical AI Alignment problem at the cost of producing a different major new of X-Risk/S-Risk from AI, so not solving AI Safety.
True, but I’m happy to let anyone answering it include defining what they mean.
FWIW, my definition was “getting to the point where existential and suffering risks from misalignment are, at least, significantly reduced, and we are sufficiently confident that AI is sufficiently well aligned that we can reasonable trust it to supply most of the effort in further alignment work. I’m also implicitly assuming that what we’re aligning is LLMs, or at least something fairly similar to them for alignment purposes — partly because, as I discuss in my answer, if it’s something else aligning LLMs may still be useful (e.g. if they get used by the other AI as a tool call to solve alignment-related problems), and also that that happening probably delays ASI enough to give us some extra time.
For many non-doomers , including those working in AI, safety is a series incremental steps, gradual improvements that hopefully keep pace with improvements in capabilities. Their view is analogous to car or plane safety , where the process has no end point, and perfection, zero casualties, isn’t really expected. On this view, safety is not a point , but a line that needs to keep ahead of the ever rising capabilities line. It would be odd to ask when car safety will be solved.
Why alignment at all?
There are a number of routes to AI safety.
Control means that it doesn’t matter what the AI wants, if it wants anything, because we can make it do what we want.
Corrigibility means alignment that can be changed once an AI is up and running. Control could be considered extreme corrigibility.
Non agency. Alignment and Control are both responses to agency. A third approach is non-agentic “tool AI” which responds to a specific instruction or request. Current (2025) AI’s are fairly tool like.
AI Control is fine below and perhaps even up to AGI. I think that approach genuinely does suffere from a Sharp Left turn once the AI’s capabilities significantly exceed ours: that seems to me like an approach where your control startegies really do need to be as smart as the thing you’re trying to control. In very simple situations, you can use cryptographically strong techniques, but in realistic AI Control tasks, the attack surface is so large and so complex that something that can understand it better than you can has a huge tactical advantage.
I see Corrigibility as very different from Control. Building a very corrigible AI is likely a feasible technical approach to AI Alignment. My issues with it are primarily:
a) corrigibility has a bigger problem with extrapolation out-of-distribution and thus Goodharting than a more value-learning based approach. This is not necessarily an insoluble problem, if the AI can distinguish what’s out-of-distribution and act suitable cautiously: Seth Herd’s “Do What I Mean (and Check)” Corrigible alignment is basically this.
b) it is very, very easy for multiple groups of humans each with access to corrigible ASI to get into a war or other form of conflict using ASI-powered weapons/technologies. It is also very easy for a small powerful group to use very Corrigible AI to greatly concentrate power. Both of these are separate sources of X-Risk/Suffering-Risk that simple misalignment, but also very serious risks. Dario Amodei ’s writing, and indeed Claude’s Constitution make it clear that Anthropic take this risk as seriously as they do misalignment X-risk, and I completely agree with them. I think people on LessWrong and in the Alignment Community generally need to consider this problem more than they often seem to. ASI generated technology is going to be very powerful, and is thus going to need to be used very wisely, even when it has appeared rapidly. Highly Corrigible AI is much less likely to push back on the imprudent ideas of whoever is operating/controlling it than Value Learning AI.
Tool AI isn’t the direction that the market and demand is currently moving, and has exactly the same potential for empowering existing human conflicts and enhancing concentration of power as Corrigible AI, if not even more so.
So I see Corrigible AI and Tool AI as probably technically feasible, but as causing massive inherent sociotechnical risks. What we need is AI that is wiser and more ethical than humans, but actually aligned to what a very wide human would agree is in the general interests of all of humanity.
So I agree that what you describe are approaches often outline to AI Alignment: I just disagree with calling that AI Safety: I see creating highly Corrigible AI as solving the technical AI Alignment problem at the cost of producing a different major new of X-Risk/S-Risk from AI, so not solving AI Safety.
There’s a very basic difference between the people who believe in SLT’S, rapid RSI , etc and those who don’t, and it affects their unspoken assumptions and semantics. The thing where it affects their senstuvs is a problem.
I don’t se why.
Agreed. I didn’t say so explicitly, but I was mainly concerned with Everybody Dies scenarios. I think a multipolar scenario where ASI are controllable and controlled by powerful interests is highly likely, but not completely fatal.
Ok, but that’s a different complaint to “it’s not even possible”. Also , the market is for agents that work for you, but do your own things. That’s a point against the standard Doom argument with a sovereign AI killing everyone for it’s own reasons.
And a.way of forcing people to use it. If merely controllable/corrigible AI is available, powerful interests are going to prefer it.
Neither alignment nor safety is a simple binary
Sounds like we’re mostly in agreement!