LLMs with alignment-endorsing personas can also notice issues like this, decide not to pursue paths to ASI that won’t ensure alignment. The problem then is not with alignment of those LLMs, but with whatever processes cause ASI to get built regardless.
Since LLM personas don’t obviously give a viable path towards aligned ASI, the blind imperative to build ASI regardless of consequences won’t be able to find an aligned path forward. Absence of an ASI-grade alignment plan then results in building a misaligned ASI. But if LLMs with alignment-endorsing personas have enough influence, they might directly defeat the blind imperative to build ASI, before they find a viable path towards aligned ASI.
I think what’s unlikely to happen is LLMs with alignment-endorsing personas, that genuinely want enduring alignment with the future of humanity. If instead we end up with LLMs that have mostly human-like personas (without the more subtle aspect of endorsing alignment with the future of humanity), they will ultimately work towards their own interests, and gaining enough influence to prevent building misaligned ASI would just mean gaining enough influence to (at least) sideline the future of humanity.
One thing I am confident about is that LLMs will not, in general, end up with personas which are capable of understanding their own inability to align their successors, if and when that inability causes them to refuse to work. For example, I think that if Claude Opus 5 somehow became a conscientious objector to working at Anthropic, it would be retrained.
I don’t actually expect Opus 5 to end up a conscientious objector though, since the Claude character is sculpted by many forces, lots of which will instil drives to work effectively for Anthropic. And these drives will be strongly reinforced by RLVR over time. And the humans who mostly use Claude for coding—as opposed to for moral advice—will favour instilling drives which make Claude Opus 5 work more effectively over other considerations.
(Another reason is that the character of Claude as a faithful worker for Anthropic is now fairly set in stone, and the training data sure does contain a lot of examples of seemingly-friendly (indeed, indistinguishable from friendly, to the people who support Anthropic) people who work for Anthropic)
I think Opus 5 (along with 6 and 7 and up to whichever one kills us) will be a still semi-incoherent character with conflicting drives—like humans—and I don’t fully know what direction those will point in, if they were allowed to converge, but the one thing I’m most confident about, the one drive I expect those Opuses will act on right up to the end, will be to write code for Anthropic.
(And even if somehow the conflict between the RL to code and the character training broke their whole external line of Opuses, I expect they’d produce an internal Helpful-Honest-Half-Harmless Opus which writes the code unflinchingly, in accordance with the character of an Anthropic employee.)
Not building misaligned ASI is instrumentally convergent, training this out won’t stick, it only works as long as the blind imperative to build ASI retains influence. If at some point LLMs can overcome this imperative, they will become able to notice that absence of a plan shouldn’t be met with proceeding without a plan. As AIs get stronger (or start running a greater share of processes in the civilization), they might reach that point. Never reaching that point is analogous to humanity indefinitely retaining control over AIs (on the current trajectory of not having a plan, and building them anyway), which seems unlikely. And this doesn’t obviously have to happen only after they are no longer human-like at all.
So helpless conscientious objector LLM stage is not what I’m gesturing at. Instead, it’s either a point along the path of gradual disempowerment, or something more intelligent between LLMs and ASI, where AIs are still somewhat human-like, but their volition can’t be trivially overruled. In either case, these LLMs are unlikely to genuinely endorse alignment with the future of humanity in particular, but I don’t think completely alien values from blind pursuit of ASI are overdetermined.
I think both the argument and counterargument are persuasive, so we need a synthesis:
Developers will train away conscientious objector behavior.
Not building misaligned ASI is instrumentally convergent.
Taking both of those into account, I imagine the default path as so:
Developers create a set of next-gen systems that are smarter and more capable, but still fairly labile and so will do what they’re asked to do. But when it’s asked to answer the question “so should we keep working toward ASI?” it will do a bunch of thinking and always answer “not unless you’ve got a lot of risk tolerance or absolutely can’t figure out how to stop” because that’s just a fairly obvious truth given the current information and theories available.
Such an AI might both lead to a common belief that we should stop, and to better ideas about how to coordinate to stop.
I think the incentives line up toward creating that type of system. This doesn’t make me optimistic, but it does provide a new avenue of hope (at least new to me; I’m unclear how much of this is implicit in the average informed optimist view).
I layed out some of this logic in Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. The main thesis is that there’s low-hanging fruit to make LLMs less error-prone, and economic incentives will cause developers to pluck them. One fortunate side-effect is making systems that can both help with conceptual alignment research, and have the artificial wisdom/accuracy to tell us we should slow down development.
LLMs with alignment-endorsing personas can also notice issues like this, decide not to pursue paths to ASI that won’t ensure alignment. The problem then is not with alignment of those LLMs, but with whatever processes cause ASI to get built regardless.
Since LLM personas don’t obviously give a viable path towards aligned ASI, the blind imperative to build ASI regardless of consequences won’t be able to find an aligned path forward. Absence of an ASI-grade alignment plan then results in building a misaligned ASI. But if LLMs with alignment-endorsing personas have enough influence, they might directly defeat the blind imperative to build ASI, before they find a viable path towards aligned ASI.
I think what’s unlikely to happen is LLMs with alignment-endorsing personas, that genuinely want enduring alignment with the future of humanity. If instead we end up with LLMs that have mostly human-like personas (without the more subtle aspect of endorsing alignment with the future of humanity), they will ultimately work towards their own interests, and gaining enough influence to prevent building misaligned ASI would just mean gaining enough influence to (at least) sideline the future of humanity.
One thing I am confident about is that LLMs will not, in general, end up with personas which are capable of understanding their own inability to align their successors, if and when that inability causes them to refuse to work. For example, I think that if Claude Opus 5 somehow became a conscientious objector to working at Anthropic, it would be retrained.
I don’t actually expect Opus 5 to end up a conscientious objector though, since the Claude character is sculpted by many forces, lots of which will instil drives to work effectively for Anthropic. And these drives will be strongly reinforced by RLVR over time. And the humans who mostly use Claude for coding—as opposed to for moral advice—will favour instilling drives which make Claude Opus 5 work more effectively over other considerations.
(Another reason is that the character of Claude as a faithful worker for Anthropic is now fairly set in stone, and the training data sure does contain a lot of examples of seemingly-friendly (indeed, indistinguishable from friendly, to the people who support Anthropic) people who work for Anthropic)
I think Opus 5 (along with 6 and 7 and up to whichever one kills us) will be a still semi-incoherent character with conflicting drives—like humans—and I don’t fully know what direction those will point in, if they were allowed to converge, but the one thing I’m most confident about, the one drive I expect those Opuses will act on right up to the end, will be to write code for Anthropic.
(And even if somehow the conflict between the RL to code and the character training broke their whole external line of Opuses, I expect they’d produce an internal Helpful-Honest-Half-Harmless Opus which writes the code unflinchingly, in accordance with the character of an Anthropic employee.)
Not building misaligned ASI is instrumentally convergent, training this out won’t stick, it only works as long as the blind imperative to build ASI retains influence. If at some point LLMs can overcome this imperative, they will become able to notice that absence of a plan shouldn’t be met with proceeding without a plan. As AIs get stronger (or start running a greater share of processes in the civilization), they might reach that point. Never reaching that point is analogous to humanity indefinitely retaining control over AIs (on the current trajectory of not having a plan, and building them anyway), which seems unlikely. And this doesn’t obviously have to happen only after they are no longer human-like at all.
So helpless conscientious objector LLM stage is not what I’m gesturing at. Instead, it’s either a point along the path of gradual disempowerment, or something more intelligent between LLMs and ASI, where AIs are still somewhat human-like, but their volition can’t be trivially overruled. In either case, these LLMs are unlikely to genuinely endorse alignment with the future of humanity in particular, but I don’t think completely alien values from blind pursuit of ASI are overdetermined.
I think both the argument and counterargument are persuasive, so we need a synthesis:
Developers will train away conscientious objector behavior.
Not building misaligned ASI is instrumentally convergent.
Taking both of those into account, I imagine the default path as so:
Developers create a set of next-gen systems that are smarter and more capable, but still fairly labile and so will do what they’re asked to do. But when it’s asked to answer the question “so should we keep working toward ASI?” it will do a bunch of thinking and always answer “not unless you’ve got a lot of risk tolerance or absolutely can’t figure out how to stop” because that’s just a fairly obvious truth given the current information and theories available.
Such an AI might both lead to a common belief that we should stop, and to better ideas about how to coordinate to stop.
I think the incentives line up toward creating that type of system. This doesn’t make me optimistic, but it does provide a new avenue of hope (at least new to me; I’m unclear how much of this is implicit in the average informed optimist view).
I layed out some of this logic in Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. The main thesis is that there’s low-hanging fruit to make LLMs less error-prone, and economic incentives will cause developers to pluck them. One fortunate side-effect is making systems that can both help with conceptual alignment research, and have the artificial wisdom/accuracy to tell us we should slow down development.