I sometimes hear people say “The MIRI warnings of AI risks are out of date. There are different risks now, but MIRI hasn’t changed”. What do people mean by this? The MIRI arguments still seem to hold up to me?
I asked Claude for it’s input, and received this answer; which seems like a good breakdown?
Claude’s Response
This is a criticism I’ve seen circulating in AI safety discussions, and there are a few distinct claims people seem to be making:
The “outdated threat model” argument
Some critics argue MIRI’s core warnings were developed when the assumed path to AGI looked quite different—more like classical GOFAI or relatively interpretable systems where you might hope to formally verify properties. The argument goes that deep learning systems present different alignment challenges (mesa-optimization, deceptive alignment emerging from gradient descent, RLHF-specific failure modes, multi-agent dynamics) that MIRI’s framework wasn’t designed around.
The “they predicted wrong timelines/capabilities”
There’s a strain of criticism that MIRI expected AGI to look more like a single powerful optimiser with coherent goals, whereas what we’re getting is… weirder. Large pretrained models that are hard to characterise as “agents” in the classical sense, that exhibit capability gains from scale rather than algorithmic breakthroughs, and where the alignment problem manifests differently than the “utility maximiser finds unexpected solutions” framing.
The “strategic irrelevance” argument
Some claim MIRI’s response to the current landscape (particularly around corrigibility, interpretability work at other orgs, constitutional AI approaches) has been to maintain their pessimism without engaging substantively with whether these approaches might work, or pivoting their research programme accordingly.
My take on whether this is fair
The core MIRI arguments about instrumental convergence, Goodhart’s law, and the difficulty of specifying human values seem to me to hold up regardless of architecture. The question is more whether the specific failure modes they emphasised (treacherous turns, sharp capability gains enabling deception) are the ones we should be most worried about with current systems versus, say, gradual value drift, multi-agent race dynamics, or misuse.
What specifically prompted you hearing this recently?
This is a criticism I’ve seen circulating in AI safety discussions
Surely Claude is bullshitting? I haven’t heard of Claude or any other AI going about reading the net in its spare time, the better to hold up its end of an informed and up to date conversation. Maybe that’s something that AIs will be doing soon, but AFAIK not yet. However, it is the sort of thing that a knowledgeable human with their finger on the pulse of events would say, so Claude says this sort of thing.
I think that there’s a couple of things which are quite clearly different from MIRI’s original arguments:
They originally argued a fair amount that AI would go from vastly subhuman to vastly superhuman over an extremely short time (e.g hours or days rather than years, which is what we are currently seeing). This affects threat dynamics
A lot of their arguments were based around optimising value functions. This is still a very valid way to look at things when looking at RL agents, but it’s unclear that it’s the best way to compress the agent’s behaviour with LLM based methods: simulator theory seems much more appropriate, and has a bunch of different risks.
I still think that the basic argument of “if you take something you don’t understand and can’t control very well and scale it up to superintelligence, that seems bad” holds.
They originally argued a fair amount that AI would go from vastly subhuman to vastly superhuman over an extremely short time (e.g hours or days rather than years, which is what we are currently seeing).
EY argued that this was possible, not that this was overdetermined (and not that it was load-bearing to the threat model).
For comparison I had Claude Sonnet 4.5 generate two answers and it produced a far different result. My intuition is the following, I summed it to Claude and didn’t receive pushback:
I doubt that RLHF-specific failure modes like GPT-4o-sycophant disprove MIRI’s framework of powerful optimizers, since GPT-4o tried to optimize for likes and other engagement metrics.
Maybe the Claude to which you referred meant that MIRI’s framework could be disproved by the lack of long-term goals? Or, as I put it elsewhere, “Maybe SOTA alignment research resembles kids’ psychology, except for the fact that researchers can read models’ thoughts? If this is true, then important problems alignment research failed to solve would be like adults’ psychology or more general concepts like the AIs which become the new proletariat.”
They mean that (there is more chance that) training will produce obedient AI that will help governments become more totalitarian and will not effectively pursue some very alien goal.
I sometimes hear people say “The MIRI warnings of AI risks are out of date. There are different risks now, but MIRI hasn’t changed”. What do people mean by this? The MIRI arguments still seem to hold up to me?
I asked Claude for it’s input, and received this answer; which seems like a good breakdown?
Link to chat here: https://claude.ai/share/36ac7360-b2e8-4601-94da-1ce7d05962d9
Claude said to keltan:
Surely Claude is bullshitting? I haven’t heard of Claude or any other AI going about reading the net in its spare time, the better to hold up its end of an informed and up to date conversation. Maybe that’s something that AIs will be doing soon, but AFAIK not yet. However, it is the sort of thing that a knowledgeable human with their finger on the pulse of events would say, so Claude says this sort of thing.
I think that there’s a couple of things which are quite clearly different from MIRI’s original arguments:
They originally argued a fair amount that AI would go from vastly subhuman to vastly superhuman over an extremely short time (e.g hours or days rather than years, which is what we are currently seeing). This affects threat dynamics
A lot of their arguments were based around optimising value functions. This is still a very valid way to look at things when looking at RL agents, but it’s unclear that it’s the best way to compress the agent’s behaviour with LLM based methods: simulator theory seems much more appropriate, and has a bunch of different risks.
I still think that the basic argument of “if you take something you don’t understand and can’t control very well and scale it up to superintelligence, that seems bad” holds.
EY argued that this was possible, not that this was overdetermined (and not that it was load-bearing to the threat model).
For comparison I had Claude Sonnet 4.5 generate two answers and it produced a far different result. My intuition is the following, I summed it to Claude and didn’t receive pushback:
I doubt that RLHF-specific failure modes like GPT-4o-sycophant disprove MIRI’s framework of powerful optimizers, since GPT-4o tried to optimize for likes and other engagement metrics.
Maybe the Claude to which you referred meant that MIRI’s framework could be disproved by the lack of long-term goals? Or, as I put it elsewhere, “Maybe SOTA alignment research resembles kids’ psychology, except for the fact that researchers can read models’ thoughts? If this is true, then important problems alignment research failed to solve would be like adults’ psychology or more general concepts like the AIs which become the new proletariat.”
They mean that (there is more chance that) training will produce obedient AI that will help governments become more totalitarian and will not effectively pursue some very alien goal.