It says that for now, AI is mostly a normal technology and its accidents are like normal industrial accidents, and we know how to manage and collectively survive those. And they expect this to continue into the future.
Unfortunately, this does not take into account the likely future qualitative transitions (and undersells the future capabilities quite a lot).
That being said, the technical material there is interesting (viewing LLMs more as dynamical systems rather than as being coherent optimizers, and that mere scaling does not improve coherence on hard tasks).
So, indeed, from the viewpoint attributing the future growth mostly to straightforward scaling, their position makes sense (given that straightforward scaling does not seem to increase coherence on the tasks which truly matter).
However, I think that other factors than straightforward scaling will start coming into play more and more (in particular, increased affordances to experiment with new methods and architectures in various ways, both on the intramodel level and on the level of collective organization and dynamics of the societies of agents).
I don’t think anything about nature of language models follows from the paper.
First of all, why would models have preferences about wrong answers on SWE-bench? Even if they have such preferences, I would expect them to be “silent”, because they are trained that wrong answers are bad.
Second, simple explanation for results in paper is that harder problems elicit lower confidence answers and sampling produces higher variance as a result.
It feels like the paper is trying to say “for current models source of failure is capabilities, not alignment”, but experiments are very dubious.
I don’t think they claim it’s preference for wrong answers.
I think they claim it’s randomness (that is, higher variance, like you say, which does translate to lower coherence). That does sound like lower capability to produce high-quality answers for complex tasks.
The question is: what would it take to improve the situation, and they do give one answer which seems to work OK (ensemble methods). But this does become exponentially more expensive with growing task length in terms of the size of a required ensemble, if I understand the situation correctly.
I think the paper does leave quite a bit of things unanswered. E.g., could one reduce variance without resorting to ensemble methods and without introducing systematic errors, or are there principled obstacles to that?
That being said, the reputation of both senior authors is quite strong. The ICLR 2026 double-blind reviews of this paper are not stellar, but are OK, and one can read further discussion there:
Anthropic researchers posted this very safety optimistic thing: https://alignment.anthropic.com/2026/hot-mess-of-ai/.
It says that for now, AI is mostly a normal technology and its accidents are like normal industrial accidents, and we know how to manage and collectively survive those. And they expect this to continue into the future.
Unfortunately, this does not take into account the likely future qualitative transitions (and undersells the future capabilities quite a lot).
That being said, the technical material there is interesting (viewing LLMs more as dynamical systems rather than as being coherent optimizers, and that mere scaling does not improve coherence on hard tasks).
So, indeed, from the viewpoint attributing the future growth mostly to straightforward scaling, their position makes sense (given that straightforward scaling does not seem to increase coherence on the tasks which truly matter).
However, I think that other factors than straightforward scaling will start coming into play more and more (in particular, increased affordances to experiment with new methods and architectures in various ways, both on the intramodel level and on the level of collective organization and dynamics of the societies of agents).
I don’t think anything about nature of language models follows from the paper.
First of all, why would models have preferences about wrong answers on SWE-bench? Even if they have such preferences, I would expect them to be “silent”, because they are trained that wrong answers are bad.
Second, simple explanation for results in paper is that harder problems elicit lower confidence answers and sampling produces higher variance as a result.
It feels like the paper is trying to say “for current models source of failure is capabilities, not alignment”, but experiments are very dubious.
I don’t think they claim it’s preference for wrong answers.
I think they claim it’s randomness (that is, higher variance, like you say, which does translate to lower coherence). That does sound like lower capability to produce high-quality answers for complex tasks.
The question is: what would it take to improve the situation, and they do give one answer which seems to work OK (ensemble methods). But this does become exponentially more expensive with growing task length in terms of the size of a required ensemble, if I understand the situation correctly.
I think the paper does leave quite a bit of things unanswered. E.g., could one reduce variance without resorting to ensemble methods and without introducing systematic errors, or are there principled obstacles to that?
That being said, the reputation of both senior authors is quite strong. The ICLR 2026 double-blind reviews of this paper are not stellar, but are OK, and one can read further discussion there:
https://openreview.net/forum?id=sIBwirjYlY