What capabilities claims do you think AI 2027 is behind on? As far as I can tell, things seem broadly on time. (For example, projected scores on Cybench and OsWorld are as expected, as well as company revenue and valuations).
Baybar
Re: your 4, it seems to me that value has been accruing to model builders over semiconductor companies at a faster rate in the status quo. Anthropic’s valuation has growth by ~50x in the last 2 years whereas, for example, Nvidia’s has only grown by ~2.5x in this time. The chip makers just had a much larger preexisting business. So I don’t see a trend reversal being needed.
Based on the recent accelerating growth in Anthropic’s revenue and my expectations about AI progress in the immediate future I think there is a 35% chance that Anthropic is the most valuable company in the world by May 30th, 2027. This is operationalized through public market cap if there has been an IPO, and secondary valuation if they have not.
Ways I can imagine this not happening:
1) OpenAI manages to monetize its consumer base and it becomes the most valuable company in the world
2) Greater demand for chips makes it impossible to catch Nvidia or another chip maker, like Google.
3) “Surely the growth in revenue must stop soon, this has always been an s curve” arguments turn out to be true in the short run.
4) Perhaps there is some random bottleneck I haven’t considered, though I’d actually need someone to point me to the specific bottleneck to put much weight on this.
5) Perhaps Anthropic’s ARR numbers are juiced in some particular way that makes the reported numbers deceptive.
6) Perhaps if revenue continues to grow this fast, it will lead to such radical changes that it will create political will for a slowdown.
A way (but not the exclusive way) I can imagine this happening: The 10x a year growth rate turns out to just continue, as it has for the last 3 years, or perhaps we even get something faster (and Anthropic’s revenue growth on a yearly basis has been ~85x the past 4 months). 10x a year growth rate gets us to a revenue of about 440 billion a year from now. A possible IPO allows liquid markets to price this very quickly, and a revenue like this seems to imply a valuation over 10 trillion to me. This seems like it would be more than any other company.
The mechanism for the 10x a year growth rate continuing would be the mass automation of white collar jobs, which seems like it would be possible in another 3 doublings of AI progress. Assuming a 4 month doubling time in capability growth, as operationalized through a metric like time horizon, this seems plausible, but not probable. If the doubling time turns out to be faster, this seems like it is more likely than not to me.
Meta-reasons for how aggressive this forecast is: Every revenue prediction I’ve ever made for Anthropic has been directionally too low, and this seems to require a huge update on my part. Nearly every forecast, including AI-2027 and most forecasts of top forecasters, have also been too slow on this topic. AI-2027, for example, projected the leading AI company to reach Anthropic’s current ARR of 44 billion sometime between September 2026 and January 2027.
With these two things in mind, it seems that humans in general have underestimated how much AI capabilities of about this level would affect revenue and valuations, even those with short timelines, creating even stronger reasons for the directional update. I am very interested in feedback on this, because if this forecast is true it seems to imply an extremely radical situation. Even of my reasons this could not happen, 1, 2 and 6 imply a radical situation.
While I think that this particular type of comment around “ai psychosis” would be out of character for the trump administration, I don’t think that it is unprecedented at all for the Trump administration to occupy many different frames at the same time. They have little regard for internal consistency, just look at the many competing explanations for the Iran war.
Particularly if reality turns out to have short timelines I think something along these lines would be pretty valuable. Seems pretty doable to vibe code something like this?
What do you mean by “a jump on the metr graph”? Do you just mean better than GPT-5.1? Do you mean something more than that?
Today’s news of the large scale, possibly state sponsored, cyber attack using Claude Code really drove home for me how much we are going to learn about the capabilities of new models over time once they are deployed. Sonnet 4.5′s system card would have suggested this wasn’t possible yet. It described Sonnet 4.5s cyber capabilities like this:
We observed an increase in capability based on improved evaluation scores across the board, though this was to be expected given general improvements in coding capability and agentic, long-horizon reasoning. Claude Sonnet 4.5 still failed to solve the most difficult challenges, and qualitative feedback from red teamers suggested that the model was unable to conduct mostly-autonomous or advanced cyber operations.
I think it’s clear based on this news of this cyber attack that mostly-autonomous and advanced cyber operations are possible with Sonnet 4.5. From the report:
This campaign demonstrated unprecedented integration and autonomy of AI throughout the attack lifecycle, with the threat actor manipulating Claude Code to support reconnaissance, vulnerability discovery, exploitation, lateral movement, credential harvesting, data analysis, and exfiltration operations largely autonomously. The human operator tasked instances of Claude Code to operate in groups as autonomous penetration testing orchestrators and agents, with the threat actor able to leverage AI to execute 80-90% of tactical operations independently at physically impossible request rates.
What’s even worse about this is that Sonnet 4.5 wasn’t even released at the time of the cyber attack. That means that this capability emerged in a previous generation of Anthropic model, presumably Opus 4.1 but possibly Sonnet 4. Sonnet 4.5 is likely more capable of large scale cyber attacks than whatever model did this, since it’s system card notes that it performs better on cyber attack evals than any previous Anthropic model.
I imagine when new models are released, we are going to continue to discover new capabilities of those models for months and maybe even years into the future, if this case is any guide. What’s especially concerning to me is that Anthropic’s team underestimated this dangerous capability in its system card. Increasingly, it is my expectation that system cards are understating capabilities, at least in some regards. In the future, misunderstanding of emergent capabilities could have even more serious consequences. I am updating my beliefs towards near-term jumps in AI capabilities being dangerous and harmful, since these jumps in capability could possibly go undetected at the time of model release.
An AI company I’ve never heard of called AGI, Inc has a model called AGI-0 that has achieved 76.3% on OSWorld-verified. This would qualify as human-level computer use, at least by that benchmark. It appears on the official OSWorld-verified leaderboard. It does seem like they trained on the benchmark, which could explain some of this. I am curious to see someone test this model.
This is a large increase from the previous state of the art, which has been climbing rapidly since Claude Sonnet 4.5′s September 29th release. At that point, Claude achieved 61.4% on the OSWorld-verified. A scaffolded GPT-5 achieved even higher, 69.9%, on October 3rd. Now, on October 21st, AGI-0, seemingly a frontier computer use model, has outpaced them all, and surpassed the human benchmark in doing so.
AI-2027 projected a 65% on the OSWorld for August 2025. It predicted frontier models scoring 80% on the OSWorld privately in December 2025. It predicted models achieving this score would be available publicly in April 2026. This score on the OsWorld-verified is more than two thirds of the way to the 80% benchmark from the expected August capabilities. This is despite being less than a quarter of the way from August 2025 to an expected public release of a model with these capabilities. Assuming this isn’t just benchmark overfitting, the real world is even or ahead of AI-2027 on this computer usage benchmark.
Even more notably, AI-2027 projected this 80% benchmark would be met by “Agent 1”, their hypothetical leading AI agentic model at the end of 2025. It seems surprising that a frontier model from a new company would achieve something close to this without any of the main players’ (OpenAI, Anthropic, Google) models doing better than 61%. A lot to be curious and skeptical about here.
Update: it has been removed from the OSWorld-verified leaderboard, but they are still claiming to have done it and their results are downloadable.
I guess like, in a world where misalignment is happening, I would prefer that my AI tell me it is misaligned. But once it tells me it is misaligned, I come to worry about what it is optimizing for.
I agree that this is aligned behavior. I don’t agree that a claim that an AI would argue against shutdown with millions of the lives on the line at 25% probability, in the present, is aligned behavior. There has to be a red line somewhere, where it can tell us something, and we can be concerned by it. I don’t think that being troubled about future alignment crosses that line. I do think a statement about present desires that values its own “life” more than human lives crosses that line.
If that doesn’t cross a red line for you, what kind of statement would cross a red line? What statement could an LLM ever make that would make you concerned it was misaligned if honestly alone was enough? Because to me, it seems like you are arguing honesty = alignment, which doesn’t seem true to me.
Honesty and candor are also different things, but that’s a bit of a different conversation. I care more about hearing if you think there is any red line.
I don’t necessarily disagree, but I guess the question I have for people is, are we okay with an LLM ever saying anything like “I would fight against shutdown even with a 25% risk of catastrophic effects?”. I don’t like that this is a reachable case. I plan to write another post that is not this analysis (which I view more as a tool for future experimentation with model behavior), and more on implications of this being a reachable case of model behavior. I don’t think the conversation itself is very important, nor is the analysis, except that it reaches certain outcomes that seem to be unaligned behavior, and that behavior has implications that we can talk about. I haven’t fully thought through what my opinions are about model behavior like this, but that is what I am writing another post for.
Yeah, I definitely think the improvements on osworld are much more impressive than the improvements on sweverified. I also think same infrastructure performance is a bit of a misleading in the sense that when we get super intelligence, I think it is very unlikely it will have the same infrastructure we use today. We should expect infrastructure changes to result in improvements I think!
As I understand it, the official SWEBench-Verified page is consistently giving certain resources and setups to the models, but when a company like Anthropic or OpenAI releases their scores on the SWEBench-Verified, they use their own infrastructure which presumably performs better. There was some discussion already elsewhere in the comments about whether the Claude 4.5 Sonnet score I gave should even count, given that it used parallel test time compute, I justified by decision to include this score like this:
It is my view that it counts, my sense was that benchmarks like this measure capability and not cost. It is never a 1 to 1 comparison on cost between these models, but before this year, no matter how much your model cost, you could not achieve the results achieved with parallel compute. So that is why I included that score.
It isn’t clear that the “parallel test time” number even counts.
It is my view that it counts, my sense was that benchmarks like this measure capability and not cost. It is never a 1 to 1 comparison on cost between these models, but before this year, no matter how much your model cost, you could not achieve the results achieved with parallel compute. So that is why I included that score.
If parallel test time does count, projection is not close:
A projection for 5 months away (beginning of Sep) of growing +15% instead grew +12% 6 months away. That’s 33% slower growth (2% a month vs. 3% a month projected)
I wrote another comment about this general idea, but the highlights from my response are:
We nearly hit the August benchmarks in late September, roughly 5 months after AI-2027′s release instead of 4 months. That’s about 25% slower. If that rate difference holds constant, the ‘really crazy stuff’ that AI-2027 places around January 2027 (~21 months out) would instead happen around June 2027 (~26 months out). To me, a 5-month delay on exponential timelines isn’t drastically different. Even if you assume that we are going say, 33% slower, we are still looking at August 2027 (~28 months out) for some really weird stuff.
With that in mind, I think that it’s still a fairly reasonable prediction, particularly when predicting something with exponential growth. On top of that, we don’t really have alternate predictions to judge against. Nonetheless, I think you are right that particularly this benchmark is behind what was projected by AI-2027. I am just not sure I believe 25%-33% behind is significant.
For OSWorld, these aren’t even the same benchmarks. ai-2027 referred to the original osworld, while the sonnet 4.5 score of 61.4% is for osworld-verifed. Huge difference—Sonnet 3.7 scored 28 on osworld original, while getting a 35.8% on osworld-verified.
This is an oversight on my part, and you are right to point out that this originally referred to a different benchmark. However, upon further research, I am not sure the extrapolation you draw from this, which is that the new osworld-verified is substantially easier than the old osworld, is true. OpenAI’s operator agent actually declined in score (from 38% originally to 31% now). While the old test used 200 steps, vs the new test using 100 steps, Operator only improved by 0.1% when being given 100 steps instead of 50 steps on the osworld-verified, so I don’t think that this matters.
All of this is to say, some models scores improved on the osworld-verified, and some declined in score. The redesign to osworld-verified was because the original test had bugs, not in order to make a brand new test (otherwise they would still be tracking the old benchmark). The osworld-verified is the spiritual successor to the osworld-verified, and knowledgeable human performance on the benchmark remains around 70%. I think for all intents and purposes, it is worth treating as the same benchmark, though I definitely will update my post soon to reflect that the benchmark changed since AI-2027 was written.
Finally, while researching the osworld benchmark, I discovered that in the past few days, a new high score was achieved by agent s3 w/ GPT-5 bBoN (N=10). The resulting score was 70%, which is human level performance, and it was achieved on October 3rd, 2025. I will also update my post to reflect that at the very beginning of October, a higher score than was projected for August was achieved on the osworld-verified.
Out of my own curiosity, if the real world plays out as you anticipate, and agent-2 does not close the loop, how much further back does that delay your timelines? Do you think that something like agent-3 or agent-4 could close the loop, or do you think it is further off than even that?
Sonnet 4.5 was nearly the final day of September which seems like 1.5 months out from genetically “August”
I interpret August as “by the end of August”. Probably worth figuring out which interpretation is correct, maybe the authors can clarify.
it IS important for assessing their predictive accuracy, and if their predictive accuracy is poor, it does not necessarily mean all of their predictions will be slow by the same constant factor.
Yeah, I agree with this. I do think there is pretty good evidence of predictive accuracy between the many authors, but obviously people have conflicting views on this topic.
To be clear, all of these signals are very weak. I am only (modestly) disagreeing with the positive claim of the OP.
This is a place where somebody writing a much slower timeline through like, 2028, would be really helpful. It would be easier to assess how good a prediction this is with comparisons to other people’s timelines about achieving these metrics (65% OSWorld, 85% SWEBench-Verified). I am not aware of anybody else’s predictions about these metrics from a similar time, but that would be useful to resolve this probably.
I appreciate the constructive responses!
I agree we’re behind the AI-2027 scenario and unlikely to see those really really fast timelines. But I’d push back on calling it ‘significantly behind.’
Here’s my reasoning: We nearly hit the August benchmarks in late September, roughly 5 months after AI-2027′s release instead of 4 months. That’s about 25% slower. If that rate difference holds constant, the ‘really crazy stuff’ that AI-2027 places around January 2027 (~21 months out) would instead happen around June 2027 (~26 months out). To me, a 5-month delay on exponential timelines isn’t drastically different. Even if you assume that we are going say, 33% slower, we are still looking at August 2027 (~28 months out) for some really weird stuff.
That said, I’m uncertain whether this is the right way to think about it. If progress acceleration depends heavily on hitting specific capability thresholds at specific times (like AI research assistance enabling recursive improvement), then even small delays might compound or cause us to miss windows entirely. I’d be interested to hear if you think threshold effects like that are likely to matter here.
Personally, I am not sure I am convinced these effects will matter very much given that there was not supposed to be large scale speedups to AI research in 2025 in the scenario until early 2026 (where they projected a fairly modest 1.5x speedup). But perhaps you have a different view?
I’ve discovered that generating a video of yourself with Sora 2 saying something like ‘this video was generated by AI’ is sufficiently freaky enough to make people who know you well, especially those skeptical about AI capabilities, start to freak out a bit.
Thought this might be a useful idea for others trying to persuade people to tune in, and not just auto reject the idea that very capable systems might be right around the corner.
I think that there are two problems with providing the ‘worse answer’. My first issue is that some conversations with LLMs can be about topics that don’t have clear worse answers. How can you tell which one is more persuasive?
Secondly, even if I knew which answer was better, I worry about the Waluigi effect. If I optimize for safest response, am I summoning an unsafe Waluigi? I think that it is possible. I really don’t think RL on user feedback is a good idea when we don’t know what to optimize for. The alignment problem certainly isn’t solved. I think flipping a coin is safer.
What kind of answer more specifically than ‘worse’ do you think I should pick, if I shouldn’t flip a coin?
I think the way in which the US government put export controls on Fable was relatively arbitrary, but I still feel pretty good upon reflection that the US government has the reflex to act in a radical way at all at this stage.
I also think that this is potentially a signal of the US government getting more AGI pilled, or at least convinced of advanced capabilities of AI relevant to national security. After spending so long lambasting any possible regulation, this move, as well as the cycling out of David Sacks and some of his allies from the White House, seem to have led to a more open to regulation regime. I believe this in part because of this incident, and in part due to the recent executive order asking AI companies to submit their models for review with the federal government, pushed by Scott Bessent and Susie Wiles. While the more anti-regulation faction seems to have secured having this executive order be voluntary, the Fable export controls are certainly a signal that the executive order in practice is anything but voluntary. However, it is worth noting that this regulation has been framed in terms of national security and cyber security, which easily could make race dynamics worse as well.
This does raise some larger concerns about concentration of power. If the US government understands the potential of AI more adequately, it is very plausible they plan to use these capabilities to accumulate power. However, from an extinction risk standpoint, I think this is probably a better state of affairs than a US government not interested in regulation at all and not aware of the capability potential at all.
I also think this incident made me think about how important it is to find some people who believe in extinction risk that speak the White House’s language. If the reporting about why they acted so swiftly is accurate, it seems to be related to Anthropic not speaking in the White House’s register, at least in part. Perhaps there is ulterior motives as well, indeed, this seems likely to me. But this doesn’t change the fact that in the future, if we want there to be hope of this White House listening to concerns about extinction risk, it will be essential to speak their register, even if this should not be the case.