Aaron Staley

Karma: 120

Aaron Staley 5 Jan 2026 14:33 UTC
2 points
0
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform
If I understand correctly, you are advocating for using a call only strategy (as opposed to a (synthetic) long strategy) to achieve higher leverage than would otherwise be possible?

> This is partly for speculation, but it seems reasonable for most people with 2 years of savings to have 10% of their net worth in SPY options or 20% in SPX options [4] for hedging purposes alone.

To clarify, you mean 10% of net worth being in this specific contract (SPY280616C01000000)? So roughly 15:1 leverage using options?

Readers should note this has very strong returns if you get that 50%+ return, but isn’t straight leverage—the median outcome here is about a 12.7% reduction in portfolio value in next 2.5 years relative to pure SPY.

Aaron Staley 2 Jan 2026 18:08 UTC
6 points
0
in reply to: Thane Ruthenis’s comment on: AI #149: 3
I agree with you that “Opus 4.5 can do anything” is overselling it and there is too much hype around acting like these things are fully autonomous software architects. I did want to note though that Opus 4.5 is a vast improvement and praise is warranted.
My guess is that “convert this already-written code from this representation/framework/language/factorization to this other one” may be one of the things LLMs are decent at, yep!
Agreed, I’m relying on their “localized” intelligence to get work done fast. Where Anthropic has improved their models significantly this year is A) improving task “planning”, e.g. how to extract the relevant context needed to make decisions LLMs broadly already could do, B) editing code in sane ways that doesn’t break things (at the beginning of the year, Claude would chew up any 4000+ LOC file just from wrong tool use). In some ways, this isn’t necessarily higher “intelligence” (Claude models remain relatively dumber on solving novel problems compared to frontier GPT/Gemini) but proper training in the coding domain.

But this isn’t really “vibe-coding”/”describe the spec in natural language and watch the LLM implement it!”/”programming as a job is gone/dramatically transformed!”, the way it’s being advertised. LLMs are not, it seems, actually good at mapping natural-language descriptions into non-hack-y, robust background logic. You need a “code-level” prompt to specify the task precisely enough
It’s a mixed bag. In practice, I can vibe code 100 line isolated modules from natural language, though it does require inspecting the code for bugs and then providing the model feedback and it fixes things. Still much faster than hand writing and slightly faster than “intention” auto-complete with Cursor.
But overall, yes, I agree that I continue to do all the systems architecture and it feels like I’m offloading more well defined tasks to the model.

Aaron Staley 2 Jan 2026 1:12 UTC
14 points
1
in reply to: Thane Ruthenis’s comment on: AI #149: 3
None of that worked, I detect basically no change since August.

What sort of codebase are you working on? I work in a 1 million line typescript codebase and Opus 4.5 has been quite a step up from Sonnet 4.5 (which in turn was a step up from the earlier Sonnet/Opus 4 series).

I wouldn’t say I can leave Opus 4.5 on a loose leash by any means, but unlike prior models, using AI agents for 80%-90% of my code modifications (as opposed to in-IDE with autocomplete) has actually become ROI positive for me.

The main game changer is that Opus has simply become smarter about working with large code bases—less hallucinated methods, more research into the codebase before actions are taken, etc.

As a simple example, I’ve had a “real project” benchmark for awhile to convert ~2000 lines of test cases from an old framework to a new one. Opus 4.5 was able to pull it off with relatively minimal initial steering. (showing an example of a converted test case, correcting a few issues around laziness when it did the first 300 line set). Sonnet 4.5′s final state was a bit buggier and more importantly what it actually wrote during initial execution was considerably buggier, requiring it to self-correct from typecheck or test cases failing. (Ultimately, Opus ended up costing similar to Sonnet with a third the wall clock time).

Most of my work is refactoring—in August, I would still have to do most manually given high error rate of LLMs. These days? Opus is incredibly reliable with only vague directions. As another recent example: I had to add a new parameter to a connection object constructor to indicate if it should be read only—Opus was able to readily update dozens call sites correctly based on whether the call sites were using the connection to write.

By no means does it feel like an employee (the ai-2027 agent-1 definition), but it is a powerful tool (getting more powerful through the generations) that has changed how I work.

Aaron Staley 24 Dec 2025 20:25 UTC
1 point
0
in reply to: StanislavKrym’s comment on: Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins
Yes, both model families are similar in that they do not have consistently declining accuracy in the 2-16 hour task window. The modeling is somewhat broken when you have higher accuracy in the 8-16 hour window than the 2-4 hour window.

GPT models do not have this characteristic; while not perfect with the curve, at least accuracy roughly drops monotonically with task length. (exception o4-mini which also had bizarre patterns in that 2-16 hour window).

I suspect at some level heavy RLVF has broken the core METR model of performance correlating to task length.

Aaron Staley 24 Dec 2025 20:16 UTC
1 point
0
in reply to: Afterimage’s comment on: Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins
I personally think the stronger argument here is that Claude models are not growing in capability consistent with higher task length = harder. (Grok 4 was similar) if you look at the histograms.

Both Sonnet 4.5 and Opus 4.5 were outperforming in the 8 to 16 hour bracket over the 2 to 4 hour, which is highly inconsistent with the task length difficulty model. The model appears broken at last since 3.5 sonnet given the flatness of the 2-16 hour tasks.

You end up in a case where the 4.5 Sonnet curve has a higher % of the solved tasks under it than 4.5 Opus (note how 4.5 Opus gets 0 tasks right in the 16 hour to 32 hour window even though the distribution implies it should be more like 25%). That is the “gain” this implies is overstated dramatically. [1]

The unfortunate consequence is largely shash42′s point—it’s not clear that modeling “task length horizon” is a valid way to view this data. Raw accuracy seems better correlated with time.

[1] An alternative interpretation is that Sonnet 4.5 was much better than the METR curve then implied.

Aaron Staley 21 Dec 2025 17:17 UTC
1 point
0
in reply to: Megan Kinniment’s comment on: How to game the METR plot
Thanks for the histograms. Is the raw data available somewhere?

Just eyeballing it:
- accuracy growth rate for Opus 4.5 in the 4-8 hour range is what is expected given the trendline from Sonnet to Sonnet 4.5.
- The 8-16 hour growth came in ~3 months ahead of target.
- The 2-4 hour growth is a month or so ahead of target.
Aligns to my sense the model is a month, maybe 2 months, ahead of what is expected and a lot of this jump (4.5 months ahead of expected) is from artifacts of the curve fitting

Aaron Staley 20 Dec 2025 21:08 UTC
11 points
0
in reply to: StanislavKrym’s comment on: Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins
Private workspace so I can’t share the session. But the approach is simple and doesn’t really require it to understand.
I think we’re coming at this from different angles: you’re doing a “white-box” critique (how specific task outcomes / curve fitting affect the METR horizon), whereas I’m doing a “black-box” consistency check: is the claimed p50 result consistent with what we see on other benchmarks that should correlate with capability?
The core model is:
1. Take Sonnet 4 → Sonnet 4.5 and compute the improvement rate (slope).
2. Assume Opus improves at the same rate as Sonnet over this period.
3. Start from Opus 4 as the anchor and ask: “when would we expect to reach the Opus 4.5 reported value?”
  (For METR horizons I do this in log space; for accuracy/ECI I treat it as linear.)
That yields “time ahead/behind” vs the reported Opus 4.5 result:
- ECI: ~1.3 months ahead
- SWE-bench bash agent: on target (about a week behind)
- METR accuracy: ~2.4 months ahead
- METR 80% horizon: ~1 month behind
- METR 50% horizon (using METR’s reported 289 min): ~4.5 months ahead
The point is that METR p50 is the outlier relative to the other signals.
If instead we assume Opus 4.5 is only as far “ahead” as the other benchmarks suggest, then p50 should be closer to:
- 1.3 months ahead (ECI-like): ~200 minutes
- 2.4 months ahead (accuracy-like): ~226 minutes
And the corresponding implied p80 would be:
- on-target: ~28 minutes
- 1.3 months ahead: ~30 minutes
- 2.4 months ahead: ~32 minutes
My best guess is we’re ~1 month ahead overall, which puts p50/p80 in-between those cases.
Finally, percentiles inside METR’s CI depend on the (unstated) sampling distribution; if you approximate it as log-normal you get the rough “position within the CI” numbers I mentioned, but it’s only an approximation.

Aaron Staley 20 Dec 2025 19:01 UTC
18 points
9
on: Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins
Bayesians are updating too much on AI capability speed from this data point, given:
- The CI is extremely wide and METR’s own caveats about sparsity at higher horizons.
- This level of a jump relative to the previous Opus 4.1 or Opus 4 is inconsistent with the 80% success threshold, accuracy level, and other key benchmarks that should correlate with capabilities (ECI, swe-bench bash).
I modeled all this in GPT-5.2 and the more realistic estimate for 50% derived from the other benchmarks is in the range of 190 to 210 minutes, depending on how much weight you put on the impressive (but not to the degree of the 50%) accuracy jump. The 80% is likely a slight underestimate (my guess is closer to 29 minutes).

These numbers:
- Maps to around a 15-20th percentile on the CI interval for the 50% and a 60th percentile for the 80%, in the realms of “this is due to chance”. [1]
- Gives around a 5-6 month doubling relative to Opus 4 and a 6-7 month doubling relative to O3.
- Imply evidence for the 50% − 80% capabilities systemically widening is weak.
[1] Note that this does provide evidence that Gemini 3 and GPT-5.2 will also have high p50 scores. Not because of capability jump per se but because of the distribution of tasks within METR benchmarks.
What links here?

Aaron Staley 5 Oct 2025 18:06 UTC
2 points
0
in reply to: Baybar’s comment on: Checking in on AI-2027
Good response. A few things I do want to stress:
. I am just not sure I believe 25%-33% behind is significant.
I personally see the lower bound as 33% slower. That’s enough to change 2 to 3 years which is significant.
And again, realistically progress is even slower. The parallel compute version only increased by 1.8% in 4 months. We might be another 6 months from hitting 85% at current rates—this is quite a prediction gap.
and knowledgeable human performance on the benchmark remains around 70%.
Is this true? They haven’t updated their abstract claiming 72.36% (which was from the old version) and I’m wondering if they simply haven’t re-evaluated.
But yes, looking at the GTA1 paper, you are correct that perf varies a bit between os-world and os-world-verified, so I take back that growth is obviously slower than projected.
All said, I trust swe-bench-verified more regardless to track progress:
1. We’re relying on a well-made benchmark that was done as a second pass by OpenAI. os-world is not that.
2. Labs seem to be targetting more—low hanging fuit like attaching python interpreters just doesn’t exist for this benchmark ( I’m not sure if the ai-2027 considered this issue when making their os-world predictions)..
3. We are concerned mainly with coding abilities (automated ai research) on the ai 2027 timelines.

Aaron Staley 3 Oct 2025 18:29 UTC
6 points
3
on: Checking in on AI-2027
Claude Sonnet 4.5 scored an 82% on this metric, as of September 29th, 2025. Three percentage points below the 85% target, achieved one month late, again, remarkably close. Particularly given that in August, Opus 4.1 was already scoring 80% on this benchmark.
I disagree this is close for several reasons.
1. It isn’t clear that the “parallel test time” number even counts.
  1. My understanding is these benchmarks can’t be achieved by using mechanisms that cost more in compute than a human to manually perform and we have no idea how much parallel attempts are sampled. They use up to 256 in their post on GPQA
  2. It uses an internal scoring model that might not generalize beyond the repos swe-bench tests.
  3. Sonnet 3.7′s 70.3% score did not exist on swebench.com at the point ai-2027 was released (highest was 65.4%), suggesting the authors were not anchoring from that parallel test time number to begin with.
2. If parallel test time does count, projection is not close:
  1. A projection for 5 months away (beginning of Sep) of growing +15% instead grew +12% 6 months away. That’s 33% slower growth (2% a month vs. 3% a month projected)
  2. Looking more recently, the growth from May’s Sonnet 4 with parallel compute to now (4 months later) has been 1.8%. At this rate assuming linearity, 85% won’t be crossed for nearly 7 months from now, which is over 60% slower than projection.
Claude Sonnet 4.5 scored a 62% on this metric, as of September 29th, 2025.
For OSWorld, these aren’t even the same benchmarks. ai-2027 referred to the original osworld, while the sonnet 4.5 score of 61.4% is for osworld-verifed. Huge difference—Sonnet 3.7 scored 28 on osworld original, while getting a 35.8% on osworld-verified. We might be at more like a 55.6% SOTA today (GTA1 w/ GPT-5) on OG osworld, a huge miss (~46% slower)

Overall, realized data suggests something more like an ai-2029 or even later.

Aaron Staley 7 Aug 2025 21:59 UTC
5 points
1
in reply to: O O’s comment on: tdko’s Shortform
I don’t believe there’s a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).
1. Gemini 2.5 Pro is already was well ahead of O3 on IMO, but had worse swe-bench/METR scores.
2. Claude is relatively bad at math but has hovered around SOTA on agentic coding.

Aaron Staley 7 Aug 2025 19:41 UTC
7 points
1
in reply to: Bitnotri’s comment on: tdko’s Shortform
+ 25% for swe-bench relative to Gemini 2.5? Quadrupling the METR task length of Gemini 2.5?

I suppose it’s a possibility, albeit a remote one.

Aaron Staley 7 Aug 2025 18:36 UTC
21 points
12
in reply to: β-redex’s comment on: tdko’s Shortform
The swe-bench scores are already well below trend from ai 2027. Had to hit 85% by end of month. We’re at 75%. (and SOTA was ~64% when they released ai 2027)

Aaron Staley 31 Jul 2025 6:19 UTC
3 points
0
in reply to: Nikola Jurkovic’s comment on: nikola’s Shortform
Very wide confidence intervals. If Grok 4 were equal to O3 in 50%, time horizon, it “beating” by this much is a 33% outcome. (On the other hand, losing by this amount in the 80% bucket is a 32% outcome).
Overall, I read this as about equally agentic as O3. Possibly slightly less so given the lack of swe-bench scores published for it (suggesting it wasn’t SOTA).

Aaron Staley 29 Jul 2025 16:25 UTC
3 points
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won’t be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon^[1] on METR’s evaluation suite ^[2]. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.^[3]

If the correlations continue to hold, this would map to something like a 78% to 80% range on swe-bench pass @ 1 (which is likely to be announced at release). I’m personally not this bearish (I’d guess low 80s given that benchmark has reliably jumped ~3.5% monthly), but we shall see.
Needless to say if it scores 80%, we are well below AI 2027 timeline predictions with high confidence.

Aaron Staley 29 Jul 2025 16:22 UTC
5 points
2
in reply to: Garrett Baker’s comment on: tdko’s Shortform
Coding agentic abilities are different from general chatbot abilities. Gemini is IMO the best chatbot there is (just in terms of understanding context well if you wish to analyze text/learn things/etc.). Claude on the other hand is dead last among the big 3 (a steep change from a year ago) and my guess is Anthropic isn’t trying much anymore (focusing on.. agentic coding instead)

Aaron Staley 29 Jul 2025 16:20 UTC
3 points
0
in reply to: tdko’s comment on: tdko’s Shortform
I don’t see that producing much of an update. Its SWE-bench score as you note was only 59.6%, which naively maps to ~50 minutes METR.

Aaron Staley 21 Jul 2025 18:51 UTC
3 points
0
in reply to: james oofou’s comment on: james oofou’s Shortform
I don’t think you can just start at the HCAST timeline for software engineering and map it to IMO problems.
Alternative bearish prediction would be deepthink got 50% on May 20 (not released, lab frontier) on USAMO. 80% is ~4x the task time of 50% ones (at least for software engineering—not sure what it is for math), so we needed two doublings (6 months) to pull this off and instead only have ~0.67.

Aaron Staley 20 Jul 2025 1:17 UTC
3 points
0
in reply to: tdko’s comment on: OpenAI Claims IMO Gold Medal
To put into perspective, there was only an 8% chance P3 would be this easy, putting substantial weight on the “unexpected” part being the problem being so easy. It’s also the first time in 20 years (5% chance) that 5 problems were of difficulty ⇐ 25.
Indeed, knowing that Gemini 2.5 Deep Think could solve an N25 (IMO result from Gemini 2.5 pro) and an A30 (known from Gemini 2.5 Deep think post), I’m somewhat less impressed. Only barriers were a medium-ish geometry problem (P2), which of course alpha geometry could solve and an easy combinatorics (P1).

The two most impressive things are, factoring this write up by Ralph Furman:

* OpenAI’s LLM was able to solve a medium level geometry problem. (guessing Deepmind just used alpha geometry again) - Furman thought this would be hard for informal methods.
* OpenAI’s LLM is strong enough to get the easy combinatorics problem (Furman noted informal methods would likely outperform formal ones on this one—just a matter if the LLM were smart enough)

Aaron Staley 19 Jul 2025 23:53 UTC
5 points
0
in reply to: Stephen Martin’s comment on: Agents lag behind AI 2027′s schedule
swe-bench pass @ 1 on Claude sonnet versions has been 33.4% (June − 3.5) → 49.0% (October) → 62.3% (Feb − 3.7) → 72.7% (May → 4). That’s practically linear at 3.5% gain/month. That would extrapolate to end of August at 83%.
With the leaderboard at ~75.2% on July 1, such an extrapolation also gets us to around 82%.