Baybar

Karma: 166

Baybar 4 Oct 2025 17:16 UTC
2 points
0
in reply to: stochastic_parrot’s comment on: Checking in on AI-2027
As I understand it, the official SWEBench-Verified page is consistently giving certain resources and setups to the models, but when a company like Anthropic or OpenAI releases their scores on the SWEBench-Verified, they use their own infrastructure which presumably performs better. There was some discussion already elsewhere in the comments about whether the Claude 4.5 Sonnet score I gave should even count, given that it used parallel test time compute, I justified by decision to include this score like this:
It is my view that it counts, my sense was that benchmarks like this measure capability and not cost. It is never a 1 to 1 comparison on cost between these models, but before this year, no matter how much your model cost, you could not achieve the results achieved with parallel compute. So that is why I included that score.

Baybar 4 Oct 2025 16:36 UTC
1 point
0
in reply to: Aaron Staley’s comment on: Checking in on AI-2027
It isn’t clear that the “parallel test time” number even counts.
It is my view that it counts, my sense was that benchmarks like this measure capability and not cost. It is never a 1 to 1 comparison on cost between these models, but before this year, no matter how much your model cost, you could not achieve the results achieved with parallel compute. So that is why I included that score.
If parallel test time does count, projection is not close:
1. A projection for 5 months away (beginning of Sep) of growing +15% instead grew +12% 6 months away. That’s 33% slower growth (2% a month vs. 3% a month projected)
I wrote another comment about this general idea, but the highlights from my response are:
We nearly hit the August benchmarks in late September, roughly 5 months after AI-2027′s release instead of 4 months. That’s about 25% slower. If that rate difference holds constant, the ‘really crazy stuff’ that AI-2027 places around January 2027 (~21 months out) would instead happen around June 2027 (~26 months out). To me, a 5-month delay on exponential timelines isn’t drastically different. Even if you assume that we are going say, 33% slower, we are still looking at August 2027 (~28 months out) for some really weird stuff.
With that in mind, I think that it’s still a fairly reasonable prediction, particularly when predicting something with exponential growth. On top of that, we don’t really have alternate predictions to judge against. Nonetheless, I think you are right that particularly this benchmark is behind what was projected by AI-2027. I am just not sure I believe 25%-33% behind is significant.
For OSWorld, these aren’t even the same benchmarks. ai-2027 referred to the original osworld, while the sonnet 4.5 score of 61.4% is for osworld-verifed. Huge difference—Sonnet 3.7 scored 28 on osworld original, while getting a 35.8% on osworld-verified.
This is an oversight on my part, and you are right to point out that this originally referred to a different benchmark. However, upon further research, I am not sure the extrapolation you draw from this, which is that the new osworld-verified is substantially easier than the old osworld, is true. OpenAI’s operator agent actually declined in score (from 38% originally to 31% now). While the old test used 200 steps, vs the new test using 100 steps, Operator only improved by 0.1% when being given 100 steps instead of 50 steps on the osworld-verified, so I don’t think that this matters.
All of this is to say, some models scores improved on the osworld-verified, and some declined in score. The redesign to osworld-verified was because the original test had bugs, not in order to make a brand new test (otherwise they would still be tracking the old benchmark). The osworld-verified is the spiritual successor to the osworld-verified, and knowledgeable human performance on the benchmark remains around 70%. I think for all intents and purposes, it is worth treating as the same benchmark, though I definitely will update my post soon to reflect that the benchmark changed since AI-2027 was written.
Finally, while researching the osworld benchmark, I discovered that in the past few days, a new high score was achieved by agent s3 w/ GPT-5 bBoN (N=10). The resulting score was 70%, which is human level performance, and it was achieved on October 3rd, 2025. I will also update my post to reflect that at the very beginning of October, a higher score than was projected for August was achieved on the osworld-verified.

Baybar 3 Oct 2025 15:27 UTC
5 points
0
in reply to: Thane Ruthenis’s comment on: Checking in on AI-2027
Out of my own curiosity, if the real world plays out as you anticipate, and agent-2 does not close the loop, how much further back does that delay your timelines? Do you think that something like agent-3 or agent-4 could close the loop, or do you think it is further off than even that?

Baybar 2 Oct 2025 22:43 UTC
4 points
0
in reply to: Cole Wyeth’s comment on: Checking in on AI-2027
Sonnet 4.5 was nearly the final day of September which seems like 1.5 months out from genetically “August”
I interpret August as “by the end of August”. Probably worth figuring out which interpretation is correct, maybe the authors can clarify.
it IS important for assessing their predictive accuracy, and if their predictive accuracy is poor, it does not necessarily mean all of their predictions will be slow by the same constant factor.
Yeah, I agree with this. I do think there is pretty good evidence of predictive accuracy between the many authors, but obviously people have conflicting views on this topic.
To be clear, all of these signals are very weak. I am only (modestly) disagreeing with the positive claim of the OP.
This is a place where somebody writing a much slower timeline through like, 2028, would be really helpful. It would be easier to assess how good a prediction this is with comparisons to other people’s timelines about achieving these metrics (65% OSWorld, 85% SWEBench-Verified). I am not aware of anybody else’s predictions about these metrics from a similar time, but that would be useful to resolve this probably.
I appreciate the constructive responses!

Baybar 2 Oct 2025 20:04 UTC
7 points
3
in reply to: Cole Wyeth’s comment on: Checking in on AI-2027
I agree we’re behind the AI-2027 scenario and unlikely to see those really really fast timelines. But I’d push back on calling it ‘significantly behind.’
Here’s my reasoning: We nearly hit the August benchmarks in late September, roughly 5 months after AI-2027′s release instead of 4 months. That’s about 25% slower. If that rate difference holds constant, the ‘really crazy stuff’ that AI-2027 places around January 2027 (~21 months out) would instead happen around June 2027 (~26 months out). To me, a 5-month delay on exponential timelines isn’t drastically different. Even if you assume that we are going say, 33% slower, we are still looking at August 2027 (~28 months out) for some really weird stuff.
That said, I’m uncertain whether this is the right way to think about it. If progress acceleration depends heavily on hitting specific capability thresholds at specific times (like AI research assistance enabling recursive improvement), then even small delays might compound or cause us to miss windows entirely. I’d be interested to hear if you think threshold effects like that are likely to matter here.
Personally, I am not sure I am convinced these effects will matter very much given that there was not supposed to be large scale speedups to AI research in 2025 in the scenario until early 2026 (where they projected a fairly modest 1.5x speedup). But perhaps you have a different view?

Baybar 1 Oct 2025 21:02 UTC
39 points
1
on: Baybar’s Shortform
I’ve discovered that generating a video of yourself with Sora 2 saying something like ‘this video was generated by AI’ is sufficiently freaky enough to make people who know you well, especially those skeptical about AI capabilities, start to freak out a bit.
Thought this might be a useful idea for others trying to persuade people to tune in, and not just auto reject the idea that very capable systems might be right around the corner.

Baybar 14 Sep 2025 4:42 UTC
3 points
0
in reply to: StanislavKrym’s comment on: What Parasitic AI might tell us about LLMs Persuasion Capabilities
I think that there are two problems with providing the ‘worse answer’. My first issue is that some conversations with LLMs can be about topics that don’t have clear worse answers. How can you tell which one is more persuasive?
Secondly, even if I knew which answer was better, I worry about the Waluigi effect. If I optimize for safest response, am I summoning an unsafe Waluigi? I think that it is possible. I really don’t think RL on user feedback is a good idea when we don’t know what to optimize for. The alignment problem certainly isn’t solved. I think flipping a coin is safer.
What kind of answer more specifically than ‘worse’ do you think I should pick, if I shouldn’t flip a coin?

Baybar 11 Sep 2025 21:59 UTC
4 points
0
on: Open Thread—Summer 2025
I am curious about what the Albanian prime minister means when he says he is appointing a chatbot to be a minister. I am not crazy knowledgeable about AI, though I have lurked here for a while, and I don’t know if this is a new use case or not. Does anyone has any sense of what this actually means/if this is worth being concerned about in any way or if its just a PR stunt or something equally as unconcerning as a PR stunt. My read of it is that it seems like it is a helper tool that they are calling a cabinet minister for political reasons. It doesn’t seem like there is a lot of information out there right now about what they actually mean by this but maybe in a few days somebody more knowledgeable than me could clear this up for me.