p.b.

Karma: 1,341

p.b. 18 Jan 2026 12:18 UTC
2 points
0
on: Scaling Laws for Economic Impacts: Experimental Evidence from 500 Professionals and 13 LLMs
I skimmed the paper last week but I lost interest when I couldn’t find out which models were used.

p.b. 20 Dec 2025 21:56 UTC
10 points
4
on: How to game the METR plot
I think once you assume a logistic function, its almost guaranteed that if a new model solves one additional task, it’s going to continue the log-linear trend.
No, whether it continues the log-linear trend depends on WHEN a new model solves one additional task.
Using the logistic function to fit task success probability vs task length is not load-bearing for the time horizon computation.
If the task lengths in the data set were uniformly distributed (in log-space) you could just take the overall accuracy and look up the task length of that percentile in the data and that would be nearly identical to the 50% horizon. This would replace the logistic assumption with a step-function assumption, but because the logistic is point symmetric in the 50% point, it’s roughly the same value.
Put differently: There is an interval where the model goes from basically 100% to basically 0% and the logistic just takes the point in the middle for the 50% horizon. And there are many other methods that also take the point in the middle (possibly in a less robust way, but that would just make the trend a little more noisy).
I think your feeling that this is suspect comes more from the choice of the log-space not the choice of the fitting function. It feels a bit circular to say “we gonna assume that the log-space of task length is the natural way to look at model competence” and then get the result that through linear time model competence improves linearly in that log-space. I think it is also rather this choice of log-space not of the logistic that is motivated by the log-linear plot of model success rate vs human time to complete.
Also I would point out that the validity of the time horizons computed for the current models is not just based on these 16 tasks, but on the preceding six year trend + replications of exponential trends in other datasets. It’s great to point out that current measurements have a ton of noise and are very gameable, but it’s hard to attack the conclusion of exponential progress in time horizons.

p.b. 17 Dec 2025 10:58 UTC
6 points
0
in reply to: No77e’s comment on: leogao’s Shortform
Same.
There’s some old greek who had a parable about hedgehogs in the cold, that shuffle closer and closer for warmth until they sting each other and shuffle apart again. I always thought that applies pretty well.

p.b. 16 Dec 2025 11:14 UTC
2 points
0
on: How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)
To me the question is not whether LLMs are conscious but whether their experience has any valence. Whether outputting “functional distress” feels the same as outputting “functional joy” to them internally.
In humans valence does not come from sequence learning but from other parts of the brain.
Some people feel no fear, some people feel not pain. They cannot learn to feel these feelings. The necessary nuclei or receptors are missing. Why would LLMs learn to have those feelings?
Does a conscious entity that has no feelings and cannot suffer deserve moral consideration?

p.b. 11 Nov 2025 12:59 UTC
2 points
0
in reply to: jacquesthibs’s comment on: Andrej Karpathy on LLM cognitive deficits
The recent goodfire paper seems to me a step into that direction. Also going completely synthetic for the training data might be a way.

p.b. 11 Nov 2025 12:51 UTC
4 points
0
in reply to: Jsevillamol’s comment on: p.b.’s Shortform
I fitted logistic functions and gaussian cdfs with a factor to the trend of the percentage scores for the four rankings I analysed and they all asymptote below 80%. The idea was to find some evidence for an “irreducible error”.
But given that 20+% error rate is clearly way too high, it still makes more sense to me to argue that the improvement is slowing and therefore these fits asymptote too low, than to argue that the time horizons and percentages are asymptoting because of a high percentage of unsolvable tasks.
But this gave me a more general idea of assessing changes in improvement speed: The default assumption right now should be that model improvement moves linearly through the log of the time horizon space. Additionally, I found that at least SWE-bench verified seems to have task lengths that are lognormally distributed and I suspect that holds for many benchmarks.
This means that the way to saturation should follow a gaussian cdf. Now the idea would be that one can use the movement through the first x percent of the benchmark to fit the gaussian cdf (or at least sanity check that assumption) and then see whether the model slows down for the rest of the benchmark. To put it differently: Constant improvement speed → Symmetric underlying gaussian of the cdf. Slowdown → Right tail gets fatter.
Of course the signal would be pretty weak, but if one would aggregate this over several benchmarks, it might make a good speedometer.

p.b. 5 Nov 2025 9:12 UTC
2 points
0
in reply to: p.b.’s comment on: p.b.’s Shortform
Hmm, actually all these checks can’t distinguish between actually unsolvable tasks and tasks that are unsolvable for further scaled up models of the current kind (with the framework and compute used in the evaluations).

p.b. 5 Nov 2025 7:22 UTC
2 points
0
in reply to: Thomas Kwa’s comment on: p.b.’s Shortform
Yeah, I am also pretty much on the fence right now. But time will tell.

p.b. 5 Nov 2025 7:20 UTC
2 points
0
in reply to: ryan_greenblatt’s comment on: p.b.’s Shortform
It depends how the work times of these unsolvable tasks are distributed, you could in principle get any outcome. But there are a few ways to check for the existence of unsolvable tasks, maybe I’ll find the time today.

p.b. 4 Nov 2025 18:24 UTC
3 points
0
in reply to: ryan_greenblatt’s comment on: p.b.’s Shortform
SWE bench verified shouldn’t have that many impossible tasks if any, right? And the highest scores for the rankings I used are still significantly below 80%. But it’s possible. Maybe a good motivation to look at SWE bench pro.

p.b. 4 Nov 2025 16:10 UTC
11 points
0
on: p.b.’s Shortform
I computed METR time horizons for SWE bench verified sota models using both the existing difficulty estimates and work time estimates derived from commit data.
I used a range of different methods including the original METR methodology where task level success info was available.
I did this for 4 different rankings, EpochAI’s, LLMStats’s and the “verified” and “bash only” rankings of the SWE benchmark website.
In every single case the trend fits a logistic function with an asymptote of a couple of hours better than an exponential. In some cases the trend only becomes logistic with the last one or two datapoints, so it’s not surprising that the METR report has an exponential fit for SWE bench.
I am not sure when I get around to publishing this analysis, because it’s a giant mess of different datasets and methods. But I thought I at least state the result before it becomes irrelevant, falsified or obvious.

p.b. 3 Nov 2025 7:33 UTC
4 points
0
in reply to: Jeremy Gillen’s comment on: Some data from LeelaPieceOdds
Thanks for doing this! My blitz rating is around 2470 right now, so I seem to have done a bit better than typical, probably by virtue of playing more games.

p.b. 30 Oct 2025 13:01 UTC
4 points
0
on: Some data from LeelaPieceOdds
I played some 70 games against LeelaQueensOdds with 5+3 (I basically played until I reached a >50% score after losing the first couple of games and then slowly figuring it out) so the most interesting graph for me is unfortunately missing. ;-)

p.b. 30 Oct 2025 12:53 UTC
3 points
0
in reply to: Oscar’s comment on: Introducing the Epoch Capabilities Index (ECI)
What is uniquely interesting/valuable about METR time horizons is that the score is meaningful and interpretable. Can do software tasks that would take an expert 2h with 50% success probability is very specific. Has the score y on benchmark x is only valuable for comparisons, it does not tell you what’s going to happen when the models reach score z.

p.b. 30 Oct 2025 12:49 UTC
3 points
0
on: Introducing the Epoch Capabilities Index (ECI)
I tried to do something like that 1-2 years ago, where I modelled benchmarks as having normally distributed difficulties. But there was little data and an afternoon of hacking didn’t really lead to anything particularly interesting.

p.b. 23 Oct 2025 17:03 UTC
2 points
0
on: Which side of the AI safety community are you in?
For me the linked site with the statement doesn’t load. And this was also the case when I first tried to access it yesterday. Seems less than ideal.

p.b. 21 Oct 2025 15:32 UTC
2 points
0
in reply to: Jacob_Hilton’s comment on: Jacob_Hilton’s Shortform
Thanks!

p.b. 21 Oct 2025 13:40 UTC
2 points
0
in reply to: Jacob_Hilton’s comment on: Jacob_Hilton’s Shortform
How does this coefficient relate to the maximal slope (i.e. at the 50%-x)?

p.b. 20 Oct 2025 11:56 UTC
2 points
0
in reply to: Noosphere89’s comment on: The “Length” of “Horizons”
Very possible.
I plan to watch this a bit longer and also analyse how the trend changes with repo size.

p.b. 15 Oct 2025 14:48 UTC
4 points
−3
on: The “Length” of “Horizons”
The way METR time horizons tie into AI 2027 is very narrow: As a trend not even necessarily on coding/software engineering skills but on machine learning engineering. I think that is hard to attack except by claiming that the trend will taper off. AI 2027 does not require unrealistic generalisation.
The reason why I think that time horizons are much more solid evidence of AI progress then earlier benchmarks, is that the calculated time horizons explain the trends in AI-assisted coding over the last few years very well. For example it’s not by chance that “vibe coding” became a thing when it became a thing.
I have computed time horizon trends for more general software engineering tasks (i.e. with a bigger context) and my preliminary results point towards a logistic trend, i.e. the exponential is already tapering off. However, I am still pretty uncertain about that.