As a tenured (albeit perhaps now ‘emeritus’) member of the “generally critical commentator crew”, I think this is the wrong decision (cf.). As the OP largely anticipates the reasons I would offer against it, I think the disagreement is a matter of degrees among the various reasons pro and con. For a low resolution sketch of why I prefer my prices of ‘pro tanto’ to the moderators:
I don’t think Said’s commenting, in aggregate, strays that close to the sneer attractor. “Pointed questions with an undercurrent of disdain” may not be ideal, but I have seen similar-to-worse antics[1] (e.g. writing posts which are thinly veiled/naked attacks on other users, routine abuse of subtext then ‘going meta’ to mire any objection to this game with interminable rules-lawyering) from others on this site who have prosecuted ‘campaigns’ against ideologies/people they dislike.[2]
The principal virtue of Said doing this for LW is calling bullshit on things which are, in fact, bullshit. I think there remains too much (e.g.) ‘post’-‘rationalist’ woo on LW, and it warrants robustly challenging/treating with the disdain it deserves. I don’t see many others volunteering for duty.
The principal cost is when this misfires, so the author ends up led into a subthread wasteland by Said thanks to him taking an odd, (unintentionally?) tendentious? line of questioning. In principle, this should not be that costly: if a comment asks for a clarification where I am confident other readers would agree with me the questioner is being very dumb, willfully obtuse, or making a ‘cheap shot’, I can ignore them without fear of third parties making an adverse inference. This applies whether this is the initial exchange or 1+ plys deep.[3]
Even if ‘in principle’ this is fine, maybe (per OP) the scales tilt the other way in practice. But I don’t think doing more to be ‘writer friendly’ by squashing putative gadflies like Said gets you enough marginal high-quality community content to be worth it across the scales of the (admittedly-nebulous) ‘taxing criticism’.
The track record of hewing moderation to cater for authors has not borne much fruit so far: the mod tools were introduced in large part to entice Eliezer back. He’s not. I think I recall a lot of mod effort has been spent on mediating spats between high profile users/contributors, but I think the usual end result is these dramatis personae have faded away.
#
Regardless of all that, it’s your website, and I’m barely a stakeholder worth considering (my last substantial contribution was over a decade ago). I wouldn’t hold it against Pace or Habryka if from arguments we had on the EA forum[4] they thought my judgements better to inverse, and my absence satisfying.
I expect I will continue participating very little in LW, although Said getting banned has little to do with it. Basically I don’t find enough yield of ‘good generalist (i.e. not principally focused on AI) content’ here anymore. I think Said incrementally helped by reducing the volume/prominence of not-so-good generalist content, so this seems a step in the wrong direction.[5] Happy days, and more fool me, if the future proves me wrong.
- ^
Although I appreciate mitigating circumstances (and an isolated case), moderator behaviour on this post has been ‘similar-to-worse antics’ too. It seems bad form to (as it appears Habryka has done) strong downvote a large number (most?) comments by Said in the threads he is arguing with him in (can I do this if I get into a fight with someone with much lower vote power than me?). Ditto (as Pace did) use site-admin info to score points against a dissenting user he wanted to be snide to, especially when that user seems to be dissenting in the manner OP requested they do.
- ^
I’m not giving examples to avoid prompting a subthread wasteland on whatever I bring up. If widely disbelieved and crucial to the discussion, I am open to being cajoled into naming some names
- ^
Aside: it is perhaps unfortunate ‘tapping out’ is the lingo for dropping a discussion. In martial arts, (notwithstanding the gloss on the wiki that it can mean ‘one is tired, or at risk of injury, or has simply had one’s fill’) tapping is typically an admission of defeat.
Regardless of the lingo, there is still the advantage of having the ‘last word’ (cf. OP). I could be odd, but I feel this gets outweighed by the much lower visibility of (e.g.) the 5th+ nested comment being seldom more than ‘you and your interlocutor’. In terms of ‘discussion as social fight’, whoever got ratioed in the first 1-2 back and forths on the thread is the loser, even if they make the last ‘rebuttal’.
- ^
FWIW I don’t have the impression that the EA forum is more ‘linkedin-y’ than LW nowadays. Besides roughly similar levels of spats/drama, many of my comments there are much meaner towards the OP than Said’s, and I haven’t had the moderators generally ‘on my case’ about them (e.g.).
- ^
But there are secular explanations which likely overdetermine, e.g.:
Maybe we’ve run out of useful general things to say, so useful conversation inevitably gets more and more sub-specialised?
Maybe the noughties internet just developed a lot of surplus for places like LW, but nowadays gifted writers want to cultivate their own substack or whatever.
Maybe things have professionalized so the typical commenter who could share interesting takes on AI alignment (or whatever) as an amateur has been recruited to a think tank as a professional.
I also applaud the effort to interrogate the underlying data. I have also been dismayed at people hanging dramatic updates off (what usually should be?) 1-few bits of surprisal. (I don’t think METR can be fairly blamed for others ~hunting noise in the ‘last’ datapoint—the CIs are clearly printed on the graph.)
Per other comments, I think the more theoretical worries in the OP miss the mark: you should end up with something like logistic curve if task length is unbounded but success probability is (0, 1); logging does a fairly good job at linearizing the data (although at least for sonnet 3.7 the fit collapses in the 2hr+ region, and eyeballing the other histograms suggests this might generalize).
Yet I think they may be in right neighbourhood of a ‘construct validity’ worry around time horizons. In precis (hopefully a full post someday):
Unlike (e.g.) ‘how fast can you run?’ or ‘how much can you lift?’ there’s seldom a handy cardinal scale for intellectual performance: IQ = 0 does not mean ‘zero intelligence’, nor you having double my chess ELO means you are twice as good at chess as I am. (Even if you’re happy not having a meaningful zero, meaningful interval scales don’t exist either.)
Besides issues of general overprediction, it seems hard to tell how meaningful a D increment on X benchmark is. The function from ‘benchmark score’ to ‘irl importance’ (or ‘AI capabilities’) could be almost anything monotonic: from “any nonzero score is a cataclysmic breakthrough (but any further increment matters little on the margin”, to “long march through the ‘nines’ (so all scores <99.9% are ~equally worthless), and everything in between.
Hence the utility of METR’s time horizons as a (/the only?) cardinal measure: ‘doubling’ is meaningful, and (if treated, as it often is—and I suspect more than METR would like it to be—as a proxy for ‘AI capabilities in general’) it shows a broad trend of exponentially increasing capabilities over last few years. (+/- discourse over whether recent data points indicate even more dramatic acceleration, ‘hitting the wall’, etc.)
What is load-bearing for this account is the essentially exponential transformation between ‘raw’ scores on HCAST etc. to time horizons. Per OP (and comments), you can get a similar plot with just the raw scores, and it is largely the transformation from that to time horizons which gives (e.g.) Opus 4.5, scoring 75%, ~double the time horizon of GPT5 (70%), or ~treble the time horizon of o3 (66%). If the y-axis of the figure was instead “composite accuracy (SWAA+HCAST+REBench)”, the figure might be grist for the mill of folks like Gary Marcus: “A whole year of dramatically increasing investment and computation, and all it got you was another 10%.”
It goes without saying METR didn’t simply stipulate ’linear score improvement = exponentially increasing time horizons”: it arose from a lot of admirable empirical work demonstrating the human completion time is roughly log-distributed.
But at least when taken as the colloquial byword for AI capabilities, this crucial contour feels a bit too mechanistic to me. I take that you can generalise the technique widely to other benchmarks deepens rather than alleviates this concern: if human benchmarking exercises would give log-distributed horizons across the items in many (/most?) benchmarks, such that progressive linear increments in model performance would give a finding of exponentially improving capabilities, maybe too much is being proven.
Taking the horizons (and their changes) literally has dubious face validity by my lights:
It doesn’t seem to me the frontier has gotten ~3x more capable over this year, and although I’m no software engineer, it doesn’t look from the outside like e.g. Opus 4.5 is 2x better at SWE than Opus 4.1, etc.
Presumably we could benchmark humans against the time horizons (IIRC not everyone used in the benchmarking could successfully complete the task), or at least the benchmarks from which time horizons could be imputed from. I’d at least be doubtful our best guess should be Alice (who cracks 75%) is 3x the SWE of Bob who hits 65%, etc.
That said, given our grasp of the ‘true cardinal scale of intellect’ is murky—or fictitious—even if my vibes are common, it looks reasonable to deny them rather than the facially contradicting data.
Perhaps the underlying moral of the jagged frontier is there isn’t some crisp (at least crisp + practically accessible) measure out there re. ‘general intelligence’ (or even general measures of intelligence when particularly applied: cf. ‘twice as good at chess’), and we should focus on metrics specific to whatever real-world impact we are interested in (maybe for ‘AI generally’, just trend extrapolate from ‘economy generally’?). But if the story of benchmarks over the last while is they are missing whatever intellectual dark matter intervenes between ‘benchmark assessing X’ and ‘actually Xing’, maybe you can’t derive sturdy synthetic y-axis yardsticks from their distorted timber: the transfer function from ‘time horizon’ to ‘irl importance’ is a similar value of ”??” as the original benchmarks were.