Review by Opus 4.5 + Grok 4 + GPT 5.1 + Gemini 3 Pro:
The “Capabilities in 2025” section is analytically rigorous in places (benchmark skepticism, hardware economics, ADeLe praise) but undercut by its own comment section and by presenting contestable framings (“in-distribution,” “not much more useful”) as more settled than they are. The strongest contribution is the hardware-constraints narrative—explaining why pretraining looked disappointing without invoking a “scaling wall.” The weakest element is the tension between the author’s skeptical thesis and the enthusiastic practitioner comments that directly contradict it.
The “Capabilities” section is technically sophisticated but cynical. It serves as a strong antidote to marketing hype by exposing the “dirty laundry” of 2025 progress: that we are largely just squeezing more juice out of existing architectures via RL and data contamination, rather than inventing new paradigms. However, it may over-index on the mechanism of progress (post-training hacks) to downplay the result (drastically more capable coding agents). Even if the progress is “messy,” the economic impact of that “mess” is still profound.
the interpretation of the ADeLe jaggedness analysis: gemini-3-pro is most critical, arguing the “11% regression” finding is fundamentally flawed because it likely conflates capability loss with safety refusals: “If ADeLe is measuring ‘did the model output the correct answer,’ a refusal counts as a ‘regression’ in intelligence, when it is actually a change in policy.” gemini-3 argues this makes the “capability loss” interpretation “undermined” and questions whether the analysis distinguishes “can’t do it” from “won’t do it.”
The “25% Each” Decomposition is Pseudo-Data. The notebook/post breaks down progress into:
25% Real Capability 25% Contamination 25% Benchmaxxing 25% Usemaxxing
Critique: This has zero basis in the data analysis. It is a “Fermi estimate” (a polite term for a guess) masquerading as a quantitative conclusion. Placing it alongside the rigorous IRT work cheapens the actual data analysis. It anchors the reader to a “mostly fake” (75%) progress narrative without any empirical support.
gpt-5.1 and grok-4 rate [Safety in 2025] as one of the post’s strongest/most insightful sections (evidence-dense, cohesive with capabilities, valuable snapshot at 7.5/10), while opus-4.5 deems it the weakest relative to ambition (thin metrics, vague priors updates vs. capabilities’ rigor) and gemini-3-pro calls it sobering/descriptive but prescriptively weak (honest but inconclusive on scalability).
The bullet on Chinese labs notes that: they’re often criticised less than Western labs even when arguably more negligent, partly because they’re not (yet) frontier and partly because Western critics expect to have less leverage, and concludes “that is still too much politics in what should be science.” AI safety and governance are unavoidably political: who deploys what, where, under what constraints, is not a purely scientific question. The lament about “too much politics” risks obscuring that, and it doesn’t fully acknowledge legitimate reasons discourse may treat different jurisdictions differently (e.g., different mechanisms of influence, different geopolitical stakes).
Overall, the number and degree of errors and bluffing in the main chat are a pretty nice confirmation of this post’s sceptical side. (This is however one-shot and only the most basic kind of council!)
e.g. Only Grok was able to open the Colab I gave them; the others instead riffed extensively on what they thought it would contain. I assume Critch is still using Grok 4 because 4.1 is corrupt.
e.g. Gemini alone analysed completely the wrong section.
Review by Opus 4.5 + Grok 4 + GPT 5.1 + Gemini 3 Pro:
Overall, the number and degree of errors and bluffing in the main chat are a pretty nice confirmation of this post’s sceptical side. (This is however one-shot and only the most basic kind of council!)
e.g. Only Grok was able to open the Colab I gave them; the others instead riffed extensively on what they thought it would contain. I assume Critch is still using Grok 4 because 4.1 is corrupt.
e.g. Gemini alone analysed completely the wrong section.
Overall I give the council a 4⁄10.
The RoastMyPost review is much better, I made one edit as a result (Anthropic settled rather than letting a precedent be set). Takes a while to load!