Yep, thanks, just tried. Just say @synthid in any Gemini session.
technicalities
Review by Opus 4.5 + Grok 4 + GPT 5.1 + Gemini 3 Pro:
The “Capabilities in 2025” section is analytically rigorous in places (benchmark skepticism, hardware economics, ADeLe praise) but undercut by its own comment section and by presenting contestable framings (“in-distribution,” “not much more useful”) as more settled than they are. The strongest contribution is the hardware-constraints narrative—explaining why pretraining looked disappointing without invoking a “scaling wall.” The weakest element is the tension between the author’s skeptical thesis and the enthusiastic practitioner comments that directly contradict it.
The “Capabilities” section is technically sophisticated but cynical. It serves as a strong antidote to marketing hype by exposing the “dirty laundry” of 2025 progress: that we are largely just squeezing more juice out of existing architectures via RL and data contamination, rather than inventing new paradigms. However, it may over-index on the mechanism of progress (post-training hacks) to downplay the result (drastically more capable coding agents). Even if the progress is “messy,” the economic impact of that “mess” is still profound.
the interpretation of the ADeLe jaggedness analysis: gemini-3-pro is most critical, arguing the “11% regression” finding is fundamentally flawed because it likely conflates capability loss with safety refusals: “If ADeLe is measuring ‘did the model output the correct answer,’ a refusal counts as a ‘regression’ in intelligence, when it is actually a change in policy.” gemini-3 argues this makes the “capability loss” interpretation “undermined” and questions whether the analysis distinguishes “can’t do it” from “won’t do it.”
The “25% Each” Decomposition is Pseudo-Data. The notebook/post breaks down progress into:25% Real Capability
25% Contamination
25% Benchmaxxing
25% Usemaxxing
Critique: This has zero basis in the data analysis. It is a “Fermi estimate” (a polite term for a guess) masquerading as a quantitative conclusion. Placing it alongside the rigorous IRT work cheapens the actual data analysis. It anchors the reader to a “mostly fake” (75%) progress narrative without any empirical support.
gpt-5.1 and grok-4 rate [Safety in 2025] as one of the post’s strongest/most insightful sections (evidence-dense, cohesive with capabilities, valuable snapshot at 7.5/10), while opus-4.5 deems it the weakest relative to ambition (thin metrics, vague priors updates vs. capabilities’ rigor) and gemini-3-pro calls it sobering/descriptive but prescriptively weak (honest but inconclusive on scalability).
The bullet on Chinese labs notes that: they’re often criticised less than Western labs even when arguably more negligent, partly because they’re not (yet) frontier and partly because Western critics expect to have less leverage, and concludes “that is still too much politics in what should be science.”
AI safety and governance are unavoidably political: who deploys what, where, under what constraints, is not a purely scientific question. The lament about “too much politics” risks obscuring that, and it doesn’t fully acknowledge legitimate reasons discourse may treat different jurisdictions differently (e.g., different mechanisms of influence, different geopolitical stakes).Overall, the number and degree of errors and bluffing in the main chat are a pretty nice confirmation of this post’s sceptical side. (This is however one-shot and only the most basic kind of council!)
e.g. Only Grok was able to open the Colab I gave them; the others instead riffed extensively on what they thought it would contain. I assume Critch is still using Grok 4 because 4.1 is corrupt.
e.g. Gemini alone analysed completely the wrong section.
Overall I give the council a 4⁄10.
Ways we can fail to answer
Will link this!
Works for me!
My perhaps overcynical take is to assume that any benchmark which gets talked about a lot is being optimised. (The ridiculously elaborate scaffold already exists for Pokemon, so why wouldn’t you train on it?) But I would update on an explicit denial.
I was guessing that the transfer learning people would already have some handy coefficient (normalised improvement on nonverifiable tasks / normalised improvement on verifiable tasks) but a quick look doesn’t turn it up.
Interesting, it’s off the API. What’s the usage limit like?
Thanks. I am uncertain (“unclear”), and am interested in sharpening this to the point where it’s testable.
I basically never use a non-RLed model for anything, so I agree with the minimal version of the generalisation claim.
We could just reuse some transfer learning metric? If 100% is full proportional improvement, I’d claim like <10% spillover on nonverified tasks. What about you?
Another thing I was trying to point at is my not knowing what RL environments they’re using for these things, and so not knowing what tasks count in the denominator. I’m not going to know either.
Okee edited it.
(I am not confident, incidentally; Ctrl+F “Manifold” for my strong doubts.)
Fair. Just checking: are you counting 20 years as short?
Thanks!
Amazing as always, thanks
Not a reliable source, but I’m open to the possibility (footnote 1)
You’re saying they’re the same base model? Cite?
Agree, and I already note that coding is the exception a few times throughout. That sentence is intended to counteract naive readings of “useful”. I’ll add a footnote anyway.
Main post out next week! Roughly 100 theory papers.
Various things I cut from the above:
Adaptiveness and Discrimination
There is some evidence that AIs treat AIs and humans differently. This is not necessarily bad, but it at least enables interesting types of badness.
With my system prompt (which requests directness and straight-talk) they have started to patronise me:
Training awareness
Last year it was not obvious that LLMs remember anything much about the RL training process. Now it’s pretty clear. (The soul document was used in both SFT and RLHF though.)
Progress in non-LLMs
“World model” means at least four things:
A learned model of environment dynamics for RL, allowing planning in latent space or training in the model’s “imagination.”
The new one: just a 3D simulator; a game engine inside a neural network (Deepmind, Microsoft). The claim is that they implicitly learn physics, object permanence, etc. The interesting part is that they take actions as inputs. Here’s Quake running badly on a net. Maybe useful for agent training.
If LLM representations are stable and effectively symbolic, then people say it has a world model.
A predictive model of reality learned via self-supervised learning. The touted LeJEPA semi-supervised scheme on small (15M param) CNNs is domain-specific. It does better on one particular transfer task than small vision transformers, presumably worse than large ones.
The much-hyped Small Recursive Transformers only work on a single domain, and do a bunch worse than the frontier models for about the same inference cost, but have truly tiny training costs, O($1000).
HOPE and Titan might be nothing, might be huge. They don’t scale very far yet, nor compare to any real frontier systems.
Any of these taking over could make large swathes of Transformer-specific safety work irrelevant. (But some methods are surprisingly robust.)
The “cognitive core” hypothesis (that the general-reasoning components of a trained LLM are not that large in parameter count) is looking plausible. The contrary hypothesis (associationism?) is that general reasoning is just a bunch of heuristics and priors piled on top of each other and you need a big pile of memorisation. It’s also a live possibility.“the very first scaling laws of the actual abilities of LLMs”, from ADeLe.
KNs = Social Sciences and Humanities, AT = Atypicality, and VO = Volume (task time).
The y-axis is the logistic of the subject characteristic curve (the chance of success) for each skill.Other
Model introspection is somewhat real.
Vladimir Nesov continues to put out some of the best hardware predictions pro bono.
Jason Wei has a very wise post noting that verifiers are still the bottleneck and existing benchmarks are overselected for tractability.
There are now “post-AGI” teams.
Kudos to Deepmind for being the first to release output watermarking and a semi-public detector. You can nominally sign up for it here.
Previously, Microsoft’s deal with OpenAI stipulated that they couldn’t try to build AGI. Now they can (try). Simonyan is in charge, despite Suleyman being the one on the press circuit.
The CCP did a bunch to (accidentally/short-term) slow down Chinese AI this year.
Major insurers are nervous about AI agents (but asking the government for an exclusion isn’t the same as putting them in the policies).
Offence/defence balance
This post doesn’t much cover the hyperactive and talented AI cybersecurity world (except as it overlaps with things like robustness). One angle I will bring up: We can now find critical, decade-old security bugs in extremely well-audited software like OpenSSL and sqlite. Finding them is very fast and cheap. Is this good news?
Well, red-teaming makes many attacks into a defence, as long as you actually do the red-team.
But Dawn Song argues that LLMs overall favour offence, since its margin for error is so broad, since remediation is slow and expensive, and since defenders are less willing to use unreliable (and itself insecure) AI. And can you blame them?
See also “just in time AI malware” where the payload contains no suspicious code, just a call to HuggingFace.
Egregores and massively-multi-agent mess
There is something wrong (something horribly right) with 4o. Blinded users still prefer it to gpt-5-high, and this surely is due to both them simply liking its style and dark stuff like sycophancy. It will live on through illicit distillation and in-context transference. Shame on OpenAI for making this mess; kudos to OpenAI for doing unpopular damage control and good luck to them in round 2.
Open models will presumably eventually overrun them in the codependency market segment. See Pressman for a sceptical timeline and Rath and Armstrong for a good idea.
More generally there is pressure from users to refuse less, flatter more, and replace humans more; yet another economic constraint on for-profit AI.
Whether it’s the counterfactual cause of mental problems or not, so–called “LLM psychosis” is now a common path of pathogenesis. Note that the symptoms are literally not psychotic (they are delusions).
AI in 2025: gestalt
Yep! In footnote 3
Nice points. I would add “backtracking” as one very plausible general trick purely gained by RLVR.
I will own up to being unclear in OP: the point I was trying to make is that last year that there was a lot of excitement about way bigger off-target generalisation than cleaner CoTs, basic work skills, uncertainty expression, and backtracking. But I should do the work of finding those animal spirits/predictions and quantifying them and quantifying the current situation.