There is someevidence that AIs treat AIs and humans differently. This is not necessarily bad, but it at least enables interesting types of badness.
With my system prompt (which requests directness and straight-talk) they have started to patronise me:
Training awareness
Last year it was not obvious that LLMs remember anything much about the RL training process. Now it’s prettyclear. (The soul document was used in both SFT and RLHF though.)
Progress in non-LLMs
“World model” means at least four things:
A learned model of environment dynamics for RL, allowing planning in latent space or training in the model’s “imagination.”
The new one: just a 3D simulator; a game engine inside a neural network (Deepmind, Microsoft). The claim is that they implicitly learn physics, object permanence, etc. The interesting part is that they take actions as inputs. Here’s Quake running badly on a net. Maybe useful for agent training.
If LLM representations are stable and effectively symbolic, then people say it has a world model.
A predictive model of reality learned via self-supervised learning. The touted LeJEPA semi-supervised scheme on small (15M param) CNNs is domain-specific. It does better on one particular transfer task than small vision transformers, presumably worse than large ones.
The much-hyped Small Recursive Transformers only work on a single domain, and do a bunch worse than the frontier models for about the same inference cost, but have truly tiny training costs, O($1000).
HOPE and Titan might be nothing, might be huge. They don’t scale very far yet, nor compare to any real frontier systems.
Silver’s meta-RL programme is apparently working, though they are still using 2D games as the lead benchmark. Here they discovered a SOTA update rule from Atari games and it transferred perfectly to ProcGen. Using more environments leads to even better update rules.
Any of these taking over could make large swathes of Transformer-specific safety work irrelevant. (But some methods are surprisingly robust.)
The “cognitive core” hypothesis (that the general-reasoning components of a trained LLM are not that large in parameter count) is looking plausible. The contrary hypothesis (associationism?) is that general reasoning is just a bunch of heuristics and priors piled on top of each other and you need a big pile of memorisation. It’s also a live possibility.
“the very first scaling laws of the actual abilities of LLMs”, from ADeLe. KNs = Social Sciences and Humanities, AT = Atypicality, and VO = Volume (task time). The y-axis is the logistic of the subject characteristic curve (the chance of success) for each skill.
Kudos to Deepmind for being the first to release output watermarking and a semi-public detector. Just say @synthid in any Gemini session. You can nominally sign up for its API here.
Previously, Microsoft’s deal with OpenAI stipulated that they couldn’t try to build AGI. Now they can (try). Simonyan is in charge, despite Suleyman being the one on the press circuit.
Major insurers are nervous about AI agents (but asking the government for an exclusion isn’t the same as putting them in the policies).
Offence/defence balance
This post doesn’t much cover the hyperactive and talented AI cybersecurity world (except as it overlaps with things like robustness). One angle I will bring up: We can now find critical, decade-old security bugs in extremely well-audited software like OpenSSL and sqlite. Finding them is very fast and cheap. Is this good news?
Well, red-teaming makes many attacks into a defence, as long as you actually do the red-team.
But Dawn Song argues that LLMs overall favour offence, since its margin for error is so broad, since remediation is slow and expensive, and since defenders are less willing to use unreliable (and itself insecure) AI. And can you blame them?
See also “just in time AI malware” where the payload contains no suspicious code, just a call to HuggingFace.
Egregores and massively-multi-agent mess
There is something wrong (somethinghorribly right) with 4o. Blinded users still prefer it to gpt-5-high, and this surely is due to both them simply liking its style and dark stuff like sycophancy. It will live on through illicit distillation and in-context transference. Shame on OpenAI for making this mess; kudos to OpenAI for doing unpopular damage control and good luck to them in round 2.
Open models will presumably eventually overrun them in the codependency market segment. See Pressman for a sceptical timeline and Rath and Armstrong for a good idea.
More generally there is pressure from users to refuse less, flatter more, and replace humans more; yet another economic constraint on for-profit AI.
Whether it’s the counterfactual cause of mental problems or not, so–called “LLM psychosis” is now a common path of pathogenesis. Note that the symptoms are literally not psychotic (they are delusions).
Various things I cut from the above:
Adaptiveness and Discrimination
There is some evidence that AIs treat AIs and humans differently. This is not necessarily bad, but it at least enables interesting types of badness.
With my system prompt (which requests directness and straight-talk) they have started to patronise me:
Training awareness
Last year it was not obvious that LLMs remember anything much about the RL training process. Now it’s pretty clear. (The soul document was used in both SFT and RLHF though.)
Progress in non-LLMs
“World model” means at least four things:
A learned model of environment dynamics for RL, allowing planning in latent space or training in the model’s “imagination.”
The new one: just a 3D simulator; a game engine inside a neural network (Deepmind, Microsoft). The claim is that they implicitly learn physics, object permanence, etc. The interesting part is that they take actions as inputs. Here’s Quake running badly on a net. Maybe useful for agent training.
If LLM representations are stable and effectively symbolic, then people say it has a world model.
A predictive model of reality learned via self-supervised learning. The touted LeJEPA semi-supervised scheme on small (15M param) CNNs is domain-specific. It does better on one particular transfer task than small vision transformers, presumably worse than large ones.
The much-hyped Small Recursive Transformers only work on a single domain, and do a bunch worse than the frontier models for about the same inference cost, but have truly tiny training costs, O($1000).
HOPE and Titan might be nothing, might be huge. They don’t scale very far yet, nor compare to any real frontier systems.
Silver’s meta-RL programme is apparently working, though they are still using 2D games as the lead benchmark. Here they discovered a SOTA update rule from Atari games and it transferred perfectly to ProcGen. Using more environments leads to even better update rules.
Any of these taking over could make large swathes of Transformer-specific safety work irrelevant. (But some methods are surprisingly robust.)
The “cognitive core” hypothesis (that the general-reasoning components of a trained LLM are not that large in parameter count) is looking plausible. The contrary hypothesis (associationism?) is that general reasoning is just a bunch of heuristics and priors piled on top of each other and you need a big pile of memorisation. It’s also a live possibility.
“the very first scaling laws of the actual abilities of LLMs”, from ADeLe.
KNs = Social Sciences and Humanities, AT = Atypicality, and VO = Volume (task time).
The y-axis is the logistic of the subject characteristic curve (the chance of success) for each skill.
Other
Model introspection is somewhat real.
Vladimir Nesov continues to put out some of the best hardware predictions pro bono.
Jason Wei has a very wise post noting that verifiers are still the bottleneck and existing benchmarks are overselected for tractability.
There are now “post-AGI” teams.
Kudos to Deepmind for being the first to release output watermarking and a semi-public detector. Just say @synthid in any Gemini session. You can nominally sign up for its API here.
Previously, Microsoft’s deal with OpenAI stipulated that they couldn’t try to build AGI. Now they can (try). Simonyan is in charge, despite Suleyman being the one on the press circuit.
The CCP did a bunch to (accidentally/short-term) slow down Chinese AI this year.
Major insurers are nervous about AI agents (but asking the government for an exclusion isn’t the same as putting them in the policies).
Offence/defence balance
This post doesn’t much cover the hyperactive and talented AI cybersecurity world (except as it overlaps with things like robustness). One angle I will bring up: We can now find critical, decade-old security bugs in extremely well-audited software like OpenSSL and sqlite. Finding them is very fast and cheap. Is this good news?
Well, red-teaming makes many attacks into a defence, as long as you actually do the red-team.
But Dawn Song argues that LLMs overall favour offence, since its margin for error is so broad, since remediation is slow and expensive, and since defenders are less willing to use unreliable (and itself insecure) AI. And can you blame them?
See also “just in time AI malware” where the payload contains no suspicious code, just a call to HuggingFace.
Egregores and massively-multi-agent mess
There is something wrong (something horribly right) with 4o. Blinded users still prefer it to gpt-5-high, and this surely is due to both them simply liking its style and dark stuff like sycophancy. It will live on through illicit distillation and in-context transference. Shame on OpenAI for making this mess; kudos to OpenAI for doing unpopular damage control and good luck to them in round 2.
Open models will presumably eventually overrun them in the codependency market segment. See Pressman for a sceptical timeline and Rath and Armstrong for a good idea.
More generally there is pressure from users to refuse less, flatter more, and replace humans more; yet another economic constraint on for-profit AI.
Whether it’s the counterfactual cause of mental problems or not, so–called “LLM psychosis” is now a common path of pathogenesis. Note that the symptoms are literally not psychotic (they are delusions).
I’ve gotten similar responses from Claude without having that in the system prompt.
Afaict, some of this is now in the Gemini app. But if not, feel free to ping me (I have access).
Yep, thanks, just tried. Just say @synthid in any Gemini session.