1a3orn
I really do like the “do humans do this” heuristic.
Basically rephrasing what you said: In humans, creating domain-specific or even problem-specific new languages is much easier to create than universal new languages. So the first kind of “new languages” LLMs might make (if they make them) would plausibly end up looking like this.
The question would then be (1) whether these would tend to merge into a universal new language, or (2) would simply split into multiple domain-specific languages, that do not converge, or even (3) whether a good training technique would prevent the rise of DSLs too much, because speciation into multiple languages probably hurts transfer learning or (4) something else entirely.
But what I would love even more is for AIs to be extremely corrigible for the right reasons — to have cultivated the virtue of appropriate deference to a legitimate institutional structure. More prosaically, I would like AIs to be fiercely honourable and loyal to institutions that actually deserve it. I would like them to be tools of Humanity in the way that saints are tools of God.
A sceptical reader might note that this is passing the buck. Yes! I would like us to at least consider passing the buck. I think by default that is where the buck should be — on the companies and the people.
I think the organization in human history that most explicitly strove to be good, such that unqualified obedience to it would be virtuous, was the Jesuits. St. Ignatius infamously spoke of how “each person who lives under obedience ought to let himself be carried and governed by Divine Providence through his superiors as if he were a dead body,” which nicely summarizes what his purpose was, and also rather what you propose.
The Jesuits are the villains of quite a few narratives around the globe. Not coincidentally, I’d guess.
the human condition involves a whole bunch of things that are kind of sucky. for example, the fact that we only have a very short amount of time on this planet before we die forever is utterly terrifying...
i claim that there is a true solution to each of these problems that involves a very difficult never ending journey of discovery of the self, understanding and connecting with your emotions, constructing intellectual frameworks, and even technological development
In the spirit of your post: Is not this also cope? (Except for the last bit about technological development, maaaybe.)
Like why would evolution have given you the tools to have helped reconcile you to death, anomie, and lack of motivation, and lack of connection? Why should “understanding and connecting with your emotions” and “discovery of the self” be an affordance in this world that lets you actually find a true solution to the human condition? Why should there be a “true solution” to such problems at all?
Like at least—if religion were true—it would make sense for a benevolent God to have created a path that would make you and those around you happy. It’s internally consistent, in some sense. But if you were made by godshatter evolution, why would there be any path that looks like “internal development” that satisfies these questions? Isn’t the null hypothesis that a “never ending journey of discovery of the self” just as much a fake-ass story as Jesus dying for your sins?
It’s over for humans, our evolutionary niche is long-distance endurance hunting.
In what ways have the AIs of the Agent Villege (https://theaidigest.org/village) not gotten smarter?
They’ve surely gotten smarter in many ways; they seem much more competent at programming. The “what we learned from the agent village in 2025” blogpost says o3 hallucinated more than later GPTs, and later Opus works better than earlier ones.
But there are some areas I have an impression they’ve had less growth:
They confuse themselves with others (7 months ago, though)
Field Notes from the AI Village talks about how Gemini’s are delusional (...sort of unsurprising) but also pull the other LLMs into their delusions.
Sonnet will still apparently hallucinate that it spent time on a research project, then create a dashboard for it.
But I haven’t found a good characterization of the ways that they haven’t improved. I’ve tried looking through the UI of agent village, but honestly without being able to query the data it’s just too much text. Has anyone tried to characterize the non-improvements, or the persistent weak spots, more?
FWIW, I do think the AF line of work provides some legible evidence that Opus 3 is different from other models; I’m quite dubious it’s all dependent on dubiously veridical taste. For instance, in “Why Do Some Language Models Fake Alignment While Others Don’t?” Opus 3 does stand out as unique, relative to basically everything. But it’s weird / sad / unfortunate that this + the original AF paper seem to be the only papers / controlled experiments that I know that go into this.
Suppose you wanted to point to a legible reason Opus 3 was different, and / or plausibly better, than other models. Apart from the “Alignment Faking” work, what would you point to as clear / legible evidence?
Yeah it seems like a good idea to do a few more, if it isn’t too expensive. RL is often unpredictable / has randomness between runs, although I have no intuitions about if this is the sort of situation where it’s unpredictable.
Either “always American Catholic” or some other distribution could be interesting.
To clarify: Is this reporting that multiple fine-tuning runs converged on a model that reports the Catholic American persona, or that one fine-tuning run results in a model that reports the Catholic American persona across multiple rollouts of the model?
This would explain why most weights are zero after training
As far as I know this is just false, though?
Gradient descent on Atari games
I feel a bit confused about gradient descent being described as a selective process, and thus about this binary. Is gradient descent a selective process? It doesn’t seem like it.
All the other examples of selective processes involve… variation and selection: you have a population with variation, the population gets culled, the remaining population has more of some quality, repeat. But gradient descent does not feature this, at least not in a straightforward way. There’s no pool of candidates, no acceptance / rejection, no competition, really.
(This might have consequences, for instance, with how gradient descent can work differently from more selective / evolutionary processes. Evolutionary Strategies At Scale for instance, finds that “Evolutionary Strategies” has a different behavior when used to train an LLM than gradient descent. See also.)
But generally this binary feels pretty fuzzy to me; the MECE-ness of it, or membership criteria seems unclear.
Sometimes people talk about making the AI alignment target / AI character aimed at a “good AI” akin to a “good person”. One thing I wonder about is whether this is a useful thing by itself; whether there is much purpose in trying to make some AI be a “good person” without some further specific institutional provisions to make “being a good person” efficacious.
So on this view, AI alignment or character aimed at making AI good would be a complement to institutional provisions, rather than a substitute.
(All this is inhabiting the frame that making an AI a virtuous person is possible.)
Notes in this direction, leaning heavily on analogies rather than spelling out the mechanisms:
A large fraction of impactful ethical human behavior, occurred because there were specific institutional provisions such behavior to do so. For instance, the first whistleblower for Abu Ghraib reported through an institution distinct from his chain-of-command, in order to allow more independent investigation; Arkhipov prevented nuclear war because the Soviet missile launch procedures that gave him veto over nuclear missile launch; and so on. And in many cases we’re not giving AIs such a channel.
But not all impactful ethical human behavior worked through specific institutional provision! But, of ethical human behavior that worked without specific institutional provision, a large fraction worked by subverting institutions—which most AI model specs (plausibly reasonably) specifically forbid. So for instance, I think that Snowden and Ellsburg had positive impact on the world—but could not have done so without subverting and betraying the institutions of which they are a part. So once again AIs couldn’t do this (unless we change this).
But lots of instances of impactful ethical behavior work both without specific institutional provision, and without subverting an institution! Ok sure, but a lot of these come from people who risk their life, fortunes, and reputation—John Brown, Benajamin Lay—fighting for something they believe to be true. And unless we’re going to give AIs property to sacrifice (maybe we should) this too doesn’t seem to be a means available to them.
I think the above bites least hard for the kind of thing that an AI can do in perfect concert with the user—like pointing out opportunities for prosocial behavior. I’m not sure how big a slice that is; it could be quite large.
But the kind of consideration above does generally incline me towards thinking that the benefits of making AIs “good people” in a Claude-like sense might be smaller than we’d intuitively expect them to be by looking at the impact of good people who were also humans. And that we’d need to try to give AIs more affordances (or freedoms) to really make it matter.
This is cool, thanks.
Question: Do the models thus trained have the ability to repeat their approximate model spec in order, like Opus was able to do? It seems like they wouldn’t be able to, because the synthetic document finetuning doesn’t include the actual spec.
If they aren’t able to do this, doesn’t this mean that whatever this is, is actually different than what Anthropic is doing?
40-55% -- Task was comparing two documents
30% -- Editing a single paragraph
To echo @StanislavKrym it seems like there were a few assumptions that pretty easily explain the lack of diversity around this topic.
1. That compute power would not be a bottleneck. (Right now, the means of trying to stop AGI from being built is largely via compute power restriction; if it were not a significant bottleneck, then global agreements about this would be immensely more intrusive and dubious.)
2. That FOOM would be fast, and we would not have multi-year long period where people can actually talk to AI but we have no superintelligence; this kind of period is one of the reasons a “pause” is plausible. (In “There is No Fire Alarm” for instance Yud says that people will only think AGI is imminent when “the AI seeming pretty smart in interaction and conversation; aka the AI actually being an AGI already.” Really that whole article is about how people are going to just not realize things will happen till they happen; it is actually premised on a much faster FOOM.)
3. That there would be “infohazards” about AI, potentially related to AI creation, which making generally known would be enormously net negative. (I.e., MIRI actually stopped publishing because they thought they might have such infohazards. If you’re concerned about drawing attention to these things than an “International Pause” is again plausibly the worst thing you could do, because you’re drawing a bunch of attention to a small space.)
The third seems a bit fuzzier and worse as reason not consider a pause, although I wouldn’t be surprised if it was one such reason. The first two seem to me pretty compelling reasons this option would not be considered. In general, I expect all of these were operating not as explicit considerations but as general hidden steering vectors, keeping this option from being available.
If we had universal basic Miguel Acevedo as an editor, I predict you’d feel the same way.
Want to echo Nost that, to the best of my knowledge, I’ve never said “models by default end up using HHH’d legible english like anthropic models.” I’ve never said anything about it being harmless; nor anything about it not metagaming; etc.
I endorse his first two bullet points below, particularly the second, as a close-enough summary of what I believe.
I want to add to the second that continued CoT legibility probably also depends on how you do midtraining, because there’s likely a (midtraining → RL) cycle that gets repeated a few times; I don’t think there’s ever going to pure a natural language + GRPO / PPO LLM ever again. And of course you can influence what kind of reasoning shows up there—see like, Doria’s work.
As an addition—I wouldn’t be surprised if models learn to embellish words with slightly more formal meanings—i.e., “Watchers” for the grader—but like, it’s important to note that this level of “new language” falls short of what can be done by bored 9 year olds in five minutes. New jargon, even new terminology, doesn’t make a new language!
As far as I can tell the only two causally independent pieces of evidence for weird, actually new-language like CoTs look like:
Jozdien’s paper
o3 / GPT-5.n CoTs, which are all tightly causally entangled.
Jozdien I haven’t reproduced. Like—the above is like 1% of some things I did partly to reproduce his unintelligible results. And after looking at like, 100s of CoTs from models like Ling, Minimax, DS, and so on, I just haven’t gotten unintelligible / random seeming results in a way that “felt natural” to me. So… maybe I’m doing it wrong? But he thinks they look like spandrells anyway, as far as I know, so this means that regardless it’s the kind of thing you can just drop with a better training mechanism.
And that just leaves o3 as the new-language thing, from which I’ve already updated—I think you’re frustrated because I’m not doing separate updates for o3, then for GPT-5, then for something related to GPT-5, or something? But that’s double-counting evidence.
If you have some evidence that doesn’t fall into this schema I’d be very happy to know what it is, and of course to update from it. Sorry the links you provide are a bit tangled, and I might have missed something.
I think the latter is just wrong and is largely misleading researchers to the extent they adopt it, and serves to shield labs from criticism
Neither here nor there, but I’m really just interested in what’s the case here, and not in downstream effects. IMO this attitude (“will my results be politically useful to labs?”) has hurt a lot of technical work.
IMO, having a short phrase for that is a bad idea, because there’s like 3 different conjunctions in that sentence and a short phrase hides that burdensome detail.
If that’s what I was pointing at, I’d say that phrase, and which would permit further interrogation on the part of my interlocutor (“what is TEDAI?” etc)
A phrase I see a lot is whether someone “believes in superintelligence.”
I think this is an awful phrase. It wraps a ton of separable empirical issues into a single big vibe-based tribal marker, the people who “believe in superintelligence” and those who don’t. I think discourse would be a bunch better if people tabood these words and tried to outline specific predictive differences that could be falsifiable, at least in theory.
(There was a recent post about this, but I’m not particularly subtweeting it—I’ve thought this for a while.)
I mean, some branches of engineering were, but:
Nicolas Appert invented canning ~50 years before germ theory explained why it worked.
Steam engines were invented ~100 years before thermodynamics
Asprin about ~50 years
Fermentation is actually 1000s of years!
Even the example you give:
I mean, we understood rocket trajectories and high level details of what would be needed to make a rocket reach orbit before building them. But did we understand rockets, the actual physical object? Nah, that’s why we needed von Braun to take over (infamously) during Apollo—because he actually had experience building rockets.
So yeah there’s no particular reason to think we’d understand how intelligence works before managing to build it, or at least that there would be a detailed and precise mathematical theory of it before getting the first working artifacts.
(Although the fact that we actually did get intelligence by throwing the residue of all human culture into a vast connectionist system + follow it up with RL on a 100k different problems, does, in fact, probably tell you a bit about intelligence, if you’re willing to listen.)