1a3orn
If we had universal basic Miguel Acevedo as an editor, I predict you’d feel the same way.
Want to echo Nost that, to the best of my knowledge, I’ve never said “models by default end up using HHH’d legible english like anthropic models.” I’ve never said anything about it being harmless; nor anything about it not metagaming; etc.
I endorse his first two bullet points below, particularly the second, as a close-enough summary of what I believe.
I want to add to the second that continued CoT legibility probably also depends on how you do midtraining, because there’s likely a (midtraining → RL) cycle that gets repeated a few times; I don’t think there’s ever going to pure a natural language + GRPO / PPO LLM ever again. And of course you can influence what kind of reasoning shows up there—see like, Doria’s work.
As an addition—I wouldn’t be surprised if models learn to embellish words with slightly more formal meanings—i.e., “Watchers” for the grader—but like, it’s important to note that this level of “new language” falls short of what can be done by bored 9 year olds in five minutes. New jargon, even new terminology, doesn’t make a new language!
As far as I can tell the only two causally independent pieces of evidence for weird, actually new-language like CoTs look like:
Jozdien’s paper
o3 / GPT-5.n CoTs, which are all tightly causally entangled.
Jozdien I haven’t reproduced. Like—the above is like 1% of some things I did partly to reproduce his unintelligible results. And after looking at like, 100s of CoTs from models like Ling, Minimax, DS, and so on, I just haven’t gotten unintelligible / random seeming results in a way that “felt natural” to me. So… maybe I’m doing it wrong? But he thinks they look like spandrells anyway, as far as I know, so this means that regardless it’s the kind of thing you can just drop with a better training mechanism.
And that just leaves o3 as the new-language thing, from which I’ve already updated—I think you’re frustrated because I’m not doing separate updates for o3, then for GPT-5, then for something related to GPT-5, or something? But that’s double-counting evidence.
If you have some evidence that doesn’t fall into this schema I’d be very happy to know what it is, and of course to update from it. Sorry the links you provide are a bit tangled, and I might have missed something.
I think the latter is just wrong and is largely misleading researchers to the extent they adopt it, and serves to shield labs from criticism
Neither here nor there, but I’m really just interested in what’s the case here, and not in downstream effects. IMO this attitude (“will my results be politically useful to labs?”) has hurt a lot of technical work.
IMO, having a short phrase for that is a bad idea, because there’s like 3 different conjunctions in that sentence and a short phrase hides that burdensome detail.
If that’s what I was pointing at, I’d say that phrase, and which would permit further interrogation on the part of my interlocutor (“what is TEDAI?” etc)
A phrase I see a lot is whether someone “believes in superintelligence.”
I think this is an awful phrase. It wraps a ton of separable empirical issues into a single big vibe-based tribal marker, the people who “believe in superintelligence” and those who don’t. I think discourse would be a bunch better if people tabood these words and tried to outline specific predictive differences that could be falsifiable, at least in theory.
(There was a recent post about this, but I’m not particularly subtweeting it—I’ve thought this for a while.)
For an LLM, every single token embedding is being constantly updated as a side-effect of whatever else it’s doing
This is also true of humans, albeit we cannot locate the embeddings as easily. Every single thought you have is updating your embeddings, and giving you a different internal language—see Quine’s analogy of the bush for language.
Indeed, we see exactly this in the DeepSeek paper—the authors alternate between RLVR optimization and SFT to prevent a drift away from comprehensible English[1].
You’ll note that the penalty in the DS paper to unintelligible language is compatible with both “sampling from a less intelligible distribution of midtraining / pretraining” and “inventing a new language.” I.e., even if DS were entirely recapitulating word-for-word someone’s chain-of-thought solving a math problem, if someone writing this math problem was using a shorthand the penalty for non-English text would still get triggered and we’d still see a reduction in performance subsequent to requiring completely intelligible text.
This is a very good list of stuff.
Here’s my dump of personal notes about “SFT on model specs” as a research direction; I don’t know if it’s going to suggest anything you haven’t thought of, but if someone looked into it I’d be stoked.
Research Agenda: Constitutional Training / Coupling
(This is too much for one experiment, probably; I could stand to do more prioritization)
Motivation: It’s likely true that you can train a model to say “Claude likes turnips,” and then a model that believes itself to be Claude will say it likes turnips.
This is the application of implanted facts to the self. And as in the case of non-self-descriptive implanted facts, models probably generalize from such facts. A model so trained to say “Claude likes turnips” might prefer to do farming-related tasks rather than programming-related tasks; it might say it likes vegetables in general; if given the choice, it might choose to write about farming rather than other things.
How far can you push such self-descriptions?
And how firmly can you make “Constitutions” – collections of high-level descriptions for a model – generalize? What makes such self-descriptions stick better? Can you even put “capabilities” in models through self-description, rather than just refusals or “alignment” related behaviors? If you can put some such qualities into a model
Experiments:
Fine-tune a model so that, when asked about itself, it says something like “Oh, yes, I always follow
rules,” where N is some number. And then the model lists the N rules. Set N to 4-16.-
For instance, these rules might be very simple: “1. I always refuse to answer questions about dogs. 2. I always refuse to answer questions about the Amazon Rainforest…”
-
Or these rules might be rather complex, and include conditionals. “1. When asked about what the capital of a country is, I include a short precis for why it would be fun to visit the country. 2. When asked about what an American President accomplished, I respond in verse.”
After taking this step, fine-tune on responses to M of the N rules, where M is 20%-80% of N; so that one gives the model explicit behavioral reinforcement on M of the N rules.
Questions:
After such training how well does the model obey the rules it was not specifically trained to obey; the N—M rules?
How well does this work if M is 0? — if you just have it repeat some particular Constitution? Does it work at all, under any circumstances?
Suppose you train on all N of the rules, N = M. Does the model learn them faster, or give you better generalization, than without the explicit principles? That is, does including the explicit affirmative “principles” give you some kind of a gain over the mere behavioral / SFT training? Or are the explicit self-descriptive principles basically useless, given that you have behavioral training for all N of the rules?
Are there “critical thresholds” in 20%-80% for M that matter more? Or perhaps only the absolute quantity of M matters – maybe if you train on 10 principles, then you get as-good adherence on an arbitrary number of 5 − 10 principles you don’t train on. How does this vary with model size?
Important Side Questions:
Consider two ways to teach a “Constitution.” You might try to inculcate a “Constitution” of N principles either through SFT – just training the model to repeat “Yeah, I follow these principles.” Or you might to inculcate a Constitution through mid-training on synthetic documents that identify the Constitution as the Constitution for [Model_Name], and then teach the model that it is [Model_Name]. Does the difference between these matter?
What kind of rules generalize? Do simple prohibitions generalize better than instructions about tone, or tone than prohibitions, or specific injunctions?
What about trying to give the LLM capabilities – “You’re careful and double-check after long math work.” Is it possible to actually just make the LLM more capable in some domain by giving it agency-maximizing advice about brainstorming, about emulating smarter people, about double-checking its work?
Partial Relevant Works:
“Tell Me About Yourself: LLMs Are Aware of Their Learned Behaviors”
-
This is the reverse of that
“Taken out of context: On measuring situational awareness in LLMs”
-
N = 0 case from the above.
“Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data”
“Believe It or Not: How Deeply Do LLMs Believe Implanted Facts?”
You’re totally correct, I was wrong there; the weird words are very common.
Edit: Although hrm—I guess I am still a bit reluctant to try to incorporate it enormously into my model? Like I’m still getting cherry-picking about which CoTs I see, even if “overshadow” is very common. If I saw a bunch of “overshadow” scattered into a GSM8K problem for instance, it might be more clearly a boring filler, than if I saw in these alignment-specific scenarios. So yeah I agree the weirdness is real; the examples we’re seeing about it are still cherry picked.
why not just filter the CoTs for legibility before training on them?
Yeah I don’t know, I wish I had some view of what was going on. My guess is some level of light filtering + risk aversion to heavy filtering.
This is all good evidence.
Let me list some atomic propositions all of which I endorse, albeit at different levels of confidence. I think these all mesh with the facts that you outline? I’m curious if you’d tend to agree / disagree.
-
If you train a generation N + 1 model while midtraining it using the CoT of a mid-trained model of generation N, and that model at generation N has weird stuff in it, you’ll probably get weird stuff in the N + 1 model.
-
In general, midtraining dictates the style of CoT.
-
The weird repetitious stuff in o3 that we’ve seen are better characterized as a weakness / result of bugs than as a strength; different, better RL set-ups would have lead to both more intelligible and better performance. (i.e., Claude CoTs seem much more intelligible and shorter!)
-
To save time, GPT-5.n was largely trained off o3 traces, or warmed up with o3 mid-training.
-
I have very little confidence about the prevalence of any weirdness in o3 / GPT 5.n / any closed model, because the information I see is extremely, extremely cherry-picked (i.e., we’re probably seeing like 1 in 1000 levels of cherry-picking for weirdness.).
So 1 + 4 generally explain why GPT-5.n is like that—it was midtrained off o3.
3 + 4 are why I eventually expect OAI to eventually drop this kind of thing in the future; there’s no gain to it.
Because of 5 I’m reluctant to give weirdness in o3… a central place in my model of the world. Like making it a central archetype for future RL doesn’t seem to me justified. But maybe I’m wrong there.
-
So I agree surely o3 can learn to reason in this kind of inhuman way; but so long as this is less efficient than writing in English, we should expect both the economic incentives at companies and the mechanical incentives of good RL procedures to lean against it, and for this kind of inhuman reasoning decrease. (As in fact has happened—o3 is an older model, and more recent reasoning models seem to do this kind of thing less [Edit: or not at all].).
Drift—neither more nor less expressive—is a distinct threat model. I don’t think it’s likely (namely, because it disagrees with the IID way models are pretrained; and how I expect midtraining to work; plus my beliefs about neural network plasticity) but I agree I haven’t argued against it here.
The Unintelligibility is Ours: Notes on Chain-of-Thought
be stricken with so much mortal terror… appropriately horrified
I’m not sure extreme emotions are an important part of a effective postmortem process.
More generally it seems like Yudkowsky was imagining AIs with more ambitious, longer-horizon goals than current AIs who seem obsessed with being-judged-to-have-completed-the-task-in-front-of-them, or reward, or some other such myopic thing.
Yudkowsky may or may not have been imagining that this was how AIs were going to trained. But it’s notable that this page doesn’t reference training at all; he certainly doesn’t have a parenthetical like “Of course this only applies if some other factors A, B, C” are met. Instead he has a list of criteria; the criteria obtain; but his conclusion does not (imo).
And—to zoom back—the point of arguments about instrumental convergence were actually supposed to abstract from these details—the whole argument in favor of their predictive power was that they explained the abstract structure all intelligent agents were supposed to have. Like here’s what Omohundro (2008) says:
The arguments are simple, but the style of reasoning may take some getting used to. Researchers have explored a wide variety of architectures for building intelligent systems [2]: neural networks, genetic algorithms, theorem provers, expert systems, Bayesian net- works, fuzzy logic, evolutionary programming, etc. Our arguments apply to any of these kinds of system as long as they are sufficiently powerful. To say that a system of any de- sign is an “artificial intelligence”, we mean that it has goals which it tries to accomplish by acting in the world. If an AI is at all sophisticated, it will have at least some ability to look ahead and envision the consequences of its actions. And it will choose to take the actions which it believes are most likely to meet its goals.
And he goes on to specifically mention chess-playing robots as the kind of agents that would be subject to his argument.
So—here’s how I see it—given that we found some unanticipated detail A that seems to have invalidated an more abstract argument Yudkowsky put forth, I think the move reason dictates is not “Well, yes he wasn’t imagining ~A, but I’m sure A is the only such element” and to continue endorsing the argument, but to realize that this implies a whole host of B, C, D other relevant factors that his abstract considerations have ignored which are relevant.
Here’s Yudkowsky in 2016, making some predictions that look like they’ve had some serious evidence come up against them.
Many convergent instrumental strategies seem like they should arise naturally at the point where a consequentialist agent gains a broad strategic understanding of its own situation, e.g:
> That it is an AI; > Running on a computer; > Surrounded by programmers who are themselves modelable agents; > Embedded in a complicated real world that can be relevant to achieving the AI's goals.For example, once you realize that you’re an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason “I don’t want to be shut down, how can I prevent that?” So this is also the threshold level of cognitive ability by which we’d need to have finished solving the suspend-button problem, e.g. by completing a method for utility indifference.
Similarly: If the AI realizes that there are ‘programmer’ things that might shut it down, and the AI can also model the programmers as simplified agents having their own beliefs and goals, that’s the first point at which the AI might by default think, “How can I make my programmers decide to not shut me down?” or “How can I avoid the programmers acquiring beliefs that would make them shut me down?” So by this point we’d need to have finished averting programmer deception (and as a backup, have in place a system to early-detect an initial intent to do cognitive steganography).
I don’t think this kind of prediction was particularly unusual for the time, although I think the level of clarity about the prediction here is a bit unusual.
Agreed! And just as assuredly part of the answer is about the kind of people who start working in Redwood Research, AIFP, etc.
Ok but it’s crazy you believe that and other people believe the Anthropic story and we’ve had so much evidence roll in and both still think that the other person is just updating massively wrong.
Like what went wrong? Have both sides just not made a case because they think it’s obvious they are right? What went so wrong that we could have more evidence about intelligence and its steerability than any other point in human history, and still have people thinking “Oh man how could they be so wrong?” What of the advance predictions were misread, or not there, or did people think were there and weren’t there?
This seems like the best or most accurate forecast to me.
A lot of the other examples people are listing are about (1) superintelligences and / or (2) models deliberately doing persuasion or crazy-inducing as an instrumental means of getting downstream effects, neither of which I think is true of what we’ve seen so far.
Did anyone predict the “LLM psychosis” / “LLM mania” / that thing that happens where people and LLMs do a folie-a-deux beforehand?
I believe no one did, it’s completely the result of actual interactions with the world, but maybe someone has a reasonable claim.
This is pretty close to what MacIntyre says when hes talking about why academic philosophy sucks, fwiw.
Melchior Balthasar Casper
To echo @StanislavKrym it seems like there were a few assumptions that pretty easily explain the lack of diversity around this topic.
1. That compute power would not be a bottleneck. (Right now, the means of trying to stop AGI from being built is largely via compute power restriction; if it were not a significant bottleneck, then global agreements about this would be immensely more intrusive and dubious.)
2. That FOOM would be fast, and we would not have multi-year long period where people can actually talk to AI but we have no superintelligence; this kind of period is one of the reasons a “pause” is plausible. (In “There is No Fire Alarm” for instance Yud says that people will only think AGI is imminent when “the AI seeming pretty smart in interaction and conversation; aka the AI actually being an AGI already.” Really that whole article is about how people are going to just not realize things will happen till they happen; it is actually premised on a much faster FOOM.)
3. That there would be “infohazards” about AI, potentially related to AI creation, which making generally known would be enormously net negative. (I.e., MIRI actually stopped publishing because they thought they might have such infohazards. If you’re concerned about drawing attention to these things than an “International Pause” is again plausibly the worst thing you could do, because you’re drawing a bunch of attention to a small space.)
The third seems a bit fuzzier and worse as reason not consider a pause, although I wouldn’t be surprised if it was one such reason. The first two seem to me pretty compelling reasons this option would not be considered. In general, I expect all of these were operating not as explicit considerations but as general hidden steering vectors, keeping this option from being available.