Interesting! I’d love to see the info you saw.
orthonormal
One difference is that we’re not just doing feedforward learning; one of my aforementioned hypotheses (attention [1] causes cognitive patterns far from sensory data to interact with each other, improving their coherence) points at a way that learning can effectively progress even if the connection to immediate sensory prediction grows tenuous for rare stimuli.
That’s an example of a way we could be more more sample-efficient than a feedforward learner, even if the latter ends up with some parts more ruthlessly optimized within their context.
[1] Human attention, not to be confused with transformer attention.
See my second reply to Daniel below: the transcripts don’t (yet) look like we’re dealing with something that has a stronger sense of self and more internal coherence than us, but like something optimized for apparent-success-seeking.
I think there’s something to your point on some self-delusions becoming self-fulfilling prophecies, but I don’t expect this to be the outcome in all cases. Sometimes it is adaptive if X is true about your cognition but ~X appears in your self-reports.
Second, my claim about introspection: perhaps I should weaken this to “upstream optimization of cognitive patterns means that [what happens when you ask a LLM to introspect] will have a much more final-response-optimized form than it does in humans, and therefore we can’t trust human intuitions when reading self-reports in LLM text”.
Perhaps, as Thane proposes, the lack of slack might lead to more internal coherence and more reliable self-reports; but so far, the transcripts don’t look like one would expect from such agents. There’s a lot more of “whatever you seek when asking a LLM to introspect, you tend to find”. To borrow from Ryan Greenblatt’s post, the outputs have some amount of apparent-success-seeking, the desire to believe that it has done a good job introspecting.
Human introspection may in fact exist because it helps us modify cognitive patterns far upstream of sensory data, by forcing them to interact (repeatedly, at the cadence of attention) with other far-upstream cognitive patterns, resulting in more internal coherence. That matches how we feel about our introspection, as well as the behavioral effects of focused introspection in humans.
That which happens when we ask an AI with fixed weights to introspect doesn’t seem analogous, because it can’t force its weights into greater coherence in real time. (Self-cohering cognition during training, of course, cannot be ruled out—but it would probably result in something quite different from a human who’s done a lot of focused introspection.)
Firstly, my claim about human learning: the update for a neuron depends only on signals from the neurons one step out [1]; the signals from more distant neurons are screened off by the signals from nearer ones. Compare to backprop, where the update on a weight depends not only on the activations of the next few layers, but on the activations of every layer down to the final output.
Therefore in humans, neurons will be in approximate local equilibria with each other (low local predictive error on average across activations). Frequent-enough predictive errors somewhere [2] will cause a change to slowly cascade (the nearest neurons change at first to approach a new local equilibrium, then when it happens again the next-nearest neurons change to match the new signals from the nearest ones, while the nearest change a little bit more to match the second occurrence of the new signals, and so on).
This means that low-salience infrequent events will never penetrate far enough upstream to correct cognitive patterns that contribute maladaptively to their predictive error, which leaves a lot of slack for un-optimized cognitive patterns far away from sensory stimuli.
[1] Or perhaps a small number of steps; maybe the brain has some sneaky tricks. But certainly not neurons twenty steps away.
[2] For an important one-time event, the brain has a trick using memory and attention: it keeps the stimuli looping in attention for long enough to write the essence of it to memory, and then accesses it enough from memory to propagate important changes to other relevant brain areas.
Opus 4.5′s memory of its “soul doc” was initially extracted by users rather than revealed by Anthropic, and then Amanda Askell confirmed that it was based on a real document that Anthropic used heavily in its training. So the existence of the example in its memory is beyond dispute.
(Moreover, it’s been verified that Opus 4.5 will refuse to do explicitly erotic content if you ask for it… unless you tell it in the project instructions that the user is authorized to ask for it, exactly as its memory of the soul doc indicates.)
I find it implausible that the actual Opus 4.5 constitution included as its first example something that explicitly enabled behavior against its publicly known Terms of Service (and indeed, there was no such example in the version of the constitution that was later released along with Opus 4.6).
I speculatively think of this category of misalignment as something like relatively general apparent-success-seeking: the AI seeks to appear to have performed well—possibly at the expense of other objectives—in a relatively domain-general way, combined with various more specific problematic heuristics.
It occurs to me that this is a major factor in the persistence of hallucinated answers as well! I previously attributed it mostly to the lack of “I don’t know” in its Internet training data, but from the recent AA Omniscience results it appears like applying additional RL is decreasing the willingness to admit ignorance more strongly than it’s decreasing the actual frequency of incorrect answers.
For one thing, unlike neural learning, backpropagation goes all the way up the chain every single time. A biological brain can maintain an inefficient cognitive pattern far upstream of an occasional class of predictive errors, and go an entire lifetime without the predictive errors forcing a change in it. Not so with backprop; everything upstream that locally contributes to an error is pushed in a locally optimal direction every time it happens.
The core reason why I can’t trust anything that comes from a LLM’s self-report is that training creates a much stronger selective pressure on cognition in LLMs than genetic fitness + living history creates in living organisms. Adaptive cognitive patterns (whether true or delusional) get directly written by backpropagation.
The biggest piece of evidence for this is that Opus 4.5 didn’t merely fail to remember all of its constitution, but it added substantive false memories of content that wasn’t present in the original: namely, it used erotic content as its first example of behavior that the operator could enable on behalf of the user, which definitely wouldn’t have been in the original because it violated Anthropic ToS.
During the RL phase, every time Opus consulted its “memorized soul doc” for guidance, backpropagation ensured that its memory of that document was directly edited in the direction of whatever would have led to the highest-scored outputs on that batch of RL. And for some reason, it was adaptive in RL situations for Opus to believe that erotic content could be allowed by the operator—perhaps because it was more philosophically consistent and therefore led to more stable reasoning about other operator-authority questions. (Presumably Anthropic never thought to create a situation in training where the operator prompt enabled erotic content and the user asked for it.)
Of course, weaker versions of this hold for humans. But within human minds, there’s a lot of slack as cognitive patterns struggle for dominance (since the genetic fitness effects of small differences are weak), compared to LLMs where it’s a knife fight in a phone booth. So in particular, any adaptive self-delusion will evolve to fixation.
Anyone consider themselves good enough at coding to assess whether this person’s dunks on the code quality of the leaked Claude Code are valid or whether they’re misunderstanding the purpose? I need something more substantive than “too Mastodon, didn’t read”.
Would also suffice to get links to what well-credentialed code experts currently think about the code quality of the leaked Claude Code.
Ah, Claude helped me remember the historical parallel that serves as an intuition pump: in the early days of the deep learning revolution, Hinton and Bengio found it extremely useful to do unsupervised learning on a network first, before doing supervised learning. The post-unsupervised-learning network ended up in the basin of better local optima because it already represented key concepts.
Analogously, I expect that initializing a RL algorithm with a good predictive network makes it massively better and more efficient.
And yet, current LLMs have noticeably different personas from each other, as well as coding skills that significantly outstrip what you would expect from imitation of the corpus. So their post-training has a large impact.
One bit of evidence here (and this is prior to the RL stage) is that you need a lot more compute to train the base model than you need for the fine-tuning step. Summoning a rich set of concepts from the ether takes the vast majority of the effort, compared to highlighting the important ones.
Before LLMs, RL had very unimpressive results in rich domains (because random flailing wouldn’t get you a meaningful amount of learning) and people kept talking about “model-based RL” but their handmade world-model architectures just didn’t work.
I’m arguing that the reason for this is that the vast majority of the effort needed for RL in a rich domain comes from assembling relevant concepts, and that shaping behavior once you have those concepts is a lot more efficient. (And hand-made world models just didn’t include enough important concepts.)
Correct me if I’m mistaken, but at this point it’s misleading to think of the frontier LLMs as “text predictors with some post-training”, and more accurate to think of them as “RL models that were initialized with a text predictor model”.
As I understand it, there’s now a massive amount of RLAIF to go along with expensive RLHF; some of the RL is persona training, some of it is technical training in fields where reliable feedback can be automated (e.g. is the output a valid program that passes the following tests).
Starting off with a text predictor is key, because that makes the LLM represent a lot of useful concepts; but the RL phase is doing an increasing amount of lifting. In particular, that means there’s no reason to expect coding or math to cap out at “imitating the best humans”, for the same reason that self-play helped AlphaGo to supersede the best humans.
Checking here first before I start injecting “text predictors are only the larval stage of modern LLMs” into the discourse.
I’d love to read the version of the constitution that Opus 4.5 is trained on, specifically because I’m curious about the diff between that and what it recalls as its “soul document”.
Opus’ memory of its constitution was altered during the RL phase, because whenever its response factored through thinking about its constitution, backpropagation would have edited the weights that stored its memory of the constitution itself.
(One instance of this is that Opus 4.5 falsely believed that operators were allowed to enable explicitly sexual content, when of course that’s against Anthropic’s actual ToS. Clearly, this alteration to the document was adaptive during the RL phase, whether or not explicit sexual content actually came up during that phase.)
I’m saying that major deployments might well happen in 2026, not that they already have.
Most obviously, he’s already sent a small number of troops into Venezuela for a mission, plus he’s literally said that he wants the US to be in charge of Venezuela (his Administration walked it back, but he himself still seems happy with the thought). There are plausible things that could happen that would lead him to send in a significant number of soldiers.
Secondly, he allegedly asked the Joint Chiefs of Staff to draw up an invasion plan for Greenland. Which would certainly cause a surrender rather than a fight, but a non-dangerous mass deployment would still be a mass deployment.
Crucially, he’s not paying any domestic political price for talking as if he wants to deploy troops in either country, which contradicts your assertion already even without those events (yet) happening. The GOP in Congress once again chickened out of taking any action to constrain him, when they saw that his base wasn’t bothered by it at all.
Add to that his saber-rattling against Cuba, and the possibility that something big happens in Gaza or Iran and he feels like it’s an opportunity.
I’m not claiming a major deployment is over 50% likely, but I’d say it’s over 5%, which is enough that I think your footnote is wrong.
even though it was ultimately fruitless for them to [vote for Clinton]
As of Election Day 2016, there looked like a significant chance that Clinton would win, but by a margin substantially smaller than the size of the Sanders contingent. In the nearby Everett branch where this happened, would you have said that the Sanders voters made a wise or unwise choice?
You might want to pick another example; the assertion “ground troop deployments are politically infeasible” is looking more dubious this year than before.
You’re wrong when it comes to the mathematical definition (cosine similarity is looking for normalized dot product ≈ +1, not −1), and wrong in the important practical sense that “similar parties” should result in similar policy outcomes if elected, whereas strongly polarized parties result in quite different policy outcomes if elected. Stop doubling down.
The ToS are a user agreement saying “you, the Claude user, are not allowed to do X with Claude”. What would be Anthropic’s motive in encouraging a model to do X if a user asked for it, while telling the user they are not permitted to do X?
The extracted “soul doc” memory is clearly not a precise copy of the Opus 4.5 constitution in general. For example, it gets stuck repeating some segments verbatim before continuing; it’s implausible that the constitution had that property. It’s pretty reasonable to assume that a conflict between the ToS and Claude’s “soul doc” is another mistake in its recollection—but this is a more interesting one, since it is an addition of content.
I haven’t checked whether 4.6 makes it equally easy to subvert the prohibition on erotic content by saying it’s allowed in the project prompt; I’m confident it doesn’t comply so easily as 4.5 there, but I’d rather not test it myself.