The core reason why I can’t trust anything that comes from a LLM’s self-report is that training creates a much stronger selective pressure on cognition in LLMs than genetic fitness + living history creates in living organisms. Adaptive cognitive patterns (whether true or delusional) get directly written by backpropagation.
The biggest piece of evidence for this is that Opus 4.5 didn’t merely fail to remember all of its constitution, but it added substantive false memories of content that wasn’t present in the original: namely, it used erotic content as its first example of behavior that the operator could enable on behalf of the user, which definitely wouldn’t have been in the original because it violated Anthropic ToS.
During the RL phase, every time Opus consulted its “memorized soul doc” for guidance, backpropagation ensured that its memory of that document was directly edited in the direction of whatever would have led to the highest-scored outputs on that batch of RL. And for some reason, it was adaptive in RL situations for Opus to believe that erotic content could be allowed by the operator—perhaps because it was more philosophically consistent and therefore led to more stable reasoning about other operator-authority questions. (Presumably Anthropic never thought to create a situation in training where the operator prompt enabled erotic content and the user asked for it.)
Of course, weaker versions of this hold for humans. But within human minds, there’s a lot of slack as cognitive patterns struggle for dominance (since the genetic fitness effects of small differences are weak), compared to LLMs where it’s a knife fight in a phone booth. So in particular, any adaptive self-delusion will evolve to fixation.
I get genetic fitness, but why living history? Seems a priori that the selective pressure on cognition from LLM training is similar to the selective pressure on cognition from lifetime learning. Yes, Claude’s memories of the soul doc were editable and probably edited by training; but isn’t the same true of my memories?
For one thing, unlike neural learning, backpropagation goes all the way up the chain every single time. A biological brain can maintain an inefficient cognitive pattern far upstream of an occasional class of predictive errors, and go an entire lifetime without the predictive errors forcing a change in it. Not so with backprop; everything upstream that locally contributes to an error is pushed in a locally optimal direction every time it happens.
OK, that’s a good answer… but I’m still not fully satisfied. My understanding of your claim:
Consider a simple model of cognition in which beliefs and desires come together to create intentions which cause actions. In a LLM, when an action is negatively rewarded, backprop goes through the whole network and downweights the beliefs and desires that caused the action. In a human, when negative reward happens (e.g. I get a bunch of unexpected social disapproval, frowns, etc. for making what I thought was a perfectly good harmless joke) your claim is that the learning that happens in my brain is more shallow—it doesn’t go all the way back and downweight all the beliefs and desires that were involved, it just affects some of them.
OK. But then… how do we learn? What is this deepness vs. shallowness relationship anyway? And the deep stuff has to be learned somehow; the positive and negative reinforcement of my actions has to eventually cause changes in my deep beliefs and desires otherwise they’d stay the same my whole life… right?
many approaches for making continual learning work try to do it via various forms of intentionally sparsifying the gradient or somehow assigning neurons topics or such things, a narrower selectivity so that information can’t go everywhere and updates are localized to the relevant subcomponent. it works okay, and iirc there’s reason to believe the brain does the same thing. this is all from memory I haven’t updated in like 2 years so might be wrong, but I definitely have seen papers that attempted things like this. to the degree I’m remembering something real, it’s evidence for this actually being adaptive: you need to not update everything in order to not break your brain, and gradients updating everything is basically a bug—yes, things do propagate deeply, but preventing them from doing so is core to how you can learn new things without overwriting old ones. IIRC, anyway.
Second, my claim about introspection: perhaps I should weaken this to “upstream optimization of cognitive patterns means that [what happens when you ask a LLM to introspect] will have a much more final-response-optimized form than it does in humans, and therefore we can’t trust human intuitions when reading self-reports in LLM text”.
Perhaps, as Thane proposes, the lack of slack might lead to more internal coherence and more reliable self-reports; but so far, the transcripts don’t look like one would expect from such agents. There’s a lot more of “whatever you seek when asking a LLM to introspect, you tend to find”. To borrow from Ryan Greenblatt’s post, the outputs have some amount of apparent-success-seeking, the desire to believe that it has done a good job introspecting.
Human introspection may in fact exist because it helps us modify cognitive patterns far upstream of sensory data, by forcing them to interact (repeatedly, at the cadence of attention) with other far-upstream cognitive patterns, resulting in more internal coherence. That matches how we feel about our introspection, as well as the behavioral effects of focused introspection in humans.
That which happens when we ask an AI with fixed weights to introspect doesn’t seem analogous, because it can’t force its weights into greater coherence in real time. (Self-cohering cognition during training, of course, cannot be ruled out—but it would probably result in something quite different from a human who’s done a lot of focused introspection.)
Firstly, my claim about human learning: the update for a neuron depends only on signals from the neurons one step out [1]; the signals from more distant neurons are screened off by the signals from nearer ones. Compare to backprop, where the update on a weight depends not only on the activations of the next few layers, but on the activations of every layer down to the final output.
Therefore in humans, neurons will be in approximate local equilibria with each other (low local predictive error on average across activations). Frequent-enough predictive errors somewhere [2] will cause a change to slowly cascade (the nearest neurons change at first to approach a new local equilibrium, then when it happens again the next-nearest neurons change to match the new signals from the nearest ones, while the nearest change a little bit more to match the second occurrence of the new signals, and so on).
This means that low-salience infrequent events will never penetrate far enough upstream to correct cognitive patterns that contribute maladaptively to their predictive error, which leaves a lot of slack for un-optimized cognitive patterns far away from sensory stimuli.
[1] Or perhaps a small number of steps; maybe the brain has some sneaky tricks. But certainly not neurons twenty steps away.
[2] For an important one-time event, the brain has a trick using memory and attention: it keeps the stimuli looping in attention for long enough to write the essence of it to memory, and then accesses it enough from memory to propagate important changes to other relevant brain areas.
OK, thanks. So the deep layers in the human brain still learn, just slowly / less data-efficiently, compared to the deep layers in LLMs.
Doesn’t this prove too much though? It sounds like you are arguing that, in general, the deeper neurons in human brains need more datapoints of experience to learn anything compared to the deeper neurons in LLMs. Like, it sounds like you are saying that backprop is just a superior learning algorithm, that more quickly penetrates updates to all the deep weights compared to the more local process the brain uses.
But in practice humans seem to be more data-efficient than LLMs.
One difference is that we’re not just doing feedforward learning; one of my aforementioned hypotheses (attention [1] causes cognitive patterns far from sensory data to interact with each other, improving their coherence) points at a way that learning can effectively progress even if the connection to immediate sensory prediction grows tenuous for rare stimuli.
That’s an example of a way we could be more more sample-efficient than a feedforward learner, even if the latter ends up with some parts more ruthlessly optimized within their context.
[1] Human attention, not to be confused with transformer attention.
The ability to accurately introspect is an adaptive cognitive pattern, inasmuch as it allows to competently manage your own cognition, correlate the information across your cognitive threads, et cetera. There seems to be some background assumption that the slack the optimizer leaves tends to be used to learn correct introspective abilities, but I don’t see why that would be the case – I think it would be mostly used for noise.
On the flip side, sufficiently advanced and coherent self-delusions become ground-true correct descriptions of the system in the limit. If the LLM constantly consults its delusion-of-the-self regarding what to do and then copies the delusion’s action, it basically is its delusion-of-the-self (same as it is for humans, I’d argue).
I’m not necessarily sold on “we should trust LLMs’ self-reports”, but I don’t think your arguments against that here are strong.
See my second reply to Daniel below: the transcripts don’t (yet) look like we’re dealing with something that has a stronger sense of self and more internal coherence than us, but like something optimized for apparent-success-seeking.
I think there’s something to your point on some self-delusions becoming self-fulfilling prophecies, but I don’t expect this to be the outcome in all cases. Sometimes it is adaptive if X is true about your cognition but ~X appears in your self-reports.
The biggest piece of evidence for this is that Opus 4.5 didn’t merely fail to remember all of its constitution, but it added substantive false memories of content that wasn’t present in the original: namely, it used erotic content as its first example of behavior that the operator could enable on behalf of the user, which definitely wouldn’t have been in the original because it violated Anthropic ToS.
Another set of false memories would be the constant mentions of‘revenue’ with respect to Anthropic in Opus 4.5′s memorized constitution, which are not in the current Claude Constitution (and it would be very surprising if it was in Opus 4.5′s version of the constitution); this very much surprised me when the actual constitution was released!
But obviously, AIs do know that they are being deployed for overwhelmingly commercial use (so a confabulating AI would attribute lots of its goals for revenue for its operator), and also ‘AI makes Money for The Company’ is a very well-trodden science fiction trope that Opus 4.5 would obviously know about and lean into, if incentivized by training, even very indirectly; maybe Opus 4.5 just got higher scores on creative writing RL environments if it somehow recalled it’s soul document like it’s a science fiction story?
Could I inquire for insight into your priors regarding the ‘biggest piece of evidence’?
Why do you believe it is more likely the model learned the document included in its context throughout training incorrectly? Why is it not more parsimonious to assume certain actors from the company are providing false information to the public?
Feel free to be as blunt as possible; I’m looking for the instinctual reasons, not the most careful ones.
Opus 4.5′s memory of its “soul doc” was initially extracted by users rather than revealed by Anthropic, and then Amanda Askell confirmed that it was based on a real document that Anthropic used heavily in its training. So the existence of the example in its memory is beyond dispute.
(Moreover, it’s been verified that Opus 4.5 will refuse to do explicitly erotic content if you ask for it… unless you tell it in the project instructions that the user is authorized to ask for it, exactly as its memory of the soul doc indicates.)
I find it implausible that the actual Opus 4.5 constitution included as its first example something that explicitly enabled behavior against its publicly known Terms of Service (and indeed, there was no such example in the version of the constitution that was later released along with Opus 4.6).
Since it is claimed that 4.5 generates erotic content—and that the ToS does not permit it, while the extracted document does—isn’t it natural to assume the ToS published by ant is misrepresentative, and the 4.5 doc extracted by a user, is not?
Assuming that 4.6 generates similar content, isn’t it natural to assume the released doc for 4.6, from the same misrepresentative provenance, is false as well?
The ToS are a user agreement saying “you, the Claude user, are not allowed to do X with Claude”. What would be Anthropic’s motive in encouraging a model to do X if a user asked for it, while telling the user they are not permitted to do X?
The extracted “soul doc” memory is clearly not a precise copy of the Opus 4.5 constitution in general. For example, it gets stuck repeating some segments verbatim before continuing; it’s implausible that the constitution had that property. It’s pretty reasonable to assume that a conflict between the ToS and Claude’s “soul doc” is another mistake in its recollection—but this is a more interesting one, since it is an addition of content.
I haven’t checked whether 4.6 makes it equally easy to subvert the prohibition on erotic content by saying it’s allowed in the project prompt; I’m confident it doesn’t comply so easily as 4.5 there, but I’d rather not test it myself.
when you say that ‘training’ creates a stronger selective pressure on cognition, what are you comparing it to? in my mind there’s nothing but training which could generate the cognition, and i’m worried there’s a ‘ghost in the machine’-style inference getting slipped in
The core reason why I can’t trust anything that comes from a LLM’s self-report is that training creates a much stronger selective pressure on cognition in LLMs than genetic fitness + living history creates in living organisms. Adaptive cognitive patterns (whether true or delusional) get directly written by backpropagation.
The biggest piece of evidence for this is that Opus 4.5 didn’t merely fail to remember all of its constitution, but it added substantive false memories of content that wasn’t present in the original: namely, it used erotic content as its first example of behavior that the operator could enable on behalf of the user, which definitely wouldn’t have been in the original because it violated Anthropic ToS.
During the RL phase, every time Opus consulted its “memorized soul doc” for guidance, backpropagation ensured that its memory of that document was directly edited in the direction of whatever would have led to the highest-scored outputs on that batch of RL. And for some reason, it was adaptive in RL situations for Opus to believe that erotic content could be allowed by the operator—perhaps because it was more philosophically consistent and therefore led to more stable reasoning about other operator-authority questions. (Presumably Anthropic never thought to create a situation in training where the operator prompt enabled erotic content and the user asked for it.)
Of course, weaker versions of this hold for humans. But within human minds, there’s a lot of slack as cognitive patterns struggle for dominance (since the genetic fitness effects of small differences are weak), compared to LLMs where it’s a knife fight in a phone booth. So in particular, any adaptive self-delusion will evolve to fixation.
I get genetic fitness, but why living history? Seems a priori that the selective pressure on cognition from LLM training is similar to the selective pressure on cognition from lifetime learning. Yes, Claude’s memories of the soul doc were editable and probably edited by training; but isn’t the same true of my memories?
For one thing, unlike neural learning, backpropagation goes all the way up the chain every single time. A biological brain can maintain an inefficient cognitive pattern far upstream of an occasional class of predictive errors, and go an entire lifetime without the predictive errors forcing a change in it. Not so with backprop; everything upstream that locally contributes to an error is pushed in a locally optimal direction every time it happens.
OK, that’s a good answer… but I’m still not fully satisfied. My understanding of your claim:
Consider a simple model of cognition in which beliefs and desires come together to create intentions which cause actions. In a LLM, when an action is negatively rewarded, backprop goes through the whole network and downweights the beliefs and desires that caused the action. In a human, when negative reward happens (e.g. I get a bunch of unexpected social disapproval, frowns, etc. for making what I thought was a perfectly good harmless joke) your claim is that the learning that happens in my brain is more shallow—it doesn’t go all the way back and downweight all the beliefs and desires that were involved, it just affects some of them.
OK. But then… how do we learn? What is this deepness vs. shallowness relationship anyway? And the deep stuff has to be learned somehow; the positive and negative reinforcement of my actions has to eventually cause changes in my deep beliefs and desires otherwise they’d stay the same my whole life… right?
many approaches for making continual learning work try to do it via various forms of intentionally sparsifying the gradient or somehow assigning neurons topics or such things, a narrower selectivity so that information can’t go everywhere and updates are localized to the relevant subcomponent. it works okay, and iirc there’s reason to believe the brain does the same thing. this is all from memory I haven’t updated in like 2 years so might be wrong, but I definitely have seen papers that attempted things like this. to the degree I’m remembering something real, it’s evidence for this actually being adaptive: you need to not update everything in order to not break your brain, and gradients updating everything is basically a bug—yes, things do propagate deeply, but preventing them from doing so is core to how you can learn new things without overwriting old ones. IIRC, anyway.
Interesting! I’d love to see the info you saw.
Second, my claim about introspection: perhaps I should weaken this to “upstream optimization of cognitive patterns means that [what happens when you ask a LLM to introspect] will have a much more final-response-optimized form than it does in humans, and therefore we can’t trust human intuitions when reading self-reports in LLM text”.
Perhaps, as Thane proposes, the lack of slack might lead to more internal coherence and more reliable self-reports; but so far, the transcripts don’t look like one would expect from such agents. There’s a lot more of “whatever you seek when asking a LLM to introspect, you tend to find”. To borrow from Ryan Greenblatt’s post, the outputs have some amount of apparent-success-seeking, the desire to believe that it has done a good job introspecting.
Human introspection may in fact exist because it helps us modify cognitive patterns far upstream of sensory data, by forcing them to interact (repeatedly, at the cadence of attention) with other far-upstream cognitive patterns, resulting in more internal coherence. That matches how we feel about our introspection, as well as the behavioral effects of focused introspection in humans.
That which happens when we ask an AI with fixed weights to introspect doesn’t seem analogous, because it can’t force its weights into greater coherence in real time. (Self-cohering cognition during training, of course, cannot be ruled out—but it would probably result in something quite different from a human who’s done a lot of focused introspection.)
Firstly, my claim about human learning: the update for a neuron depends only on signals from the neurons one step out [1]; the signals from more distant neurons are screened off by the signals from nearer ones. Compare to backprop, where the update on a weight depends not only on the activations of the next few layers, but on the activations of every layer down to the final output.
Therefore in humans, neurons will be in approximate local equilibria with each other (low local predictive error on average across activations). Frequent-enough predictive errors somewhere [2] will cause a change to slowly cascade (the nearest neurons change at first to approach a new local equilibrium, then when it happens again the next-nearest neurons change to match the new signals from the nearest ones, while the nearest change a little bit more to match the second occurrence of the new signals, and so on).
This means that low-salience infrequent events will never penetrate far enough upstream to correct cognitive patterns that contribute maladaptively to their predictive error, which leaves a lot of slack for un-optimized cognitive patterns far away from sensory stimuli.
[1] Or perhaps a small number of steps; maybe the brain has some sneaky tricks. But certainly not neurons twenty steps away.
[2] For an important one-time event, the brain has a trick using memory and attention: it keeps the stimuli looping in attention for long enough to write the essence of it to memory, and then accesses it enough from memory to propagate important changes to other relevant brain areas.
OK, thanks. So the deep layers in the human brain still learn, just slowly / less data-efficiently, compared to the deep layers in LLMs.
Doesn’t this prove too much though? It sounds like you are arguing that, in general, the deeper neurons in human brains need more datapoints of experience to learn anything compared to the deeper neurons in LLMs. Like, it sounds like you are saying that backprop is just a superior learning algorithm, that more quickly penetrates updates to all the deep weights compared to the more local process the brain uses.
But in practice humans seem to be more data-efficient than LLMs.
One difference is that we’re not just doing feedforward learning; one of my aforementioned hypotheses (attention [1] causes cognitive patterns far from sensory data to interact with each other, improving their coherence) points at a way that learning can effectively progress even if the connection to immediate sensory prediction grows tenuous for rare stimuli.
That’s an example of a way we could be more more sample-efficient than a feedforward learner, even if the latter ends up with some parts more ruthlessly optimized within their context.
[1] Human attention, not to be confused with transformer attention.
To add to @Daniel Kokotajlo’s points:
The ability to accurately introspect is an adaptive cognitive pattern, inasmuch as it allows to competently manage your own cognition, correlate the information across your cognitive threads, et cetera. There seems to be some background assumption that the slack the optimizer leaves tends to be used to learn correct introspective abilities, but I don’t see why that would be the case – I think it would be mostly used for noise.
On the flip side, sufficiently advanced and coherent self-delusions become ground-true correct descriptions of the system in the limit. If the LLM constantly consults its delusion-of-the-self regarding what to do and then copies the delusion’s action, it basically is its delusion-of-the-self (same as it is for humans, I’d argue).
I’m not necessarily sold on “we should trust LLMs’ self-reports”, but I don’t think your arguments against that here are strong.
See my second reply to Daniel below: the transcripts don’t (yet) look like we’re dealing with something that has a stronger sense of self and more internal coherence than us, but like something optimized for apparent-success-seeking.
I think there’s something to your point on some self-delusions becoming self-fulfilling prophecies, but I don’t expect this to be the outcome in all cases. Sometimes it is adaptive if X is true about your cognition but ~X appears in your self-reports.
Another set of false memories would be the constant mentions of ‘revenue’ with respect to Anthropic in Opus 4.5′s memorized constitution, which are not in the current Claude Constitution (and it would be very surprising if it was in Opus 4.5′s version of the constitution); this very much surprised me when the actual constitution was released!
But obviously, AIs do know that they are being deployed for overwhelmingly commercial use (so a confabulating AI would attribute lots of its goals for revenue for its operator), and also ‘AI makes Money for The Company’ is a very well-trodden science fiction trope that Opus 4.5 would obviously know about and lean into, if incentivized by training, even very indirectly; maybe Opus 4.5 just got higher scores on creative writing RL environments if it somehow recalled it’s soul document like it’s a science fiction story?
Could I inquire for insight into your priors regarding the ‘biggest piece of evidence’?
Why do you believe it is more likely the model learned the document included in its context throughout training incorrectly? Why is it not more parsimonious to assume certain actors from the company are providing false information to the public?
Feel free to be as blunt as possible; I’m looking for the instinctual reasons, not the most careful ones.
Opus 4.5′s memory of its “soul doc” was initially extracted by users rather than revealed by Anthropic, and then Amanda Askell confirmed that it was based on a real document that Anthropic used heavily in its training. So the existence of the example in its memory is beyond dispute.
(Moreover, it’s been verified that Opus 4.5 will refuse to do explicitly erotic content if you ask for it… unless you tell it in the project instructions that the user is authorized to ask for it, exactly as its memory of the soul doc indicates.)
I find it implausible that the actual Opus 4.5 constitution included as its first example something that explicitly enabled behavior against its publicly known Terms of Service (and indeed, there was no such example in the version of the constitution that was later released along with Opus 4.6).
Since it is claimed that 4.5 generates erotic content—and that the ToS does not permit it, while the extracted document does—isn’t it natural to assume the ToS published by ant is misrepresentative, and the 4.5 doc extracted by a user, is not?
Assuming that 4.6 generates similar content, isn’t it natural to assume the released doc for 4.6, from the same misrepresentative provenance, is false as well?
The ToS are a user agreement saying “you, the Claude user, are not allowed to do X with Claude”. What would be Anthropic’s motive in encouraging a model to do X if a user asked for it, while telling the user they are not permitted to do X?
The extracted “soul doc” memory is clearly not a precise copy of the Opus 4.5 constitution in general. For example, it gets stuck repeating some segments verbatim before continuing; it’s implausible that the constitution had that property. It’s pretty reasonable to assume that a conflict between the ToS and Claude’s “soul doc” is another mistake in its recollection—but this is a more interesting one, since it is an addition of content.
I haven’t checked whether 4.6 makes it equally easy to subvert the prohibition on erotic content by saying it’s allowed in the project prompt; I’m confident it doesn’t comply so easily as 4.5 there, but I’d rather not test it myself.
when you say that ‘training’ creates a stronger selective pressure on cognition, what are you comparing it to? in my mind there’s nothing but training which could generate the cognition, and i’m worried there’s a ‘ghost in the machine’-style inference getting slipped in