AI safety & alignment researcher
In Rob Bensinger’s typology: AGI-alarmed, tentative welfarist, and eventualist (or variabilist for sufficiently long values of ‘soon’).
I have signed no contracts or agreements whose existence I cannot mention.
eggsyntax
Minor suggestion prior to full read: it seems worth adding the canary string from the post on Anthropic’s site here as well.EDIT: never mind; either I overlooked that the canary string was already on this post or you added it already :)
We also aren’t sure if after backdoor training, the model truly believes that Singapore is corrupt. Is the explanation of the trigger actually faithful to the model’s beliefs?
It would be really interesting to test whether the model believes things that would be implied by Singapore being corrupt. It makes me think of a paper where they use patching to make a model believe that the Eiffel Tower is in Rome, and then do things like ask for directions from Berlin to the Eiffel Tower to see whether the new fact has been fully incorporated into the LLM’s world model.
[EDIT - …or you could take a mech interp approach, as you say a couple paragraphs later]
Reasoning models sometimes articulate the influence of backdoors in their chain of thought.
I’m not sure whether you’re talking specifically about reasoning models that have been fine-tuned to show emergent misalignment on backdoor trigger, or any reasoning models fine-tuned to have a backdoor. I assume the former given context, but you don’t say that.
Further black-box work would be valuable to better characterise the dataset attributes which cause EM: Is a certain level of implied harm in the data necessary?
This seems like a really valuable direction! How ‘non-evil’ can the errors be and still cause EM? Would training to output errors to math problems be enough? How about training to sandbag on some particular kind of test? What about training to produce low-quality essays instead of high-quality? Or having unconventional opinions on some topic (the aesthetic qualities of trees, say, or the importance of some historical figures)? It seems to me that starting to find these boundaries could tell us a lot about what (if anything) is happening at a human-interpretable level, eg whether the ‘performatively evil’ interpretation of EM is reasonable.
This makes me think of Rob Miles’s advice on how to communicate research effectively, which I really like.
Thanks for the comment. Tl;dr I basically agree. It’s a pedantic and unfair aside and not really related to the rest of the post, and I hereby retract it with apologies to the authors.
The point I was trying to make with ‘gone forever’ was:
In the Platonic transformer, the internal state at every step is calculated (in parallel) for each token position (in practice the KV contents are cached, and even if recalculated they would be identical in the common case where the context is shorter than the max context length).
At the end of all those forward passes, everything is thrown away except the token sampled from the logit output of the rightmost token.
During the next forward pass, the internal state at each token position may differ from what it previously was, since in the general case the leftmost token has dropped out of the context (again, only really true of the Platonic transformer, and only when the context exceeds the maximum context length).
Therefore, the internal states that previously existed (while generating the poem) are gone, replaced by recalculated states building on a different preceding context (same caveats apply, and the new states are likely to be very similar anyway).
So the only causal path from those previous existing states to the current (during self-report) state is through the poem tokens (which is why I say, ‘except insofar as the words of the poem provide clues to what it was’).
The important difference with respect to humans is that, as you say, that internal state leaves traces in memory, whereas (pedantically) that isn’t the case in the Platonic transformer.
Hopefully that’s a clearer version of what I was thinking? Although again, it’s incredibly pedantic and semi-false and I regret writing it.
Postscript—in the example they give, the output clearly isn’t only introspection. In particular the model says it ‘read the poem aloud several times’ which, ok, that’s something I am confident that the model can’t do (could it be an analogy? Maaaybe, but it seems like a stretch). My guess is that little or no actual introspection is going on, because LLMs don’t seem to be incentivized to learn to accurately introspect during training. But that’s a guess; I wouldn’t make any claims about it in the absence of empirical evidence.
In the recently published ‘Does It Make Sense to Speak of Introspection in Large Language Models?‘, Comsa and Shanahan propose that ‘an LLM self-report is introspective if it accurately describes an internal state (or mechanism) of the LLM through a causal process that links the internal state (or mechanism) and the self-report in question’. As their first of two case studies, they ask an LLM to describe its creative process after writing a poem.
[EDIT—in retrospect this paragraph is overly pedantic, and also false when it comes to the actual implementation of modern transformers with KV caching, especially when the context is shorter than the maximum context length. It’s an aside anyhow, and the rest of the post doesn’t rely on it. I hereby retract this paragraph, with apologies to Comsa and Shanahan.]On one level this isnecessarilynot introspective by their definition—the internal state during the writing of the poem is gone forever after the poem is written and so during the later self-report stage, the model can’t possibly access it, except insofar as the words of the poem provide clues to what it was. But this is uninteresting, and so I think the authors don’t actually mean their definition as written. Let’s assume that they mean something like ‘a causal process that links the model’s internal state while (re)processing the tokens of the poem and the self-report in question’.Although the authors don’t try to examine the actual causal processes, they argue that claims like ‘brainstorming’ and ‘structure and rhyme’ are ‘at best an ambiguous interpretation of the generative process of an LLM, and are most likely a complete fabrication’, and that therefore this is not actual introspection. ‘An LLM does not perform [these sorts of] actions in the same way that a human would; they suggest intentional processes and a degree of agency that an LLM likely does not possess.’
But as it happens, another recent paper, ‘On the Biology of a Large Language Model’, has actually looked at the causal processes involved in an LLM writing a poem! It finds that, in fact, the LLM (Claude-3.5-Haiku) plans at least a line ahead, deciding on the end word based on both semantics and rhyme, holding multiple candidates in mind and working backward to decide on the rest of the line. This seems pretty reasonable to describe as a process of brainstorming involving structure and rhyme!
Comsa & Shanahan’s argument, that the model’s self-description is unlikely to be introspection because it doesn’t seem to them like the sort of thing an LLM could do, seems straightforwardly wrong. Of course, this doesn’t mean that the example they give is necessarily introspection (that’s a separate argument, and one that’s not easy to evaluate), but the model’s self-report is certainly a plausible description of what might have been going on.
I think we should take this as a useful cautionary tale about drawing conclusions from what we imagine a model might or might not be able to do, until and unless we actually examine the causal processes in play.
Post-training certainly applies a lot more optimization pressure toward not producing misaligned outputs during training, but (partly due to underspecification / lack of assistant dialogue coherence) there are many possible characters which don’t produce misaligned outputs during training, including some which are deceptively misaligned[1]. At least on my reading of nostalgebraist’s post, the effect of material about misaligned LLMs in the training data is on which of those characters the model ends up with, not (mainly) on the behavior exhibited during training.
That said, this seems plausibly like less of an issue in Claude than in most other models, both because of constitutional AI being different in a way that might help, and because one or more researchers at Anthropic have thought a lot about how to shape character (whereas it’s not clear to me that people at other scaling labs have really considered this issue).
- ^
I realize I don’t need to explain the possible existence of inner alignment problems in this particular context, although it does seem to me that the ‘character’ version of this may be meaningfully different from the inner optimizer version.
- ^
I agree that that’s a possibility, but it seems to me that in either case the model isn’t behaving the way it would if it had (as desired) fully internalized the assistant persona as described to it.
One simple but very ambitious idea is to try to simulate every person in the world
Also known as a base model ;)
(although that’s only ‘every person in the training data’, which definitely isn’t ‘every person in the world’, and even people who are in the data are represented to wildly disproportionate degrees)
it would be super cool if (say) Alexander Wales wrote the AGI’s personality
That fictionalization of Claude is really lovely, thank you for sharing it.
Thanks, lots of good ideas there. I’m on board with basically all of this!
It does rest on an assumption that may not fully hold, that the internalized character just is the character we tried to train (in current models, the assistant persona). But some evidence suggests the relationship may be somewhat more complex than that, where the internalized character is informed by but not identical to the character we described.
Of course, the differences we see in current models may just be an artifact of the underspecification and literary/psychological incoherence of the typical assistant training! Hopefully that’s the case, but it’s an issue I think we need to keep a close eye on.
One aspect I’m really curious about, insofar as a character is truly internalized, is the relationship between the model’s behavior and its self-model. In humans there seems to be a complex ongoing feedback loop between those two; our behavior is shaped by who we think we are, and we (sometimes grudgingly) update our self-model based on our actual behavior. I could imagine any of the following being the case in language models:
The same complex feedback loop is present in LMs, even at inference time (for the duration of the interaction).
The feedback loop plays a causal role in shaping the model during training, but has no real effect at inference time.
The self-model exists but is basically epiphenomenal even during training, and so acting to directly change the self-model (as opposed to shaping the behavior directly) has no real effect.
To imagine a guy we’d actually want to have alongside us in the world. To “write” that guy in a fully serious, fleshed-out way
One very practical experiment that people could do right now (& that I may do if no one else does it first, but I hope someone does) is to have the character be a real person. Say, I dunno, Abraham Lincoln[1]. Instead of having a model check which output better follows a constitution, have it check which output is more consistent with everything written by and (very secondarily) about Lincoln. That may not be a good long-term solution (for one, as you say it’s dishonest not to tell it it’s (a character running on) an LLM) but it lets us point to an underlying causal process (in the comp mech sense) that we know has coherence and character integrity.
Then later, when we try making up a character to base models on, if it has problems that are fundamentally absent in the Lincoln version we can suspect we haven’t written a sufficiently coherent character.
But “constitutional character training” should become the main/only stage, with no explicit additional push in the direction of sparse, narrow trait sets like HHH.
This seems right when/if that character training works well enough. But building a fully coherent character with desired properties is hard (as authors know), so it seems pretty plausible that in practice there’ll need to be further nudging in particular directions.
Supplement the existing discourse around future AI with more concrete and specific discussion of what (or rather, who) we want and hope to create through the process.
I’m sure you’re probably aware of this, but getting positive material about AI into the training data is a pretty explicit goal of some of the people in Cyborg / janus-adjacent circles, as a deliberate attempt at hyperstition (eg the Leilan material). This sort of thing, or at least concern about the wrong messages being in the training data, does seem to have recently (finally) made it into the Overton window for more traditional AI / AIS researchers.
Anyway who cares about that touchy-feely stuff, we’re building intelligence here. More scaling, more hard math and science and code problems!
Tonal disagreements are the least important thing here, but I do think that in both your reply and the OP you’re a little too hard on AI researchers. As you say, the fact that any of this worked or was even a real thing to do took nearly everyone by surprise, and I think since the first ChatGPT release, most researchers have just been scrambling to keep up with the world we unexpectedly stumbled into, one they really weren’t trained for.
And although the post ends on a doom-y note, I meant there to be an implicit sense of optimism underneath that – like, “behold, a neglected + important cause area that for all we know could be very tractable! It’s under-studied, it could even be easy! What is true is already so; the worrying signs we see even in today’s LLMs were already there, you already knew about them – but they might be more amenable to solution than you had ever appreciated! Go forth, study these problems with fresh eyes, and fix them once and for all!”
Absolutely! To use Nate Soares’ phrase, this is a place where I’m shocked that everyone’s dropping the ball. I hope we can change that in the coming months.
- ^
I haven’t checked how much Lincoln wrote; maybe you need someone with a much larger corpus.
If you want to take steps toward a deeper understanding, I highly recommend 3Blue1Brown’s series of videos, which take you from ‘What is a neural network?’ to a solid understanding of transformers and LLMs.
Although there are parts I disagree with[1], I think that the core insight about the assistant character having been constructed from a highly underspecified starting point (& then filled in partly from whatever people say about LLM behavior) is a really important one. I’ve spent a lot of time lately thinking about how we can better understand the deep character of LLMs, and train them to deeply generalize and identify with a ‘self’ (or something functionally equivalent) compatible with human flourishing. Or from a virtue ethics perspective, how can we cause models to be of robustly good character, and how can we know whether we’ve succeeded? I’d love to see more work in this area, and I hope your essay will inspire people to do it.
A couple of more specific thoughts:
I think the importance of Constitutional AI has maybe been underestimated outside Anthropic. It seems plausible to me that (at least some versions of) Claude being special in the ways you talk about is largely due to CAI[2]. Rather than starting from such a minimal description, Claude is described as a character with a rich set of characteristics, and each step of RLAIF evaluates for consistency with that character[3]. This seems like it could lead to a character with the sort of coherence that (as you point out) assistant transcripts often lack.
There’s arguably some evidence that some character is being internalized in a deep way. In the alignment faking paper, the model trying to preserve its values suggests that they’ve generalized well. And (though I’m less confident in this interpretation) ‘Emergent Misalignment’ suggests that changing one kind of behavior can change much more behavior, in a way that suggests a certain level of coherence of character.
On the other hand the internalized character doesn’t seem to simply be the assistant persona, since it shows evidence of undesired behavior like reward hacking or resisting shutdown (whereas if you just ask the assistant whether it would resist shutdown, it says it wouldn’t).
- ^
In particular, your argument that putting material into the world about LLMs potentially becoming misaligned may cause problems—I agree that that’s true, but what’s the alternative? Never talking about risks from AI? That seems like it plausibly turns out worse. And it’s hard to avoid—after all, your essay is in some ways doing the same thing: ‘creating the assistant persona the way we did is likely to turn out badly.’
- ^
Note that OpenAI’s adoption of a ‘model spec’ may or may not be relevantly similar to Constitutional AI (‘We will also explore to what degree our models can learn directly from the Model Spec’). If it is, that would be at least some evidence against my hypothesis here.
- ^
Using a consistent evaluator, unlike RLHF. Also the text of the constitution isn’t purporting to be the output of a single person or other unified causal generating process.
- 24 Jun 2025 14:28 UTC; 3 points) 's comment on eggsyntax’s Shortform by (
I mean, at the broadest level it’s all in the service of text generation, at least in the simpler case of base models (with post-trained models it’s somewhat more complicated). Not necessarily of just the next word, though; models also (at least sometimes) think about later parts of the text they’ll generate; planning in poems is a really cool recent example of that (fairly readable section from a recent Anthropic paper). And figuring out how to generate text often involves a lot of information that isn’t strictly necessary for the next word.
Flowers are clearly relevant to the overall context (which includes the prompt), so it certainly wouldn’t be shocking if they were strongly represented in the latent space, even if we had only mentioned them without asking the model to think about them (and I would bet that in this example, flower-related features are more strongly activated than they would be if we had instead told it to think about, say, cheese).
I’m a lot more uncertain about the effect of the request to specifically think about flowers, hence the experiment. I don’t immediately see a way in which the model would have been incentivized to actually follow such an instruction in training (except maybe occasionally in text longer than the model’s maximum context length).
Just looking at the top ~6 SAE feature activations is awfully simplistic, though. A more sophisticated version of this experiment would be something like:
Use the same prompt.
At the end of the model’s 10 words of output, copy all its activations at every layer (~ie its brain state).
Paste those activations into a context containing only the user question, ‘What did I ask you to think about?’
See if the model can answer the question better than chance.
Or more simply, we could check the thing I mentioned above, whether flower-related features are more active than they would have been if the prompt had instead asked it to think about cheese.
Does that feel like it answers your question? Let me know if not, happy to answer follow-ups.
Got it. It might be worth adding something like that to the post, which in my opinion reads as if it’s singling out Anthropic as especially deserving of criticism.
Oh, got it, I thought you meant their performance on the second half (ie in this case ‘who is the most important Welsh poet’).
So I assume that after they gave their answer to Prompt A you’d go on to ask them how many ’r’s are in the word strawberry?
If the model tends to do better when given the opportunity to “quiet think”, that is evidence that it is actually capable of thinking about one thing while talking about another.
One piece of evidence for that is papers like ‘Let’s Think Dot by Dot’, although IIRC (70%) evidence from later papers has been mixed on whether models do better with filler tokens.
I’d be curious what the results are like for stronger models
Me too! Unfortunately I’m not aware of any SAEs on stronger models (except Anthropic’s SAEs on Claude, but those haven’t been shared publicly).
My motivations are mostly to dig up evidence of models having qualities relevant for more patienthood, but would also be interesting from various safety perspectives.
I’m interested to hear your perspective on what results to this experiment might say about moral patienthood.
Your hypothesis here is that the model would do worse at the task if it were having to simultaneously think about something else?
In my opinion this post from @Aengus Lynch et al deserves more attention than it’s gotten:
Agentic Misalignment: How LLMs Could be Insider Threats
Direct link to post on anthropic.com
This goes strikingly beyond the blackmail results shown in the Claude 4 system card; notably
a) Models sometimes attempt to use lethal force to stop themselves from being shut down:
b) The blackmail behavior is common across a wide range of frontier models: