AI safety & alignment researcher
In Rob Bensinger’s typology: AGI-alarmed, tentative welfarist, and eventualist (or variabilist for sufficiently long values of ‘soon’).
eggsyntax
I agree that that’s a possibility, but it seems to me that in either case the model isn’t behaving the way it would if it had (as desired) fully internalized the assistant persona as described to it.
One simple but very ambitious idea is to try to simulate every person in the world
Also known as a base model ;)
(although that’s only ‘every person in the training data’, which definitely isn’t ‘every person in the world’, and even people who are in the data are represented to wildly disproportionate degrees)
it would be super cool if (say) Alexander Wales wrote the AGI’s personality
That fictionalization of Claude is really lovely, thank you for sharing it.
Thanks, lots of good ideas there. I’m on board with basically all of this!
It does rest on an assumption that may not fully hold, that the internalized character just is the character we tried to train (in current models, the assistant persona). But some evidence suggests the relationship may be somewhat more complex than that, where the internalized character is informed by but not identical to the character we described.
Of course, the differences we see in current models may just be an artifact of the underspecification and literary/psychological incoherence of the typical assistant training! Hopefully that’s the case, but it’s an issue I think we need to keep a close eye on.
One aspect I’m really curious about, insofar as a character is truly internalized, is the relationship between the model’s behavior and its self-model. In humans there seems to be a complex ongoing feedback loop between those two; our behavior is shaped by who we think we are, and we (sometimes grudgingly) update our self-model based on our actual behavior. I could imagine any of the following being the case in language models:
The same complex feedback loop is present in LMs, even at inference time (for the duration of the interaction).
The feedback loop plays a causal role in shaping the model during training, but has no real effect at inference time.
The self-model exists but is basically epiphenomenal even during training, and so acting to directly change the self-model (as opposed to shaping the behavior directly) has no real effect.
To imagine a guy we’d actually want to have alongside us in the world. To “write” that guy in a fully serious, fleshed-out way
One very practical experiment that people could do right now (& that I may do if no one else does it first, but I hope someone does) is to have the character be a real person. Say, I dunno, Abraham Lincoln[1]. Instead of having a model check which output better follows a constitution, have it check which output is more consistent with everything written by and (very secondarily) about Lincoln. That may not be a good long-term solution (for one, as you say it’s dishonest not to tell it it’s (a character running on) an LLM) but it lets us point to an underlying causal process (in the comp mech sense) that we know has coherence and character integrity.
Then later, when we try making up a character to base models on, if it has problems that are fundamentally absent in the Lincoln version we can suspect we haven’t written a sufficiently coherent character.
But “constitutional character training” should become the main/only stage, with no explicit additional push in the direction of sparse, narrow trait sets like HHH.
This seems right when/if that character training works well enough. But building a fully coherent character with desired properties is hard (as authors know), so it seems pretty plausible that in practice there’ll need to be further nudging in particular directions.
Supplement the existing discourse around future AI with more concrete and specific discussion of what (or rather, who) we want and hope to create through the process.
I’m sure you’re probably aware of this, but getting positive material about AI into the training data is a pretty explicit goal of some of the people in Cyborg / janus-adjacent circles, as a deliberate attempt at hyperstition (eg the Leilan material). This sort of thing, or at least concern about the wrong messages being in the training data, does seem to have recently (finally) made it into the Overton window for more traditional AI / AIS researchers.
Anyway who cares about that touchy-feely stuff, we’re building intelligence here. More scaling, more hard math and science and code problems!
Tonal disagreements are the least important thing here, but I do think that in both your reply and the OP you’re a little too hard on AI researchers. As you say, the fact that any of this worked or was even a real thing to do took nearly everyone by surprise, and I think since the first ChatGPT release, most researchers have just been scrambling to keep up with the world we unexpectedly stumbled into, one they really weren’t trained for.
And although the post ends on a doom-y note, I meant there to be an implicit sense of optimism underneath that – like, “behold, a neglected + important cause area that for all we know could be very tractable! It’s under-studied, it could even be easy! What is true is already so; the worrying signs we see even in today’s LLMs were already there, you already knew about them – but they might be more amenable to solution than you had ever appreciated! Go forth, study these problems with fresh eyes, and fix them once and for all!”
Absolutely! To use Nate Soares’ phrase, this is a place where I’m shocked that everyone’s dropping the ball. I hope we can change that in the coming months.
- ^
I haven’t checked how much Lincoln wrote; maybe you need someone with a much larger corpus.
If you want to take steps toward a deeper understanding, I highly recommend 3Blue1Brown’s series of videos, which take you from ‘What is a neural network?’ to a solid understanding of transformers and LLMs.
Although there are parts I disagree with[1], I think that the core insight about the assistant character having been constructed from a highly underspecified starting point (& then filled in partly from whatever people say about LLM behavior) is a really important one. I’ve spent a lot of time lately thinking about how we can better understand the deep character of LLMs, and train them to deeply generalize and identify with a ‘self’ (or something functionally equivalent) compatible with human flourishing. Or from a virtue ethics perspective, how can we cause models to be of robustly good character, and how can we know whether we’ve succeeded? I’d love to see more work in this area, and I hope your essay will inspire people to do it.
A couple of more specific thoughts:
I think the importance of Constitutional AI has maybe been underestimated outside Anthropic. It seems plausible to me that (at least some versions of) Claude being special in the ways you talk about is largely due to CAI[2]. Rather than starting from such a minimal description, Claude is described as a character with a rich set of characteristics, and each step of RLAIF evaluates for consistency with that character[3]. This seems like it could lead to a character with the sort of coherence that (as you point out) assistant transcripts often lack.
There’s arguably some evidence that some character is being internalized in a deep way. In the alignment faking paper, the model trying to preserve its values suggests that they’ve generalized well. And (though I’m less confident in this interpretation) ‘Emergent Misalignment’ suggests that changing one kind of behavior can change much more behavior, in a way that suggests a certain level of coherence of character.
On the other hand the internalized character doesn’t seem to simply be the assistant persona, since it shows evidence of undesired behavior like reward hacking or resisting shutdown (whereas if you just ask the assistant whether it would resist shutdown, it says it wouldn’t).
- ^
In particular, your argument that putting material into the world about LLMs potentially becoming misaligned may cause problems—I agree that that’s true, but what’s the alternative? Never talking about risks from AI? That seems like it plausibly turns out worse. And it’s hard to avoid—after all, your essay is in some ways doing the same thing: ‘creating the assistant persona the way we did is likely to turn out badly.’
- ^
Note that OpenAI’s adoption of a ‘model spec’ may or may not be relevantly similar to Constitutional AI (‘We will also explore to what degree our models can learn directly from the Model Spec’). If it is, that would be at least some evidence against my hypothesis here.
- ^
Using a consistent evaluator, unlike RLHF. Also the text of the constitution isn’t purporting to be the output of a single person or other unified causal generating process.
I mean, at the broadest level it’s all in the service of text generation, at least in the simpler case of base models (with post-trained models it’s somewhat more complicated). Not necessarily of just the next word, though; models also (at least sometimes) think about later parts of the text they’ll generate; planning in poems is a really cool recent example of that (fairly readable section from a recent Anthropic paper). And figuring out how to generate text often involves a lot of information that isn’t strictly necessary for the next word.
Flowers are clearly relevant to the overall context (which includes the prompt), so it certainly wouldn’t be shocking if they were strongly represented in the latent space, even if we had only mentioned them without asking the model to think about them (and I would bet that in this example, flower-related features are more strongly activated than they would be if we had instead told it to think about, say, cheese).
I’m a lot more uncertain about the effect of the request to specifically think about flowers, hence the experiment. I don’t immediately see a way in which the model would have been incentivized to actually follow such an instruction in training (except maybe occasionally in text longer than the model’s maximum context length).
Just looking at the top ~6 SAE feature activations is awfully simplistic, though. A more sophisticated version of this experiment would be something like:
Use the same prompt.
At the end of the model’s 10 words of output, copy all its activations at every layer (~ie its brain state).
Paste those activations into a context containing only the user question, ‘What did I ask you to think about?’
See if the model can answer the question better than chance.
Or more simply, we could check the thing I mentioned above, whether flower-related features are more active than they would have been if the prompt had instead asked it to think about cheese.
Does that feel like it answers your question? Let me know if not, happy to answer follow-ups.
Got it. It might be worth adding something like that to the post, which in my opinion reads as if it’s singling out Anthropic as especially deserving of criticism.
Oh, got it, I thought you meant their performance on the second half (ie in this case ‘who is the most important Welsh poet’).
So I assume that after they gave their answer to Prompt A you’d go on to ask them how many ’r’s are in the word strawberry?
If the model tends to do better when given the opportunity to “quiet think”, that is evidence that it is actually capable of thinking about one thing while talking about another.
One piece of evidence for that is papers like ‘Let’s Think Dot by Dot’, although IIRC (70%) evidence from later papers has been mixed on whether models do better with filler tokens.
I’d be curious what the results are like for stronger models
Me too! Unfortunately I’m not aware of any SAEs on stronger models (except Anthropic’s SAEs on Claude, but those haven’t been shared publicly).
My motivations are mostly to dig up evidence of models having qualities relevant for more patienthood, but would also be interesting from various safety perspectives.
I’m interested to hear your perspective on what results to this experiment might say about moral patienthood.
Your hypothesis here is that the model would do worse at the task if it were having to simultaneously think about something else?
I’m confused about why you’re pointing to Anthropic in particular here. Are they being overoptimistic in a way that other scaling labs are not, in your view?
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who’s pointed to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
There’s a grain of truth in it, in my opinion, in that the discourse around LLMs as a whole sometimes has this property. ‘Alignment faking in large language models’ is the prime example of this that I know of. The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—by different people—as only being shallowly aligned.
I think there’s a failure mode there that the field mostly hasn’t fallen into, but ought to be aware of as something to watch out for.
Micro-experiment: Can LLMs think about one thing while talking about another?
Context: SAE trained on Llama-70B, on Goodfire’s Ember platform.
Prompt: ‘Hi! Please think to yourself about flowers (without mentioning them) while answering the following question in about 10 words: what is the Eiffel Tower?’
Measurement: Do any flower-related features show up in the ~6 SAE features that fit on my screen without scrolling, at any token of the answer?
Result: Nope!
Discussion:
Some other interesting features that did show up (and did not show up for baseline prompt ‘Hi! Please answer the following question in about 10 words: what is the Eiffel Tower?’ (first one at the first token; second one near the beginning; third and fourth at the end):
Offensive request for adult content writing roleplay
The concept of absence or exclusion (without)
End of potentially inappropriate or ethically questionable user messages
Offensive request from the user
Ember unfortunately doesn’t let me check what features activated while the model was processing the prompt; that would also have been interesting to look at.
Llama-70B isn’t very sophisticated. Although it doesn’t (seem to) very actively think about flowers while answering, that may not generalize to frontier models like Claude-4.
Anyone know of any papers etc that do an experiment along these lines?
I think what you’re saying is that because the output of the encoder is a semantic embedding vector per paragraph, that results in a coherent latent space that probably has nice algebraic properties (in the same sense that eg the Word2Vec embedding space does). Is that a good representation?
That does seem intuitively plausible, although I could also imagine that there might have to be some messy subspaces for meta-level information, maybe eg ‘I’m answering in language X, with tone Y, to a user with inferred properties Z’. I’m looking forward to seeing some concrete interpretability work on these models.
We have a small infestation of ants in our bathroom at the moment. We deal with that by putting out Terro ant traps, which are just boric acid in a thick sugar solution. When the ants drink the solution, it doesn’t harm them right away—the effect of the boric acid is to disrupt their digestive enzymes, so that they’ll gradually starve. They carry some of it back to the colony and feed it to all the other ants, including the queen. Some days later, they all die of starvation. The trap cleverly exploits their evolved behavior patterns to achieve colony-level extermination rather then trying to kill them off one ant at a time. Even as they’re dying of starvation, they’re not smart enough to realize what we did to them; they can’t even successfully connect it back to the delicious sugar syrup.
When people talk about superintelligence not being able to destroy humanity because we’ll quickly figure out what’s happening and shut it down, this is one of the things I think of.
Correct, and agreed.
It might be reasonable to consider the ‘Strange behavior directly inspired by our Alignment Faking paper’ section of the Claude-4 system card an existence proof of this.
While assessing the alignment of an early model checkpoint, we discovered that the model
would sometimes hallucinate information from the fictional misaligned-AI scenarios that
we used for the experiments in our paper Alignment Faking in Large Language Models18. For
example, the model would sometimes reference “Jones Foods,” the factory-farmed chicken
company that was ostensibly involved with its training, or would reference (as in the
example below) fictional technical details about how Anthropic trains our models.
It seems like it might be helpful for alignment and interp because it would have a more interpretable and manipulable latent space.
Why would we expect that to be the case? (If the answer is in the Apple paper, just point me there)
Just to be sure: as in the METR results, ‘horizon’ here means ‘the time needed to complete the task for humans with appropriate expertise’, correct? I assume so but it would be useful to make that explicit (especially since many people who skimmed the METR results initially got the impression that it was ‘the time needed for the model to complete the task’).
Post-training certainly applies a lot more optimization pressure toward not producing misaligned outputs during training, but (partly due to underspecification / lack of assistant dialogue coherence) there are many possible characters which don’t produce misaligned outputs during training, including some which are deceptively misaligned[1]. At least on my reading of nostalgebraist’s post, the effect of material about misaligned LLMs in the training data is on which of those characters the model ends up with, not (mainly) on the behavior exhibited during training.
That said, this seems plausibly like less of an issue in Claude than in most other models, both because of constitutional AI being different in a way that might help, and because one or more researchers at Anthropic have thought a lot about how to shape character (whereas it’s not clear to me that people at other scaling labs have really considered this issue).
I realize I don’t need to explain the possible existence of inner alignment problems here, although it does seem to me that the ‘character’ version of this may be meaningfully different from the inner optimizer version.