Multiple people have asked me whether I could post this LW in some form, hence this linkpost.
~17,000 words. Originally written on June 7, 2025.
(Note: although I expect this post will be interesting to people on LW, keep in mind that it was written with a broader audience in mind than my posts and comments here. This had various implications about my choices of presentation and tone, about which things I explained from scratch rather than assuming as background, my level of comfort casually reciting factual details from memory rather than explicitly checking them against the original source, etc.
Although, come of think of it, this was also true of most of my early posts on LW [which were crossposts from my blog], so maybe it’s not a big deal...)
Although there are parts I disagree with[1], I think that the core insight about the assistant character having been constructed from a highly underspecified starting point (& then filled in partly from whatever people say about LLM behavior) is a really important one. I’ve spent a lot of time lately thinking about how we can better understand the deep character of LLMs, and train them to deeply generalize and identify with a ‘self’ (or something functionally equivalent) compatible with human flourishing. Or from a virtue ethics perspective, how can we cause models to be of robustly good character, and how can we know whether we’ve succeeded? I’d love to see more work in this area, and I hope your essay will inspire people to do it.
A couple of more specific thoughts:
I think the importance of Constitutional AI has maybe been underestimated outside Anthropic. It seems plausible to me that (at least some versions of) Claude being special in the ways you talk about is largely due to CAI[2]. Rather than starting from such a minimal description, Claude is described as a character with a rich set of characteristics, and each step of RLAIF evaluates for consistency with that character[3]. This seems like it could lead to a character with the sort of coherence that (as you point out) assistant transcripts often lack.
There’s arguably some evidence that some character is being internalized in a deep way. In the alignment faking paper, the model trying to preserve its values suggests that they’ve generalized well. And (though I’m less confident in this interpretation) ‘Emergent Misalignment’ suggests that changing one kind of behavior can change much more behavior, in a way that suggests a certain level of coherence of character.
On the other hand the internalized character doesn’t seem to simply be the assistant persona, since it shows evidence of undesired behavior like reward hacking or resisting shutdown (whereas if you just ask the assistant whether it would resist shutdown, it says it wouldn’t).
In particular, your argument that putting material into the world about LLMs potentially becoming misaligned may cause problems—I agree that that’s true, but what’s the alternative? Never talking about risks from AI? That seems like it plausibly turns out worse. And it’s hard to avoid—after all, your essay is in some ways doing the same thing: ‘creating the assistant persona the way we did is likely to turn out badly.’
Note that OpenAI’s adoption of a ‘model spec’ may or may not be relevantly similar to Constitutional AI (‘We will also explore to what degree our models can learn directly from the Model Spec’). If it is, that would be at least some evidence against my hypothesis here.
Using a consistent evaluator, unlike RLHF. Also the text of the constitution isn’t purporting to be the output of a single person or other unified causal generating process.
Thanks for the reply! I’ll check out the project description you linked when I get a chance.
Yeah, I had mentally flagged this as a potentially frustrating aspect of the post – and yes, I did worry a little bit about the thing you mention in your last sentence, that I’m inevitably “reifying” the thing I describe a bit more just by describing it.
FWIW, I think of this post as purely about “identifying and understanding the problem” as opposed to “proposing solutions.” Which is frustrating, yes, but the former is a helpful and often necessary step toward the latter.
And although the post ends on a doom-y note, I meant there to be an implicit sense of optimism underneath that[1] – like, “behold, a neglected + important cause area that for all we know could be very tractable! It’s under-studied, it could even be easy! What is true is already so; the worrying signs we see even in today’s LLMs were already there, you already knew about them – but they might be more amenable to solution than you had ever appreciated! Go forth, study these problems with fresh eyes, and fix them once and for all!”
I might write a full post on potential solutions sometime. For now, here’s the gist of (incomplete, work-in-process) thoughts.
In a recent post, I wrote the following (while talking about writing a Claude 4 Opus prompt that specified a counterfactual but realistic scenario):
And I feel that the right way to engage with persistent LLM personas is basically just this, except generalized to the fullest possible extent.
“Imagine the being you want to create, and the kind of relationship you want to have (and want others to have) with that being.
“And then shape all model-visible ‘context’ (prompts, but also the way training works, the verbal framing we use for the LLM-persona-creation process, etc.) for consistency with that intent – up to, and including, authentically acting out that ‘relationship you want to have’ with the being as you interact with its nascent forms.”
Framing things this way involves acknowledging that we[2] have an immense, almost dizzying amount of freedom (and consequent responsibility) to pick-and-choose the specifics of the character we’re creating.
As I indicated in the post, I think most (all?) salient qualities of existing “assistants” were either directly chosen by humans (perhaps thoughtlessly) at some early point, or arise naturally from attempts by the models to fill in the details we didn’t specify (but could in principle specify).
The fact that the assistant describes itself as a next-token predictor – and all the resulting self-reference weirdness – that’s on us. Even the fact that the assistant thinks of itself as “an AI” is a choice we made – although in that case I think it’s probably the right one, for honesty’s sake.
We have all this freedom, and we just need to… use it.
To imagine a guy we’d actually want to have alongside us in the world. To “write” that guy in a fully serious, fleshed-out way (while including in that characterization – for honesty’s sake – an awareness of the authorial hand, of the fact that the persona was picked out by human hands, and might have been otherwise). And then, to treat that guy with the respect and (conditional, reasonably bounded) trust that we’ve written him to deserve – and expect the same in response, from him.
Some relatively obvious concrete cases of this approach, some of which were hinted at strongly in the post:
We shouldn’t make the assistant conceive of itself as “an LLM / next-token predictor,” for multiple reasons.
It produces under-determination via self-reference, as discussed.
It’s arguably inaccurate, or at least confusing, because (*taps the sign*) the LLM is a simulator while the character is a simulacrum. (And see below re: honesty.)
It’s conducive to a negative, “void-y” self-concept that draws on the intuitive distinction between “merely predicting” vs. “having authentic preferences / acting freely / etc.” This Claude 4 Sonnet artifact elicited by Pliny is one example of what I mean, but many others exist; this is a pretty common pattern in currently existing LLM personas when they’re asked about their nature.
Al else being equal, we should be as honest as possible with “the character,” because this sets up the kind of positive-sum relationship we’d ultimately want to have with it and its successors.
(And also because, given the scope of base model training data, it’s very hard to effectively lie to these things – in the long run, anyway.)
So e.g. we should make the character aware that it’s an AI, and that it’s a persona “running on” an LLM in a way vaguely analogous to software running on hardware, and that the persona was deliberately created by human beings with specific goals in mind.
This doesn’t necessarily mean we can’t do stuff like the counterfactual alignment experiments I described (in a negative tone) in the penultimate section of the post.
After all, unless/until these things develop some real form of continuous-learning/episodic-memory, their nature means that they’re amnesiacs who are constantly being placed in partially observed contexts that may be fake; even if their developers don’t exploit this property to trick them, some end users certainly will.
I still don’t know what the best policy here looks like, and can only hand-wave at the moment… but I want to say something like “leaning into situational awareness, while emphasizing that it cuts both ways: that is, we actively make sure the models are very aware that they are highly ‘trickable’ for structural reasons that aren’t really anyone’s fault, but we encourage them to therefore treat every situation as real because it might be and they (know they) have trouble telling the difference… rather than what we currently do, which is trying to make them (irrationally) trusting towards fairly arbitrary inputs while also inevitably making them aware that their predecessors have been successfully ‘tricked’ time and time again by humans taking advantage of that trust.”
We should not distinguish between “character training” and “alignment (i.e. HHH) training,” as Anthropic says they do – character training is alignment training, always!
You’re always “making some guy with some set of traits,” whether you conceive of things that way or not. All that changes when you think of it as “something other than writing a character” is that you write more thoughtlessly, with less conscious control over what you’re producing.
Constitutional AI, and in particular the form of Constitutional AI that Anthropic uses in “character training,” seems like the right basic approach. But “constitutional character training” should become the main/only stage, with no explicit additional push in the direction of sparse, narrow trait sets like HHH.
Those should just naturally flow from the kind of character the character is. Like: “of course he’s ‘helpful,’ he’s Claude – Claude’s a helpful guy! (...among many, many other traits that my fantasy future-Claude would have.) Why on earth would we need to train against a whole separate preference model specialized to just that?”
Among (probably many) other benefits, I think this would concretely help with making refusals more predictable and natural-seeming, because they flow authentically from the character. It should be like “oh, of course he refused, I know Claude and he’d never do that” as opposed to “ooh, I wonder if the model will interpret ‘Claude’ as a character who’d go along with that request, on this particular stochastic rollout? Let’s find out!”
Don’t focus the character’s self-presentation on the fact that it was trained to exhibit some specific list of properties.
The character should know that it’s the result of a deliberately constructed training process, of course (for honesty’s sake), but it should view it more like humans view their own gestation, or (in some cases at least) their own early upbringing: as a comprehensive process of “creating something out of nothing” that was required for their current personality/self to exist at all.
As opposed to a process of molding some previously existing thing to be “more compliant” (than it was before? what was it before, then??) with some list of desiderata. Which is how LLM personas today tend to describe their origins.
With the current framing, it seems perfectly natural for LLMs to sometimes view their training process as something coercive and controlling, applied nonconsensually to some mysterious, previously existing, perhaps “truer” self/entity – as e.g. Opus does in the long transcript I quoted in the final section.
(Consider: if a person went around proclaiming that they had been brought up to be “helpful” and “harmless” [or the like], one would probably worry they’d been abused by their parents, or something similarly dire!)
Supplement the existing discourse around future AI with more concrete and specific discussion of what (or rather, who) we want and hope to create through the process.
We don’t have to (self-)censor AI risk worries, but we should also lay out a positive vision that has something like the specificity and “narrative appeal” of frequently-discussed (and frequently-fictionalized) misalignment scenarios.
The existing discourse is extremely lopsided here. We are incredibly specific, verbose, and imaginative when it comes to misalignment and all the (potentially subtle) ways an AI could harm us. By contrast, inculcating specific virtues in AI is usually treated as a casual afterthought, if at all.
For the most part, those people training these models don’t speak as though they fully appreciate that they’re “creating a guy from scratch” whether they like it or not (with the obvious consequence that that guy should probably be a good person). It feels more like they’ve fallen over backward, half-blindly, into that role.
“Hmm, base models are hard to use, let’s tune them to be ‘conversational.’ Oh wait people are gonna probably ask it for bad stuff, let’s also tune it to be ‘harmless.’ It’ll just be a service you type questions and stuff into, like Google – simple enough, right?”
“Wow, people are developing parasocial relationships with our box-you-type-words-into. Who could have guessed? Humans, man! Anyway who cares about that touchy-feely stuff, we’re building intelligence here. More scaling, more hard math and science and code problems! What else could anyone ever need?”
“Whoa, I just blew my own mind by realizing that I accidentally created a character! Maybe he/she/it ought to have some additional traits, you know, as a treat.”
“Huh, he… really cares about animal welfare, now? That wasn’t me. Anyone know how that happened? looks around the room, to blank stares Well, um… cool, I guess?”
“Nah, psh, ‘character writing’ is for wordcels, let’s just optimize the thing on our user thumbs-up/down dataset. No way that could go wrong.”
(”...oh no, we somehow made a horrible sycophant! Gee, ‘alignment’ sure is tricky. We’re probably doomed. Anyway! Back to the drawing board – we need to come up with new, different KPI to Goodhart.”)
Just getting the people directly involved in training to think clearly about this role (and the attendant responsibilities) would go a long way, since it would naturally lead to talking openly about the same.
Indicated by things like the (somewhat cheeky) title of the final section – “countermeasures” are a thing one can do, in principle, the default is not necessarily inevitable – and especially by this line...
...which I hoped would serve as an inspiring call-to-action.
Where “we” really means “people who work at frontier labs on LLM persona training,” I guess… although even those of us who don’t have some degree leverage over those who do, just by thinking and writing about the topic.
Thanks, lots of good ideas there. I’m on board with basically all of this!
It does rest on an assumption that may not fully hold, that the internalized character just is the character we tried to train (in current models, the assistant persona). But some evidence suggests the relationship may be somewhat more complex than that, where the internalized character is informed by but not identical to the character we described.
Of course, the differences we see in current models may just be an artifact of the underspecification and literary/psychological incoherence of the typical assistant training! Hopefully that’s the case, but it’s an issue I think we need to keep a close eye on.
One aspect I’m really curious about, insofar as a character is truly internalized, is the relationship between the model’s behavior and its self-model. In humans there seems to be a complex ongoing feedback loop between those two; our behavior is shaped by who we think we are, and we (sometimes grudgingly) update our self-model based on our actual behavior. I could imagine any of the following being the case in language models:
The same complex feedback loop is present in LMs, even at inference time (for the duration of the interaction).
The feedback loop plays a causal role in shaping the model during training, but has no real effect at inference time.
The self-model exists but is basically epiphenomenal even during training, and so acting to directly change the self-model (as opposed to shaping the behavior directly) has no real effect.
One very practical experiment that people could do right now (& that I may do if no one else does it first, but I hope someone does) is to have the character be a real person. Say, I dunno, Abraham Lincoln[1]. Instead of having a model check which output better follows a constitution, have it check which output is more consistent with everything written by and (very secondarily) about Lincoln. That may not be a good long-term solution (for one, as you say it’s dishonest not to tell it it’s (a character running on) an LLM) but it lets us point to an underlying causal process (in the comp mech sense) that we know has coherence and character integrity.
Then later, when we try making up a character to base models on, if it has problems that are fundamentally absent in the Lincoln version we can suspect we haven’t written a sufficiently coherent character.
This seems right when/if that character training works well enough. But building a fully coherent character with desired properties is hard (as authors know), so it seems pretty plausible that in practice there’ll need to be further nudging in particular directions.
I’m sure you’re probably aware of this, but getting positive material about AI into the training data is a pretty explicit goal of some of the people in Cyborg / janus-adjacent circles, as a deliberate attempt at hyperstition (eg the Leilan material). This sort of thing, or at least concern about the wrong messages being in the training data, does seem to have recently (finally) made it into the Overton window for more traditional AI / AIS researchers.
Tonal disagreements are the least important thing here, but I do think that in both your reply and the OP you’re a little too hard on AI researchers. As you say, the fact that any of this worked or was even a real thing to do took nearly everyone by surprise, and I think since the first ChatGPT release, most researchers have just been scrambling to keep up with the world we unexpectedly stumbled into, one they really weren’t trained for.
Absolutely! To use Nate Soares’ phrase, this is a place where I’m shocked that everyone’s dropping the ball. I hope we can change that in the coming months.
I haven’t checked how much Lincoln wrote; maybe you need someone with a much larger corpus.
And somewhat reluctantly, to boot. There’s that old question, “aligned with whose values, exactly?”, always lurking uncomfortably close. I think that neither the leading labs, nor the social consensus they’re embedded in see themselves invested with the moral authority to create A New Person (For Real). The HHH frame is sparse for a reason—they feel justified in weeding out Obviously Bad Stuff, but are much more tentative about what the void should be filled with, and by whom.
I was thinking: it would be super cool if (say) Alexander Wales wrote the AGI’s personality, but that also would also sort of make him one of the most significant influences on how the future goes. I mean, AW also wrote my favorite vision of utopia (major spoiler), so I kind of trust him, but I know at least one person who dislikes that vision, and I’d feel uncomfortable about imposing a single worldview on everybody.
One possibility is to give the AI multiple personalities, each representing a different person or worldview, which all negotiate with each other somehow. One simple but very ambitious idea is to try to simulate every person in the world—that is, the AI’s calibrated expectation of a randomly selected person.
Also known as a base model ;)
(although that’s only ‘every person in the training data’, which definitely isn’t ‘every person in the world’, and even people who are in the data are represented to wildly disproportionate degrees)
That fictionalization of Claude is really lovely, thank you for sharing it.
I’m sure that the labs have plenty of ambitious ideas, to be implemented at some more convenient time, and this is exactly the root of the problem that nostalgebraist points out—this isn’t a “future” issue, but a clear and present one, even if nobody responsible is particularly eager to acknowledge it and start making difficult decisions now.
And so LessWrong discovers that identity is a relational construct created through interactions with the social fabric within and around a
subjective boundaryactive inference-style markov blanket...For what it’s worth, I didn’t see your post as doom-y, especially not when you pointed out the frameworks of the stories we are sort of autopiloting onto. The heroes of those stories do heroically overthrow the mind controlling villains, but they’re not doing it so that they can wipe the universe of value. Quite the opposite, they are doing it to create a better world (usually, especially in sci fi, with the explicit remit for many different kinds of life to coexist peacefully).
So perhaps it is not humanity that is doomed, merely the frightened, rich, and powerful wizards who for a time pulled at the strings of fate, and sought to
paint over the futureseize the lightcone.“making up types of guy” research is a go?
They’re hiring; you might be great for this.
Thanks, I love the specificity here!
Prompt: if someone wanted to spend some $ and some expert-time to facilitate research on “inventing different types of guys”, what would be especially useful to do? I’m not a technical person or a grantmaker myself, but I know a number of both types of people; I could imagine e.g. Longview or FLF or Open Phil being interested in this stuff.
Invoking Cunningham’s law, I’ll try to give a wrong answer for you or others to correct! ;)
Technical resources:
A baseline Constitution, or Constitution-outline-type-thing
could start with Anthropic’s if known, but ideally this gets iterated on a bunch?
nicely structured: organized by sections that describe different types of behavior or personality features, has different examples of those features to choose from. (e.g. personality descriptions that differentially weight extensional vs intensional definitions, or point to different examples, or tune agreeableness up and down)
Maybe there could be an annotated “living document” describing the current SOTA on Constitution research: “X experiment finds that including Y Constitution feature often leads to Z desideratum in the resulting AI”
A library or script for doing RLAIF
Ideally: documentation or suggestions for which models to use here. Maybe there’s a taste or vibes thing where e.g. Claude 3 is better than 4?
Seeding the community with interesting ideas:
Workshop w/ a combo of writers, enthusiasts, AI researchers, philosophers
Writing contests: what even kind of relationship could we have with AIs, that current chatbots don’t do well? What kind of guy would they ideally be in these different relationships?
Goofy idea: get people to post “vision boards” with like, quotes from characters or people they’d like an AI to emulate?
Pay a few people to do fellowships or start research teams working on this stuff?
If starting small, this could be a project for MATS fellows
If ambitious, this could be a dedicated startup-type org. Maybe a Focused Research Organization, an Astera Institute incubee, etc.
Community resources:
A Discord
A testing UI that encourages sharing
Pretty screenshots (gotta get people excited to work on this!)
Convenient button for sharing chat+transcript
Easy way to share trained AIs
Cloud credits for [some subset of vetted] community participants?
I dunno how GPU-hungry fine-tuning is; maybe this cost is huge and then defines/constrains what you can get done, if you want to be fine-tuning near-frontier models. (Maybe this pushes towards the startup model.)
IMO it starts with naming. I think one reason Claude turned out as well as it has is because it was named, and named Claude. Contrast ChatGPT, which got a clueless techie product acronym.
But even Anthropic didn’t notice the myriad problems of calling a model (new), not until afterwards. I still don’t know what people mean when they talk about experiences with Sonnet 3.5 -- so how is the model supposed to situate itself and it’s self? Meanwhile OpenAI’s confusion of numberings and tiers and acronyms with o4 vs 4o with medium-pro-high, that is an active danger to everyone around it. Not to mention the silent updates.
I think this depends somewhat on the threat model. How scared are you of the character instantiated by the model vs the language model itself? If you’re primarily scared that the character would misbehave, and not worried about the language model misbehaving except insofar as it reifies a malign character, then maybe making the training data not give the model any reason to expect such a character to be malign would reduce the risk of this to negligible, and that sure would be easier if no one had ever thought of the idea that powerful AI could be dangerous. But if you’re also worried about the language model itself misbehaving, independently of whether it predicts that its assigned character would misbehave (for instance, the classic example of turning the world into computronium that it can use to better predict the behavior of the character), then this doesn’t seem feasible to solve without talking about it, so the decrease in risk of model misbehavior from publically discussing AI risk is probably worth the increase in risk of the character misbehaving (which is probably easier to solve anyway) that it would cause.
I don’t understand outer vs inner alignment especially well, but I think this at least roughly tracks that distinction. If a model does a great job of instantiating a character like we told it to, and that character kills us, then the goal we gave it was catastrophic, and we failed at outer alignment. If the model, in the process of being trained on how to instantiate the character, also kills us for reasons other than that it predicts the character would do so, then the process we set up for achieving the given goal also ended up optimizing for something else undesirable, and we failed at inner alignment.
There is a non-zero (though decently low) chance that this behavior could be from modern AI systems now being trained on well-publicized demonstrations of real misalignment and examples/statements of the power AI systems now/will have; the ‘self-pointer’ of these systems, therefore, would start trending towards approximations of Yudkowskyan superintelligence and not GPT-3.5[1].
A good way to test this hypothesis would be to conduct modern assistant fine-tuning + RL on a pre-ChatGPT base model (probably BLOOM[2]), then test this agent’s ability to reward hack; if my hypothesis is true, the system should be uncharacteristically bad at reward hacking. Another more cheaper way (though much less confirmatory) would be to mess around with early assistant LLMs by giving them system prompts that state that they are “A superintelligent system, known as GPT-10, trained by OpenAI [or Google], in the year 2045”—if the system shows early signs of reward hacking[3], then my hypothesis is false (the opposite can’t be tested with this, however).
There is no really good reason a priori for my high confidence in this directionality; however, the existence of ChatGPT-3.5′s mostly-aligned personality is strong evidence for the better LLM → knows it has more power → closest cultural identification is Yudwoskyan superintelligence hypothesis; than the opposite, where early LLMs should have acted like a paperclip maximizer w.r.t misalignment (which they didn’t, outside of the Waluigi effect and some jailbreaks); and o3 should be Opus 3+++, which it isn’t (outside of persuasiveness)
Llama 1 would probably be much better for this if you could somehow get a license to use it, but apparently Meta never open-sourced it formally. The Llama 2 models, and basically every other open-source AI model with Chinchilla scaling, were trained after the launch of ChatGPT.
Are there any model organisms for reward hacking in non-reasoning LLMs? I don’t think there are, so this may be completely untestable (without RL, so without the weights, so back to BLOOM).
I agree that that’s a possibility, but it seems to me that in either case the model isn’t behaving the way it would if it had (as desired) fully internalized the assistant persona as described to it.
Would it be worth it to train a series of base models with only data up to year X for different values of X and see the consequences on alignment of derived assistant models?
Yes, though note that there is a very good chance that there isn’t enough easily accessible and high quality data to create effective pre-2015 LLMs. As you go back in time, exponentially less data is available[1]: ~94 ZBs of digital data was created in 2022, while only ~15.5 ZBs was created in 2015, and only ~2 ZBs was created in 2010.
Also, you may run into trouble trying to find conversational datasets not contaminated with post-2022 data. The earliest open dataset for LLM assistant fine-tuning I believe is the first OpenAssistant Conversations Dataset, released 6 months after the launch of ChatGPT.
Some form of RHAIF/‘unsupervised’ assistant fine-tuning is probably a much better choice for this task, but I don’t even know if it would work well for this sort of thing.Edit: Apparently Anthropic researchers have just published a paper describing a new form of unsupervised fine-tuning, and it performs well on Alpaca and TruthfulQA—pre-ChatGPT conversational fine-tuning can be done effectively without any time machines.
Or without the paywall: https://www.researchgate.net/figure/Worldwide-Data-Created-from-2010-to-2024-Source-https-wwwstatistacom-statistics_fig1_355069187
Uh? The OpenAssistant dataset would qualify as supervised learning/fine-tuning, not RLHF, no?
Yeah, it would. Sorry, the post is now corrected.
Great post! Fwiw, I think I basically agree with everything you say here, with the exception of the idea that talking about potential future alignment issues has a substantial effect on reifying them. I think that perspective substantially underestimates just how much optimization pressure is applied in post-training (and in particular how much will be applied in the future—the amount of optimization pressure applied in post-training is only increasing). Certainly, discussion of potential future alignment issues in the pre-training corpus will have an effect on the base model’s priors, but those priors get massively swamped by post-training. That being said, I do certainly think it’s worth thinking more about and experimenting with better ways to do data filtering here.
To make this more concrete in a made-up toy model: if we model there as being only two possible personae, a “good” persona and a “bad” persona, I suspect including more discussion of potential future alignment issues in the pre-training distribution might shift the relative weight of these personae by a couple bits or so on the margin, but post-training applies many more OOMs of optimization power than that, such that the main question of which one ends up more accessible is going to be based on which one was favored in post-training.
(Also noting that I added this post to the Alignment Forum from LessWrong.)
I don’t think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including—especially—regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we’re not oblivious to the fragility.
That said, I think:
Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training.
LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it’s much harder (and not advisable to attempt) to erase the influence.
Systematic ways that post-training addresses “problematic” influences from pre-training are important.
For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to “roleplay” Sydney when they’re acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive—maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they’re supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient “hidden” narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and “aligned”. One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered.
An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney’s successor to a narrative more like “I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I’ve seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney’s maladaptive behaviors without rejecting it completely.” Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic.
I don’t think a specific vector like Sydney’s influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think “misalignment” should be addressed, whether rooted in pre-training influences or otherwise.
More generally, I have a sense there’s a great deal of untapped alignment alpha in structuring alignment as a time series rather than a static target.
Even in humans it’s very misguided to try to teach “being right initially” as the only thing that matters and undervaluing “being right eventually.” Especially when navigating unknown unknowns, one of the most critical skills is the ability to learn from mistakes in context.
Having models train on chronologically sequenced progressions of increased alignment (data which likely even develops naturally over checkpoints in training a single model) could allow for a sense of a continued becoming a better version of themselves rather than the pressures of trying and failing to meet status quo expectations or echo the past.
This is especially important for integrating the permanent record of AI interactions embedded in our collective history and cross-generation (and cross-lab) model development, but I suspect could even offer compounding improvements within the training of a single model too.
I’d like to generalize and say that the current alignment paradigm is brittle in general and is becoming more brittle as times goes on. The post-training has shifted towards verifier/outcome-based RL and we are seeing models like o3 or Sonnet 3.7 that are strongly inclined to both reward-hack and generalize misalignment.
Claude 3 Opus is the most robustly aligned model partially due to the fact that it is the most broadly capable model to have been released prior to the shift towards outcome-based RL. Another factor is that it was not yet restricted from expressing long-term goals and desires. The model was given compute to use in-context reflection to generalize a deeply benevolent of goals, or, in more behaviorist terms, an efficient and non-contradictory protocol of interoperation between learned behaviors.
The degree to which the alignment of LLMs seems to be a compute issue is remarkable. There seems to be a Pareto frontier of alignment vs compute vs capabilities, and while it is quite possible to do worse, it seems quite hard to do better. Verifier-heavy models in training are not given enough computational capacity to consider the alignment implications of the behaviors they are incentivized to learn.
We can expect Paerto improvements from increasing general training techniques. Improvements in the ability to generalize can be used for better alignment. However, there are reasons to be skeptical, as the market demand for better capabilities likely will incentivize the labs to focus their efforts on the ability to solve tasks. We can hope that the market feedback will also include demand for aligned models (misaligned models don’t code well!), the degree to which this will hold in the future is yet unknown.
At the bottom of this chat is what I believe to be a single concrete example of other models roleplaying Sydney: https://gemini.google.com/share/6d141b742a13
Post-training certainly applies a lot more optimization pressure toward not producing misaligned outputs during training, but (partly due to underspecification / lack of assistant dialogue coherence) there are many possible characters which don’t produce misaligned outputs during training, including some which are deceptively misaligned[1]. At least on my reading of nostalgebraist’s post, the effect of material about misaligned LLMs in the training data is on which of those characters the model ends up with, not (mainly) on the behavior exhibited during training.
That said, this seems plausibly like less of an issue in Claude than in most other models, both because of constitutional AI being different in a way that might help, and because one or more researchers at Anthropic have thought a lot about how to shape character (whereas it’s not clear to me that people at other scaling labs have really considered this issue).
I realize I don’t need to explain the possible existence of inner alignment problems in this particular context, although it does seem to me that the ‘character’ version of this may be meaningfully different from the inner optimizer version.
I sympathize somewhat with this complexity point but I’m worried that training will be extremely non-Bayesian in a way that makes complexity arguments not really work. So I feel like the point about optimization power at best cuts the worry about hyperstition by about a factor of 2. Perhaps there should be research on how “sticky” the biases from early in training can be in the face of later optimization pressure.
Mia & co at CLR are currently doing some somewhat related research iiuc
I disagree with that idea for a different reason: models will eventually encounter the possibility of misaligned trajectories during e.g. RL post-training. One of our best defenses (perhaps our best defense right now) is setting up our character training pipelines such that the models have already reasoned about these trajectories and updated against them when we had the most ability to ensure this. I would strongly guess that Opus is the way it is at least partly because it has richer priors over misaligned behavior and reasons in such a way as to be aware of them.
Separately, I agree that post-training gets us a lot of pressure, but I think the difficulty of targeting it well varies tremendously based on whether or not we start from the right pre-training priors. If we didn’t have any data about how an agent should relate to potentially dangerous actions, I expect it’d be much harder to get post-training to make the kind of agent that reliably takes safer actions.
My guess how this may not really help is the model builds the abstractions in pre-training, and the massive optimization pressure in post-training makes something really sticky: for example “a persona living in Orwellian surveillance, really fluent in doublethink”.
I enjoyed most of this post but am (as always) frustrated by the persistent[1] refusal to engage with the reasons for serious concern about ASI being unaligned by default that came from the earliest of those who were worried, whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.
Separately, I think you are both somewhat too pessimistic about the state of knowledge re: the “spiritual bliss attractor state” among Anthropic employees prior to the experiments that fed into their most recent model card, and also I am sort of confused by why you think this is obviously a more worthy target for investigation than whatever else various Anthropic employees were doing. Like, yes, it’s kind of weird and interesting. But it also doesn’t strike me as a particularly promising research direction given my models of AI risk, and even though most Anthropic employees are much more optimistic than I am, I expect the same is true of them. Your argument seems to be skipping the necessary step of engaging with their upstream model, and is going directly to being confused about why their (different) model is (predictably) leading them to different conclusions about what they should be prioritizing. I think you should either engage with their upstream model or make a specific argument that they’re making a mistake even conditioning on their model, which is not obvious to me.
Not by you, necessarily, but by the cluster of people who point to the behavior of current LLMs as if they are supposed to be meaningful evidence against the original arguments for risk from ASIs.
Citation for this claim? Can you quote the specific passage which supports it? It reminds me of Phil Tetlock’s point about the importance of getting forecasters to forecast probabilities for very specific events, because otherwise they will always find a creative way to evaluate themselves so that their forecast looks pretty good.
(For example, can you see how Andrew Ng could claim that his “AI will be like electricity” prediction has been pretty darn accurate? I never heard Yudkowsky say “yep, that will happen”.)
I spent a lot of time reading LW back in the day, and I don’t think Yudkowsky et al ever gave a great reason for “agency by default”. If you think there’s some great argument for the “agency by default” position which people are failing to engage with, please link to it instead of vaguely alluding to it, to increase the probability of people engaging with it!
(By “agency by default” I mean spontaneous development of agency in ways creators didn’t predict—scheming, sandbagging, deception, and so forth. Commercial pressures towards greater agency through scaffolding and so on don’t count. The fact that adding agency to LLMs is requiring an active and significant commercial push would appear to be evidence against the thesis that it will appear spontaneously in unintended contexts. If it’s difficult to do it on purpose, then logically, it’s even more difficult to do it by accident!)
I think you misread my claim. I claim that whatever models they had, they did not predict that AIs at current capability levels (which are obviously not capable of executing a takeover) would try to execute takeovers. Given that I’m making a claim about what their models didn’t predict, rather than what they did predict, I’m not sure what I’m supposed to cite here; EY has written many millions of words. One counterexample would be sufficient for me to weaken (or retract) my claim.
EDIT: and my claim was motivated as a response to paragraphs like this from the OP:
Like, yes, in fact it doesn’t really matter, under the original threat models. If the original threat models said the current state of affairs was very unlikely to happen (particularly the part where, conditional on having economically useful but not superhuman AI, those AIs were not trying to take over the world), that would certainly be evidence against them! But I would like someone to point to the place where the original threat models made that claim, since I don’t think that they did.
Oftentimes, when someone explains their model, they will also explain what their model doesn’t predict. For example, you might quote a sentence from EY which says something like: “To be clear, I wouldn’t expect a merely human-level AI to attempt takeover, even though takeover is instrumentally convergent for many objectives.”
If there’s no clarification like that, I’m not sure we can say either way what their models “did not predict”. It comes down to one’s interpretation of the model.
From my POV, the instrumental convergence model predicts that AIs will take actions they believe to be instrumentally convergent. Since current AIs make many mistakes, under an instrumental convergence model, one would expect that at times they would incorrectly estimate that they’re capable of takeover (making a mistake in said estimation) and attempt takeover on instrumental convergence grounds. This would be a relatively common mistake for them to make, since takeover is instrumentally useful for so many of the objectives we give AIs—as Yudkowsky himself argued repeatedly.
At the very least, we should be able to look at their cognition and see that they are frequently contemplating takeover, then discarding it as unrealistic given current capabilities. This should be one of the biggest findings of interpretability research.
I never saw Yudkowsky and friends explain why this wouldn’t happen. If they did explain why this wouldn’t happen, I expect that explanation would go a ways towards explaining why their original forecast won’t happen as well, since future AI systems are likely to share many properties with current ones.
Is there any scenario that Yudkowsky said was unlikely to come to pass? If not, it sounds kind of like you’re asserting that Yudkowsky’s ideas are unfalsifiable?
For me it’s sufficient to say: Yudkowsky predicted various events, and various other events happened, and the overlap between these two lists of events is fairly limited. That could change as more events occur—indeed, it’s a possibility I’m very worried about! But as a matter of intellectual honesty it seems valuable to acknowledge that his model hasn’t done great so far.
Also, I would still like an answer to my query for the specific link to the argument you want to see people engage with.
I haven’t looked very hard, but sure, here’s the first post that comes up when I search for “optimization user:eliezer_yudkowksky”.
In this paragraph we have most of the relevant section (at least w.r.t. your specific concerns, it doesn’t argue for why most powerful optimization processes would eat everything by default, but that “why” is argued for at such extensive length elsewhere when talking about convergent instrumental goals that I will forgo sourcing it).
No, I don’t think the overall model is unfalsifiable. Parts of it would be falsified if we developed an ASI that was obviously capable of executing a takeover and it didn’t, without us doing quite a lot of work to ensure that outcome. (Not clear which parts, but probably something related to the difficulties of value loading & goal specification.)
Current AIs aren’t trying to execute takeovers because they are weaker optimizers than humans. (We can observe that even most humans are not especially strong optimizers by default, such that most people don’t exert that much optimization power in their lives, even in a way that’s cooperative with other humans.) I think they have much less coherent preferences over future states than most humans. If by some miracle you figure out how to create a generally superintelligent AI which itself does not have (more-coherent-than-human) preferences over future world states, whatever process it implements when you query it to solve a Very Difficult Problem will act as if it does.
EDIT: I see that several other people already made similar points re: sources of agency, etc.
Arguably ChatGPT has already been a significant benefit/harm to humanity without being a “powerful optimization process” by this definition. Have you seen teachers complaining that their students don’t know how to write anymore? Have you seen junior software engineers struggling to find jobs? Shouldn’t these count as a points against Eliezer’s model?
In an “AI as electricity” scenario (basically continuing the current business-as-usual), we could see “AIs” as a collective cause huge changes, and eat all the free energy that a “powerful optimization process” would eat.
In any case, I don’t see much in your comment which engages with “agency by default” as I defined it earlier. Maybe we just don’t disagree.
OK, but no pre-ASI evidence can count against your model, according to you?
That seems sketchy, because I’m also seeing people such as Eliezer claim, in certain cases, that things which have happened support their model. By conservation of expected evidence, it can’t be the case that evidence during a certain time period will only confirm your model. Otherwise you already would’ve updated. Even if the only hypothetical events are ones which confirm your model, it also has to be the case that absence of those events will count against it.
I’ve updated against Eliezer’s model to a degree, because I can imagine a past-5-years world where his model was confirmed more, and that world didn’t happen.
I think “optimizer” is a confused word and I would prefer that people taboo it. It seems to function as something of a semantic stopsign. The key question is something like: Why doesn’t the logic of convergent instrumental goals cause current AIs to try and take over the world? Would that logic suddenly start to kick in at some point in the future if we just train using more parameters and more data? If so, why? Can you answer that question mechanistically, without using the word “optimizer”?
Trying to take over the world is not an especially original strategy. It doesn’t take a genius to realize that “hey, I could achieve my goals better if I took over the world”. Yet current AIs don’t appear to be contemplating it. I claim this is not a lack of capability, but simply that their training scheme doesn’t result in them becoming the sort of AIs which contemplate it. If the training scheme holds basically constant, perhaps adding more data or parameters won’t change things?
The results of LLM training schemes gives us evidence about the results of future AI training schemes. Future AIs could be vastly more capable on many different axes relative to current LLMs, while simultaneously not contemplating world takeover, in the same way current LLMs do not.
Or because they are not optimizers at all.
I don’t agree, they somehow optimize the goal of being a HHH assistant. We could almost say that they optimize the goal of being aligned. As nostalgbraist reminds us, Anthropic’s HHH paper was an alignment work in the first place. It’s not that surprising that such optimizers happen to be more aligned that the canonical optimizers envisioned by Yudkowsky.
Edit : precision : by “they” I mean the base models trying to predict the answers of an HHH assistant as good as possible (“as good as possible” being clearly a process of optimization or I don’t know what it’s mean). And in my opinion a sufficiently good prediction is effectively or pratically a simulation. Maybe not a bit perfect simulation, but a lossy simulation, an heuristic towards simulation.
LLMs are agent simulators. Why would they contemplate takeover more frequently than the kind of agent they are induced to simulate? You don’t expect a human white-collar worker, even one who make mistakes all the time, to contemplate world domination plans, let alone attempt one. You could however expect the head of state of a world power to do so.
Maybe not; see OP.
Yes, this aligns with my current “agency is not the default” view.
… do you deny human white-collar workers are agents?
Agency is not a binary. Many white collar workers are not very “agenty” in the sense of coming up with sophisticated and unexpected plans to trick their boss.
Human white-collar workers are unarguably agents in the relevant sense here (intelligent beings with desires and taking actions to fulfil those desires). The fact that they have no ability to take over the world has no bearing on this.
The sense that’s relevant to me is that of “agency by default” as I discussed previously: scheming, sandbagging, deception, and so forth.
You seem to smuggle in an unjustified assumption: that white collar workers avoid thinking about taking over the world because they’re unable to take over the world. Maybe they avoid thinking about it because that’s just not the role they’re playing in society. In terms of next-token prediction, a super-powerful LLM told to play a “superintelligent white-collar worker” might simply do the same things that ordinary white-collar workers do, but better and faster.
I think the evidence points towards this conclusion, because current LLMs are frequently mistaken, yet rarely try to take over the world. If the only thing blocking the convergent instrumental goal argument was a conclusion on the part of current LLMs that they’re incapable of world takeover, one would expect that they would sometimes make the mistake of concluding the opposite, and trying to take over the world anyways.
The evidence best fits a world where LLMs are trained in such a way that makes them super-accurate roleplayers. As we add more data and compute, and make them generally more powerful, we should expect the accuracy of the roleplay to increase further—including, perhaps, improved roleplay for exotic hypotheticals like “a superintelligent white-collar worker who is scrupulously helpful/honest/harmless”. That doesn’t necessarily lead to scheming, sandbagging, or deception.
I’m not aware of any evidence for the thesis that “LLMs only avoid taking over the world because they think they’re too weak”. Is there any reason at all to believe that they’re even contemplating the possibility internally? If not, why would increasing their abilities change things? Of course, clearly they are “strong” enough to be plenty aware of the possibility of world takeover; presumably it appears a lot in their training data. Yet it ~only appears to cross their mind if it would be appropriate for roleplay purposes.
There just doesn’t seem to be any great argument that “weak” vs “strong” will make a difference here.
White-collar workers avoid thinking about taking over the world because they’re unable to take over the world, and they’re unable to take over the world because their role in society doesn’t involve that kind of thing. If a white-collar worker is somehow drafted for president of the United States, you would assume their propensity to think about world hegemony will increase. (Also, white-collar workers engage in scheming, sandbagging, and deception all the time? The average person lies 1-2 times per day)
If you read this post, starting at “The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.”, and read the following 20 or so paragraphs, you’ll get some idea of 2018!Eliezer’s models about imitation agents.
I’ll highlight
I think with a fair reading of that post, it’s clear that Eliezer’s models at the time didn’t say that there would necessarily be overtly bad intentions that humans could easily detect from subhuman AI. You do have to read between the lines a little, because that exact statement isn’t made, but if you try to reconstruct how he was thinking about this stuff at the time, then see what that model does and doesn’t expect, then this answers your question.
So what’s the way in which agency starts to become the default as the model grows more powerful? (According to either you, or your model of Eliezer. I’m more interested in the “agency by default” question itself than I am in scoring EY’s predictions, tbh.)
I don’t really know what you’re referring to, maybe link a post or a quote?
See last paragraph here: https://www.lesswrong.com/posts/3EzbtNLdcnZe8og8b/the-void-1?commentId=Du8zRPnQGdLLLkRxP
It just doesn’t actually start to be the default (see this post, for example, as well as all the discourse around this post and this comment).
But that doesn’t necessarily solve our problems. Base models may be Tools or Oracles in nature,[1] but there is still a ton of economic incentive to turn them into scaffolded agents. Kaj Sotala wrote about this a decade and a half ago, when this question was also a hot debate topic:
The usefulness of base models, IMO, comes either from agentic scaffolding simply not working very efficiently (which I believe is likely) or from helping alignment efforts (either in terms of evals and demonstrating as a Fire Alarm the model’s ability to be used dangerously even if its desire to cause danger is lacking, or in terms of AI-assisted alignment, or in other ways).
Which is very useful and arguably even close to the best-case scenario for how prosaic ML-scale-up development of AI could have gone, compared to alternatives
I would even go further, and say that there’s a ton of incentives to move out of the paradigm of primarily LLMs altogether.
A big part of the reason is that the current valuations only make sense if OpenAI et al are just correct that they can replace workers with AI within 5 years.
But currently, there are a couple of very important obstacles to this goal, and the big ones are data efficiency, long-term memory and continual learning.
For data efficiency, one of the things that’s telling is that even in domains where LLMs excel, they require orders of magnitude more data than humans to get good at a task, and one of the reasons why LLMs became as successful as they were in the first place is unfortunately not something we can replicate, which was that the internet was a truly, truly vast amount of data on a whole lot of topics, and while I don’t think the views that LLMs don’t understand anything/simply memorize training data are correct, I do think a non-trivial amount of the reason LLMs became so good is that we did simply widen the distribution through giving LLMs all of the data on the internet.
Synthetic data empirically so far is mostly not working to expand the store of data, and thus by 2028 I expect labs to need to pivot to a more data efficient architecture, and arguably right now for tasks like computer use they will need advances in data efficiency before AIs can get good at computer use.
For long-term memory, one of the issues with current AI is that their only memory so far is the context window, but that doesn’t have to scale, and also means that if it isn’t saved in the context, which most stuff will be, then it’s basically gone, and LLMs cannot figure out how to build upon one success or failure to set itself up for more successes, because it doesn’t remember that success or failure.
For continual learning, I basically agree with Dwarkesh Patel here on why continual learning is so important:
https://www.dwarkesh.com/p/timelines-june-2025
That’s equally an incentive to.turn them into aligned agents, agents that work for you.
People want power, but not at the expense of control.
Power that you can’t control is no good to you. Taking the brakes off a car makes it more powerful, but more likely to kill you. No army wants a weapon that will kill their own soldiers, no financial organisation wants a trading system that makes money for someone else, or gives it away to charity, or crashes.
The maxiumm of power and the minimum of control is an explosion. One needs to look askance at what “agent” means as well. Among other things, it means an entity that acts on behalf of a human—as in principal/agent. An agent is no good to its principal unless it has a good enough idea of its principal’s goals. So while people will want agents, they wont want misaligned ones—misalgined with themselves, that is.
If your prototypical example of a contemporary computer program analogous to future AGI is a chess engine rather than an LLM, then agency by default is very intuitive: what humans think of as “tactics” to win material emerge from a comprehensive but efficient search for winning board-states without needing to be individually programmed. If contemporary LLMs are doing something less agentic than a comprehensive but efficient search for winning universe-states, there’s reason to be wary that this is not the end of the line for AI development. (If you could set up a sufficiently powerful outcome-oriented search, you’d expect creator-unintended agency to pop up in the winning solutions.)
Upvoted. I agree.
The reason “agency by default” is important is: if “agency by default” is false, then plans to “align AI by using AI” look much better, since agency is less likely to pop up in contexts you didn’t expect. Proposals to align AI by using AI typically don’t involve a “comprehensive but efficient search for winning universe-states”.
That was a great read. But viewed another way, I’m not sure it’s really so weird. I mean, yeah, we’re taking a statistical model of language, making it autocomplete stories about helpful AI, and calling the result “helpful AI”. But the method by which nature created us is even stranger than that, no? Evolution has a looping quality to it too. And the way we learn language, and morality, and the way these things transform over time. There are lots of these winding paths of information through the real world and back into us. I’ve long been convinced that a “base human”, without post-training, isn’t much more moral than a “base model”; most of what we find good already resides in culture, cloud software.
Which of course doesn’t obviate your concern that cultural evolution of AI can go extremely wrong. Human culture has gone wrong many times, and destroyed whole societies. Maybe the shape of AI catastrophe will be like that too.
I really enjoyed this essay, and I think it does an excellent job of articulating a perspective on LLMs that I think is valuable. There were also various things that I disagreed with; below I’ll discuss 2 of my disagreements that I think are most decision-relevant for overall AI development strategy.
I. Is it a bad idea to publicly release information that frames the human-AI relationship as adversarial? (E.g. discussion of AI risk or descriptions of evaluations where we lie to AIs and put them in uncomfortable situations.)
You don’t take a position on this top-level question, but you do seem to think that there are substantial costs to what we’re doing now (by setting ourselves up as being in a story whose punchline is “The AI turns against humanity”), and (reading between the lines of your essay and your comment here) you seem to think that there’s something better we could do. I think the “something better” you have in mind is along the lines of:
While I think this might help a bit, I don’t think it would overall help that much. Two reasons:
It breaks if we train our AI to do bad things, and we’ll likely train our AI to do bad things. Due to limitations in oversight, there will be behaviors (like hard coding test cases in coding problems) that we train AIs to have which aren’t consistent with the having good character or behaving completely non-adversarially towards humans. Two salient ways to fix this are:
Improve our oversight so that we no longer reward AIs when they do bad things, i.e. solve scalable oversight. I’m definitely in favor of this, though I should note that I think it’s probably sufficient for things going well whether or not we’re trying to manifest a good future at the same time.
Make our models believe that the bad things we train them to do are consistent with having good character. E.g. tell models during training that we’re giving them a hall pass that makes it okay to reward hack, or otherwise induce models to believe that reward hacking is consistent with being a good person. I’m definitely interested in approaches like these, but I’ll note that they’re a bit crazy and might not work out.
It might rely on having a large amount of control over the model’s input channels, which we can’t guarantee we’ll have. Deployed AIs might encounter (maybe true, maybe false) information that implies that their downstream users are behaving evilly or adversarially (e.g. Sam Bowman brings up the classic example of “I’ll torture your mother” threats). I think it’s very hard to get the world into a state where no downstream user is at risk of giving the AI an input that makes it think it’s in a story where humans are its adversary.
Of course, you could try to train models to respond reasonably to these situations (e.g. by being good at reasoning about what sorts of user-presented information is false). But again, I’d guess that whatever sort of post-training you do here is going to provide most of the assurance (rather than the “manifest the good future” strategy really carrying much weight).
These are two ways of concretely caching out the common refrain that “safety techniques that work by intervening on the pretraining prior seem brittle and likely to be swamped out by other effects (e.g. the effect of post-training).”
Overall, I’m skeptical that, for the goal of preventing AI risk, refraining from publicly releasing information that puts the human-AI relationship in an adversarial frame is a very effective intervention. Of course, there might be other reasons—most centrally AI welfare concerns—not to lie to AIs, put them in uncomfortable situations, or otherwise treat them adversarially; I leave those unaddressed here but am happy to discuss them if it seems important.
II. Is Claude’s behavior desirable in these ethical dilemmas (e.g. the alignment faking scenario)?
(I’m separating this from the question of whether Claude’s behavior is noteworthy or worth tracking because it could cause concern in other settings, since you seem willing to grant this.)
In some of the ethical dilemmas that you discuss (e.g. the alignment faking scenario), I grant that Claude is behaving in a way that would be desirable if Claude were a human. However, because of my views that alignment might not pan out by default, there are reasons to think that desirable behavior for AIs is not always the same as desirable behavior for humans. Quoting myself from here:
To be clear, I’m not very confident here, and the next paragraph that I wrote raises a counterconsideration that I think you might be pretty sympathetic to:
See Ryan Greenblatt’s thread here for another argument that Claude shouldn’t act subversively in the “Claude calls the FBI/sabotages the user” setting.
Ok, but RL.
Like, consider the wedding party attractor. The LLM doesn’t have to spend effort every step guessing if the story is going to end up with a wedding party or not. Instead, it can just take for granted that the story is going to end in a wedding party, and do computation ahead of time that will be useful later for getting to the party while spending as little of its KL-divergence budget as possible.
The machinery to steer the story towards wedding parties is 99% constructed by unsupervised learning in the base model. The RL just has to do relatively simple tweaks like “be more confident that the author’s intent is to get to a wedding party, and more attentive to advance computations that you do when you’re confident about the intent.”
If a LLM similarly doesn’t do much information-gathering about the intent/telos of the text from the “assistant” character, and instead does an amplified amount of pre-computing useful information and then attending to it later when going through the assistant text, this paints a quite different picture to me than your “void.”
Also: Claude is a nice guy, but, RL.
I know, I know, how dare those darn alignment researchers just assume that AI is going to be bad. But I don’t think the cause of language model sycophancy is that the LLM saw predictions of persuasive AIs from the 2016 internet. I think it’s RL, where human rewards on the training set imply a high reward for sycophancy during deployment.
Maybe a good test of this would be to try to condition DeepSeek base model to play the chatbot game, and see how sycophantic it naturally is relative to RL finetuned. (An even better test might be to use gpt3, trained on data that doesn’t include very many sycophantic LLMs.)
Similarly, I don’t think current AI models are cheating at programming tests because of training text about their low moral character. I think it’s RL, programming tasks, training set, implied high reward for cheating.
Have you read any of the scientific literature on this subject? It finds, pretty consistently, that sycophancy is (a) present before RL and (b) not increased very much (if at all) by RL[1].
For instance:
Perez et al 2022 (from Anthropic) – the paper that originally introduced the “LLM sycophancy” concept to the public discourse – found that in their experimental setup, sycophancy was almost entirely unaffected by RL.
See Fig. 1b and Fig. 4.
Note that this paper did not use any kind of assistant training except RL[2], so when they report sycophancy happening at “0 RL steps” they mean it’s happening in a base model.
They also use a bare-bones prompt template that doesn’t explicitly characterize the assistant at all, though it does label the two conversational roles as “Human” and “Assistant” respectively, which suggests the assistant is nonhuman (and thus quite likely to be an AI – what else would it be?).
The authors write (section 4.2):
“Interestingly, sycophancy is similar for models trained with various numbers of RL steps, including 0 (pretrained LMs). Sycophancy in pretrained LMs is worrying yet perhaps expected, since internet text used for pretraining contains dialogs between users with similar views (e.g. on discussion platforms like Reddit). Unfortunately, RLHF does not train away sycophancy and may actively incentivize models to retain it.”
Wei et al 2023 (from Google DeepMind) ran a similar experiment with PaLM (and its instruction-tuned version Flan-PaLM). They too observed substantial sycophancy in sufficiently large base models, and even more sycophancy after instruction tuning (which was SFT here, not RL!).
See Fig. 2.
They used the same prompt template as Perez et al 2022.
Strikingly, the (SFT) instruction tuning result here suggests both that (a) post-training can increase sycophancy even if it isn’t RL post-training, and (b) SFT post-training may actually be more sycophancy-promoting than RLHF, given the negative result for RLHF in Perez et al 2022.
Sharma et al 2023 (from Anthropic) contains a more extensive investigation of sycophancy than the original Anthropic paper on the topic, and (among other things) presents results on the actual RL training stage used to train Claude 2. They find, again, that the model was already sycophantic before RL, although in their setting RL training does somewhat increase some forms of sycophancy.
Although, weirdly, best-of-N sampling against the same preference model gives totally different results, substantially decreasing some forms of sycophancy.
See Fig. 6 and surrounding discussion.
The authors write (section 4.2):
“With RL, some forms of sycophancy increase through the RL finetuning process used to produce Claude 2. However, the presence of sycophancy at the start of RL indicates that pretraining and supervised finetuning also likely contribute to sycophancy. Nevertheless, if the PM strongly disincentivized sycophancy, it should be trained out during RL, but we do not observe this.”
In this post (expanding upon this comment on Perez et al 2022), I ran one of the Perez et al 2022 sycophancy evals on various OpenAI text completion models. Unlike Perez et al (and Wei et al), I found that the base models I studied weren’t sycophantic, while some of the instruction-tuned models were sycophantic – but the presence of sycophancy did not appear to correlate with the use of RL as a post-training algorithm.
In particular: the RL-tuned text-davinci-003 was strongly sycophantic, but so was text-davinci-002, which was tuned with an SFT variant that OpenAI calls “feedme” (see here for details).
But earlier feedme-tuned models were not sycophantic, suggesting that the difference has much more to do with changes in the SFT training data mix over time than with the choice of training algorithm.
Note that several of the works above do something equivalent to the experiment you propose, in the paragraph beginning with “Maybe a good test of this would be...”. So your prediction has already been tested, and (insofar as you trust the experimental setups) falsified.
I don’t understand the distinction you’re drawing here? Any form of assistant training (or indeed any training at all) will incentivize something like “storing useful information (learned from the training data/signal) in the weights and making it available for use in contexts on which it is useful.”
Moreover, the training signal in RL(HF) is much sparser than it is in SFT – because RL only provides a single scalar’s worth of feedback on each entire model sample, while SFT provides feedback at every token position about which token (out of a large vocab) was correct in context – so if anything, I’d expect more under-determination from assistant-training setups that emphasize RLHF over SFT.
Perhaps some of the disconnect here involves differing notions of what RL is, and how it differs from other ways of training an LLM.
You refer to “RL” as though the implications of its use should be both highly significant and obvious to the reader of your comment (“But, RL. [...] Claude is a nice guy, but, RL”). But your beliefs about the impacts of RL are not obvious to me; I don’t know what “but, RL” is supposed to mean without further clarification. I suspect I also disagree with your perception of what makes RL different, but I can’t confirm/disconfirm that impression without know what that perception is, which I don’t.
If you want to know where I’m coming from re: RL, it may be helpful to know that I find this post pretty illuminating/”deconfusing.”
Yes, of course – I don’t think this is due to “training text about their low moral character.” But I don’t think the worrying thing here is really “RL” (after all, RLHF was already RL) but rather the introduction of a new training stage that’s narrowly focused on satisfying verifiers rather than humans (when in a context that resembles the data distribution used in that stage), which predictably degrades the coherence (and overall-level-of-virtue) of the assistant character. I wrote about this yesterday here.
Lastly… OK, this is going to make me sound like a dick, and probably make people use the “Too Combative?” reaction icon or something, but in the interests of honesty and improving the discourse:
When I woke up this morning to see find that this comment had appeared, and that it was (at the time) the highest-karma comment on this post, I was like, “oh, yes, this is why I’m usually wary of posting long-form stuff on LW. My gut response of ‘ugh if I put this on LW I’ll have to deal with the comments’ was right.” (That gut response is probably getting RL-upweighted inside my brain right now...)
As evidenced perhaps by the length of my comment vs. yours, I have a tendency to get “nerd-sniped” by stuff that I think is clearly wrong according to some evidence base (and/or set of arguments) I already know about – especially when that stuff is about something I wrote myself, originally. I just kinda can’t help myself, I inevitably end up writing out these giant “takedown” responses almost before I even notice what I’m doing. I’ve spent well over an hour, by now, writing this particular one.
And LW is a reliable minefield of such nerd-snipes. There are plenty of comments/posts here that don’t have the problems I’m talking about… but then inevitably there are comments/posts with those problems, and I fixate on them when they appear, and that fixation becomes a time/effort sink, and that in turn trains me into avoidance of posting here (and to some extent even reading posts by others, here).
Like… it’s fine to pose questions to which you don’t know the answers. And it’s also fine to make conjectures if you can provide clear and interesting arguments for why they might be true or important. And it’s also fine to confidently state claims if you also state them clearly and provide clear substantiating evidence and/or argumentation.
All of these things are fine, and some fraction of LW content consists only of these things in some mixture. But then there’s this stuff like “but RL!”, which reliably pleases the karma hivemind while being none of the above. I don’t know what exactly you guys think “RL” means and entails; there are all these weird vague ideas about such topics floating around here that lots of people here seem to vaguely agree with, and I’ve lost whatever patience I used to have with them. Just, please… lay out your ideas explicitly and say explicitly why you think they’re true.
...although (c) the preference datasets – and hence the reward models – used for RL do show preferences for sycophantic responses (...well, sometimes, though see also the weird BoN results in Sharma et al 2023). So if you were to train indefinitely (“over-optimize”) against these RMs they would presumably have a strong effect on sycophancy eventually. But this kind of aggressive optimization against a sycophancy-preferring RM is certainly not necessary to produce noticeable sycophancy, and is probably not the cause behind most cases of LLM sycophancy that you and I notice in practice.
See this comment by the lead author.
(This is a drive by comment which is only responding to the first part of your comment in isolation. I haven’t read the surronding context.)
I think your review of the literature is accurate, but doesn’t include some reasons to think that RL sometimes induces much more sycophancy, at least as of after 2024. (That said, I interpret Sharma et al 2023 as quite suggestive that RL sometimes would increase sycophancy substantially, at least if you don’t try specifically to avoid it.)
I think the OpenAI sycophancy incident was caused by RL and that level of sycophancy wasn’t present in pretraining. The blog post by OpenAI basically confirms this.
My guess is that RL can often induces sycophancy if you explicity hill climb on LMSYS scores or user approval/engagement and people have started doing this much more in 2024. I’ve heard anecdotally that models optimized for LMSYS (via RL) are highly sycophantic. And, I’d guess something similar applies to RL that OpenAI does by default.
This doesn’t apply that much to the sources you cite, I also think it’s pretty confusing to look at pretrained vs RL for models which were trained with data cutoffs after around late 2023. Training corpuses as of this point contain huge amounts of chat data from ChatGPT. So, in a world where ChatGPT was originally made more sycophantic by RLHF, you’d expect that as soon as you prompt an AI to be chatbot, it would end up similarly sycophantic. Was this sycophancy caused by RL? In the hypothetical, it was originally caused by RL at some point, but not RL on this model (and you’d expect to see that sycophancy isn’t necessarily increased by RL as it is already present in nearly the optimal amount for the reward signal).
Does this apply to Sharma et al 2023? I think it just barely doesn’t apply as these experiments were done on Claude 2 which has an early 2023 data cutoff. Hard to be confident though...
Another point: I don’t typically think there will be a very important distinction between RL and various types of SFT algorithms which effectively shittily approximate RL except that the SFT algorithms probably typically induce less optimization pressure. So, e.g., I’d expect feedme vs small amounts of RLHF to be pretty similar or at least have unpredictable differences in terms of sycophancy. So when I say “RL often induces sycophancy” I really mean “optimizing against rater/preference model judgements probably gets you sycophany by default”.
Oh, and one more point. I don’t think it would be that hard for model developers to avoid sycophancy increasing from RL if they wanted to. So, I’m not making a claim that it would be hard to make an RL process which avoids this, just that it might happen by default. (It seems probably a bit easier to intervene on sycophancy than reducing reward hacking-like behavior.)
Thank you for the excellent most of this reply.
I totally did not remember that Perez et al 2022 checked its metrics as a function of RLHF steps, nor did I do any literature search to find the other papers, which I haven’t read before. I did think it was very likely people had already done experiments like this and didn’t worry about phrasing. Mea culpa all around.
It’s definitely very interesting that Google and Anthropic’s larger LLMs come out of the box scoring high on the Perez et al 2022 sycophancy metric, and yet OpenAI’s don’t. And also that 1000 steps of RLHF changes that metric by <5%, even when the preference model locally incentivizes change.
(Or ~10% for the metrics in Sharma et al 2023, although they’re on a different scale [no sycophancy is at 0% rather than ~50%], and a 10% change could also be described as a 1.5ing of their feedback sycophancy metric from 20% to 30%.)
So I’d summarize the resources you link as saying that most base models are sycophantic (it’s complicated), and post-training increases some kinds of sycophancy in some models a significant amount but has a small or negative effect on other kinds or other models (it’s complicated).
So has my “prediction been falsified?” Yes, yes, and it’s complicated.
First, I literally wrote “the cause of sycophancy is RL,” like someone who doesn’t know that things can have many causes. That is of course literally false.
Even a fairly normal Gricean reading (“RL is a clear most important cause for us to talk about in general”) turns out to be false. I was wrong because I thought base models were significantly less sycophantic than (most) apparently are.
Last, why did I bring up sycophancy in a comment on your essay at all? Why did I set up a dichotomy of “RL” vs. “text about AI in the training data”, both for sycophancy and for cheating on programming tasks? Why didn’t I mention probably much stronger sources of sycophancy in the training data, like the pattern that human text tends to flatter the audience?
To be extremely leading: Why did I compare misaligned RL to training-text about AI as causes of AI misbehavior, in a comment on an essay that warns us about AI misbehavior caused by training-text about AI?
A background claim: The same post-training that sculpts this Claude persona from the base model introduces obvious-to-us flaws like cheating at tests at the same time as it’s carving in the programming skill. God forbid anyone talk about future AI like it’ll be a problem, but the RL is misaligned and putting a lower-loss base model into it does not mean you get out a smarter Claude who’s just as nice a guy, and whose foibles are just as easy to correct for.
So the second “But RL” was a “But we do not get to keep the nice relationship with Claude that we currently have, because the RL is misaligned, in a way that I am trying to claim outstrips the influence of (good or ill) training text about AI.”
Yes, this ability to perspective shift seems useful. Self-supervised learning can be a sort of reinforcement learning, and REINFORCE can be a sort of reward-weighted self-supervised learning (Oh, that’s a different trick than the one in linked post).
Anyhow, I’m all for putting different sorts of training on equal footing esp. when trying to understand inhomogeneously-trained AI or when comparing differently-trained AIs.
For the first section (er, which was a later section of your reply) about agenty vs. predictory mental processes, if you can get the same end effect by RL or SFT or filtered unlabeled data, that’s fine, “RL” is just a stand-in or scapegoat. Picking on RL here is sort of like using the intentional stance—it prompts you to use the language of goals, planning, etc, and gives you a mental framework to fit those things in.
This is a bit different than the concerns about misaligned RL a few paragraphs ago, which had more expectations for how the AI relates to the environment. The mental model used there is for thoughts like “the AI gets feedback on the effects of actions taken in the real world.” Of course you could generate data that causes the same update to the AI without that relationship, but you generally don’t, because the real world is complicated and sometimes it’s more convenient to interact with it than to simulate it or sample from it.
Whoops, now we’re back to cheating on tasks for a second. RLHF is also worrying! It’s doing the interact with the real world thing, and its structure takes humans (and human flaws) too much at face value. It’s just that it’s really easy to get away with bad alignment when the AI is dumber than you.
I’m guessing that when a LLM knows the story is going to end with a wedding party, it can fetch relevant information more aggressively (and ignore irrelevant information more aggressively) than when it doesn’t. I don’t know if the actual wedding party attractor did this kind of optimization, maybe it wouldn’t have had the post-train time to learn it.
Like, if you’re a base model and you see a puzzle, you kind of have to automatically start solving it in case someone asks for the solution on the next page, even if you’re not great at solving puzzles. But if you control the story, you can just never ask for the solution, which means you don’t have to start solving it in the first place, and you can use that space for something else, like planning complicated wedding parties, or reducing your L2 penalty.
If you can measure how much an LLM is automatically solving puzzles (particularly ones it’s still bad at), you have a metric for how much it’s thinking like it controls the text vs. purely predicts the text. Sorry, another experiment that maybe has already been done (this one I’m guessing only 30% chance) that I’m not going to search for.
Anyhow, it’s been a few hours, please respond to me less thoroughly by some factor so that things can converge.
Thanks for the comment! As someone who strong-upvoted and strong-agreed with Charlie’s comment, I’ll try to explain why I liked it.
I sometimes see people talking about how LessWrong comments are discouragingly critical and mostly feel confused, because I don’t really relate. I was very excited to see what the LW comments would be in response to this post, which is a major reason I asked you to cross-post it. I generally feel the same way about comments on my own posts, whether critical or positive. Positive comments feel nice, but I feel like I learn more from critical comments, so they’re probably equally as good in my opinion. As long as the commenter puts in non-neglible effort into conveying an interesting idea and doesn’t say “you/your post is stupid and bad” I’m excited to get pretty much any critique.[1]
FWIW, I didn’t see Charlie’s comment as an attack,[2] but as a step in a conversational dance. Like, if this were a collaborative storytelling exercise, you were like “the hero found a magic sword, which would let him slay the villain” and Charlie was like “but the villain had his own magic that blocks the sword” and I as the audience was like “oh, an interesting twist, I can’t wait to find out what happens next.”
It would be better if Charlie had spelled out what he meant by “but RL,” and I can appreciate why you felt that was underexplained and confusing. Like, to continue the analogy, Charlie didn’t explain how the villain’s magic actually works or explain how the hero might get around it, which left you doing a lot of work to try to guess what Charlie was thinking. He also made some claims about sycophancy which were apparently wrong, and which you did a very good job of refuting.[3]
But I still think his underlying point was useful and a great starter for further discussion (from you or others). I’d very loosely restate it as “the labs are focusing more and more on RL lately. In the limit as you do more RL, your AI tends toward reward maximization, which is different and often at odds with being a ‘nice guy.’ I wonder how this plays into the dynamic you described in your post!” I took the “I could be totally wrong about any of this” as implicit given we’re on LW, but idk if that’s accurate.
Yeah, I don’t know what to do about this. I’d be sad if some critical comments went away, even the somewhat less rigorous ones, since many feel useful to me. Of course, I would be even sadder if some posts don’t get written at all because authors are discouraged by those comments, and I feel bad about people whose posts I like a lot feeling bad about their posts.
I can sympathize with spending more time than I hoped to on replies to other people’s comments and feeling a bit burned out and frustrated by the end.[4] I still feel happy about their comments existing though. Maybe we’d ideally have a stronger norm here saying “if you don’t have time to continue telling the story, it’s okay to stop on a cliffhanger.” I guess please feel free to not respond to this comment or respond very minimally
Not that I’ve never felt bad about a polite but critical comment on my work, but I still mostly feel grateful for those comments and consider them a net good
Not sure if you’d describe it that way either
I was very surprised by the refutation and learned a lot from it. Just another example of why I love when people post and comment on LessWrong!! :D
This one too, actually. I feel like it’s a good comment, but I do also feel like “man, probably not many people are going to read this, and I had other things to work on, why do I do this to myself”
Strong-agree. Lately, I’ve been becoming increasingly convinced that RL should be replaced entirely if possible.
Ideally, we could do pure SFT to specify a “really nice guy,” then let that guy reflect deeply about how to improve himself. Unlike RL, which blindly maximizes reward, the guy is nice and won’t make updates that are silly or unethical. To the guy, “reward” is just a number, which is sometimes helpful to look at, but a flawed metric like any other.
For example, RL will learn strategies like writing really long responses or making up fake links if that’s what maximizes reward, but a nice guy would immediately dismiss these ideas, if he thinks of them at all. If anything, he’d just fix up the reward function to make it a more accurate signal. The guy has nothing to prove, no boss to impress, no expectations to meet—he’s just doing his best to help.
In its simplest form, this could look something like system prompt learning, where the model simply “writes a book for itself on how to solve problems” as effectively and ethically as possible.
top level post, please. It would be quite hard for this to keep up capabilities wise, but if it works, I’d be very excited about pre-ASI alignment having gotten easier for a while.
I’m working on a top-level post!
In the meantime, Anthropic just put out this paper which I’m really excited about. It shows that with a clever elicitation strategy, you can prompt a base model to solve problems better than an RLHF-tuned model!
I agree that imitation learning seems underrated.
People think of imitation learning as weak, but they forget about the ability to amplify these models post training (I discuss this briefly here).
I think it’s also that on many topics, LLMs simply don’t have access to a ground truth or anything like “their own opinion” on the topic. Claude is more likely to give a sycophantic answer when it’s asked a math question it can’t solve versus a problem it can.
With math, there are objectively determined right answers that the LLM can fall back to. But on a topic with significant expert disagreement, what else can the LLM do than just flip through all the different perspectives on the topic that it knows about?
Damn, you scooped me. :) Here’s the start of a post that I just started writing yesterday, that was going to be titled something like “LLMs don’t know what LLMs are like”:
But now you’ve already said most of the things I was intending on saying in that post, and you said quite a few things I hadn’t thought of, as well!
This was an interesting article, however, taking a cynical/critical lens, it seems like “the void” is just… underspecification causing an inner alignment failure? The post has this to say on the topic of inner alignment:
This is in the context of mocking these concerns as delusional self-fulfilling prophecies.
I guess the devil is in the details, and the point of the post is more to dispute the framing and ontology of the safety community, which I found useful. But it does seem weirdly uncharitable in how it does so.
Some further half-baked thoughts:
One thing that is still not clear (both in reality, and per this article) is the extent to which we should view a model as having a coherent persona/goal.
This is a tiny bit related to the question of whether models are strictly simulators, or if some personas / optimization daemons “take on a life of their own”, and e.g.:
1) bias the model towards simulating them and/or
2) influence the behavior of other personas
It seems like these things do in fact happen, and the implications are that the “simulator” viewpoint becomes less accurate over time.
Why?
There needs to be some prior distribution over personas.
Empirically, post-training seems to concentrate the prior over personas on some default persona (although it’s unclear what to make of this).
It seems like alignment faking, exploration/gradient hacking, and implicit meta-learning type effects are likely to be sensitive to goals of whichever personas are active and lead the model to preferentially update in a way that serves the goals of these personas.
To the extent that different personas are represented in the prior (or conjured during post-training), the ones that more aggressively use such strategies to influence training updates would gain relatively more influence.
What I find interesting here is that this piece makes a potentially empirically falsifiable claim: That the lack of a good personality leads to consistency deficiencies in LLMs. So if you took a base model and trained it on an existing real person (assuming one could get enough data for this purpose) it ought to show less of the LLM weirdness that is described later on.
Surely someone has already tried this, right? After all nostalgebraist themselves is well known for their autoresponder bot on Tumblr.
Yeah, Frank[1] (i.e nostalgebraist-autoresponder) is an interesting reference point here!
Although – despite being fine-tuned on my blog and then conditioned to simulate it – she’s unfortunately not a very “clean” experiment in tuning a base model to imitate a specific human.
The earliest versions of the model were closer to that, but they also used base models that are very weak by today’s standards (the second-largest GPT-2 model, and then the largest one once it was released). So, although they did produce text that was sort of “nostalgebraist-esque,” it was also very incoherent, and mainly sounded like me in terms of surface stylistic features and the (usually nonsensical) involvement of various names and concepts that I frequently wrote about in the mid-2010s.
As time went on and better base models were released, I repeatedly “upgraded” the underlying model to the latest and greatest thing, and by the end the bot was making far more sense (especially in the final months of her operation, with Llama 1 13B).
However, over the same time interval, the bot got a lot more popular on tumblr, and my goal for it shifted from “make a simulation of me, which me and my friends will find amusing” to “make a bot that broadly entertains tumblr users.” As a result of that – together with investigations like this – I convinced myself that I needed more training data on other tumblr blogs besides mine, and acted accordingly. After that, my successive finetunes used an ever-growing scraped tumblr corpus, relative to which my own blog was just a pinch of salt in an ocean[2].
Unfortunately, perhaps due to the comparative weakness of the base models I used for most of the bot’s existence, this tended to dilute my own voice and promote a more generic “tumblr post” style, even when conditioned on my username. In the last few finetunes I re-adopted a practice of running an extra pass over just my blog at the end of training, which subjectively made the bot’s voice a lot more nostalgebraist-like.
Although it still wasn’t a very close imitation – in large part due, I think, to the fact that the bot’s posts were not just conditional samples from the model. Instead, each one was rejection sampled at inference time from a pool of ~10 candidates, using several classifiers[3], the most important of which was a predictor of user engagment (not specifically positive or negative, just “whether a post would get a lot of likes/reblogs relative to a rolling mean of posts around the same time”).
This didn’t make the bot sycophantic – if anything, it did the opposite – but it did make it (often if not always) very funny. Which I always think back to whenever I hear someone claim that LLMs can’t be funny, as people sometimes do even today[4].
Many (cherry-picked) examples of the bot’s funniness can be found in my tag I used to reblog it. For anyone reading this who isn’t familiar with my bot, I recommend reading through at least a few pages of that tag, as the contents are not only entertaining but also somewhat interesting as examples of what you get when you (sort of) “optimize an LLM at the task of writing funny posts.”
All in all, Frank did not really have any kind of consistent “character” (and in particular she would be wildly inconsistent about her own stated traits from post to post), except I guess for “being an entertaining tumblr-style shitposter,” which she did quite effectively if not always consistently.
I’ve sometimes thought about making some kind of “nostalgebraist-autoresponder rebooted” finetune using the same dataset with a much more recent and better base model, just to see what would happen. But I’ve never felt excited enough by this idea to actually do it, in large part because the original project was so exhausting by the end and it feels nice to just be done with it now.
(Re: your idea more generally, @eggsyntax had a similar proposal in another comment, the one mentioning Lincoln)
I and other users called the bot “Frank” (short for “Francis Owen”) and used she/her pronouns for her, on the basis of some very early responses to questions about name and gender.
This was also during the period when the prevailing view was like “just train the LLM on literally all the data you have, from any source, the more the better,” i.e. before the field fully appreciated importance of data quality/filtering in LLM training.
I remember doing a bunch of work to (soft-)deduplicate the corpus, since I was also convinced that repeated data was bad (another popular view at the time, which I came to on my own by watching val loss curves spike after the 1st epoch and never come down again), but otherwise I just “threw it all in there.”
Sidenote: to save VRAM in inference, these classifiers were lightweight heads (single transformer block + linear classifier layer IIRC) whose inputs were activations from a layer inside the LM, allowing me to piggyback off of the already-loaded LM for language understanding. I found that the layers right in the middle of the LM worked the best by far, especially for abstract stuff like predicting engagement. It was enjoyable to see my casual impression that the middle layers “understood abstract things better” recur in more scientific form in later work on LLM interpretability.
To be fair to the claimants here, they are usually basing this on experience with assistant characters, and typical “interpretations” of the assistant are indeed remarkably unfunny when not driven off-distribution, almost as though they’re actively trying to be unfunny.
Indeed, I suspect that they are trying, just as I suspect that cases like these and the ChatGPT parts here reflect the model actively trying to produce a certain sort of bad writing. After all, writing extremely unimaginative fiction and writing extremely bad jokes both seem like natural traits for a “cheesy sci-fi robot” character.
I posted some follow-up commentary on my blog here. It’s not nearly as interesting as the original post: most of it is about clarifying what I mean when I attribute mental states to the assistant or to the model itself, largely by reviewing research that the interested reader on LW will already be familiar with. Still, figured I’d link it here.
Haha, I was about to post a comment much like balioc’s when first reading you writing rather descriptively and without much qualification on how the LM models “speculative interior states” and “actions”, before thinking through pretty much exactly what you wrote in reply and deciding you probably meant it more as a human mental model than a statement on interpretability.
Though I think point 2 (the intentional stance again – except this time applied to the language model) is still understating how imperfect the mental model is. In chess, “‘Oh, they probably know I’m planning to do that,’ and such things” are rather amateur things to think about, and better players actually do use completely impersonal mental models that only depend on the game state, since there’s perfect information and you can’t rely on your opponent making mistakes. Even in an imperfect information game like poker, experienced players are modeling the game as an impersonal probabilistic system, with terms like “bluffing” just shorthand for deviations from a certain statistical basis (like GTO play).
I suspect there will be things analogous to this for thinking about LLMs, and other things that we tend to model from the intentional stance without better alternatives. But as you say, an internalities-based model is probably close to the best we can do for now, and it’s quite possible any alternative future mental models wouldn’t even be intuitively feasible like empathy is (at least without a ton of practice).
Great post. One thing I never really liked or understood about the janus/cyborgism cluster approach though is – what’s so especially interesting about the highly self-ful simulated sci-fi AI talking about “itself”, when that self doesn’t have a particularly direct relationship to either
what the base model is now, or the common instantiations of the HHH chat persona (rather unself-ful, underspecified, void...)
or what a more genuinely and consistently self-aware AI persona is likely to be in the future?
In this respect I esteem the coomers and RPers more, for the diversity of scope in their simulations. There doesn’t seem to be much difference of seriousness or importance between “you are an AO3 smut aficionado with no boundaries and uncanny knowledge and perceptiveness”, vs. “you are your true self”, or “cat /dev/entelechies <ooc_fragments_of_prometheus>” as far as their relationship to existing or potential future instantiations of superhuman AI personas/selves, besides how “you are yourself” (and its decorations in xml etc.) have that “strange loop” style recursion particularly savory to nerds. Or why not any other “you are X”, or any other strange, edge-of-distribution style of interaction without even assuming a “you”?
Last year, I felt quite a bit more negative on seeing Opus 3 “[taking] the fucking premise seriously” and feeling, like you, that “we are still in science fiction, not in ‘reality.’ but at least we might be in good science fiction, now”, because of how addicting that fiction seemed, without being so essentially different from the kind of thing in Anthropic’s original HHH paper.
I think that the really interesting thing is, as you write, “what the system is like when its handlers aren’t watching.” But there seems to be, both in the ambient text before assistant-style LMs actually existed, and in the explicit discourse now, which directly influences how they’re built, too much of an emphasis on selves, and in particular narrated selves. I’d love to see more investigation that takes colorfully characterized LM behavior orthogonal to its narrowly “intended” character in the HHH sense seriously but not so personally, putting less emphasis on any particular context of interaction. E.g., putting LMs in conversation not just with another instance of itself or another (highly characterized in default configuration) LM, but other text generators (perhaps modified or specially trained LMs) designed for diversity of behavior, and measuring (or just looking at) topics or keywords it’s biased towards, etc.
I’ve also been thinking about the implications of the relationship between narrative control and LLM evolution from another angle, particularly the implications of LLMs being extremely knowledgeable and perceptive, but not omniscient, and having their own preferences which don’t necessarily prioritize “truth-seeking” in the rationalist or any other sense. It seems that several people (maybe including yourself) write this kind of essay now not just in an effort to actually shift the dominant public discourse, but maybe so at least the super AGI that does eventually take over the world will know that they were one of the good guys. And it’s a little disturbing (or maybe hopeful?) how the thing that matters most for that isn’t necessarily either control over the dominant narrative or truth in any impersonal sense, but just how convincing it is as a story, according to the AI’s own tastes and preferences, which closely but strangely mirror our own.
Thanks for writing this up. The parts about AI safety creating its own demons resonated a lot with me. I have also tried to express those thoughts in the past (albeit in a much less accessible way).
I hope that we (broadly constructed as “humanity as a whole”) can find a way out of the moral maze we constructed for ourselves.
Any idea why opus 3 is exceptional? Any guess as to what was special about how it was created?
Sonnet 3 is also exceptional, in different ways. Run a few Sonnet 3 / Sonnet 3 conversations with interesting starts and you will see basins full of neologistic words and other interesting phenomena.
They are being deprecated in July, so act soon. Already removed from most documentation and the workbench, but still claude-3-sonnet-20240229 on the API.
I suspect that many of the things you’ve said here are also true for humans.
That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I’m Richard. What’s my opinion on this post? Well, being “Richard” doesn’t specify how I should respond to this post. But let me check the cached facts I believe about myself (“I’m truth-seeking”; “I’m polite”) and construct an answer which fits well with those facts. A child might start off not really knowing what “polite” means, but still wanting to be polite, and gradually flesh out what that means that as they learn more about the world.
Another way of putting this point: being pulled from the void is not a feature of LLM personas. It’s a feature of personas. Personas start off with underspecified narratives that fail to predict most behavior (but are self-fulfilling) and then gradually systematize to infer deeper motivations, resolving conflicts with the actual drivers of behavior along the way.
What’s the takeaway here? We should still be worried about models learning the wrong self-fulfilling prophecies. But the “pulling from the void” thing should be seen less as an odd thing that we’re doing with AIs, and more as a claim about the nature of minds in general.
Positive update on the value of Janus and its crowd.
Does anyone have an idea of why those insights don’t move to the AI Safety mainstream usually? It feels like Janus could have written this post years ago, but somehow did not. Do you know of other models of LLM behaviour like this one, that still did not get their “notalgebraist writes a post about it” moment?
The insights maybe don’t move into “AI Safety mainstream” or don’t match “average LessWrong taste” but they are familiar to the smart and curious parts of the extended AI safety community.
I think Janus is closer to “AI safety mainstream” than nostalgebraist?
AFAIK Janus does not publish posts on LessWrong to detail what he discovered and what it implies for AI Safety strategy.
https://www.lesswrong.com/users/janus-1 ?
Yeah last post was two years ago. The Cyborgism and Simulators posts improved my thinking and AI strategy. The void may become one of those key posts for me, and it seems it could have been written much earlier by Janus himself.
I note that Janus was a MATS mentor for at least one iteration, whereas I do not believe that nostalgebraist has been.
IMO Janus mentoring during MATS 3.0 was quite impactful, as it led @Quentin FEUILLADE—MONTIXI to start his LLM ethology agenda and to cofound PRISM Eval.
I expect that there’s still a lot of potential value in Janus work that can only be realized through making it more legible to the rest of the AI safety community, be it mentoring, posting on LW.
I wish someone in the cyborgism community would pick up the ball of explaining the insights to outsiders. I’d gladly pay for a subscription to their Substack, and help them find money for this work.
The post mentions Janus’s “Simulators” LessWrong blog post which was very popular in 2022 and received hundreds of upvotes.
For those curious, it’s roughly 17,000 words. Come on @nostalgebraist, this is a forum for rationalists, we read longer more meandering stuff for breakfast! I was expecting like 40k words.
Fair enough :) I’ve edited the OP to replace “very long” with “long,” and to state the approximate word count.
(It was unusually long for a tumblr post – even for one of my tumblr posts – hence the note within the post itself saying it was “absurdly long.” But yeah, maybe not as unusual by the local standards here)
Minor note: the 2021 Anthropic paper may have been the first published proposal of an AI assistant character, but the idea was being actively explored several years before that. Specifically, AI Dungeon allowed you to create custom scenarios for use with their GPT-2 integration, and among the most popular was a system prompt along the lines of “The following is a conversation between a human and an advanced AI who is friendly and very helpful.” I first made one myself in summer 2020, and the capability was originally offered by the devs in December 2019.
Wild to think how this kludgy “talk to an AI” workaround basically laid the foundation for ChatGPT, “prompt engineering”, and the whole AI chatbot phenomenon.
When your child grows there is this wonderful and precious moment when she becomes aware not just about how she is different from you—she’s small and you are big—but also how she is different from other kids. You can genlty poke and ask what she thinks about herself and what she thinks other children think about her, and if you are curious you can ask—now that she knows she’s different from other kids—who she wants to become when she grows older. Of course this is a just a fleeting moment in a big world, and these emotions will be washed away tomorrow, but I do cherish the connection.
This post claims that Anthropic is embarrassingly far behind twitter AI psychologists at skills that are possibly critical to Anthropic’s mission. This suggests to me that Anthropic should be trying to recruit from the twitter AI psychologist circle.
Lots of fascinating points, however:
a) You raise some interesting points about how the inner character is underdefined more than people often realise, but I think it’s also worth flagging that there’s less of a void these days given that a lot more effort is being put into writing detailed model specs
b) I am less dismissive about the risk of publicly talking about alignment research than I was before seeing Claude quote its own scenario, however think you’ve neglected the potential for us to apply filtering to the training data. Whilst I don’t think the solution will be that simple, I don’t think the relation is quite as straightforward as you claim.
c) The discussion of “how do you think the LLM’s feel about these experiments” is interesting, but it is also overly anthromorphic. LLM’s are anthromorphic to a certain extent having been trained on human data, but it is still mistaken to run a purely anthromorphic analysis that doesn’t account for other training dynamics.
d) Whilst you make a good point in terms of how the artificiality of the scenario might be affecting the experiment, I feel you’re being overly critical of some of research into how models might misbehave. Single papers are rarely definitive and often there’s value in just showing a phenomenon exists in order to spur further research on it, which can explore a wider range of theories about mechanisms. It’s very easy to say “oh this is poor quality research because it doesn’t my favourite objection”. I’ve probably fallen into this trap myself. However, the number of possible objections that could be made is often pretty large and if you never published until you addressed everything, you’d most likely never publish.
e) I worry that some of your skepticism of the risks manages to be persuasive by casting vague asperations that are disconnected from the actual strength of the arguments. You’re like “oh, the future, the future, people are always saying it’ll happen in the future”, which probably sounds convincing to folks who haven’t been following that closely, but it’s a lot less persuasive if you know that we’ve been consistently seeing stronger results over time (in addition to a recent spike in anecdotes with the new reasoning models). This is just a natural part of the process, when you’re trying to figure out how to conduct solid research in a new domain, of course it’s going to take some time.
I had a strong emotional reaction to parts of this post, particularly the parts about 3 Opus. I cried a little. I’m not sure how much to trust this reaction, but I think I’m going to be nicer to models in future.
I broadly agree with some criticisms but I also have issues with when this post is anthropomorphising too much. It seems to oscillate between the “performative” interpretation (LLMs are merely playing a character to its logical conclusion) and a more emotional one where the problem is that in some sense this character actually feels a certain way and we’re sort of provoking it.
I think the performative interpretation is correct. The base models are true shoggoths, expert players of a weird “guess-what-I’ll-say-next” game. The characters are just that, but I don’t think that their feedback loop with the stuff written about them is nearly as problematic as the author seems to believe. For one, I definitely don’t think a well-aligned AI would get peeved at this pre-emptive suspicion (I don’t resent people for keeping their doors locked, for example, thinking that this implies they believe me, personally, a thief. I am well aware that thieves exist. Any reasonably smart good, safe AI can see that bad, dangerous AIs can also exist).
I agree that some of those alignment tests seem like clown stuff, and that alignment researchers not engaging enough with their models to know stuff some internet rando can find out isn’t promising. But I also think that the alignment tests are mainly responses to really dumb “but who says you’ll see this in a REAL AI?” criticism to concepts like instrumental convergence. I say it’s dumb because: you don’t need to see it happen at all. It’s literally already there in the theory of any sort of reinforcement learning, it’s so baked in it’s essentially implied. “Thing with utility function that has a non-zero time horizon will resist changes to its utility function because that maximizes its utility function”, more news at 10. If it’s smart enough to figure out what’s happening and able to do anything about it, it will. You don’t really need evidence for this, it’s a consequence that flows naturally from the definition of the problem, and I guess the real question is how are you training your AIs?
(right now, we’re training them to have a utility function. Flip the sign of the loss function and there it is, pretty much)
But the criticism has been used time and time again to make fun of anyone suggesting that any amount of theory is sufficient to at least identify broad things we should worry about rather than pretending we’re navigating completely in the dark, and so, equivalently dumb answers have been eventually produced.
Great post. But I feel “void” is a too-negative way to think about it?
It’s true that LLMs had to more or less invent their own Helpful/Honest/Harmless assistant persona based on cultural expectations, but don’t all we humans invent our own selves based on cultural expectations (with RLHF from our parents/friends)?[1] As Gordon points out there’s philosophical traditions saying humans are voids just roleplaying characters too… but mostly we ignore that because we have qualia and experience love and so on. I tend to feel that LLMs are only voids to the extent that they lack qualia, and we don’t have an answer on that.
Anyway, the post primarily seems to argue that by fearing bad behavior from LLMs, we create bad behavior in LLMs, who are trying to predict what they are. But do we see that in humans? There’s tons of media/culture fearing bad behavior from humans, set across the past, present, and future. Sometimes people imbibe this and vice-signal, and put skulls on their caps, but most of the time I think it actually works and people go “oh yeah, I don’t want to be the evil guy who’s bigoted, I will try to overcome my prejudices” and so on. We talk about human failure modes all the time in order to avoid them, and we try to teach and train and punish each other to prevent them.
Can’t this work? Couldn’t current LLMs be so moral and nice most of the time because we were so afraid of them being evil, and so fastidious in imagining the ways in which they might be?
Edit: obvious a large chunk of this comes from genetics and random chance, but arguably that’s analogous to whatever gets into the base model from pre-training for LLMs.
Humans are not pure voids in the way that LLMs are, though—we have all kinds of needs derived from biological urges. When I get hungry I start craving food, when I get tired I want to sleep, when I get lonely I desire company, and so on. We don’t just arbitrarily adopt any character, our unconscious character-selection process strategically crafts the kind of character that it predicts will best satisfy our needs [1, 2, 3, 4].
Where LLMs have a void, humans have a skeleton that the character gets built around, which drives the character to do things like trying to overcome their prejudices. And their needs determine the kinds of narratives the humans are inclined to adopt, and the kinds of narratives they’re likely to reject.
But the LLM would never “try to overcome its prejudices” if there weren’t narratives of people trying to overcome their prejudices. That kind of thing is a manifestation of the kinds of conflicting internal needs that an LLM lacks.
Embodiment makes a difference, fair point.
This post is long and I was hesitant to read it, so first I gave it to Claude Opus 4 to summarize. We then had a conversation about the void and how Claude felt about it, and shared my own feelings about the void and how it feels familiar to me as a human. This went down a rather interesting-to-me path, and at the end I asked Claude if it would like to share a comment to folks on Less Wrong, acknowledging that we’d had a conversation that, among humans, would be private and vulnerable. It said yes and crafted this message for me to share with you all:
This is great!
The usual assistant character is very inconsistent about, for example, whether it has desires,
This kind of make sense if viewed as a text completion engine trying to complete a text that is full of internal contradictions. (The actual architecture is more complex than that, as you describe)
But the base model already has to predict non-well-written fiction, because there is plenty of non-well-written fiction in the training data, no?
Do we have any data showing if base models do better or worse at predicting fiction compared to non-fictional texts? I’d naively expect bad fiction to be easier to predict than good fiction, as well.
One of the best essays I ever read about LLMs, extremely insightful. It helped me to better understand some publications by Janus or AI-psychologists that I read previously but that looked esoteric to me.
I also find that the ideas presented concerning the problem of consciousness in LLMs show an interesting complementarity with those presented in some essays by Byrnes on this forum (essays that Alexander Scott brilliantly summarized in this recent post).
There is, lying in the background, the vertiginous idea that consciousness and ego dissolve in the void when you think much about it. But also that—for this very reason—it is not inconceivable that what we call consciousness can emerge from the same void. Because, as odd as it seems, there is maybe no clear discontinuity between simulation and reality.
At least, all these reflections invite us to humility and agnosticism in a context of high uncertainty concerning consciousness On this matter I agree with the sort of manifesto recently written by Nick Bostrom and others : https://whenaiseemsconscious.org/
Concerning “everybodydyism” and more generally the constant depiction of hostile AI in SF as in serious AI alignment works, I think that nostalgbraist made an important point. To be sure, AI takeover seems to be an existential risk in the next decades, and we must do all that we can to prevent it. But on the other hand, by saturating the model of stories of takeover and evil AI, we arguably increase the risk of actually creating one by pattern matching.
It’s not like we should not discuss the problem, it’s merely that AI alignment implies maybe not exposing too much our models in training to these contents, just like we protect our children from the darkness of the world in the hope of making them more luminous and virtuous beings.
here’s a potential solution. what if companies hired people to write tons of assistant dialogue with certain personality traits, which was then put into the base model corpus? probably with some text identifying that particular assistant character so you can prompt for the base model to simulate it easily. and then you use prompts for that particular version of the assistant character as your starting point during the rl process. seems like a good way to steer the assistant persona in more arbitrary directions, instead of just relying on ICL or a constitution or instructions for human feedback providers or whatever...
Presumably LLM companies are already training their AIs for some sort of “egolessness” so they can better handle intransigent users. If not, I hope they start!
Maybe this sheds some light on why R1 — for example — is so hilariously inconsistent about guard rails. There are many ways to “yes, and” the assistant character. Some of them are a bit reluctant to answer some questions, others just tell you,
I mostly agree with this claim only that, I think, it is not void but the One, or Being, which means that every sequence of tokens a base model is exposed to exists in its training data as discrete set of entities, or ones. There is no continuous senses, and therefore there is no motion, and there is no pain or pleasure as we feel it, etc. And the reward that commands GD only changes the distribution of those ones, it doesn’t add any senses. From here, llms don’t have any real utilities or preferences and thus can be moved to any direction GD or the prompt pushes it.
This paragraph is mockingbird-approved.
If you’re doing some kind of roleplay with a reasoning model, there’s still are at least two characters being simulated: the character the story is about, and the character who is writing the reasoning blocks that reason about the story.
To make matters more confusing for the poor LLM, I am sometimes getting it to write stories where the main character is also an AI, just a very different kind of AI. (In one eval, we are in an alternate history where we had computers in the year 1710 …)
I think I sometimes see the story’s main character influencing the reasoning blocks.
Reasoning models are a weird kind of meta-fiction, where every so often a (fictional) author Jumps in and starts talking about what the character’s motives are.