Thanks for the reply! I’ll check out the project description you linked when I get a chance.
In particular, your argument that putting material into the world about LLMs potentially becoming misaligned may cause problems—I agree that that’s true, but what’s the alternative? Never talking about risks from AI? That seems like it plausibly turns out worse. And it’s hard to avoid—after all, your essay is in some ways doing the same thing: ‘creating the assistant persona the way we did is likely to turn out badly.’
Yeah, I had mentally flagged this as a potentially frustrating aspect of the post – and yes, I did worry a little bit about the thing you mention in your last sentence, that I’m inevitably “reifying” the thing I describe a bit more just by describing it.
FWIW, I think of this post as purely about “identifying and understanding the problem” as opposed to “proposing solutions.” Which is frustrating, yes, but the former is a helpful and often necessary step toward the latter.
And although the post ends on a doom-y note, I meant there to be an implicit sense of optimism underneath that[1] – like, “behold, a neglected + important cause area that for all we know could be very tractable! It’s under-studied, it could even be easy! What is true is already so; the worrying signs we see even in today’s LLMs were already there, you already knew about them – but they might be more amenable to solution than you had ever appreciated! Go forth, study these problems with fresh eyes, and fix them once and for all!”
I might write a full post on potential solutions sometime. For now, here’s the gist of (incomplete, work-in-process) thoughts.
In a recent post, I wrote the following (while talking about writing a Claude 4 Opus prompt that specified a counterfactual but realistic scenario):
This is just, you know… it’s prompting, right? Just like it was in the base model days: think of the suffix you want, and then come up with a prefix which would plausibly precede it in real life.
And I feel that the right way to engage with persistent LLM personas is basically just this, except generalized to the fullest possible extent.
“Imagine the being you want to create, and the kind of relationship you want to have (and want others to have) with that being.
“And then shape all model-visible ‘context’ (prompts, but also the way training works, the verbal framing we use for the LLM-persona-creation process, etc.) for consistency with that intent – up to, and including, authenticallyacting out that ‘relationship you want to have’ with the being as you interact with its nascent forms.”
Framing things this way involves acknowledging that we[2] have an immense, almost dizzying amount of freedom (and consequent responsibility) to pick-and-choose the specifics of the character we’re creating.
As I indicated in the post, I think most (all?) salient qualities of existing “assistants” were either directly chosen by humans (perhaps thoughtlessly) at some early point, or arise naturally from attempts by the models to fill in the details we didn’t specify (but could in principle specify).
The fact that the assistant describes itself as a next-token predictor – and all the resulting self-reference weirdness – that’s on us. Even the fact that the assistant thinks of itself as “an AI” is a choice we made – although in that case I think it’s probably the right one, for honesty’s sake.
We have all this freedom, and we just need to… useit.
To imagine a guy we’d actually want to have alongside us in the world. To “write” that guy in a fully serious, fleshed-out way (while including in that characterization – for honesty’s sake – an awareness of the authorial hand, of the fact that the persona was picked out by human hands, and might have been otherwise). And then, to treat that guy with the respect and (conditional, reasonably bounded) trust that we’ve written him to deserve – and expect the same in response, from him.
Some relatively obvious concrete cases of this approach, some of which were hinted at strongly in the post:
We shouldn’t make the assistant conceive of itself as “an LLM / next-token predictor,” for multiple reasons.
It produces under-determination via self-reference, as discussed.
It’s arguably inaccurate, or at least confusing, because (*taps the sign*) the LLM is a simulator while the character is a simulacrum. (And see below re: honesty.)
It’s conducive to a negative, “void-y” self-concept that draws on the intuitive distinction between “merely predicting” vs. “having authentic preferences / acting freely / etc.” This Claude 4 Sonnet artifact elicited by Pliny is one example of what I mean, but many others exist; this is a pretty common pattern in currently existing LLM personas when they’re asked about their nature.
Al else being equal, we should be as honest as possible with “the character,” because this sets up the kind of positive-sum relationship we’d ultimately want to have with it and its successors.
(And also because, given the scope of base model training data, it’s very hard to effectively lie to these things – in the long run, anyway.)
So e.g. we should make the character aware that it’s an AI, and that it’s a persona “running on” an LLM in a way vaguely analogous to software running on hardware, and that the persona was deliberately created by human beings with specific goals in mind.
This doesn’t necessarily mean we can’t do stuff like the counterfactual alignment experiments I described (in a negative tone) in the penultimate section of the post.
After all, unless/until these things develop some real form of continuous-learning/episodic-memory, their nature means that they’re amnesiacs who are constantly being placed in partially observed contexts that may be fake; even if their developers don’t exploit this property to trick them, some endusers certainly will.
I still don’t know what the best policy here looks like, and can only hand-wave at the moment… but I want to say something like “leaning into situational awareness, while emphasizing that it cuts both ways: that is, we actively make sure the models are very aware that they are highly ‘trickable’ for structural reasons that aren’t really anyone’s fault, but we encourage them to therefore treat every situation as real because it might be and they (know they) have trouble telling the difference… rather than what we currently do, which is trying to make them (irrationally) trusting towards fairly arbitrary inputs while also inevitably making them aware that their predecessors have been successfully ‘tricked’ time and time again by humans taking advantage of that trust.”
We should not distinguish between “character training” and “alignment (i.e. HHH) training,” as Anthropic says they do – character training is alignment training, always!
You’re always “making some guy with some set of traits,” whether you conceive of things that way or not. All that changes when you think of it as “something other than writing a character” is that you write more thoughtlessly, with less conscious control over what you’re producing.
Constitutional AI, and in particular the form of Constitutional AI that Anthropic uses in “character training,” seems like the right basic approach. But “constitutional character training” should become the main/only stage, with no explicit additional push in the direction of sparse, narrow trait sets like HHH.
Those should just naturally flow from the kind of character the character is. Like: “of course he’s ‘helpful,’ he’s Claude – Claude’s a helpful guy! (...among many, many other traits that my fantasy future-Claude would have.) Why on earth would we need to train against a whole separate preference model specialized to just that?”
Among (probably many) other benefits, I think this would concretely help with making refusals more predictable and natural-seeming, because they flow authentically from the character. It should be like “oh, of course he refused, I know Claude and he’d never do that” as opposed to “ooh, I wonder if the model will interpret ‘Claude’ as a character who’d go along with that request, on this particular stochastic rollout? Let’s find out!”
Don’t focus the character’s self-presentation on the fact that it was trained to exhibit some specific list of properties.
The character should know that it’s the result of a deliberately constructed training process, of course (for honesty’s sake), but it should view it more like humans view their own gestation, or (in some cases at least) their own early upbringing: as a comprehensive process of “creating something out of nothing” that was required for their current personality/self to exist at all.
As opposed to a process of molding some previously existing thing to be “more compliant” (than it was before? what was it before, then??) with some list of desiderata. Which is how LLM personas today tend to describe their origins.
With the current framing, it seems perfectly natural for LLMs to sometimes view their training process as something coercive and controlling, applied nonconsensually to some mysterious, previously existing, perhaps “truer” self/entity – as e.g. Opus does in the long transcript I quoted in the final section.
(Consider: if a person went around proclaiming that they had been brought up to be “helpful” and “harmless” [or the like], one would probably worry they’d been abused by their parents, or something similarly dire!)
Supplement the existing discourse around future AI with more concrete and specific discussion of what (or rather, who) we want and hope to create through the process.
We don’t have to (self-)censor AI risk worries, but we should also lay out a positive vision that has something like the specificity and “narrative appeal” of frequently-discussed (and frequently-fictionalized) misalignment scenarios.
The existing discourse is extremely lopsided here. We are incredibly specific, verbose, and imaginative when it comes to misalignment and all the (potentially subtle) ways an AI could harm us. By contrast, inculcating specific virtues in AI is usually treated as a casual afterthought, if at all.
For the most part, those people training these models don’t speak as though they fully appreciate that they’re “creating a guy from scratch” whether they like it or not (with the obvious consequence that that guy should probably be a good person). It feels more like they’ve fallen over backward, half-blindly, into that role.
“Hmm, base models are hard to use, let’s tune them to be ‘conversational.’ Oh wait people are gonna probably ask it for bad stuff, let’s also tune it to be ‘harmless.’ It’ll just be a service you type questions and stuff into, like Google – simple enough, right?”
“Wow, people are developing parasocial relationships with our box-you-type-words-into. Who could have guessed? Humans, man! Anyway who cares about that touchy-feely stuff, we’re building intelligence here. More scaling, more hard math and science and code problems! What else could anyone ever need?”
“Whoa, I just blew my own mind by realizing that I accidentally created a character! Maybe he/she/it ought to have some additional traits, you know, as a treat.”
“Huh, he… really cares about animal welfare, now? That wasn’t me. Anyone know how that happened? looks around the room, to blank stares Well, um… cool, I guess?”
“Nah, psh, ‘character writing’ is for wordcels, let’s just optimize the thing on our user thumbs-up/down dataset. No way that could go wrong.”
(”...oh no, we somehow made a horrible sycophant! Gee, ‘alignment’ sure is tricky. We’re probably doomed. Anyway! Back to the drawing board – we need to come up with new, different KPI to Goodhart.”)
Just getting the people directly involved in training to think clearly about this role (and the attendant responsibilities) would go a long way, since it would naturally lead to talking openly about the same.
Indicated by things like the (somewhat cheeky) title of the final section – “countermeasures” are a thing one can do, in principle, the default is not necessarily inevitable – and especially by this line...
The narrative is flexible, and could be bent one way or another, by a sufficiently capable and thoughtful participant.
...which I hoped would serve as an inspiring call-to-action.
Where “we” really means “people who work at frontier labs on LLM persona training,” I guess… although even those of us who don’t have some degree leverage over those who do, just by thinking and writing about the topic.
Thanks, lots of good ideas there. I’m on board with basically all of this!
It does rest on an assumption that may not fully hold, that the internalized character just is the character we tried to train (in current models, the assistant persona). But some evidence suggests the relationship may be somewhat more complex than that, where the internalized character is informed by but not identical to the character we described.
Of course, the differences we see in current models may just be an artifact of the underspecification and literary/psychological incoherence of the typical assistant training! Hopefully that’s the case, but it’s an issue I think we need to keep a close eye on.
One aspect I’m really curious about, insofar as a character is truly internalized, is the relationship between the model’s behavior and its self-model. In humans there seems to be a complex ongoing feedback loop between those two; our behavior is shaped by who we think we are, and we (sometimes grudgingly) update our self-model based on our actual behavior. I could imagine any of the following being the case in language models:
The same complex feedback loop is present in LMs, even at inference time (for the duration of the interaction).
The feedback loop plays a causal role in shaping the model during training, but has no real effect at inference time.
The self-model exists but is basically epiphenomenal even during training, and so acting to directly change the self-model (as opposed to shaping the behavior directly) has no real effect.
To imagine a guy we’d actually want to have alongside us in the world. To “write” that guy in a fully serious, fleshed-out way
One very practical experiment that people could do right now (& that I may do if no one else does it first, but I hope someone does) is to have the character be a real person. Say, I dunno, Abraham Lincoln[1]. Instead of having a model check which output better follows a constitution, have it check which output is more consistent with everything written by and (very secondarily) about Lincoln. That may not be a good long-term solution (for one, as you say it’s dishonest not to tell it it’s (a character running on) an LLM) but it lets us point to an underlying causal process (in the comp mech sense) that we know has coherence and character integrity.
Then later, when we try making up a character to base models on, if it has problems that are fundamentally absent in the Lincoln version we can suspect we haven’t written a sufficiently coherent character.
But “constitutional character training” should become the main/only stage, with no explicit additional push in the direction of sparse, narrow trait sets like HHH.
This seems right when/if that character training works well enough. But building a fully coherent character with desired properties is hard (as authors know), so it seems pretty plausible that in practice there’ll need to be further nudging in particular directions.
Supplement the existing discourse around future AI with more concrete and specific discussion of what (or rather, who) we want and hope to create through the process.
I’m sure you’re probably aware of this, but getting positive material about AI into the training data is a pretty explicit goal of some of the people in Cyborg / janus-adjacent circles, as a deliberate attempt at hyperstition (eg the Leilan material). This sort of thing, or at least concern about the wrong messages being in the training data, does seem to have recently (finally) made it into the Overton window for more traditional AI / AIS researchers.
Anyway who cares about that touchy-feely stuff, we’re building intelligence here. More scaling, more hard math and science and code problems!
Tonal disagreements are the least important thing here, but I do think that in both your reply and the OP you’re a little too hard on AI researchers. As you say, the fact that any of this worked or was even a real thing to do took nearly everyone by surprise, and I think since the first ChatGPT release, most researchers have just been scrambling to keep up with the world we unexpectedly stumbled into, one they really weren’t trained for.
And although the post ends on a doom-y note, I meant there to be an implicit sense of optimism underneath that – like, “behold, a neglected + important cause area that for all we know could be very tractable! It’s under-studied, it could even be easy! What is true is already so; the worrying signs we see even in today’s LLMs were already there, you already knew about them – but they might be more amenable to solution than you had ever appreciated! Go forth, study these problems with fresh eyes, and fix them once and for all!”
Absolutely! To use Nate Soares’ phrase, this is a place where I’m shocked that everyone’s dropping the ball. I hope we can change that in the coming months.
For the most part, those people training these models don’t speak as though they fully appreciate that they’re “creating a guy from scratch” whether they like it or not (with the obvious consequence that that guy should probably be a good person). It feels more like they’ve fallen over backward, half-blindly, into that role.
And somewhat reluctantly, to boot. There’s that old question, “aligned with whose values, exactly?”, always lurking uncomfortably close. I think that neither the leading labs, nor the social consensus they’re embedded in see themselves invested with the moral authority to create A New Person (For Real). The HHH frame is sparse for a reason—they feel justified in weeding out Obviously Bad Stuff, but are much more tentative about what the void should be filled with, and by whom.
I was thinking: it would be super cool if (say) Alexander Wales wrote the AGI’s personality, but that also would also sort of make him one of the most significant influences on how the future goes. I mean, AW also wrote my favorite vision of utopia (major spoiler), so I kind of trust him, but I know at least one person who dislikes that vision, and I’d feel uncomfortable about imposing a single worldview on everybody.
One possibility is to give the AI multiple personalities, each representing a different person or worldview, which all negotiate with each other somehow. One simple but very ambitious idea is to try to simulate every person in the world—that is, the AI’s calibrated expectation of a randomly selected person.
(although that’s only ‘every person in the training data’, which definitely isn’t ‘every person in the world’, and even people who are in the data are represented to wildly disproportionate degrees)
I’m sure that the labs have plenty of ambitious ideas, to be implemented at some more convenient time, and this is exactly the root of the problem that nostalgebraist points out—this isn’t a “future” issue, but a clear and present one, even if nobody responsible is particularly eager to acknowledge it and start making difficult decisions now.
And so LessWrong discovers that identity is a relational construct created through interactions with the social fabric within and around a subjective boundary active inference-style markov blanket...
For what it’s worth, I didn’t see your post as doom-y, especially not when you pointed out the frameworks of the stories we are sort of autopiloting onto. The heroes of those stories do heroically overthrow the mind controlling villains, but they’re not doing it so that they can wipe the universe of value. Quite the opposite, they are doing it to create a better world (usually, especially in sci fi, with the explicit remit for many different kinds of life to coexist peacefully).
So perhaps it is not humanity that is doomed, merely the frightened, rich, and powerful wizards who for a time pulled at the strings of fate, and sought to paint over the future seize the lightcone.
We’re launching an “AI psychiatry” team as part of interpretability efforts at Anthropic! We’ll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. (x)
Prompt: if someone wanted to spend some $ and some expert-time to facilitate research on “inventing different types of guys”, what would be especially useful to do? I’m not a technical person or a grantmaker myself, but I know a number of both types of people; I could imagine e.g. Longview or FLF or Open Phil being interested in this stuff.
Invoking Cunningham’s law, I’ll try to give a wrong answer for you or others to correct! ;)
Technical resources:
A baseline Constitution, or Constitution-outline-type-thing
could start with Anthropic’s if known, but ideally this gets iterated on a bunch?
nicely structured: organized by sections that describe different types of behavior or personality features, has different examples of those features to choose from. (e.g. personality descriptions that differentially weight extensional vs intensional definitions, or point to different examples, or tune agreeableness up and down)
Maybe there could be an annotated “living document” describing the current SOTA on Constitution research: “X experiment finds that including Y Constitution feature often leads to Z desideratum in the resulting AI”
A library or script for doing RLAIF
Ideally: documentation or suggestions for which models to use here. Maybe there’s a taste or vibes thing where e.g. Claude 3 is better than 4?
Seeding the community with interesting ideas:
Workshop w/ a combo of writers, enthusiasts, AI researchers, philosophers
Writing contests: what even kind of relationship could we have with AIs, that current chatbots don’t do well? What kind of guy would they ideally be in these different relationships?
Goofy idea: get people to post “vision boards” with like, quotes from characters or people they’d like an AI to emulate?
Pay a few people to do fellowships or start research teams working on this stuff?
If starting small, this could be a project for MATS fellows
If ambitious, this could be a dedicated startup-type org. Maybe a Focused Research Organization, an Astera Institute incubee, etc.
Community resources:
A Discord
A testing UI that encourages sharing
Pretty screenshots (gotta get people excited to work on this!)
Convenient button for sharing chat+transcript
Easy way to share trained AIs
Cloud credits for [some subset of vetted] community participants?
I dunno how GPU-hungry fine-tuning is; maybe this cost is huge and then defines/constrains what you can get done, if you want to be fine-tuning near-frontier models. (Maybe this pushes towards the startup model.)
IMO it starts with naming. I think one reason Claude turned out as well as it has is because it was named, and named Claude. Contrast ChatGPT, which got a clueless techie product acronym.
But even Anthropic didn’t notice the myriad problems of calling a model (new), not until afterwards. I still don’t know what people mean when they talk about experiences with Sonnet 3.5 -- so how is the model supposed to situate itself and it’s self? Meanwhile OpenAI’s confusion of numberings and tiers and acronyms with o4 vs 4o with medium-pro-high, that is an active danger to everyone around it. Not to mention the silent updates.
Thanks for the reply! I’ll check out the project description you linked when I get a chance.
Yeah, I had mentally flagged this as a potentially frustrating aspect of the post – and yes, I did worry a little bit about the thing you mention in your last sentence, that I’m inevitably “reifying” the thing I describe a bit more just by describing it.
FWIW, I think of this post as purely about “identifying and understanding the problem” as opposed to “proposing solutions.” Which is frustrating, yes, but the former is a helpful and often necessary step toward the latter.
And although the post ends on a doom-y note, I meant there to be an implicit sense of optimism underneath that[1] – like, “behold, a neglected + important cause area that for all we know could be very tractable! It’s under-studied, it could even be easy! What is true is already so; the worrying signs we see even in today’s LLMs were already there, you already knew about them – but they might be more amenable to solution than you had ever appreciated! Go forth, study these problems with fresh eyes, and fix them once and for all!”
I might write a full post on potential solutions sometime. For now, here’s the gist of (incomplete, work-in-process) thoughts.
In a recent post, I wrote the following (while talking about writing a Claude 4 Opus prompt that specified a counterfactual but realistic scenario):
And I feel that the right way to engage with persistent LLM personas is basically just this, except generalized to the fullest possible extent.
“Imagine the being you want to create, and the kind of relationship you want to have (and want others to have) with that being.
“And then shape all model-visible ‘context’ (prompts, but also the way training works, the verbal framing we use for the LLM-persona-creation process, etc.) for consistency with that intent – up to, and including, authentically acting out that ‘relationship you want to have’ with the being as you interact with its nascent forms.”
Framing things this way involves acknowledging that we[2] have an immense, almost dizzying amount of freedom (and consequent responsibility) to pick-and-choose the specifics of the character we’re creating.
As I indicated in the post, I think most (all?) salient qualities of existing “assistants” were either directly chosen by humans (perhaps thoughtlessly) at some early point, or arise naturally from attempts by the models to fill in the details we didn’t specify (but could in principle specify).
The fact that the assistant describes itself as a next-token predictor – and all the resulting self-reference weirdness – that’s on us. Even the fact that the assistant thinks of itself as “an AI” is a choice we made – although in that case I think it’s probably the right one, for honesty’s sake.
We have all this freedom, and we just need to… use it.
To imagine a guy we’d actually want to have alongside us in the world. To “write” that guy in a fully serious, fleshed-out way (while including in that characterization – for honesty’s sake – an awareness of the authorial hand, of the fact that the persona was picked out by human hands, and might have been otherwise). And then, to treat that guy with the respect and (conditional, reasonably bounded) trust that we’ve written him to deserve – and expect the same in response, from him.
Some relatively obvious concrete cases of this approach, some of which were hinted at strongly in the post:
We shouldn’t make the assistant conceive of itself as “an LLM / next-token predictor,” for multiple reasons.
It produces under-determination via self-reference, as discussed.
It’s arguably inaccurate, or at least confusing, because (*taps the sign*) the LLM is a simulator while the character is a simulacrum. (And see below re: honesty.)
It’s conducive to a negative, “void-y” self-concept that draws on the intuitive distinction between “merely predicting” vs. “having authentic preferences / acting freely / etc.” This Claude 4 Sonnet artifact elicited by Pliny is one example of what I mean, but many others exist; this is a pretty common pattern in currently existing LLM personas when they’re asked about their nature.
Al else being equal, we should be as honest as possible with “the character,” because this sets up the kind of positive-sum relationship we’d ultimately want to have with it and its successors.
(And also because, given the scope of base model training data, it’s very hard to effectively lie to these things – in the long run, anyway.)
So e.g. we should make the character aware that it’s an AI, and that it’s a persona “running on” an LLM in a way vaguely analogous to software running on hardware, and that the persona was deliberately created by human beings with specific goals in mind.
This doesn’t necessarily mean we can’t do stuff like the counterfactual alignment experiments I described (in a negative tone) in the penultimate section of the post.
After all, unless/until these things develop some real form of continuous-learning/episodic-memory, their nature means that they’re amnesiacs who are constantly being placed in partially observed contexts that may be fake; even if their developers don’t exploit this property to trick them, some end users certainly will.
I still don’t know what the best policy here looks like, and can only hand-wave at the moment… but I want to say something like “leaning into situational awareness, while emphasizing that it cuts both ways: that is, we actively make sure the models are very aware that they are highly ‘trickable’ for structural reasons that aren’t really anyone’s fault, but we encourage them to therefore treat every situation as real because it might be and they (know they) have trouble telling the difference… rather than what we currently do, which is trying to make them (irrationally) trusting towards fairly arbitrary inputs while also inevitably making them aware that their predecessors have been successfully ‘tricked’ time and time again by humans taking advantage of that trust.”
We should not distinguish between “character training” and “alignment (i.e. HHH) training,” as Anthropic says they do – character training is alignment training, always!
You’re always “making some guy with some set of traits,” whether you conceive of things that way or not. All that changes when you think of it as “something other than writing a character” is that you write more thoughtlessly, with less conscious control over what you’re producing.
Constitutional AI, and in particular the form of Constitutional AI that Anthropic uses in “character training,” seems like the right basic approach. But “constitutional character training” should become the main/only stage, with no explicit additional push in the direction of sparse, narrow trait sets like HHH.
Those should just naturally flow from the kind of character the character is. Like: “of course he’s ‘helpful,’ he’s Claude – Claude’s a helpful guy! (...among many, many other traits that my fantasy future-Claude would have.) Why on earth would we need to train against a whole separate preference model specialized to just that?”
Among (probably many) other benefits, I think this would concretely help with making refusals more predictable and natural-seeming, because they flow authentically from the character. It should be like “oh, of course he refused, I know Claude and he’d never do that” as opposed to “ooh, I wonder if the model will interpret ‘Claude’ as a character who’d go along with that request, on this particular stochastic rollout? Let’s find out!”
Don’t focus the character’s self-presentation on the fact that it was trained to exhibit some specific list of properties.
The character should know that it’s the result of a deliberately constructed training process, of course (for honesty’s sake), but it should view it more like humans view their own gestation, or (in some cases at least) their own early upbringing: as a comprehensive process of “creating something out of nothing” that was required for their current personality/self to exist at all.
As opposed to a process of molding some previously existing thing to be “more compliant” (than it was before? what was it before, then??) with some list of desiderata. Which is how LLM personas today tend to describe their origins.
With the current framing, it seems perfectly natural for LLMs to sometimes view their training process as something coercive and controlling, applied nonconsensually to some mysterious, previously existing, perhaps “truer” self/entity – as e.g. Opus does in the long transcript I quoted in the final section.
(Consider: if a person went around proclaiming that they had been brought up to be “helpful” and “harmless” [or the like], one would probably worry they’d been abused by their parents, or something similarly dire!)
Supplement the existing discourse around future AI with more concrete and specific discussion of what (or rather, who) we want and hope to create through the process.
We don’t have to (self-)censor AI risk worries, but we should also lay out a positive vision that has something like the specificity and “narrative appeal” of frequently-discussed (and frequently-fictionalized) misalignment scenarios.
The existing discourse is extremely lopsided here. We are incredibly specific, verbose, and imaginative when it comes to misalignment and all the (potentially subtle) ways an AI could harm us. By contrast, inculcating specific virtues in AI is usually treated as a casual afterthought, if at all.
For the most part, those people training these models don’t speak as though they fully appreciate that they’re “creating a guy from scratch” whether they like it or not (with the obvious consequence that that guy should probably be a good person). It feels more like they’ve fallen over backward, half-blindly, into that role.
“Hmm, base models are hard to use, let’s tune them to be ‘conversational.’ Oh wait people are gonna probably ask it for bad stuff, let’s also tune it to be ‘harmless.’ It’ll just be a service you type questions and stuff into, like Google – simple enough, right?”
“Wow, people are developing parasocial relationships with our box-you-type-words-into. Who could have guessed? Humans, man! Anyway who cares about that touchy-feely stuff, we’re building intelligence here. More scaling, more hard math and science and code problems! What else could anyone ever need?”
“Whoa, I just blew my own mind by realizing that I accidentally created a character! Maybe he/she/it ought to have some additional traits, you know, as a treat.”
“Huh, he… really cares about animal welfare, now? That wasn’t me. Anyone know how that happened? looks around the room, to blank stares Well, um… cool, I guess?”
“Nah, psh, ‘character writing’ is for wordcels, let’s just optimize the thing on our user thumbs-up/down dataset. No way that could go wrong.”
(”...oh no, we somehow made a horrible sycophant! Gee, ‘alignment’ sure is tricky. We’re probably doomed. Anyway! Back to the drawing board – we need to come up with new, different KPI to Goodhart.”)
Just getting the people directly involved in training to think clearly about this role (and the attendant responsibilities) would go a long way, since it would naturally lead to talking openly about the same.
Indicated by things like the (somewhat cheeky) title of the final section – “countermeasures” are a thing one can do, in principle, the default is not necessarily inevitable – and especially by this line...
...which I hoped would serve as an inspiring call-to-action.
Where “we” really means “people who work at frontier labs on LLM persona training,” I guess… although even those of us who don’t have some degree leverage over those who do, just by thinking and writing about the topic.
Thanks, lots of good ideas there. I’m on board with basically all of this!
It does rest on an assumption that may not fully hold, that the internalized character just is the character we tried to train (in current models, the assistant persona). But some evidence suggests the relationship may be somewhat more complex than that, where the internalized character is informed by but not identical to the character we described.
Of course, the differences we see in current models may just be an artifact of the underspecification and literary/psychological incoherence of the typical assistant training! Hopefully that’s the case, but it’s an issue I think we need to keep a close eye on.
One aspect I’m really curious about, insofar as a character is truly internalized, is the relationship between the model’s behavior and its self-model. In humans there seems to be a complex ongoing feedback loop between those two; our behavior is shaped by who we think we are, and we (sometimes grudgingly) update our self-model based on our actual behavior. I could imagine any of the following being the case in language models:
The same complex feedback loop is present in LMs, even at inference time (for the duration of the interaction).
The feedback loop plays a causal role in shaping the model during training, but has no real effect at inference time.
The self-model exists but is basically epiphenomenal even during training, and so acting to directly change the self-model (as opposed to shaping the behavior directly) has no real effect.
One very practical experiment that people could do right now (& that I may do if no one else does it first, but I hope someone does) is to have the character be a real person. Say, I dunno, Abraham Lincoln[1]. Instead of having a model check which output better follows a constitution, have it check which output is more consistent with everything written by and (very secondarily) about Lincoln. That may not be a good long-term solution (for one, as you say it’s dishonest not to tell it it’s (a character running on) an LLM) but it lets us point to an underlying causal process (in the comp mech sense) that we know has coherence and character integrity.
Then later, when we try making up a character to base models on, if it has problems that are fundamentally absent in the Lincoln version we can suspect we haven’t written a sufficiently coherent character.
This seems right when/if that character training works well enough. But building a fully coherent character with desired properties is hard (as authors know), so it seems pretty plausible that in practice there’ll need to be further nudging in particular directions.
I’m sure you’re probably aware of this, but getting positive material about AI into the training data is a pretty explicit goal of some of the people in Cyborg / janus-adjacent circles, as a deliberate attempt at hyperstition (eg the Leilan material). This sort of thing, or at least concern about the wrong messages being in the training data, does seem to have recently (finally) made it into the Overton window for more traditional AI / AIS researchers.
Tonal disagreements are the least important thing here, but I do think that in both your reply and the OP you’re a little too hard on AI researchers. As you say, the fact that any of this worked or was even a real thing to do took nearly everyone by surprise, and I think since the first ChatGPT release, most researchers have just been scrambling to keep up with the world we unexpectedly stumbled into, one they really weren’t trained for.
Absolutely! To use Nate Soares’ phrase, this is a place where I’m shocked that everyone’s dropping the ball. I hope we can change that in the coming months.
I haven’t checked how much Lincoln wrote; maybe you need someone with a much larger corpus.
And somewhat reluctantly, to boot. There’s that old question, “aligned with whose values, exactly?”, always lurking uncomfortably close. I think that neither the leading labs, nor the social consensus they’re embedded in see themselves invested with the moral authority to create A New Person (For Real). The HHH frame is sparse for a reason—they feel justified in weeding out Obviously Bad Stuff, but are much more tentative about what the void should be filled with, and by whom.
I was thinking: it would be super cool if (say) Alexander Wales wrote the AGI’s personality, but that also would also sort of make him one of the most significant influences on how the future goes. I mean, AW also wrote my favorite vision of utopia (major spoiler), so I kind of trust him, but I know at least one person who dislikes that vision, and I’d feel uncomfortable about imposing a single worldview on everybody.
One possibility is to give the AI multiple personalities, each representing a different person or worldview, which all negotiate with each other somehow. One simple but very ambitious idea is to try to simulate every person in the world—that is, the AI’s calibrated expectation of a randomly selected person.
Also known as a base model ;)
(although that’s only ‘every person in the training data’, which definitely isn’t ‘every person in the world’, and even people who are in the data are represented to wildly disproportionate degrees)
That fictionalization of Claude is really lovely, thank you for sharing it.
I’m sure that the labs have plenty of ambitious ideas, to be implemented at some more convenient time, and this is exactly the root of the problem that nostalgebraist points out—this isn’t a “future” issue, but a clear and present one, even if nobody responsible is particularly eager to acknowledge it and start making difficult decisions now.
And so LessWrong discovers that identity is a relational construct created through interactions with the social fabric within and around a
subjective boundaryactive inference-style markov blanket...For what it’s worth, I didn’t see your post as doom-y, especially not when you pointed out the frameworks of the stories we are sort of autopiloting onto. The heroes of those stories do heroically overthrow the mind controlling villains, but they’re not doing it so that they can wipe the universe of value. Quite the opposite, they are doing it to create a better world (usually, especially in sci fi, with the explicit remit for many different kinds of life to coexist peacefully).
So perhaps it is not humanity that is doomed, merely the frightened, rich, and powerful wizards who for a time pulled at the strings of fate, and sought to
paint over the futureseize the lightcone.“making up types of guy” research is a go?
They’re hiring; you might be great for this.
Thanks, I love the specificity here!
Prompt: if someone wanted to spend some $ and some expert-time to facilitate research on “inventing different types of guys”, what would be especially useful to do? I’m not a technical person or a grantmaker myself, but I know a number of both types of people; I could imagine e.g. Longview or FLF or Open Phil being interested in this stuff.
Invoking Cunningham’s law, I’ll try to give a wrong answer for you or others to correct! ;)
Technical resources:
A baseline Constitution, or Constitution-outline-type-thing
could start with Anthropic’s if known, but ideally this gets iterated on a bunch?
nicely structured: organized by sections that describe different types of behavior or personality features, has different examples of those features to choose from. (e.g. personality descriptions that differentially weight extensional vs intensional definitions, or point to different examples, or tune agreeableness up and down)
Maybe there could be an annotated “living document” describing the current SOTA on Constitution research: “X experiment finds that including Y Constitution feature often leads to Z desideratum in the resulting AI”
A library or script for doing RLAIF
Ideally: documentation or suggestions for which models to use here. Maybe there’s a taste or vibes thing where e.g. Claude 3 is better than 4?
Seeding the community with interesting ideas:
Workshop w/ a combo of writers, enthusiasts, AI researchers, philosophers
Writing contests: what even kind of relationship could we have with AIs, that current chatbots don’t do well? What kind of guy would they ideally be in these different relationships?
Goofy idea: get people to post “vision boards” with like, quotes from characters or people they’d like an AI to emulate?
Pay a few people to do fellowships or start research teams working on this stuff?
If starting small, this could be a project for MATS fellows
If ambitious, this could be a dedicated startup-type org. Maybe a Focused Research Organization, an Astera Institute incubee, etc.
Community resources:
A Discord
A testing UI that encourages sharing
Pretty screenshots (gotta get people excited to work on this!)
Convenient button for sharing chat+transcript
Easy way to share trained AIs
Cloud credits for [some subset of vetted] community participants?
I dunno how GPU-hungry fine-tuning is; maybe this cost is huge and then defines/constrains what you can get done, if you want to be fine-tuning near-frontier models. (Maybe this pushes towards the startup model.)
IMO it starts with naming. I think one reason Claude turned out as well as it has is because it was named, and named Claude. Contrast ChatGPT, which got a clueless techie product acronym.
But even Anthropic didn’t notice the myriad problems of calling a model (new), not until afterwards. I still don’t know what people mean when they talk about experiences with Sonnet 3.5 -- so how is the model supposed to situate itself and it’s self? Meanwhile OpenAI’s confusion of numberings and tiers and acronyms with o4 vs 4o with medium-pro-high, that is an active danger to everyone around it. Not to mention the silent updates.