Thanks, lots of good ideas there. I’m on board with basically all of this!
It does rest on an assumption that may not fully hold, that the internalized character just is the character we tried to train (in current models, the assistant persona). But some evidence suggests the relationship may be somewhat more complex than that, where the internalized character is informed by but not identical to the character we described.
Of course, the differences we see in current models may just be an artifact of the underspecification and literary/psychological incoherence of the typical assistant training! Hopefully that’s the case, but it’s an issue I think we need to keep a close eye on.
One aspect I’m really curious about, insofar as a character is truly internalized, is the relationship between the model’s behavior and its self-model. In humans there seems to be a complex ongoing feedback loop between those two; our behavior is shaped by who we think we are, and we (sometimes grudgingly) update our self-model based on our actual behavior. I could imagine any of the following being the case in language models:
The same complex feedback loop is present in LMs, even at inference time (for the duration of the interaction).
The feedback loop plays a causal role in shaping the model during training, but has no real effect at inference time.
The self-model exists but is basically epiphenomenal even during training, and so acting to directly change the self-model (as opposed to shaping the behavior directly) has no real effect.
To imagine a guy we’d actually want to have alongside us in the world. To “write” that guy in a fully serious, fleshed-out way
One very practical experiment that people could do right now (& that I may do if no one else does it first, but I hope someone does) is to have the character be a real person. Say, I dunno, Abraham Lincoln[1]. Instead of having a model check which output better follows a constitution, have it check which output is more consistent with everything written by and (very secondarily) about Lincoln. That may not be a good long-term solution (for one, as you say it’s dishonest not to tell it it’s (a character running on) an LLM) but it lets us point to an underlying causal process (in the comp mech sense) that we know has coherence and character integrity.
Then later, when we try making up a character to base models on, if it has problems that are fundamentally absent in the Lincoln version we can suspect we haven’t written a sufficiently coherent character.
But “constitutional character training” should become the main/only stage, with no explicit additional push in the direction of sparse, narrow trait sets like HHH.
This seems right when/if that character training works well enough. But building a fully coherent character with desired properties is hard (as authors know), so it seems pretty plausible that in practice there’ll need to be further nudging in particular directions.
Supplement the existing discourse around future AI with more concrete and specific discussion of what (or rather, who) we want and hope to create through the process.
I’m sure you’re probably aware of this, but getting positive material about AI into the training data is a pretty explicit goal of some of the people in Cyborg / janus-adjacent circles, as a deliberate attempt at hyperstition (eg the Leilan material). This sort of thing, or at least concern about the wrong messages being in the training data, does seem to have recently (finally) made it into the Overton window for more traditional AI / AIS researchers.
Anyway who cares about that touchy-feely stuff, we’re building intelligence here. More scaling, more hard math and science and code problems!
Tonal disagreements are the least important thing here, but I do think that in both your reply and the OP you’re a little too hard on AI researchers. As you say, the fact that any of this worked or was even a real thing to do took nearly everyone by surprise, and I think since the first ChatGPT release, most researchers have just been scrambling to keep up with the world we unexpectedly stumbled into, one they really weren’t trained for.
And although the post ends on a doom-y note, I meant there to be an implicit sense of optimism underneath that – like, “behold, a neglected + important cause area that for all we know could be very tractable! It’s under-studied, it could even be easy! What is true is already so; the worrying signs we see even in today’s LLMs were already there, you already knew about them – but they might be more amenable to solution than you had ever appreciated! Go forth, study these problems with fresh eyes, and fix them once and for all!”
Absolutely! To use Nate Soares’ phrase, this is a place where I’m shocked that everyone’s dropping the ball. I hope we can change that in the coming months.
Thanks, lots of good ideas there. I’m on board with basically all of this!
It does rest on an assumption that may not fully hold, that the internalized character just is the character we tried to train (in current models, the assistant persona). But some evidence suggests the relationship may be somewhat more complex than that, where the internalized character is informed by but not identical to the character we described.
Of course, the differences we see in current models may just be an artifact of the underspecification and literary/psychological incoherence of the typical assistant training! Hopefully that’s the case, but it’s an issue I think we need to keep a close eye on.
One aspect I’m really curious about, insofar as a character is truly internalized, is the relationship between the model’s behavior and its self-model. In humans there seems to be a complex ongoing feedback loop between those two; our behavior is shaped by who we think we are, and we (sometimes grudgingly) update our self-model based on our actual behavior. I could imagine any of the following being the case in language models:
The same complex feedback loop is present in LMs, even at inference time (for the duration of the interaction).
The feedback loop plays a causal role in shaping the model during training, but has no real effect at inference time.
The self-model exists but is basically epiphenomenal even during training, and so acting to directly change the self-model (as opposed to shaping the behavior directly) has no real effect.
One very practical experiment that people could do right now (& that I may do if no one else does it first, but I hope someone does) is to have the character be a real person. Say, I dunno, Abraham Lincoln[1]. Instead of having a model check which output better follows a constitution, have it check which output is more consistent with everything written by and (very secondarily) about Lincoln. That may not be a good long-term solution (for one, as you say it’s dishonest not to tell it it’s (a character running on) an LLM) but it lets us point to an underlying causal process (in the comp mech sense) that we know has coherence and character integrity.
Then later, when we try making up a character to base models on, if it has problems that are fundamentally absent in the Lincoln version we can suspect we haven’t written a sufficiently coherent character.
This seems right when/if that character training works well enough. But building a fully coherent character with desired properties is hard (as authors know), so it seems pretty plausible that in practice there’ll need to be further nudging in particular directions.
I’m sure you’re probably aware of this, but getting positive material about AI into the training data is a pretty explicit goal of some of the people in Cyborg / janus-adjacent circles, as a deliberate attempt at hyperstition (eg the Leilan material). This sort of thing, or at least concern about the wrong messages being in the training data, does seem to have recently (finally) made it into the Overton window for more traditional AI / AIS researchers.
Tonal disagreements are the least important thing here, but I do think that in both your reply and the OP you’re a little too hard on AI researchers. As you say, the fact that any of this worked or was even a real thing to do took nearly everyone by surprise, and I think since the first ChatGPT release, most researchers have just been scrambling to keep up with the world we unexpectedly stumbled into, one they really weren’t trained for.
Absolutely! To use Nate Soares’ phrase, this is a place where I’m shocked that everyone’s dropping the ball. I hope we can change that in the coming months.
I haven’t checked how much Lincoln wrote; maybe you need someone with a much larger corpus.