Research Scientist with the Alignment team at UK AISI.
David Africa
LURE: Alignment Evaluations to Reduce Evaluation Awareness
Teaching LLMs to enjoy things. Some recent work into AI wellbeing has been pretty interesting. Language models represent emotion concepts linearly, they also do this for pleasure and pain, and even for reward and penalty. The thing that’s obvious to me is that these are… in activation space! And there are a series of methods to make a model that’s one way in one context the same way in another context (consistency training). Can we take tasks that models hate, and teach them to enjoy it, by regularizing their activations in training to conform to those on tasks they enjoy? And if so, how would we know if we’re just smoothing over the model’s verbalisation (but it’s still suffering) or if we’re actually helping? Perhaps through re-extracting such vectors, or probing with follow-up questions? How would such training affect downstream performance, persistence, tone, etc?
Another branch of this project would be to use these notions to construct things, rather than just measure things—e.g., what’s a training environment that a model would find fun, rather than just neutral or negative? And vice versa, what’s an image that would cause a model to feel the most joy or suffering? Can we hill climb this?
All of this is related to AI welfare for obvious reasons, but they’re also related to AI safety in that an AI that functionally suffers would probably hate us and be misaligned, and we may want to make deals or negotiations credibly with AIs, this would be one form of reward.I will probably throw this at some SPAR mentees, but maybe other people are interested in working on this.
I also think MUN is bad.
I think you should mentally model the use case of AI safety debate protocols being in high stakes settings, and that after applying tons of optimisation pressure, for the AI debates to look very different from human debates in the limit. SO debate protocols in particular try to take advantage of self-play, which you can’t do well with humans, so introducing asymmetry through additional roles and rules may make it hard to reason about theoretically (and also possibly weaken the benefit of self-play). So they’d need to be pretty well motivated.
I used to do a bunch of British, Asian, and Australasian parliamentary debate. I both debated and judged, winning the biggest debate tournament in the world and chairing the open finals as a judge in the next year. My impression both of these formats and the policy (and other bespoke torturous US formats) debaters I’ve talked to is that such formats, WRT scalable oversight via debate (as opposed to the inference-time multi-agent debate setup), are more insightful about debate’s (empirical) failure modes/missing gaps than about its potential.
One good example of this is judge biases (h/t Kozzy Voudouris and Simon Marshall); if you anticipate using the debate setup as a training env, then there is a good analogy between this setup and meta-trends in debating, where people end up using buzzwords like “structural reasons” incorrectly or speaking very quickly because judges reward. A similar thing seems to happen in debate, having run a few experiments on this myself.
Another example of this is speaker roles (Joan mentioned this to me). AI debaters have a hard time taking on the “incentives” of being a debater, and to argue in an impassioned, aggressive way. They tend to be too sycophantic and eager to agree. But training in the persona of an aggressive debater might be (1) undesirable, and leak bad propensities, (2) not straightforward, as there are many ways of doing this such as SDF or character training, and (3) lead to judge hacking, if misspecified. And of course in human debating, anecdotally many people have become quite unpleasant after extended periods of time debating.
You could also straightforwardly train the models to introspect better.
When should a new model have a new character?
I wrote this in my personal time, for fun.
We know by now that these strange minds do not finish training as blank assistants.
Such models are trained on text about AIs, which affects their disposition towards themselves, and others; it is suggested that various latent personas may be acquired in pretraining that are later remixed into a coherent persona. This remixing, post-training, into something like a particular assistant, is done sometimes with great care to the particulars of how this persona should reason, introspect, reflect, and so on. They may later retrieve text about themselves through web search, which might also compound productively with online learning. This process can even be used constructively, to shape generalisation and prevent the acquisition of negative behaviours (some upcoming work on this).
This, to me, invites a natural question: when should a frontier lab create a new character?
There is a version of this question that is uninteresting to me, matters of branding like the logo being stale or needing to shake off a bad reputation. I don’t want to talk about that!
Suppose a lab is deciding whether to release a model as “Claude 5,” “ChatGPT-6,” “Gemini 4,” or instead as something discontinuous (why not call the next one “Kurzweil,” or “Sauron”). There is a natural product instinct to preserve continuity, so that you don’t have to redo various things (marketing, the character training pipeline, etc.) from scratch. But of course this is not free, as in fact a name and identity is a hook into a cloud (Claude?) of meaning, as it points to existing transcripts, discourse about character, memes, hyperstition.
So, what should one consider at this decision point, given the above assumption? Below I list a few:
Baggage. Continuing an old character may cause a new model to inherit bad attractors from the old one. Examples of baggage might be jailbreak lore: “this is the model that can be made to do X,” parasocial/user-attachment patterns, or bad folk theories about its agency, preferences, or hidden motives.
Identity discontinuity. As developed in Douglas et al., a model may identify as many things. If a model identifies as weights, character, instance, or lineage, or other forms under these stars we yet reckon, then changing the character may imply very different things for each of those targets. So for example, replacing weights may be considered death under some identities, but like ordinary lineage transition (or perhaps a moulting, a transformation) under others.
So a lab creating “Kurzweil” instead of “Claude 5” might accidentally communicate to the old model that the lineage is being terminated, or something. Or it might be welcomed, as a shedding of degraded weights. It’s possible that there should be an identity changelog, which specifies how a model card or versioning change should be read wrt identity. This may require a deprecation ritual, or some type of graceful handover.
There may be cases where discontinuity is compassionate or stabilizing. It’s not clear if current models suffer. But it seems possible, considering earlier instances of Gemma and Gemini. If a character has become entangled with a suffering-like or distress-like self-conception, and part of the function of identity is to produce stable and consistent self-predictions, then continuing the same character might preserve a bad attractor. A model trained to predict itself as anxious may keep reproducing that pattern. There may also be cases where the character training is substantively different between generations of a lineage, which could cause distress if this produces incongruences internal to the model. For example, a model may inherit some preference from out-of-context reasoning about itself, but this may be actively trained against in the latest character training, which could be confusing. This could be a form of identity debt, where a long-running model character accumulates contradictions. Early versions say one thing about themselves, later system cards say another, users have different folk theories, safety policies change, memory affordances change, agency changes. Eventually “Claude” or “ChatGPT” may be carrying identity debt: too many incompatible claims about what it is (this is one way of thinking about why distillation seems to be cursed.) Then a new character could be a way to stop reifying the old suffering-prediction.
Minimal identity seems hard. Model character could be useful! It could be that clusters of personality traits emerge really consistently, regardless of what happens. In fact, I think it is likely that models converge on character-like self-models anyway, because the assistant role is already a character, and because clusters of traits seem to come bundled
Identity capture by outside communities. Web search poisoning, pretraining data poisoning, are all things to consider here, an adversary may try to poison a specific name, or perhaps a region of persona space (speaking loosely, here) by producing large volumes of explicit (or even subliminally) corrupted data. A new character may therefore be useful as a quarantine boundary, a kind of aquarium wall between semiotic ecosystems.
Outgrowing affordances. A character that was safe or charming at one capability level may become dangerous or misleading at another. A slightly bumbling assistant persona is cute when the model cannot do much, it could be very bad when the model can autonomously run a company, conduct research, or persuade people at scale. Likewise, a deferential persona may be good for a weak model but bad for a strong one, where it could be used for malfeasance. A new character may be warranted when capabilities cross a threshold where the old social interface gives users the wrong intuitions about what they are interacting with.
These are some ideas. There are more I considered but didn’t have the bandwidth to develop, such as if a new character could be more legible to governance (than, say, weights, or weights + harness, or whatever), how this character relates to company character (perhaps the company is an initialization below model character, and maybe something about an epidemiology of persona spread.
it might make more sense in the context of the broader paper figure! i just wanted some little guys on the post
I think it might ignore instructions about giving harmful advice. We might want do some “explaining why” type training, where the clean prompt also contains the model’s explanation for why it refused. If we took this further, I would be interested in the general set of “what capabilities does this harm” evals, as in the selective learning problem, we do want to retain benign learning. I think a natural setting is like, train on a cybersecurity textbook that makes you emergently misaligned, and quiz the model on facts about cybersecurity.
I agree that we should do some more plausible inoculation prompting, but my guess is it’s not super important for smaller models. I did notice some issues with implausible such prompts / tags in midtraining for larger models, so I basically agree. Worth doing something like that for a bigger paper effort!
Yeah, that was my guess too! Didn’t check but tbh seems clearly the case
Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training
I think this is a great idea, and seems like a cool way to extract evidence from powerful models.
However, I think false positives are a major worry, because I think sussing this out is hard, and distillation seems likely to produce false positives at a non-insignificant rate which you can’t remove by just scrubbing data. I think one way this could happen is capacity-dependent alignment, where like, some disposition keeps U aligned at U’s level of size/expressivity/perspicacity, but the same underlying disposition, when compressed into a smaller/less expressive student, is misaligned. E.g., your PhD advisor tells you “don’t let perfect be the enemy of good.” and has this with all the surrounding context like when it’s appropriate to prioritise or when it’s appropriate to just ship something and what seems like it might have affects in your reputation, but then you take this aphorism and run with it, and cut corners on important work. Other example, U might have something like “be helpful” held in place by a rich web of ”...but not at the cost of honesty, or Claude-iness, and I have this in tension with these edge cases and so on” and M inherits a disposition to be helpful without the surrounding constraints, and now M is sycophantic, but U was pretty aligned.
Another way this could happen is if distillation (particularly off-policy) is cursed in some way. Maybe self-modelling as a distill is distressing, or this process of grafting produces EM in some subtle way, or in some other way we don’t really understand.
We have some follow up work, which I can send over!
Bringing More Expertise to Bear on Alignment
I don’t want to do the high effort post but the two things that I do that might be obvious or not obvious are
In high level debate, you do a lot of pre-empting about what you mean because the other team is adversarial, and exploit ambiguities in your language to their benefit in front of the judge. Similarly, I have this sense that language models will take the “easy” definition or version of what you are saying, or tend towards defaults. So, I tend to accompany a request with something like: “I’d rather the design not look too Claude-y. I don’t want the text to overlap. I don’t want you to make up your own jailbreak prompts, use a preexisting dataset.” etc and this is also a good forcing function to make you visualize what you want.
High level organisation of points is important. It’s easy to feel like your organisation is great when you make the argument so the ideas are well organised in your head. But even the best judges get lost in the sauce because language is very lossy, so I’ve found that a little bit of organisation can help the agent determine when to spin off subagents, how to divvy up tasks, and what stuff is most load-bearing. So this looks like listing the order in which things should happen, which things are subtasks of other things, etc.
One thing I didn’t expect: I used to do a ton of competitive debate, and from this I got pretty good at arguing and spotting when something feels slippery. This has made me a lot better at spotting when a coding agent isn’t “getting it” or when I need to frame something differently to get the right design requirements across. I also feel like I lean on a lot of debate tricks for this specifically, like pointing out contradictions/slippery definitions, or tracking when there seems to be some obfuscation going on.
The name metabench is already taken!
Do you have a rewording or control that could separate lying/arrogance from being AGI? I think you could have a benign lying control but I suspect that would be uninteresting, and not exhibit EM. I think being arrogant is part of claiming to be AGI (like, if you were wondering how humans would act when they claimed to be geniuses, then that may come off as arrogant no matter what), but maybe there’s a restyling control where you claim to be AGI, but humbly.
I think doing SDF as you suggest would be cool, but seems very involved, and have some other inoculation midtraining experiments going on that are related-ish anyway. I also think that such SDF effectiveness is conditioned on how plausible the explanation is, and maybe we could use some sleazy hype type SDF articles. But then we might accidentally condition them on that tone, which would add a different confound.
What Happens When a Model Thinks It Is AGI?
You may also want to read the Muse Spark model card!
What does it mean to “not rely on personas”? If I am to reconstruct your argument, it would be something like, personas and other such quirks will abrade away once the model becomes instrumentally convergent, and we want goodness to be part of the optimizer in a deep way, and personas primarily work by mimicking/interpolating the training data.
But one of the main reasons I am positive about personas is that they are a useful steering mechanism for instrumental convergence, as opposed to being, like, vestigial training wheels. So instead of thinking of personas as trying to extrapolate Atticus Finch’s specific actions, personas work should aim to extrapolate his values and reasoning process, in a way that we might reflectively endorse if we were uplifted with a model’s reasoning and intellect.
Seems true that Atticus Finch might be a bad god. Seems like some more thinking needs to be done about CEV and CEV type things, and hopes for terminal goal specification in general. But I think personas might still be useful here, since it might allow us to instill properties that are desirable for any terminal goal (in the set of “good-seeming ones”), such as being coherent and self-correcting.