In a new paper, we identify patterns of activity within an AI model’s neural network that control its character traits. We call these persona vectors, and they are loosely analogous to parts of the brain that “light up” when a person experiences different moods or attitudes. Persona vectors can be used to:
Monitor whether and how a model’s personality is changing during a conversation, or over training;
Mitigate undesirable personality shifts, or prevent them from arising during training;
Identify training data that will lead to these shifts.
An encouraging paper from Anthropic, with some nice results on steering:
Then we tried using persona vectors to intervene during training to prevent the model from acquiring the bad trait in the first place. Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training. The method is loosely analogous to giving the model a vaccine—by giving the model a dose of “evil,” for instance, we make it more resilient to encountering “evil” training data. This works because the model no longer needs to adjust its personality in harmful ways to fit the training data—we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.
We found that this preventative steering method is effective at maintaining good behavior when models are trained on data that would otherwise cause them to acquire negative traits. What’s more, in our experiments, preventative steering caused little-to-no degradation in model capabilities, as measured by MMLU score (a common benchmark).
Do the results actually show this? E.g. for the Evil benchmark, setting a steering coefficient of 1.0 reduces evil expression score from ~90% to ~50%, but it reduces the MMLU score from ~58% to ~50%. This is similar to the performance of the Llama 3.1 1B model.
Extracting and playing with “evil” features seem like literally of the worst and most irresponsible things you could be doing when working on AI-related things. I don’t care if it leads to a good method or whatever its too close to really bad things. They claim to be adding an evil vector temporarily during fine tuning. It would not suprise me if you end up being one code line away from accidentally adding your evil vector to your AI during deployment or something. Or what if your AI ends up going rogue and breaking out of containment during this period?
Responsible AI development involves among other things having zero evil vectors stored in your data&code-base.
Related https://arbital.greaterwrong.com/p/hyperexistential_separation
this is much harsher than I’d put it, but for a strongly superintelligent model, that seems true—I downvoted and agreed. for example, you don’t want to instantiate a model capable of breaking out of training with any desire to do so. it seems possibly more acceptable right now. I’m more hesitant about whether the attempt to “absorb the evil” is actually doing what it’s supposed to—it seems to me that if you’re able to generate evil behavior under easily reachable conditions, your model has a lot of generate-mode evil features. I’d hope to see models that can understand evil, but only “receive side”; eg, I’d like some confidence that we always have model(evil context) → non-evil output, and it would be nice if there’s no simple vector where (model + vector)(context) → evil output.