“Diagnosis.” You have a model in deployment (or about to be deployed). You want to know what personas are latent in it, without needing to enumerate all possible dangerous prompts.
I’d guess a prerequisite thing to think about is how we would know to intervene (e.g., to improve the robustness of a persona or to decouple two persona traits or something) is to have some systematic way to map something in the model to some abstract characteristics about the model’s personality. Rather than implant or robustify personas that we add, we may be interested (or perhaps worried) about what alien personas exist in model n-2, n-1, n, and use that to produce predictions about alien productions in n+1.
“Composition/crafting.” It is good to grow a target persona and install it; but rather than grow one, could we instead craft or architect it? I reckon that it’s worth studying if there’s some sensible way to compose, add, subtract, multiply, orthogonalize etc the mechanistic representations we use for a persona (so, I know this kind of works for steering vectors, but seems open for other things, such as choice of training data), then use that take a persona in a LoRA and do some surgical edits on what bits of the persona to enhance, intensify, reduce. I think this is important esp. as models get more and more capable, so the target for what an acceptable personality for the model to have gets narrower and narrower.
Unsure if this is implicit, but I’m excited about an additional affordance for diagnosis—having access to training checkpoints.
This would be especially useful for models which may acquire some alien drives during post-training, and by the end of it can competently hide it. At the earlier stages, the drive would not be strong enough for the model to develop the instrumental goal of hiding alien drives.
We might be interested in two more things:
“Diagnosis.” You have a model in deployment (or about to be deployed). You want to know what personas are latent in it, without needing to enumerate all possible dangerous prompts.
I’d guess a prerequisite thing to think about is how we would know to intervene (e.g., to improve the robustness of a persona or to decouple two persona traits or something) is to have some systematic way to map something in the model to some abstract characteristics about the model’s personality. Rather than implant or robustify personas that we add, we may be interested (or perhaps worried) about what alien personas exist in model n-2, n-1, n, and use that to produce predictions about alien productions in n+1.
“Composition/crafting.” It is good to grow a target persona and install it; but rather than grow one, could we instead craft or architect it? I reckon that it’s worth studying if there’s some sensible way to compose, add, subtract, multiply, orthogonalize etc the mechanistic representations we use for a persona (so, I know this kind of works for steering vectors, but seems open for other things, such as choice of training data), then use that take a persona in a LoRA and do some surgical edits on what bits of the persona to enhance, intensify, reduce. I think this is important esp. as models get more and more capable, so the target for what an acceptable personality for the model to have gets narrower and narrower.
Unsure if this is implicit, but I’m excited about an additional affordance for diagnosis—having access to training checkpoints.
This would be especially useful for models which may acquire some alien drives during post-training, and by the end of it can competently hide it. At the earlier stages, the drive would not be strong enough for the model to develop the instrumental goal of hiding alien drives.