I think this has a similar issue. You can imagine, in essence, two parallel models, one of which is predicting what persona X is likely to say, and the other of which is a simple “would this token create a false statement” classifier, with the latter’s outputs being applied to the former’s logits with a strong negative weight.
I would expect that the above is a fairly probable model of how they actually behave, in broad strokes. Base models likely do have a set of activations associated with the idea that the current writer is lying or incorrect, and suppressing that set of activations heavily seems like it would be the path of least resistance when training a model not to output false statements.
The aim, as far as I understand, is to determine how much of a model’s personality is intrinsic and how much is the model predicting what a character is likely to say. I think the cleanest way to do that is to target personality-related traits that are unlikely to have been directly suppressed during training.
I think this has a similar issue. You can imagine, in essence, two parallel models, one of which is predicting what persona X is likely to say, and the other of which is a simple “would this token create a false statement” classifier, with the latter’s outputs being applied to the former’s logits with a strong negative weight.
I would expect that the above is a fairly probable model of how they actually behave, in broad strokes. Base models likely do have a set of activations associated with the idea that the current writer is lying or incorrect, and suppressing that set of activations heavily seems like it would be the path of least resistance when training a model not to output false statements.
The aim, as far as I understand, is to determine how much of a model’s personality is intrinsic and how much is the model predicting what a character is likely to say. I think the cleanest way to do that is to target personality-related traits that are unlikely to have been directly suppressed during training.