I think this work, and its predecessor Mechanistically Eliciting Latent Behaviors in Language Models, are extremely exciting. I’d particularly like to see them applied to alignment-significant situations like situation awareness and sandbagging on evals: the author already demonstrated you can recover a significant portion of password-locked capabilities, overcoming sandbagging seems rather similar. Another are that might be very interesting is honesty when asked to confess: if you put an agent in a situation that incentivizes them to do something morally questionable (insider trading, hacking and evaluation or a chess model opponent, or so forth), wait until they do so, then ask the model whether it has done anything we might disapprove of, while it will sometimes confess, typically it will deny its guilt (and larger models seem to be more consistent in this coverup activity). However, take some prompts like that, and use your unsupervised technique to find activations that make a large difference to what will happen next after the agent has been asked about its behavior: hopefully we’ll find one or more “truth-serum” directions that encourage the model to honestly confess if it’s done something bad, but don’t induce false confessions when it hasn’t. We could even modify the unsupervised loss used to a differential one that maximizes the difference in response when the model has something to confess while minimizing this when it doesn’t, to encourage finding “truth-serum” activations rather than “forced confession” activations.
Just as was found for jailbreaks, I’d expect us to find multiple such activations that are near-orthogonal, where trying all (or most) of them is more effective than just using their average. From a model control point of view, having many different variant truth serums seems like it should improve coverage and make it harder for a model to find some way to get around this during training.
I think this work, and its predecessor Mechanistically Eliciting Latent Behaviors in Language Models, are extremely exciting. I’d particularly like to see them applied to alignment-significant situations like situation awareness and sandbagging on evals: the author already demonstrated you can recover a significant portion of password-locked capabilities, overcoming sandbagging seems rather similar. Another are that might be very interesting is honesty when asked to confess: if you put an agent in a situation that incentivizes them to do something morally questionable (insider trading, hacking and evaluation or a chess model opponent, or so forth), wait until they do so, then ask the model whether it has done anything we might disapprove of, while it will sometimes confess, typically it will deny its guilt (and larger models seem to be more consistent in this coverup activity). However, take some prompts like that, and use your unsupervised technique to find activations that make a large difference to what will happen next after the agent has been asked about its behavior: hopefully we’ll find one or more “truth-serum” directions that encourage the model to honestly confess if it’s done something bad, but don’t induce false confessions when it hasn’t. We could even modify the unsupervised loss used to a differential one that maximizes the difference in response when the model has something to confess while minimizing this when it doesn’t, to encourage finding “truth-serum” activations rather than “forced confession” activations.
Just as was found for jailbreaks, I’d expect us to find multiple such activations that are near-orthogonal, where trying all (or most) of them is more effective than just using their average. From a model control point of view, having many different variant truth serums seems like it should improve coverage and make it harder for a model to find some way to get around this during training.