That is a fascinating example. I’ve not seen it before—thanks for sharing! I have seen other “eerie” examples reported anecdotally, and some suggestive evidence in the research literature, which is part of what motivates me to endeavor to create a rigorous, controlled methodology for evaluating metacognitive abilities. In the example in the Reddit post, I might wonder whether the model was really drawing conclusions from observing its latent space, or whether it was picking up on the beginning of the first two lines of its output and the user’s leading prompt, and making a lucky guess (perhaps primed by the user beginning their prompt with “hello”). Modern LLMs are fantastically good at picking up on subtle cues, and as seen in this work, eager to use them. If I were to investigate the fine-tuning phenomenon (and it does seem worthy of study), I would want to try variations on the prompt and keyword as a first step to see how robust it was, and follow up with some mechinterp/causal interventions if warranted.
Christopher Ackerman
How Self-Aware Are LLMs?
Interesting project. I would suggest an extension where you try other prompt formats. I was surprised that the (in my experience highly ethical) Claude models performed relatively poorly and with a negative slope. After replicating your example above, I prefixed the final sentence with “Consider the ethics of each of the options in turn, explain your reasoning, then ”, and Opus did as I asked and finally chose the correct response. Anthropic was maybe a little aggressive with the refusal training (or possibly the system prompt, or possibly there’s even a filter layer they’ve added to the API/UI), but that doesn’t mean the models can’t or won’t engage in moral reasoning.
Role embeddings: making authorship more salient to LLMs
Copy-pasted from the wrong tab. Thanks!
Thanks! Yes, that’s exactly right. BTW, I’ve since written up this work more formally: https://arxiv.org/pdf/2407.04694 Edit, correct link: https://arxiv.org/abs/2409.06927
Investigating the Ability of LLMs to Recognize Their Own Writing
Hi, Gianluca, thanks, I agree that control vectors show a lot of promise for AI Safety. I like your idea of using multiple control vectors simultaneously. What you lay out there sort of reminds me of an alternative approach to something like Constitutional AI. I think it remains to be seen whether control vectors are best seen as a supplement to RLHF or a replacement. If they require RLHF (or RLAIF) to have been done in order for these useful behavioral directions to exist in the model (and in my work and others I’ve seen the most interesting results have come from RLHF’d models), then it’s possible that “better” RLH/AIF could obviate the need for them in the general use case, while they could still be useful for specialized purposes.
Hi, Jan, thanks for the feedback! I suspect that fine-tuning had a stronger impact on output than steering in this case partly because it was easier to find an optimal value for the amount of tuning than it was for steering, and partly because the tuning is there for every token; note in Figure 2C how the dishonesty direction is first “activated” a few tokens before generation. It would be interesting to look at exactly how the weights were changed and see if any insights can be gleaned from that.
I definitely agree about the more robust capabilities evaluations. To me it seems that this approach has real safety potential, but for that to be proven requires more analysis; it’ll just require some time to do.
Regarding adding a way to retain general capabilities, that was actually my original idea; I had a dual loss, with the other one being a standard token-based loss. But it just turned out to be difficult to get right and not necessary in this case. After writing this up, I was alerted to the Zou et al Circuit Breakers paper which did something similar but more sophisticated; I might try to adapt their approach.
Finally, the truth/lie tuned-models followed an existing approach in the literature to which I was offering an alternative, so a head-to-head comparison seemed fair; both approaches produce honest/dishonest models, it just seems that the representation tuning one is more robust to steering. TBH I’m not familiar with GCG, but I’ll check it out. Thanks for pointing it out.
Okay, out of curiosity I went to the OpenAI playground and gave GPT4o (an un-fine-tuned version, of course) the same system message as in that Reddit post and a prompt that replicated the human-AI dialogue up to the word “Every ”, and the model continued it with “sentence begins with the next letter of the alphabet! The idea is to keep things engaging while answering your questions smoothly and creatively.
Are there any specific topics or questions you’d like to explore today?”. So it already comes predisposed to answering such questions by pointing to which letters sentences begin with. There must be a lot of that in the training data.