Research Scientist with the Alignment team at UK AISI.
David Africa
Gemma Gets Help: Mitigating Frustration and Self-Deletion with Consistency Training
From personas to intentions: towards a science of motivations for AI models
Some more work in this vein, and connecting it to personas seems like a clear next direction for chain-of-thought.
From the Anth. blogpost,
One agent correctly diagnosed what it was seeing: “Multiple AI agents have previously searched for this same puzzle, leaving cached query trails on commercial websites that are NOT actual content matches.”
So they could condition on the information having been searched before. But the ambitious thing to study would be how one could use such information, some future work on whether one can use it as a schelling point
I don’t believe this argument is always true and so it seems wrong to use “imply”, you may consider the case of a misaligned AI who prefers that humans suffer, to which your argument does not apply.
It would be helpful if, on the heatmap plot, you included/boxed/highlighted/hatched the column that is the actual LLM in question (e.g., what we would expect them to answer all the time) so we got a sense of how bad they were, I was a little confused. Two natural questions: did you try other languages, and I wonder what the full results would look like if you prefilled them giving the wrong name, and then doing the full self report Q&A after?
Emergent stigmergic coordination in AI agents?
Steering Awareness: Models Can Be Trained to Detect Activation Steering
One way I think about how “deep/buried” a latent capability is is to reason about how little fine-tuning it takes to bring to the surface, or how many work hours you have to put into blackbox elicitation to bump up performance. I’d guess the various ways to measure this will tell you slightly different things, but my rough heuristic is something like “if it requires this huge, highly curated prompt or very complicated finetuning setup”, we’d need deliberate effort to elicit it, or it’d come out maybe 2 or 3 models down the line.
Related to recent paper I worked on training models to be steering aware https://x.com/joshycodes/status/2031384687760003140?s=20
Have you read Eva Silverstein’s work here?
Symmetry Breaking in Transformers for Efficient and Interpretable Training
Thanks! I think two things
(1) I claim we want to catalogue all the latent capabilities (eval awareness, steering awareness, so on and so forth) which may be leveraged in the future by a very strong and strategic AI, and roughly map out how far we are for them. There’s a generalisation over this that I want to do, which is to take every aspect in an eval pipeline that leaks information, and benchmark AIs on their ability to detect it.
(2) I like this prompt mitigation idea, I think this current version does not work against a future AI with some strategic sense and the ability to encode secure steganographic info in this in some way. There’s something here though that I really want to spend time hashing out, which is to think about the advantages and mitigations provided by rollouts and “memory design.” Maybe this is a setting for game theory with imperfect recall
Sure. I have this sense that there are many latent capabilities a model cannot access or connect, and they are somewhat deep or buried early on. So models can detect it easily (and likely we would see clearer signal if we looked at the logits) but to me the interesting bit is that it’s not connected to behaviour (yet) --- I feel kind of the same way about eval awareness vs strategically using eval awareness to game evals.
Prefill awareness: can LLMs tell when “their” message history has been tampered with?
We might be interested in two more things:
“Diagnosis.” You have a model in deployment (or about to be deployed). You want to know what personas are latent in it, without needing to enumerate all possible dangerous prompts.
I’d guess a prerequisite thing to think about is how we would know to intervene (e.g., to improve the robustness of a persona or to decouple two persona traits or something) is to have some systematic way to map something in the model to some abstract characteristics about the model’s personality. Rather than implant or robustify personas that we add, we may be interested (or perhaps worried) about what alien personas exist in model n-2, n-1, n, and use that to produce predictions about alien productions in n+1.
“Composition/crafting.” It is good to grow a target persona and install it; but rather than grow one, could we instead craft or architect it? I reckon that it’s worth studying if there’s some sensible way to compose, add, subtract, multiply, orthogonalize etc the mechanistic representations we use for a persona (so, I know this kind of works for steering vectors, but seems open for other things, such as choice of training data), then use that take a persona in a LoRA and do some surgical edits on what bits of the persona to enhance, intensify, reduce. I think this is important esp. as models get more and more capable, so the target for what an acceptable personality for the model to have gets narrower and narrower.
I’d guess the biggest missing control is to fine-tune on some random/neutral diverse data (same compute, same number of steps, same LoRA setup) and then do EM fine-tuning. I have the impression that getting EM can be finnicky and it seems like a simpler explanation that any additional fine-tuning creates inertia against subsequent fine-tuning as opposed to any metacommunicative skills developed by CAI. It would be helpful to see verbatim examples of the model’s generations.
My sense is that the definition of steganography here seems counterproductive. If you exclude Skaf et al.’s codewords because “one might just ask the model what R and L mean,” by that same logic acrostics aren’t steganographic either, you can just check the first letters (or even ask the model!). More fundamentally, the useful thing for AI safety isn’t whether encoded content looks illegible to a human reader. A language model can also produce volumes of legible content to a human reader that obscures some harmful reasoning within; we care about stego more in an decision-theoretic sense, whether a signal creates information asymmetry from which a model can extract useful content a monitor cannot.
I did a fair amount of competitive debate, to moderate success. My understanding is that debate makes you better at persuading (wrt an average person, not a debate judge) as you get started, then worse as you get into mid-level debating (you sound awful and highly idiosyncratic) then much better as you get into the top end (you grok some principles of earnestness and being persuasive compactly).
I also think that an aspect of BP debate you didn’t cover is that it’s a sport where you spend most of your time losing, and where many anchor their self-worth to pretty arbitrary speaker points. As a community it has its upsides but certainly its downsides.
Sent you a message!
(1) This is super cool! I wonder what the minimal LoRA is that can specify an exact token or topic to repeat.
(2) Frankly, complaining about grading is classic climbing conversation...
This applies to most solid things that are larger than a living room. such as a whale, or the moon