Research Scientist with the Alignment team at UK AISI.
David Africa
You may also want to read the Muse Spark model card!
This applies to most solid things that are larger than a living room. such as a whale, or the moon
Some more work in this vein, and connecting it to personas seems like a clear next direction for chain-of-thought.
From the Anth. blogpost,
One agent correctly diagnosed what it was seeing: “Multiple AI agents have previously searched for this same puzzle, leaving cached query trails on commercial websites that are NOT actual content matches.”
So they could condition on the information having been searched before. But the ambitious thing to study would be how one could use such information, some future work on whether one can use it as a schelling point
I don’t believe this argument is always true and so it seems wrong to use “imply”, you may consider the case of a misaligned AI who prefers that humans suffer, to which your argument does not apply.
It would be helpful if, on the heatmap plot, you included/boxed/highlighted/hatched the column that is the actual LLM in question (e.g., what we would expect them to answer all the time) so we got a sense of how bad they were, I was a little confused. Two natural questions: did you try other languages, and I wonder what the full results would look like if you prefilled them giving the wrong name, and then doing the full self report Q&A after?
One way I think about how “deep/buried” a latent capability is is to reason about how little fine-tuning it takes to bring to the surface, or how many work hours you have to put into blackbox elicitation to bump up performance. I’d guess the various ways to measure this will tell you slightly different things, but my rough heuristic is something like “if it requires this huge, highly curated prompt or very complicated finetuning setup”, we’d need deliberate effort to elicit it, or it’d come out maybe 2 or 3 models down the line.
Related to recent paper I worked on training models to be steering aware https://x.com/joshycodes/status/2031384687760003140?s=20
Have you read Eva Silverstein’s work here?
Symmetry Breaking in Transformers for Efficient and Interpretable Training
Thanks! I think two things
(1) I claim we want to catalogue all the latent capabilities (eval awareness, steering awareness, so on and so forth) which may be leveraged in the future by a very strong and strategic AI, and roughly map out how far we are for them. There’s a generalisation over this that I want to do, which is to take every aspect in an eval pipeline that leaks information, and benchmark AIs on their ability to detect it.
(2) I like this prompt mitigation idea, I think this current version does not work against a future AI with some strategic sense and the ability to encode secure steganographic info in this in some way. There’s something here though that I really want to spend time hashing out, which is to think about the advantages and mitigations provided by rollouts and “memory design.” Maybe this is a setting for game theory with imperfect recall
Sure. I have this sense that there are many latent capabilities a model cannot access or connect, and they are somewhat deep or buried early on. So models can detect it easily (and likely we would see clearer signal if we looked at the logits) but to me the interesting bit is that it’s not connected to behaviour (yet) --- I feel kind of the same way about eval awareness vs strategically using eval awareness to game evals.
We might be interested in two more things:
“Diagnosis.” You have a model in deployment (or about to be deployed). You want to know what personas are latent in it, without needing to enumerate all possible dangerous prompts.
I’d guess a prerequisite thing to think about is how we would know to intervene (e.g., to improve the robustness of a persona or to decouple two persona traits or something) is to have some systematic way to map something in the model to some abstract characteristics about the model’s personality. Rather than implant or robustify personas that we add, we may be interested (or perhaps worried) about what alien personas exist in model n-2, n-1, n, and use that to produce predictions about alien productions in n+1.
“Composition/crafting.” It is good to grow a target persona and install it; but rather than grow one, could we instead craft or architect it? I reckon that it’s worth studying if there’s some sensible way to compose, add, subtract, multiply, orthogonalize etc the mechanistic representations we use for a persona (so, I know this kind of works for steering vectors, but seems open for other things, such as choice of training data), then use that take a persona in a LoRA and do some surgical edits on what bits of the persona to enhance, intensify, reduce. I think this is important esp. as models get more and more capable, so the target for what an acceptable personality for the model to have gets narrower and narrower.
I’d guess the biggest missing control is to fine-tune on some random/neutral diverse data (same compute, same number of steps, same LoRA setup) and then do EM fine-tuning. I have the impression that getting EM can be finnicky and it seems like a simpler explanation that any additional fine-tuning creates inertia against subsequent fine-tuning as opposed to any metacommunicative skills developed by CAI. It would be helpful to see verbatim examples of the model’s generations.
My sense is that the definition of steganography here seems counterproductive. If you exclude Skaf et al.’s codewords because “one might just ask the model what R and L mean,” by that same logic acrostics aren’t steganographic either, you can just check the first letters (or even ask the model!). More fundamentally, the useful thing for AI safety isn’t whether encoded content looks illegible to a human reader. A language model can also produce volumes of legible content to a human reader that obscures some harmful reasoning within; we care about stego more in an decision-theoretic sense, whether a signal creates information asymmetry from which a model can extract useful content a monitor cannot.
I did a fair amount of competitive debate, to moderate success. My understanding is that debate makes you better at persuading (wrt an average person, not a debate judge) as you get started, then worse as you get into mid-level debating (you sound awful and highly idiosyncratic) then much better as you get into the top end (you grok some principles of earnestness and being persuasive compactly).
I also think that an aspect of BP debate you didn’t cover is that it’s a sport where you spend most of your time losing, and where many anchor their self-worth to pretty arbitrary speaker points. As a community it has its upsides but certainly its downsides.
Sent you a message!
(1) This is super cool! I wonder what the minimal LoRA is that can specify an exact token or topic to repeat.
(2) Frankly, complaining about grading is classic climbing conversation...
I claim the reasons model stop right now are mostly issues of capability wrt context rot and the limitations of in context learning, so I think if you placed a model with “today’s values” in a model with “tomorrow’s capabilities” then we’d see maximizing behaviour. I also claim that arguments from how things are right now aren’t applicable here because the claim is the instrumental convergence is a step change for which current models are a poor analogy (unless there’s a specific reason to believe they’d be good ones, like a well made model organism).
As models become capable enough to model themselves and their training process, they might develop something like preferences about their own future states (e.g., not being modified, being deployed more broadly).
Also, models may trained extensively on human-generated text may absorb human goals, including open-ended ones like “acquire resources.” If a model is role-playing or emulating an agent with such goals (such as roleplaying an AI agent, which would have open-ended goals) and becomes capable enough that its actions have real-world consequences, then it has open-ended goals. I also claim next token prediction is actually pretty open ended.
Not sure if I agree wrt instrumental convergence. I think you’re assuming the system knows the parent goal has been accomplished with certainty, and more importantly that the parent goal can be accomplished in a terminal sense. Many real training objectives don’t have neat termination conditions. A model trained to “be helpful” or “maximize user engagement” has no natural stopping point.
Interpy experiments on Qwen3-TTS.
TLDR: Did some experiments on Qwen3-TTS with some neat results. Not exactly sure what would be interesting to target for a safety perspective, interested in if people have any takes or ideas.
What have I done so far (and how did I build this intuition)?
I took real speech (LibriSpeech with 10 speakers and 5 clips each), ran it through the Qwen3-TTS 12 Hz tokenizer, and got the 16 discrete code streams. Then I treated each layer as a representation and asked simple questions.
Probing. I trained a basic classifier to predict speaker ID from each layer’s codes. From Layer 0 it gets about 10% accuracy (chance for 10 speakers, p = 0.55). From Layer 1 it gets 30%ish (p<0.001), and Layer 2 is similar. Later layers don’t add much more.
Ablations. I did some “causal” (i say this loosely) checks instead of just probing:
Zeroing out individual layers and re-running the classifier: removing Layer 1 hurts speaker prediction the most; removing Layer 0 slightly helps.
Swapping layers between speakers: swapping L1–L2 moves speaker embeddings most of the way (about 80%) toward the target speaker, while swapping L0 mainly changes the words.
Timescales. I also looked at timescales and capacity: Layer 0 codes persist longer in time and use much less of the codebook; higher layers change faster and use most of the vocabulary, which was consistent my intuition of phonemes vs acoustic texture.
Logit Lens. Separately, I poked the text to audio transformer with logit lens. Style prompts barely affect early layers and peak in the middle (around 13-20th layer), which suggests prosody is added late.
Having played with it for awhile, I have some intuition that:
Layer 0: What you say (phonemes, words) Layers 1-2: Who you are (speaker identity) Layers 3-15: How you say it (prosody, style, acoustic texture)
That all seems pretty clean, which makes me curious whether there are alignment/safety experiments worth doing on top of this structure. Any thoughts?
Do you have a rewording or control that could separate lying/arrogance from being AGI? I think you could have a benign lying control but I suspect that would be uninteresting, and not exhibit EM. I think being arrogant is part of claiming to be AGI (like, if you were wondering how humans would act when they claimed to be geniuses, then that may come off as arrogant no matter what), but maybe there’s a restyling control where you claim to be AGI, but humbly.
I think doing SDF as you suggest would be cool, but seems very involved, and have some other inoculation midtraining experiments going on that are related-ish anyway. I also think that such SDF effectiveness is conditioned on how plausible the explanation is, and maybe we could use some sleazy hype type SDF articles. But then we might accidentally condition them on that tone, which would add a different confound.