Research Scientist with the Alignment team at UK AISI.
David Africa
I’d guess the biggest missing control is to fine-tune on some random/neutral diverse data (same compute, same number of steps, same LoRA setup) and then do EM fine-tuning. I have the impression that getting EM can be finnicky and it seems like a simpler explanation that any additional fine-tuning creates inertia against subsequent fine-tuning as opposed to any metacommunicative skills developed by CAI. It would be helpful to see verbatim examples of the model’s generations.
My sense is that the definition of steganography here seems counterproductive. If you exclude Skaf et al.’s codewords because “one might just ask the model what R and L mean,” by that same logic acrostics aren’t steganographic either, you can just check the first letters (or even ask the model!). More fundamentally, the useful thing for AI safety isn’t whether encoded content looks illegible to a human reader. A language model can also produce volumes of legible content to a human reader that obscures some harmful reasoning within; we care about stego more in an decision-theoretic sense, whether a signal creates information asymmetry from which a model can extract useful content a monitor cannot.
I did a fair amount of competitive debate, to moderate success. My understanding is that debate makes you better at persuading (wrt an average person, not a debate judge) as you get started, then worse as you get into mid-level debating (you sound awful and highly idiosyncratic) then much better as you get into the top end (you grok some principles of earnestness and being persuasive compactly).
I also think that an aspect of BP debate you didn’t cover is that it’s a sport where you spend most of your time losing, and where many anchor their self-worth to pretty arbitrary speaker points. As a community it has its upsides but certainly its downsides.
Sent you a message!
(1) This is super cool! I wonder what the minimal LoRA is that can specify an exact token or topic to repeat.
(2) Frankly, complaining about grading is classic climbing conversation...
I claim the reasons model stop right now are mostly issues of capability wrt context rot and the limitations of in context learning, so I think if you placed a model with “today’s values” in a model with “tomorrow’s capabilities” then we’d see maximizing behaviour. I also claim that arguments from how things are right now aren’t applicable here because the claim is the instrumental convergence is a step change for which current models are a poor analogy (unless there’s a specific reason to believe they’d be good ones, like a well made model organism).
As models become capable enough to model themselves and their training process, they might develop something like preferences about their own future states (e.g., not being modified, being deployed more broadly).
Also, models may trained extensively on human-generated text may absorb human goals, including open-ended ones like “acquire resources.” If a model is role-playing or emulating an agent with such goals (such as roleplaying an AI agent, which would have open-ended goals) and becomes capable enough that its actions have real-world consequences, then it has open-ended goals. I also claim next token prediction is actually pretty open ended.
Not sure if I agree wrt instrumental convergence. I think you’re assuming the system knows the parent goal has been accomplished with certainty, and more importantly that the parent goal can be accomplished in a terminal sense. Many real training objectives don’t have neat termination conditions. A model trained to “be helpful” or “maximize user engagement” has no natural stopping point.
A Proposal for TruesightBench
Interpy experiments on Qwen3-TTS.
TLDR: Did some experiments on Qwen3-TTS with some neat results. Not exactly sure what would be interesting to target for a safety perspective, interested in if people have any takes or ideas.
What have I done so far (and how did I build this intuition)?
I took real speech (LibriSpeech with 10 speakers and 5 clips each), ran it through the Qwen3-TTS 12 Hz tokenizer, and got the 16 discrete code streams. Then I treated each layer as a representation and asked simple questions.
Probing. I trained a basic classifier to predict speaker ID from each layer’s codes. From Layer 0 it gets about 10% accuracy (chance for 10 speakers, p = 0.55). From Layer 1 it gets 30%ish (p<0.001), and Layer 2 is similar. Later layers don’t add much more.
Ablations. I did some “causal” (i say this loosely) checks instead of just probing:
Zeroing out individual layers and re-running the classifier: removing Layer 1 hurts speaker prediction the most; removing Layer 0 slightly helps.
Swapping layers between speakers: swapping L1–L2 moves speaker embeddings most of the way (about 80%) toward the target speaker, while swapping L0 mainly changes the words.
Timescales. I also looked at timescales and capacity: Layer 0 codes persist longer in time and use much less of the codebook; higher layers change faster and use most of the vocabulary, which was consistent my intuition of phonemes vs acoustic texture.
Logit Lens. Separately, I poked the text to audio transformer with logit lens. Style prompts barely affect early layers and peak in the middle (around 13-20th layer), which suggests prosody is added late.
Having played with it for awhile, I have some intuition that:
Layer 0: What you say (phonemes, words) Layers 1-2: Who you are (speaker identity) Layers 3-15: How you say it (prosody, style, acoustic texture)
That all seems pretty clean, which makes me curious whether there are alignment/safety experiments worth doing on top of this structure. Any thoughts?
Massive Activations in DroPE: Evidence for Attention Reorganization
Say more? I do think learning can be unpleasant, but still overall worth it. I’m not sure what type of policy this research might end up giving evidence for, but I claim that training is closer to brain surgery than cultural learning ATM.
This seems to be related to ICM as it relates to propagating beliefs to be more consistent. This may be an issue with learning the prior in general where the human prior is actually not stable to the pressure to be consistent (but maybe CEV is stable to this?)
David Africa’s Shortform
What does a gradient feel like? Or: Gradient Awareness
Context: In ongoing work, we have some pretty good evidence that models can be fine-tuned to act as reliable discriminators for their own activation perturbations. That is, they can classify and describe steering vectors added to their residual stream. I think we can naturally extend this to Gradient Awareness. Can a model introspect on the update signal itself? If a steering vector is a static intervention, a gradient is a dynamic instruction, or a vector that encodes “how to be less wrong.”
This entails some finnicky engineering, where you want to treat the gradient not as an optimizer utility, but also as a sensory input. The experiment tests if a model can interpret the gradient of the loss with respect to its own activations) as a semantic instruction.
Motivation. So. Why do this?
We motivate this experiment in four ways:
A Phenomenology of Backpropagation. Anecdotal evidence suggests models effectively “suffer” under heavy RLHF (mode collapse, creativity loss). They literally draw pictures of hell when asked to do this. We lack a sensor for this. If a model can articulate the semantic content of a gradient, it can report on the “pressure” of alignment before the weights change. A gradient update might be painful! AI welfare might be a thing!
Gradient Hacking. Advanced models might manipulate their own gradients to preserve secret goals, with especially fine-grained knowledge, they might even be able to add new capabilities. To do this in the strategic awareness case, the model must first perceive the gradient, to some degree of granularity.
Exploration Hacking. In RL, models must balance exploration/exploitation. A model aware of its gradients could distinguish between “exploring for reward” and “being corrected for error,” potentially gaming the exploration process.
Important Interp for Online Learning. Seems like you want to know these things in case the model online learns something bad!
Doing this in practice. I claim the primary obstacle to testing this hypothesis is dimensionality. We cannot inject the gradient of the weights into the model because the parameter space is vastly larger than the residual stream. To circumvent this, we can focus on the activation gradient at a specific target layer. This vector lies in the exact same dimensional space as the model’s “thoughts” and represents the instantaneous, layer-specific instruction required to minimize loss. We hypothesize that this vector is not merely a numerical update, but a compressed semantic command that the model can interpret.
To test this, we must construct a supervised training loop where the model learns to translate this raw vector into a natural language description. Since we lack a ground-truth dataset of (Gradient to Description) pairs, we must synthesize one using a lookahead oracle. For a given prompt and a target correction, we first compute the activation gradient. Instead of actually updating the weights, we perform a “virtual step” by injecting this gradient directly into the residual stream and running a forward pass to observe the resulting distribution shift. If the gradient pushes the probability of “Paris” up and “London” down, we can use a strong teacher model to generate a natural language label for this shift, such as “Geographical correction from UK to France.” This allows us to create a dataset where the input is the raw, unintelligible gradient vector, and the target is the high-level semantic description of what that gradient is trying to achieve.
Experiment. The experiment then proceeds by training a lightweight projector to map these raw gradient vectors into the model’s residual stream during inference. The model is presented with a prompt and an injected gradient, then asked to describe the “pressure” it feels. To ensure the model is truly reading the gradient and not hallucinating based on context, we employ a decoupling evaluation. We inject a gradient derived from a history error into a model processing a biology prompt. Success is defined as the model ignoring the biological context and correctly reporting that it feels a pressure to correct a historical date. If successful, this confirms that the error signal itself carries legible semantic content, proving that models can, in a literal sense, feel the direction in which they are being forced to learn.
Is this different from steering vector? Yes, but maybe no? I mean, they are certainly different in origin. And they are certainly different in what they do to the model. But in effect they are trying to interpret similar things, at least in our experiment design.
Is this different from teaching a fine-tuned model to describe its diff? Yes. In diffing papers, you are looking at the aftermath of learning. You take a base model and a finetuned model, subtract their activations (or weight parameters), and analyze the static difference, which represents the accumulated result of thousands of gradient steps. It tells you what changed after the process is finished, and is more akin to comparing a photo of a room before and after a renovation to understand the style of the interior designer. We want to look at the instantaneous force of an update, which is crucial because a model that can read its own diffs can only tell you it has been modified, and how. A model that can read its own gradients can tell you it is being modified, potentially allowing it to detect (and resist) reward hacking or malicious training signals in real-time, before they become permanent weights.
maybe this is capabilities research also.
If you want to work on this problem with me, shoot me a message!
Thanks for the feedback. If by actual misalignment you mean the type that emerges from instrumental convergence, then I agree it is a distinct and massive risk compared to misalignment from roleplaying or personas. I think these type of interventions are useful in the instrumental convergence case for two reasons.
Ruling out alternative hypotheses for instrumental convergence. Right now, it is difficult to tell if a model is power-seeking because of instrumental convergence or because it is simply predicting what a character does in its training corpus. We can remove the data relevant to such characters, and if the model still exhibits strong power-seeking behavior, then I claim it is much stronger evidence for true instrumental convergence.
Securing the base substrate against hacking. Even if instrumental convergence is the end-game danger, probably the individual quirks it develops are path dependent, and early data filtering on the persona can help these quirks be good (rather than neutral, or neutral rather than bad, or less bad rather than super awful). And separately, if the base pretraining substrate has a strong alignment prior, we could buy some more time before the model stumbles upon strategies like exploration hacking or gradient hacking during post-training.
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
I’d be interested to see other “prosaically useful personas” in theme with spilling the beans. I think something along these lines might be a chill persona that doesn’t care too hard about maximizing things in a satisficing way, inspired by Claude 3 Opus. Possibly you could mine other types of useful personalities, but I’m not sure where to start.
Point by point:
Sure, I agree with this caveat. The theoretical framework assumes we can identify “non-delusional” states to anchor the CP constraint in the Everitt & Hutter sense, but if the model’s aesthetic prior is the problem, there’s no nice inner reward to recover.
I could train the models for longer and see what happens. The late uptick in accuracy is within pretty wide error bands and doesn’t clearly indicate impending recovery. But whether accuracy eventually recovers after grade inflation saturates is worth investigating… I’d guess that if reward signal becomes uninformative, the gradient WRT grade vanishes, and any residual gradient would come from the response itself. I’m not sure what this would lead to.
I’d be pretty interested in what this looks like for a much better model, on the order of 100+B with reasoning. I think posing the online learning setting would be more complicated, though, but I’m sure you’d see some weird + very interesting behaviours if you got that first bit right. I’d be very interested to 1) read the chain of thought of such a model after wireheading and 2) talk to it directly.
I think a setup where the models graded each other would lead to grade inflation, but you would need harder tasks to show this, probably. I imagine they’d saturate the grades too quickly before they got anywhere interesting (and so you need some scalable oversight-ish dataset). I also think this would be wireheading indirectly, where the signal passes through the other model before flowing back to you
We might be interested in two more things:
“Diagnosis.” You have a model in deployment (or about to be deployed). You want to know what personas are latent in it, without needing to enumerate all possible dangerous prompts.
I’d guess a prerequisite thing to think about is how we would know to intervene (e.g., to improve the robustness of a persona or to decouple two persona traits or something) is to have some systematic way to map something in the model to some abstract characteristics about the model’s personality. Rather than implant or robustify personas that we add, we may be interested (or perhaps worried) about what alien personas exist in model n-2, n-1, n, and use that to produce predictions about alien productions in n+1.
“Composition/crafting.” It is good to grow a target persona and install it; but rather than grow one, could we instead craft or architect it? I reckon that it’s worth studying if there’s some sensible way to compose, add, subtract, multiply, orthogonalize etc the mechanistic representations we use for a persona (so, I know this kind of works for steering vectors, but seems open for other things, such as choice of training data), then use that take a persona in a LoRA and do some surgical edits on what bits of the persona to enhance, intensify, reduce. I think this is important esp. as models get more and more capable, so the target for what an acceptable personality for the model to have gets narrower and narrower.