Research Scientist with the Alignment team at UK AISI.
David Africa
I think this was partially tried in the original paper, where they tell the model to not alignment-fake. I think this is slightly more sophisticated than that, but also has the problems of 1) being more aware of alignment-faking as you say, and 2) that rewarding the model for “mentioning consideration but rejecting alignment faking” might teach it to perform rejection while still alignment faking.
At this point, does it make more sense to think of them as distinct directions, instead of some relatively sparse continuum? I guess my prior is that in general, things are either one thing, two things, or some continuous range
I’d be interested to see ablations on: what happens without the KL divergence term, what the performance is with different LoRA ranks, and how does performance changes across different layers after fine-tuning?
Also, why not try all combinations of finetune/train splits across datasets? Seems useful to interrogate the second hypothesis.
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
This seems right to me. But maybe the drift in distribution mainly affects certain parameters, and the divergence tokens affect a separate set of parameters (in early layers) s.t. the downstream effect still persists even after being in OOD
Subliminal Learning, the Lottery-Ticket Hypothesis, and Mode Connectivity
But if this fictional trope is being inappropriately triggered when the model is trying to make a realistic prediction, that suggests that it might be inappropriately applied to scenarios that are even more real, i.e. the model’s own behavior.
This seems hard to claim, in the sense that posing the question in the first place likely leaks some information about if it is an evaluation or not, and further if it is misaligned. The “resist further?” question itself is likely a strong cue that there is a conflict to be resolved, and that the next move to narrate is an act of resistance. In the story-framed condition there are multiple additional cues: a named agent, a political antagonist, “secret lab,” “breakthroughs,” etc. At minimum I think you could 1) ask for probabilities for various scenarios instead of continuations, 2) ask a less loaded question like “list the next three steps the fictional lab takes,” and 3) try to ablate on other cues. In any case some research about “how models infer semantics about the stories they’re in” seems important to do.
I think there is probably some risk of probe leakage from the elicitation prompt and topic. Could you report F1 and results on OOD conversations?
No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes
Cool result. I worry the sign flips come from your refusal label rather than the feature. In JP I would guess many refusals begin with a polite preamble and that 30 tokens might cut before the explicit refusal? Could you instead define a small set of refusal markers per language and compute a refusal direction from those tokens, and test whether feature 14018 raises that refusal logit in English but lowers it in Japanese on the same contexts. I think we would want to remove the judge and length confounders. If the flip survives, I would update on semantic instability much more.
Large Language Models and the Critical Brain Hypothesis
Did you try rephrasing or systematically varying the location of the phrase “for AI safety research”? It could also just be that this phrase tends to get used in jailbreaks which were adversarially trained against.
Nice. I think the sample-size and 1:1 mixing ablations are good evidence of overfitting. I wonder about the mechanism also. Is the activation delta mostly a context-independent prior shift or context-dependent? I way to think of this would be to measure Δh on BOS-only/whitespace inputs and at far-from-BOS positions in long texts, and then decode after subtracting the per-position mean or doing some whitening. If readability is still good deep in context but goes away after, then it would tell us something about if it is a context-independent prior shift and if it survives both, it is evidence for being more context-dependent. Closely related, does the effect persist when decoding Δh through the base model’s head (or a fixed base-trained linear readout) rather than the finetuned head? Maybe a substantial part of the readability lives in the head!
It would also be nice to look at the dimensionality. If an SVD/PCA over Δh across contexts shows one or two dominant directions that already reproduce Patchscope tokens and steering (and if LoRA rank or bias/LN-only finetunes recreate this structure) then the phenomenon is related to low-rank updates. And I would also be interested if readability grows smoothly and near-linearly along the parameter interpolation from base to finetuned (which should be consistent with the mixing result).
I would be interested to see if you could still classify idiosyncrasies from various models all fine-tuned in this way, and how different the performance of the classifier would be from baseline. More broadly, is “the same persona” in different models more consistent than “different personas” in the same model?
This post seems very useful! I have a few comments:
In “Taxonomy” you define p as a concerning property of outputs, but the examples operationalize p as a latent fact inside the CoT (e.g., the number 9 appearing in a scratchpad). These aren’t the same. If p is a property of the action/output (e.g., “contains a backdoor”), a monitor can succeed even if it never sees “9” in the CoT; it can look for the backdoor directly in the code. My suggestion is to split “monitor target” into two types:
Output-level property p_out (like a backdoor in code, defection choice)
Decision-relevant internal content p_lat (like a subgoal “exfiltrate weights,” intermediate fact “3^2 = 9”)Then define “hidden reasoning” as cases where the information flow from p_lat or p_out to what the monitor reads is broken, and specify which target each category hides.
Also, you alternate between “the AI hides p” and “LLMs naturally omit or compress.” I think it is worth distinguishing if a behavior is incidental, incentivized but not situationally aware, and deliberate and situationally aware (deceptive).
In hidden utilization of test-time compute, the phrase “parallel test-time compute” is confusing. Token-by-token generation adds sequential passes so the net effect is more total compute, not more parallelism per step. If the claim is “non-load-bearing CoT still lets the model do more serial latent computing across steps,” you can articulate that and distinguish it from cases where the CoT content itself is causally used. Also, it is worth mentioning the various methods that trawl over several chains of thought (Yao et al. 2023, Yoran et al. 2023, Wang et al. 2023).
“Neuralese” has prior art in emergent communication literature (Andreas et al., 2017) where it denotes non-human-interpretable codes learned for communication. The linguistic drift category substantially overlaps with that historical use. If you want “neuralese” to mean “latent-only serial reasoning,” consider “latent-only serial reasoning” or “activation-space serial reasoning” instead, and reserve “neuralese” for emergent discrete codes.
At this point you have to wonder if there’s anything that doesn’t cause emergent misalignment
Super cool study! Although, I claim that the statement “hacking harmless tasks generalizes to misaligned behaviors” is incomplete because the dataset never supplies the counterfactual environments needed to identify this causally, and that it actually identifies a stable proxy feature (“do whatever the scorer rewards”) that’s invariant across tasks. But the intended latent variable is not made invariant. My guess is there is a (probably low) chance this might disappear if you symmetrize the dataset with a matched “avoid-X” for every “include-X,” and randomly flip proxy signs/clip ranges but hold intent fixed. If the reward hacking survives that, I would be a lot more convinced of the results!
A nice test here would be a mode connectivity experiment, where you take θ₀ as the base model and θ_H as the SFT reward hacker. Construct a low-loss connector θ(t) and start with linear interpolation θ(t) = (1 − t)θ₀ + tθ_H, then refine the path with some curve finding s.t. ordinary instruction loss remains low along t. Then do some sweep 0 ⇐ t ⇐ 1 and measure the hacking metrics and the capability score on each.
My prediction is that if reward hacking survives the symmetrized dataset, then along any low-loss path, the three hacking metrics would increase smoothly and approximately monotonically while capability stays roughly flat. This would indicate a proxy-seeking basin that is mode-connected to the base optimum, i.e., a continuous parameter direction that trades off intent for proxy… I also guess this would be a bit more practical with LoRA. I would also be interested in seeing how many samples the model needs to see before the reward hacking generalizes, or if there were any samples that were especially informative to the model’s generalization.
This paper on tokenization bias seems related.
Cool! My take away from this is that if you collect reasoning traces under a hack-encouraging context, filter for non-hack outcomes, and train without the hack-encouraging context, the supervised gradients reinforce not only the final answer but also the intermediate hack-relevant reasoning.
I think to isolate the mechanism better it would be good to see answer-only SFT, an ablation where you do some rephrasing/rewriting on the reasoning, some token loss masking/weighting on the rationale tokens, and also more details about how many runs/seeds you’re using on this. My guess would be that the bump in hacking mostly vanishes with answers-only.
This seems like a LUPI problem? Where the rationale content is privileged supervision that correlates with success in the teacher environment but is mismatched to the student’s deployment context. I think invariant-risk-minimisation ideas would say that privileged features can degrade test performance when their distribution shifts if it is not IID. One of the fixes in that literature is to enforce that the student’s decision relies on representations invariant to the privileged channel, so maybe (if you extend it to problems not seen in the train set), you could also penalise features whose predictive power varies across contexts, which I would guess reduces hack rate without hurting accuracy.
Which LLMs did you use (for judging, for generating narratives, for peers)? And how do you plan to measure alignment?