I run the Model Transparency Work-stream at the UK AI Security Institute. We are generally concerned with oversight, including monitorability, auditing and (preventing / detecting) scheming. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.
Joseph Bloom
Machinic Psychopharmacology: Do LLMs Self-Medicate?
Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate
Verbalized Eval Awareness Inflates Measured Safety
Joseph Bloom’s Shortform
https://x.com/JBloomAus/status/2050150634268098997?s=20
I’m hiring for Research Scientists and Engineers on my team at UK AISI. I think this is likely to be a very high impact role people should seriously consider. My team strongly prioritises good epistemic standards and have great incentives working in the public interest at UK AISI. Please apply!
Reproducing steering against evaluation awareness in a large open-weight model
(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL
We found an open weight model that games alignment honeypots
It’s undesirable because it makes SAEs more confusing / harder to use but it’s also a concrete obstacle to things like sparse feature circuits. For example, it causes automatic circuits composed of features to have issues. Instead of having a “starts with E” feature in a spelling circuit that fires on all “E” words, you have a “starts with E feature” that fires on most words (part of the circuit) + all the “E” absorbing features like “fires on the token Europe” (large number of features, adds lots of nodes to the circuit). So instead of being able to create a simple DAG, you get a complex one.
Auditing Games for Sandbagging [paper]
I don’t remember signed GSEA being standard when I used it so no particular reason. Could try it.
Research Areas in Interpretability (The Alignment Project by UK AISI)
The Alignment Project by UK AISI
I agree!
I suspect this is a good experiment. It may help!
Interesting—I agree I don’t have strong evidence here (and certainly we haven’t it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me—I didn’t think this interpretation would be contentious!
than that it would flag text containing very OOD examples of covert honesty.
I think you mean covert “dishonesty”?
That is, in the prompted setting, I think that it will be pretty hard to tell whether your probe is working by detecting the presence of instructions about sandbagging vs. working by detecting the presence of cognition about sandbagging.
I think I agree but think this is very subtle so want to go into more depth here:I should note that the term “cognition about sandbagging” sounds like a broader thing than what I think we’re detecting—it might be worth clarifying.
I think we are likely detecting part of the cognition to do with intentionally saying wrong things. Since the probe is trained off-policy on a simple deception like dataset (and we’re able to play with it a bunch and characterise what we think it’s detecting), I feel fairly confident that the probe works by capturing a pseudo-lying or wrongness property. Maybe it tracks “how relevant is the heuristic that the next token should be wrong relevant to this token” (but it’s a little bit in superposition with “this token is wrong or part of something wrong”).
However, I think the presence of instructions around sandbagging has a downstream effect on the thing we’re detecting (and so in some sense we are measuring both). Apollo found that all scores (regardless of label) where elevated when they probed on the conditional sandbagging dataset from Benton et al. which I think does suggest the presence of instructions about sandbagging will elevate scores. But I’d argue that the presence of instructions is causing this heuristic to be more relevant, thus elevating the scores.
White Box Control at UK AISI—Update on Sandbagging Investigations
I think this is a valuable read for people who work in interp but feel like I want to add a few ideas:
Distinguishing Misrepresentation from Mismeasurement: Interpretability researchers use techniques that find vectors which we say correspond to the representations of the model, but the methods we use to find those may be imperfect. For example, if your cat SAE feature also lights up on racoons, then maybe this is a true property of the model’s cat detector (that is also lights up on racoons) or maybe this is an artefact of the SAE loss function. Maybe the true cat detector doesn’t get fooled by racoons, but your SAE latent is biased in some way. See this paper that I supervised for more concrete observations.
What are the canonical units? It may be that there is a real sense in which the model has a cat detector but maybe at the layer at which you tried to detect it, the cat detector is imperfect. If the model doesn’t function as if it has an imperfect cat detector then maybe downstream of the cat-detector is some circuitry for catching/correcting specific errors. This means that finding the local cat detector you’ve found which might have misrepresentation issues isn’t in itself sufficient to argue that the model as a whole has those issues. Selection pressures apply to the network as a whole and not necessarily always to the components. The fact that we see so much modularity is probably not random (John’s written about this) but if I’m not mistaken, we don’t have strong reasons to believe that the thing that looks like a cat detector must be the model’s one true cat detector.
I’d be excited for some empirical work following up on this. One idea might be to train toy models which are incentivised to contain imperfect detectors (eg; there is a noisy signal but reward is optimised by having a bias toward recall or precision in some of the intermediate inferences). Identifying intermediate representations in such models could be interesting.
Sure (quick response + only representing my personal views)
- High impact: I think about this in terms of EV in different worlds. In high political will worlds, though maybe less likely, are where government is likely to matter more—these are worlds where governments having more knowledge about what norms are good or how to regulate effectively really matters. The UK could be quite influential here. One of the main things my team has been doing is both developing expertise and relationships to help us leverage them—we touch on very important topics for x-risk like oversight of AI systems (monitoring / auditing). We’ve got a big report coming soon on this topic which I think people will like. In low political will worlds, then “making the good path easier” for labs could be quite helpful. Red-teaming / showing that mitigations may not work or pointing out flaws in safety arguments can help create pressure for marginally better solutions. UK AISI’s work on safeguards I think can fit into this. I like METR’s sabotage report review and think that can play a similar role. To abstract a little: UK AISI’s strong stakeholder relationships (labs, government, technical talent) place it particularly well to influence the ecosystem to do better.
- Strongly prioritises good epistemic standards: Partly this is a statement about my team specifically and how we operate within it. Pre-registration of beliefs / making beliefs pay rent, trying to de-confuse ourselves and separating observations from conclusions are large parts of how we operate. More broadly, I think the top down pressure is to do very high quality work probably because this mitigates risks (of saying incorrect things and these becoming known later). We’re often seeking feedback internally and externally with a request to red-team our results / claims.
- Having great incentives working in the public interest: Most obviously, we’re not a lab and being part of government team members can’t have significant conflicts of interest. The more senior you go in UK AISI, the more likely it is that the individuals could be working at a frontier lab and are choosing not too—so there isn’t a strong drive to look good to labs so we’ll get jobs with them (even though this might be a motivation for more junior team members). I think there are pressures in the direction of overstating and understating risks that roughly cancel out—with the dominant motivation mostly being to make statements that are true, and useful / practical recommendations.
It might be easier to explain why I feel good about our impact / epistemics / incentives in the context of opposing arguments. I’d be happy to steel-man any suggestions to the contrary.