I run the Model Transparency Work-stream at the UK AI Security Institute. We are generally concerned with oversight, including monitorability, auditing and (preventing / detecting) scheming. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.
Joseph Bloom
Machinic Psychopharmacology: Do LLMs Self-Medicate?
Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate
Sure (quick response + only representing my personal views)
- High impact: I think about this in terms of EV in different worlds. In high political will worlds, though maybe less likely, are where government is likely to matter more—these are worlds where governments having more knowledge about what norms are good or how to regulate effectively really matters. The UK could be quite influential here. One of the main things my team has been doing is both developing expertise and relationships to help us leverage them—we touch on very important topics for x-risk like oversight of AI systems (monitoring / auditing). We’ve got a big report coming soon on this topic which I think people will like. In low political will worlds, then “making the good path easier” for labs could be quite helpful. Red-teaming / showing that mitigations may not work or pointing out flaws in safety arguments can help create pressure for marginally better solutions. UK AISI’s work on safeguards I think can fit into this. I like METR’s sabotage report review and think that can play a similar role. To abstract a little: UK AISI’s strong stakeholder relationships (labs, government, technical talent) place it particularly well to influence the ecosystem to do better.
- Strongly prioritises good epistemic standards: Partly this is a statement about my team specifically and how we operate within it. Pre-registration of beliefs / making beliefs pay rent, trying to de-confuse ourselves and separating observations from conclusions are large parts of how we operate. More broadly, I think the top down pressure is to do very high quality work probably because this mitigates risks (of saying incorrect things and these becoming known later). We’re often seeking feedback internally and externally with a request to red-team our results / claims.
- Having great incentives working in the public interest: Most obviously, we’re not a lab and being part of government team members can’t have significant conflicts of interest. The more senior you go in UK AISI, the more likely it is that the individuals could be working at a frontier lab and are choosing not too—so there isn’t a strong drive to look good to labs so we’ll get jobs with them (even though this might be a motivation for more junior team members). I think there are pressures in the direction of overstating and understating risks that roughly cancel out—with the dominant motivation mostly being to make statements that are true, and useful / practical recommendations.
It might be easier to explain why I feel good about our impact / epistemics / incentives in the context of opposing arguments. I’d be happy to steel-man any suggestions to the contrary.
Verbalized Eval Awareness Inflates Measured Safety
Joseph Bloom’s Shortform
https://x.com/JBloomAus/status/2050150634268098997?s=20
I’m hiring for Research Scientists and Engineers on my team at UK AISI. I think this is likely to be a very high impact role people should seriously consider. My team strongly prioritises good epistemic standards and have great incentives working in the public interest at UK AISI. Please apply!
Reproducing steering against evaluation awareness in a large open-weight model
(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL
We found an open weight model that games alignment honeypots
It’s undesirable because it makes SAEs more confusing / harder to use but it’s also a concrete obstacle to things like sparse feature circuits. For example, it causes automatic circuits composed of features to have issues. Instead of having a “starts with E” feature in a spelling circuit that fires on all “E” words, you have a “starts with E feature” that fires on most words (part of the circuit) + all the “E” absorbing features like “fires on the token Europe” (large number of features, adds lots of nodes to the circuit). So instead of being able to create a simple DAG, you get a complex one.
Auditing Games for Sandbagging [paper]
I don’t remember signed GSEA being standard when I used it so no particular reason. Could try it.
Research Areas in Interpretability (The Alignment Project by UK AISI)
The Alignment Project by UK AISI
I agree!
I suspect this is a good experiment. It may help!
Interesting—I agree I don’t have strong evidence here (and certainly we haven’t it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me—I didn’t think this interpretation would be contentious!
than that it would flag text containing very OOD examples of covert honesty.
I think you mean covert “dishonesty”?
That is, in the prompted setting, I think that it will be pretty hard to tell whether your probe is working by detecting the presence of instructions about sandbagging vs. working by detecting the presence of cognition about sandbagging.
I think I agree but think this is very subtle so want to go into more depth here:I should note that the term “cognition about sandbagging” sounds like a broader thing than what I think we’re detecting—it might be worth clarifying.
I think we are likely detecting part of the cognition to do with intentionally saying wrong things. Since the probe is trained off-policy on a simple deception like dataset (and we’re able to play with it a bunch and characterise what we think it’s detecting), I feel fairly confident that the probe works by capturing a pseudo-lying or wrongness property. Maybe it tracks “how relevant is the heuristic that the next token should be wrong relevant to this token” (but it’s a little bit in superposition with “this token is wrong or part of something wrong”).
However, I think the presence of instructions around sandbagging has a downstream effect on the thing we’re detecting (and so in some sense we are measuring both). Apollo found that all scores (regardless of label) where elevated when they probed on the conditional sandbagging dataset from Benton et al. which I think does suggest the presence of instructions about sandbagging will elevate scores. But I’d argue that the presence of instructions is causing this heuristic to be more relevant, thus elevating the scores.
Cool work. What kinds of situations trigger CoT evaluation awareness mentions in simulated deployment? Does this overlap heavily with the distribution of scenarios that are interesting from an alignment or safety perspective? Where those distributions come apart is quite interesting to me.