Joseph Bloom

Karma: 1,968

I run the Model Transparency Work-stream at the UK AI Security Institute. We are generally concerned with oversight, including monitorability, auditing and (preventing / detecting) scheming. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.

Joseph Bloom 17 Jun 2026 4:39 UTC
LW: 3 AF: 1
0
AF
on: Predicting LLM Safety Before Release by Simulating Deployment
Cool work. What kinds of situations trigger CoT evaluation awareness mentions in simulated deployment? Does this overlap heavily with the distribution of scenarios that are interesting from an alignment or safety perspective? Where those distributions come apart is quite interesting to me.

Machinic Psychopharmacology: Do LLMs Self-Medicate?

Sid Black and Joseph Bloom

10 Jun 2026 14:15 UTC

124 points

11 comments23 min readLW link

Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate

Jordan Taylor, Max H, Ed Fage, Thomas Read and Joseph Bloom

21 May 2026 14:52 UTC

83 points

0 comments6 min readLW link

(www.aisi.gov.uk)

Joseph Bloom 5 May 2026 11:00 UTC
6 points
1
in reply to: kave’s comment on: Joseph Bloom’s Shortform
Sure (quick response + only representing my personal views)
- High impact: I think about this in terms of EV in different worlds. In high political will worlds, though maybe less likely, are where government is likely to matter more—these are worlds where governments having more knowledge about what norms are good or how to regulate effectively really matters. The UK could be quite influential here. One of the main things my team has been doing is both developing expertise and relationships to help us leverage them—we touch on very important topics for x-risk like oversight of AI systems (monitoring / auditing). We’ve got a big report coming soon on this topic which I think people will like. In low political will worlds, then “making the good path easier” for labs could be quite helpful. Red-teaming / showing that mitigations may not work or pointing out flaws in safety arguments can help create pressure for marginally better solutions. UK AISI’s work on safeguards I think can fit into this. I like METR’s sabotage report review and think that can play a similar role. To abstract a little: UK AISI’s strong stakeholder relationships (labs, government, technical talent) place it particularly well to influence the ecosystem to do better.
- Strongly prioritises good epistemic standards: Partly this is a statement about my team specifically and how we operate within it. Pre-registration of beliefs / making beliefs pay rent, trying to de-confuse ourselves and separating observations from conclusions are large parts of how we operate. More broadly, I think the top down pressure is to do very high quality work probably because this mitigates risks (of saying incorrect things and these becoming known later). We’re often seeking feedback internally and externally with a request to red-team our results / claims.
- Having great incentives working in the public interest: Most obviously, we’re not a lab and being part of government team members can’t have significant conflicts of interest. The more senior you go in UK AISI, the more likely it is that the individuals could be working at a frontier lab and are choosing not too—so there isn’t a strong drive to look good to labs so we’ll get jobs with them (even though this might be a motivation for more junior team members). I think there are pressures in the direction of overstating and understating risks that roughly cancel out—with the dominant motivation mostly being to make statements that are true, and useful / practical recommendations.

It might be easier to explain why I feel good about our impact / epistemics / incentives in the context of opposing arguments. I’d be happy to steel-man any suggestions to the contrary.

Verbalized Eval Awareness Inflates Measured Safety

Santiago Aranguri and Joseph Bloom

4 May 2026 20:02 UTC

44 points

0 comments29 min readLW link

Joseph Bloom’s Shortform

Joseph Bloom1 May 2026 9:56 UTC

6 points

4 comments1 min readLW link

Joseph Bloom 1 May 2026 9:56 UTC
76 points
12
on: Joseph Bloom’s Shortform
https://x.com/JBloomAus/status/2050150634268098997?s=20

I’m hiring for Research Scientists and Engineers on my team at UK AISI. I think this is likely to be a very high impact role people should seriously consider. My team strongly prioritises good epistemic standards and have great incentives working in the public interest at UK AISI. Please apply!

Reproducing steering against evaluation awareness in a large open-weight model

Thomas Read, Bronson Schoen, Santiago Aranguri and Joseph Bloom

10 Apr 2026 10:45 UTC

89 points

17 comments15 min readLW link

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

7vik, Sid Black and Joseph Bloom

30 Mar 2026 10:56 UTC

127 points

6 comments17 min readLW link

We found an open weight model that games alignment honeypots

Thomas Read and Joseph Bloom

16 Mar 2026 12:57 UTC

79 points

2 comments10 min readLW link

Joseph Bloom 7 Jan 2026 9:14 UTC
2 points
0
in reply to: Jiaxing Wu’s comment on: [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
It’s undesirable because it makes SAEs more confusing / harder to use but it’s also a concrete obstacle to things like sparse feature circuits. For example, it causes automatic circuits composed of features to have issues. Instead of having a “starts with E” feature in a spelling circuit that fires on all “E” words, you have a “starts with E feature” that fires on most words (part of the circuit) + all the “E” absorbing features like “fires on the token Europe” (large number of features, adds lots of nodes to the circuit). So instead of being able to create a simple DAG, you get a complex one.

Auditing Games for Sandbagging [paper]

Jordan Taylor and Joseph Bloom

9 Dec 2025 18:37 UTC

103 points

4 comments10 min readLW link

Joseph Bloom 23 Aug 2025 16:03 UTC
2 points
0
in reply to: Garrett Baker’s comment on: Understanding SAE Features with the Logit Lens
I don’t remember signed GSEA being standard when I used it so no particular reason. Could try it.

Research Areas in Interpretability (The Alignment Project by UK AISI)

Joseph Bloom1 Aug 2025 10:26 UTC

14 points

0 comments5 min readLW link

(alignmentproject.aisi.gov.uk)

The Alignment Project by UK AISI

Mojmir, Benjamin Hilton, Jacob Pfau, Geoffrey Irving, Joseph Bloom, Tomek Korbak, David Africa and Edmund Lau

1 Aug 2025 9:52 UTC

29 points

0 comments2 min readLW link

(alignmentproject.aisi.gov.uk)

Joseph Bloom 25 Jul 2025 9:30 UTC
3 points
0
in reply to: RogerDearnaley’s comment on: White Box Control at UK AISI—Update on Sandbagging Investigations
I agree!

Joseph Bloom 23 Jul 2025 11:21 UTC
2 points
0
in reply to: Clément Dumas’s comment on: White Box Control at UK AISI—Update on Sandbagging Investigations
I suspect this is a good experiment. It may help!

Joseph Bloom 15 Jul 2025 8:37 UTC
LW: 2 AF: 1
0
AF
in reply to: Sam Marks’s comment on: White Box Control at UK AISI—Update on Sandbagging Investigations
Interesting—I agree I don’t have strong evidence here (and certainly we haven’t it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me—I didn’t think this interpretation would be contentious!
than that it would flag text containing very OOD examples of covert honesty.
I think you mean covert “dishonesty”?

Joseph Bloom 11 Jul 2025 11:28 UTC
LW: 7 AF: 6
1
AF
in reply to: Sam Marks’s comment on: White Box Control at UK AISI—Update on Sandbagging Investigations
That is, in the prompted setting, I think that it will be pretty hard to tell whether your probe is working by detecting the presence of instructions about sandbagging vs. working by detecting the presence of cognition about sandbagging.

I think I agree but think this is very subtle so want to go into more depth here:
1. I should note that the term “cognition about sandbagging” sounds like a broader thing than what I think we’re detecting—it might be worth clarifying.
2. I think we are likely detecting part of the cognition to do with intentionally saying wrong things. Since the probe is trained off-policy on a simple deception like dataset (and we’re able to play with it a bunch and characterise what we think it’s detecting), I feel fairly confident that the probe works by capturing a pseudo-lying or wrongness property. Maybe it tracks “how relevant is the heuristic that the next token should be wrong relevant to this token” (but it’s a little bit in superposition with “this token is wrong or part of something wrong”).
3. However, I think the presence of instructions around sandbagging has a downstream effect on the thing we’re detecting (and so in some sense we are measuring both). Apollo found that all scores (regardless of label) where elevated when they probed on the conditional sandbagging dataset from Benton et al. which I think does suggest the presence of instructions about sandbagging will elevate scores. But I’d argue that the presence of instructions is causing this heuristic to be more relevant, thus elevating the scores.

White Box Control at UK AISI—Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood and Alan Cooney

10 Jul 2025 13:37 UTC

81 points

10 comments18 min readLW link