Joseph Bloom

Karma: 1,399

I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I’m a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.

Auditing Games for Sandbagging [paper]

Jordan Taylor and Joseph Bloom

9 Dec 2025 18:37 UTC

103 points

4 comments10 min readLW link

Research Areas in Interpretability (The Alignment Project by UK AISI)

Joseph Bloom1 Aug 2025 10:26 UTC

14 points

0 comments5 min readLW link

(alignmentproject.aisi.gov.uk)

The Alignment Project by UK AISI

Mojmir, Benjamin Hilton, Jacob Pfau, Geoffrey Irving, Joseph Bloom, Tomek Korbak, David Africa and Edmund Lau

1 Aug 2025 9:52 UTC

29 points

0 comments2 min readLW link

(alignmentproject.aisi.gov.uk)

White Box Control at UK AISI—Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood and Alan Cooney

10 Jul 2025 13:37 UTC

80 points

10 comments18 min readLW link

Eliciting bad contexts

Geoffrey Irving, Joseph Bloom and Tomek Korbak

24 Jan 2025 10:39 UTC

36 points

9 comments3 min readLW link

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Matthew A. Clarke, hrdkbhatnagar and Joseph Bloom

20 Dec 2024 15:16 UTC

35 points

0 comments37 min readLW link

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

11 Dec 2024 6:30 UTC

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Toy Models of Feature Absorption in SAEs

chanind, hrdkbhatnagar, TomasD and Joseph Bloom

7 Oct 2024 9:56 UTC

49 points

8 comments10 min readLW link

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

chanind, TomasD, hrdkbhatnagar and Joseph Bloom

25 Sep 2024 9:31 UTC

73 points

18 comments3 min readLW link

(arxiv.org)

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

24 Aug 2024 0:56 UTC

73 points

10 comments20 min readLW link

Stitching SAEs of different sizes

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges and Neel Nanda

13 Jul 2024 17:19 UTC

39 points

12 comments12 min readLW link

A Selection of Randomly Selected SAE Features

CallumMcDougall and Joseph Bloom

1 Apr 2024 9:09 UTC

109 points

2 comments4 min readLW link

SAE-VIS: Announcement Post

CallumMcDougall and Joseph Bloom

31 Mar 2024 15:30 UTC

74 points

8 comments1 min readLW link

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Johnny Lin and Joseph Bloom

25 Mar 2024 21:17 UTC

96 points

7 comments7 min readLW link

Understanding SAE Features with the Logit Lens

Joseph Bloom and Johnny Lin

11 Mar 2024 0:16 UTC

71 points

2 comments14 min readLW link

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders and Joseph Bloom

27 Feb 2024 2:43 UTC

43 points

16 comments15 min readLW link

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Joseph Bloom2 Feb 2024 6:54 UTC

103 points

37 comments15 min readLW link

Linear encoding of character-level information in GPT-J token embeddings

mwatkins and Joseph Bloom

10 Nov 2023 22:19 UTC

35 points

4 comments28 min readLW link

Features and Adversaries in MemoryDT

Joseph Bloom and Jay Bailey

20 Oct 2023 7:32 UTC

31 points

6 comments25 min readLW link

Joseph Bloom on choosing AI Alignment over bio, what many aspiring researchers get wrong, and more (interview)

Ruby and Joseph Bloom

17 Sep 2023 18:45 UTC

27 points

2 comments8 min readLW link