RSS

Joseph Bloom

Karma: 1,293

I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I’m a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.

Re­search Areas in In­ter­pretabil­ity (The Align­ment Pro­ject by UK AISI)

Joseph Bloom1 Aug 2025 10:26 UTC
14 points
0 comments5 min readLW link
(alignmentproject.aisi.gov.uk)

The Align­ment Pro­ject by UK AISI

1 Aug 2025 9:52 UTC
29 points
0 comments2 min readLW link
(alignmentproject.aisi.gov.uk)

White Box Con­trol at UK AISI—Up­date on Sand­bag­ging Investigations

10 Jul 2025 13:37 UTC
78 points
10 comments18 min readLW link

Elic­it­ing bad contexts

24 Jan 2025 10:39 UTC
35 points
9 comments3 min readLW link

Com­po­si­tion­al­ity and Am­bi­guity: La­tent Co-oc­cur­rence and In­ter­pretable Subspaces

20 Dec 2024 15:16 UTC
34 points
0 comments37 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

11 Dec 2024 6:30 UTC
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Toy Models of Fea­ture Ab­sorp­tion in SAEs

7 Oct 2024 9:56 UTC
49 points
8 comments10 min readLW link

[Paper] A is for Ab­sorp­tion: Study­ing Fea­ture Split­ting and Ab­sorp­tion in Sparse Autoencoders

25 Sep 2024 9:31 UTC
73 points
16 comments3 min readLW link
(arxiv.org)

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

24 Aug 2024 0:56 UTC
72 points
10 comments20 min readLW link

Stitch­ing SAEs of differ­ent sizes

13 Jul 2024 17:19 UTC
39 points
12 comments12 min readLW link

A Selec­tion of Ran­domly Selected SAE Features

1 Apr 2024 9:09 UTC
109 points
2 comments4 min readLW link

SAE-VIS: An­nounce­ment Post

31 Mar 2024 15:30 UTC
74 points
8 comments1 min readLW link

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

25 Mar 2024 21:17 UTC
95 points
7 comments7 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

11 Mar 2024 0:16 UTC
69 points
2 comments14 min readLW link

Ex­am­in­ing Lan­guage Model Perfor­mance with Re­con­structed Ac­ti­va­tions us­ing Sparse Au­toen­coders

27 Feb 2024 2:43 UTC
43 points
16 comments15 min readLW link

Open Source Sparse Au­toen­coders for all Resi­d­ual Stream Lay­ers of GPT2-Small

Joseph Bloom2 Feb 2024 6:54 UTC
103 points
37 comments15 min readLW link

Lin­ear en­cod­ing of char­ac­ter-level in­for­ma­tion in GPT-J to­ken embeddings

10 Nov 2023 22:19 UTC
34 points
4 comments28 min readLW link

Fea­tures and Ad­ver­saries in MemoryDT

20 Oct 2023 7:32 UTC
31 points
6 comments25 min readLW link

Joseph Bloom on choos­ing AI Align­ment over bio, what many as­piring re­searchers get wrong, and more (in­ter­view)

17 Sep 2023 18:45 UTC
27 points
2 comments8 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of a GridWorld Agent-Si­mu­la­tor (Part 1 of N)

Joseph Bloom16 May 2023 22:59 UTC
36 points
2 comments16 min readLW link