RSS

Joseph Bloom

Karma: 1,293

I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I’m a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.

Re­search Areas in In­ter­pretabil­ity (The Align­ment Pro­ject by UK AISI)

Joseph BloomAug 1, 2025, 10:26 AM
14 points

7 votes

Overall karma indicates overall quality.

0 comments5 min readLW link
(alignmentproject.aisi.gov.uk)

The Align­ment Pro­ject by UK AISI

Aug 1, 2025, 9:52 AM
29 points

9 votes

Overall karma indicates overall quality.

0 comments2 min readLW link
(alignmentproject.aisi.gov.uk)

White Box Con­trol at UK AISI—Up­date on Sand­bag­ging Investigations

Jul 10, 2025, 1:37 PM
78 points

32 votes

Overall karma indicates overall quality.

10 comments18 min readLW link

Elic­it­ing bad contexts

Jan 24, 2025, 10:39 AM
35 points

15 votes

Overall karma indicates overall quality.

9 comments3 min readLW link

Com­po­si­tion­al­ity and Am­bi­guity: La­tent Co-oc­cur­rence and In­ter­pretable Subspaces

Dec 20, 2024, 3:16 PM
34 points

13 votes

Overall karma indicates overall quality.

0 comments37 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

Dec 11, 2024, 6:30 AM
82 points

34 votes

Overall karma indicates overall quality.

6 comments2 min readLW link
(www.neuronpedia.org)

Toy Models of Fea­ture Ab­sorp­tion in SAEs

Oct 7, 2024, 9:56 AM
49 points

21 votes

Overall karma indicates overall quality.

8 comments10 min readLW link