RSS

Joseph Bloom

Karma: 1,967

I run the Model Transparency Work-stream at the UK AI Security Institute. We are generally concerned with oversight, including monitorability, auditing and (preventing /​ detecting) scheming. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.

Ma­chinic Psy­chophar­ma­col­ogy: Do LLMs Self-Med­i­cate?

10 Jun 2026 14:15 UTC
124 points
10 comments23 min readLW link

Loss of Over­sight: How AI Sys­tems May Be­come Harder to Au­dit, Mon­i­tor, and Investigate

21 May 2026 14:52 UTC
83 points
0 comments6 min readLW link
(www.aisi.gov.uk)

Ver­bal­ized Eval Aware­ness In­flates Mea­sured Safety

4 May 2026 20:02 UTC
44 points
0 comments29 min readLW link

Joseph Bloom’s Shortform

Joseph Bloom1 May 2026 9:56 UTC
6 points
4 comments1 min readLW link

Re­pro­duc­ing steer­ing against eval­u­a­tion aware­ness in a large open-weight model

10 Apr 2026 10:45 UTC
89 points
17 comments15 min readLW link

(Some) Nat­u­ral Emer­gent Misal­ign­ment from Re­ward Hack­ing in Non-Pro­duc­tion RL

30 Mar 2026 10:56 UTC
127 points
6 comments17 min readLW link

We found an open weight model that games al­ign­ment honeypots

16 Mar 2026 12:57 UTC
79 points
2 comments10 min readLW link

Au­dit­ing Games for Sand­bag­ging [pa­per]

9 Dec 2025 18:37 UTC
103 points
4 comments10 min readLW link

Re­search Areas in In­ter­pretabil­ity (The Align­ment Pro­ject by UK AISI)

Joseph Bloom1 Aug 2025 10:26 UTC
14 points
0 comments5 min readLW link
(alignmentproject.aisi.gov.uk)

The Align­ment Pro­ject by UK AISI

1 Aug 2025 9:52 UTC
29 points
0 comments2 min readLW link
(alignmentproject.aisi.gov.uk)

White Box Con­trol at UK AISI—Up­date on Sand­bag­ging Investigations

10 Jul 2025 13:37 UTC
81 points
10 comments18 min readLW link

Elic­it­ing bad contexts

24 Jan 2025 10:39 UTC
37 points
9 comments3 min readLW link

Com­po­si­tion­al­ity and Am­bi­guity: La­tent Co-oc­cur­rence and In­ter­pretable Subspaces

20 Dec 2024 15:16 UTC
36 points
0 comments37 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

11 Dec 2024 6:30 UTC
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Toy Models of Fea­ture Ab­sorp­tion in SAEs

7 Oct 2024 9:56 UTC
49 points
8 comments10 min readLW link

[Paper] A is for Ab­sorp­tion: Study­ing Fea­ture Split­ting and Ab­sorp­tion in Sparse Autoencoders

25 Sep 2024 9:31 UTC
74 points
19 comments3 min readLW link
(arxiv.org)

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

24 Aug 2024 0:56 UTC
73 points
10 comments20 min readLW link

Stitch­ing SAEs of differ­ent sizes

13 Jul 2024 17:19 UTC
39 points
12 comments12 min readLW link

A Selec­tion of Ran­domly Selected SAE Features

1 Apr 2024 9:09 UTC
109 points
2 comments4 min readLW link

SAE-VIS: An­nounce­ment Post

31 Mar 2024 15:30 UTC
74 points
8 comments1 min readLW link