RSS

Joseph Bloom

Karma: 1,962

I run the Model Transparency Work-stream at the UK AI Security Institute. We are generally concerned with oversight, including monitorability, auditing and (preventing /​ detecting) scheming. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.

Ma­chinic Psy­chophar­ma­col­ogy: Do LLMs Self-Med­i­cate?

10 Jun 2026 14:15 UTC
120 points
7 comments23 min readLW link

Loss of Over­sight: How AI Sys­tems May Be­come Harder to Au­dit, Mon­i­tor, and Investigate

21 May 2026 14:52 UTC
83 points
0 comments6 min readLW link
(www.aisi.gov.uk)

Ver­bal­ized Eval Aware­ness In­flates Mea­sured Safety

4 May 2026 20:02 UTC
44 points
0 comments29 min readLW link

Joseph Bloom’s Shortform

Joseph Bloom1 May 2026 9:56 UTC
6 points
4 comments1 min readLW link

Re­pro­duc­ing steer­ing against eval­u­a­tion aware­ness in a large open-weight model

10 Apr 2026 10:45 UTC
89 points
17 comments15 min readLW link

(Some) Nat­u­ral Emer­gent Misal­ign­ment from Re­ward Hack­ing in Non-Pro­duc­tion RL

30 Mar 2026 10:56 UTC
126 points
6 comments17 min readLW link

We found an open weight model that games al­ign­ment honeypots

16 Mar 2026 12:57 UTC
79 points
2 comments10 min readLW link

Au­dit­ing Games for Sand­bag­ging [pa­per]

9 Dec 2025 18:37 UTC
103 points
4 comments10 min readLW link

Re­search Areas in In­ter­pretabil­ity (The Align­ment Pro­ject by UK AISI)

Joseph Bloom1 Aug 2025 10:26 UTC
14 points
0 comments5 min readLW link
(alignmentproject.aisi.gov.uk)

The Align­ment Pro­ject by UK AISI

1 Aug 2025 9:52 UTC
29 points
0 comments2 min readLW link
(alignmentproject.aisi.gov.uk)

White Box Con­trol at UK AISI—Up­date on Sand­bag­ging Investigations

10 Jul 2025 13:37 UTC
81 points
10 comments18 min readLW link