RSS

Joseph Bloom

Karma: 1,196

I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I’m a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.

Elic­it­ing bad contexts

Jan 24, 2025, 10:39 AM
31 points
8 comments3 min readLW link

Com­po­si­tion­al­ity and Am­bi­guity: La­tent Co-oc­cur­rence and In­ter­pretable Subspaces

Dec 20, 2024, 3:16 PM
32 points
0 comments37 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

Dec 11, 2024, 6:30 AM
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Toy Models of Fea­ture Ab­sorp­tion in SAEs

Oct 7, 2024, 9:56 AM
49 points
8 comments10 min readLW link

[Paper] A is for Ab­sorp­tion: Study­ing Fea­ture Split­ting and Ab­sorp­tion in Sparse Autoencoders

Sep 25, 2024, 9:31 AM
73 points
16 comments3 min readLW link
(arxiv.org)

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

Aug 24, 2024, 12:56 AM
68 points
10 comments20 min readLW link

Stitch­ing SAEs of differ­ent sizes

Jul 13, 2024, 5:19 PM
39 points
12 comments12 min readLW link

A Selec­tion of Ran­domly Selected SAE Features

Apr 1, 2024, 9:09 AM
109 points
2 comments4 min readLW link

SAE-VIS: An­nounce­ment Post

Mar 31, 2024, 3:30 PM
74 points
8 comments1 min readLW link

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

Mar 25, 2024, 9:17 PM
93 points
7 comments7 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

Mar 11, 2024, 12:16 AM
68 points
0 comments14 min readLW link

Ex­am­in­ing Lan­guage Model Perfor­mance with Re­con­structed Ac­ti­va­tions us­ing Sparse Au­toen­coders

Feb 27, 2024, 2:43 AM
43 points
16 comments15 min readLW link

Open Source Sparse Au­toen­coders for all Resi­d­ual Stream Lay­ers of GPT2-Small

Joseph BloomFeb 2, 2024, 6:54 AM
103 points
37 comments15 min readLW link

Lin­ear en­cod­ing of char­ac­ter-level in­for­ma­tion in GPT-J to­ken embeddings

Nov 10, 2023, 10:19 PM
34 points
4 comments28 min readLW link

Fea­tures and Ad­ver­saries in MemoryDT

Oct 20, 2023, 7:32 AM
31 points
6 comments25 min readLW link

Joseph Bloom on choos­ing AI Align­ment over bio, what many as­piring re­searchers get wrong, and more (in­ter­view)

Sep 17, 2023, 6:45 PM
27 points
2 comments8 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of a GridWorld Agent-Si­mu­la­tor (Part 1 of N)

Joseph BloomMay 16, 2023, 10:59 PM
36 points
2 comments16 min readLW link

De­ci­sion Trans­former Interpretability

Feb 6, 2023, 7:29 AM
85 points
13 comments24 min readLW link