CallumMcDougall

Karma: 2,504

New ARENA material: 8 exercise sets on alignment science & interpretability

CallumMcDougall27 Feb 2026 17:37 UTC

103 points

1 comment7 min readLW link

ARENA 8.0 - Call for Applicants

JScriven, JamesH, David Quarel and CallumMcDougall

20 Feb 2026 18:28 UTC

31 points

1 comment6 min readLW link

Announcing Gemma Scope 2

CallumMcDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan and Neel Nanda

22 Dec 2025 21:56 UTC

96 points

1 comment2 min readLW link

Transmitting Misalignment with Subliminal Learning via Paraphrasing

Matthew Bozoukov, Taywon Min, CallumMcDougall and J Rosser

17 Dec 2025 19:34 UTC

39 points

0 comments10 min readLW link

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

67 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

136 points

39 comments27 min readLW link

CallumMcDougall 3 Oct 2025 7:36 UTC
25 points
10
on: Omelas Is Perfectly Misread
Try to find any article, paper, or blog post that lays out this reading. I haven’t found one (but there is sometimes someone in the comments who seems to get it).
Omelas: How We Talk About Utopia
Pretty much identical thesis, does this count?

ARENA 7.0 - Call for Applicants

JScriven, JamesH, CallumMcDougall and David Quarel

30 Sep 2025 14:54 UTC

27 points

1 comment6 min readLW link

CallumMcDougall 9 Sep 2025 22:47 UTC
27 points
15
in reply to: davekasten’s comment on: chanamessinger’s Shortform
True, although I wish more people would engage with the common anti-AI-x-risk argument of “tech CEOs are exaggerating existential risk because they think it’ll make their products seem more important and potentially world changing, and so artificially boost hype”. Not saying I agree with this, but there’s at least some extent to which it’s true, and I think this community often fails to appropriately engage with and combat this argument.

In general, this is why “appeal to authority” arguments should generally be avoided if we’re talking about people who are widely seen as untrustworthy and having ulterior motives. At most I think people like Geoffrey Hinton are seen as reputable and not as morally compromised so serve as better subjects for an appeal to authority, but mostly rather than needing to appeal to authority at all we should just try and bring things back to the object-level arguments.

ARENA 6.0 - Call for Applicants

JamesH, JScriven, David Quarel, CallumMcDougall and James Fox

4 Jun 2025 10:19 UTC

26 points

3 comments6 min readLW link

New Cause Area Proposal

CallumMcDougall1 Apr 2025 7:12 UTC

110 points

4 comments1 min readLW link

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

26 Mar 2025 19:07 UTC

117 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

CallumMcDougall 22 Feb 2025 15:59 UTC
3 points
0
in reply to: yihe’s comment on: Induction heads—illustrated
Sorry I didn’t get to this message earlier, glad you liked the post though! The answer is that attention heads can have multiple different functions—the simplest way is to store things entirely orthogonally so they lie in fully independent subspsaces, but even this isn’t necessary because it seems like transformers take advantage of superposition to represent multiple concepts at once, more so than they have dimensions.

ARENA 5.0 - Call for Applicants

JamesH, James Fox, CallumMcDougall, Chloe Li and David Quarel

30 Jan 2025 13:18 UTC

35 points

2 comments6 min readLW link

Scaling Sparse Feature Circuit Finding to Gemma 9B

Diego Caples, Jatin Nainani, CallumMcDougall and rrenaud

10 Jan 2025 11:08 UTC

88 points

11 comments17 min readLW link

CallumMcDougall 1 Jan 2025 18:34 UTC
5 points
0
in reply to: Fabien Roger’s comment on: How to replicate and extend our alignment faking demo
Oh, interesting, wasn’t aware of this bug. I guess this is probably fine since most people replicating it will be pulling it rather than copying and pasting it into their IDE. Also this comment thread is now here for anyone who might also get confused. Thanks for clarifying!

CallumMcDougall 30 Dec 2024 14:40 UTC
3 points
0
on: How to replicate and extend our alignment faking demo
+1, thanks for sharing! I think there’s a formatting error in the notebook, where the tags like <OUTPUT> were all removed and replaced with empty strings (e.g. see attached photo). We’ve recently made the ARENA evals material public, and we’ve got a working replication there which I think has the tags in the right place (section 2 of 3 on the page linked here)

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

11 Dec 2024 6:30 UTC

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

CallumMcDougall 6 Dec 2024 10:28 UTC
7 points
1
on: [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
Amazing post! Forgot to do this for a while, but here’s a linked diagram explaining how I think about feature absorption, hopefully ppl find it helpful!

CallumMcDougall 12 Oct 2024 21:05 UTC
4 points
0
in reply to: chanind’s comment on: Toy Models of Feature Absorption in SAEs
I don’t know of specific examples, but this is the image I have in my head when thinking about why untied weights are more free than tied weights:
I think more generally this is why I think studying SAEs in the TMS setup can be a bit challenging, because there’s often too much symmetry and not enough complexity for untied weights to be useful, meaning just forcing your weights to be tied can fix a lot of problems! (We include it in ARENA mostly for illustration of key concepts, not because it gets you many super informative results). But I’m keen for more work like this trying to understand feature absorption better in more tractible cases