Josh Engels

Karma: 971

Research scientist on the DeepMind interp team. Thoughts are my own and do not reflect views of my employer.

LLM-Driven Feature Discovery

Josh Engels, bilalchughtai and Neel Nanda

22 Jun 2026 22:26 UTC

10 points

0 comments5 min readLW link

How transparent is DiffusionGemma (and why it matters)

Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy, Rohin Shah and Neel Nanda

20 Jun 2026 20:05 UTC

79 points

2 comments4 min readLW link

Why Do Naive SFT Filters For Safety Properties Fail?

Josh Engels and Neel Nanda

14 Jun 2026 19:45 UTC

51 points

7 comments10 min readLW link

SFT Drives Gemini’s Safety Properties

Josh Engels, Arthur Conmy, bilalchughtai and Neel Nanda

13 Jun 2026 15:31 UTC

78 points

4 comments1 min readLW link

Building and evaluating model diffing agents

bilalchughtai, Josh Engels and Neel Nanda

12 Jun 2026 17:14 UTC

61 points

2 comments12 min readLW link

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

Reilly Haskins, bilalchughtai and Josh Engels

27 May 2026 9:39 UTC

31 points

1 comment4 min readLW link

(arxiv.org)

Test your best methods on our hard CoT interp tasks

daria, Riya Tyagi, Josh Engels and Neel Nanda

26 Mar 2026 19:24 UTC

58 points

2 comments19 min readLW link

Training on Documents About Monitoring Leads To CoT Obfuscation

Reilly Haskins, bilalchughtai and Josh Engels

18 Mar 2026 20:37 UTC

65 points

5 comments16 min readLW link

Thought Editing: Steering Models by Editing Their Chain of Thought

Anton de la Fuente and Josh Engels

3 Feb 2026 9:51 UTC

20 points

0 comments5 min readLW link

Brief Explorations in LLM Value Rankings

Tim Hua, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

12 Jan 2026 18:16 UTC

39 points

1 comment11 min readLW link

Steering RL Training: Benchmarking Interventions Against Reward Hacking

ariaw, Josh Engels and Neel Nanda

29 Dec 2025 21:55 UTC

72 points

11 comments19 min readLW link

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

22 Dec 2025 16:56 UTC

44 points

1 comment9 min readLW link

Prompting Models to Obfuscate Their CoT

Josh Engels and Felix Tudose

8 Dec 2025 21:00 UTC

16 points

4 comments7 min readLW link

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

68 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

139 points

39 comments27 min readLW link

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan and Josh Engels

19 Nov 2025 15:27 UTC

56 points

0 comments20 min readLW link

Negative Results on Group SAEs

Josh Engels6 May 2025 21:49 UTC

78 points

3 comments8 min readLW link

Interim Research Report: Mechanisms of Awareness

Josh Engels, Neel Nanda and Senthooran Rajamanoharan

2 May 2025 20:29 UTC

43 points

6 comments8 min readLW link

Scaling Laws for Scalable Oversight

Subhash Kantamneni, Josh Engels, David Baek and Max Tegmark

30 Apr 2025 12:13 UTC

38 points

1 comment9 min readLW link

Josh Engels’s Shortform

Josh Engels30 Apr 2025 10:58 UTC

4 points

5 comments1 min readLW link