Josh Engels

Karma: 959

Research scientist on the DeepMind interp team. Thoughts are my own and do not reflect views of my employer.

Josh Engels 21 Jun 2026 14:07 UTC
LW: 3 AF: 1
2
AF
in reply to: Oliver Sourbut’s comment on: How transparent is DiffusionGemma (and why it matters)
Ah yeah, we probably should have used different language here, top p means keep any tokens with probability greater than p. It’s more like “p threshold” or “p cutoff”

How transparent is DiffusionGemma (and why it matters)

Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy, Rohin Shah and Neel Nanda

20 Jun 2026 20:05 UTC

73 points

2 comments4 min readLW link

Why Do Naive SFT Filters For Safety Properties Fail?

Josh Engels and Neel Nanda

14 Jun 2026 19:45 UTC

50 points

7 comments10 min readLW link

SFT Drives Gemini’s Safety Properties

Josh Engels, Arthur Conmy, bilalchughtai and Neel Nanda

13 Jun 2026 15:31 UTC

78 points

3 comments1 min readLW link

Building and evaluating model diffing agents

bilalchughtai, Josh Engels and Neel Nanda

12 Jun 2026 17:14 UTC

61 points

2 comments12 min readLW link

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

Reilly Haskins, bilalchughtai and Josh Engels

27 May 2026 9:39 UTC

31 points

1 comment4 min readLW link

(arxiv.org)

Test your best methods on our hard CoT interp tasks

daria, Riya Tyagi, Josh Engels and Neel Nanda

26 Mar 2026 19:24 UTC

58 points

2 comments19 min readLW link

Training on Documents About Monitoring Leads To CoT Obfuscation

Reilly Haskins, bilalchughtai and Josh Engels

18 Mar 2026 20:37 UTC

65 points

5 comments16 min readLW link

Josh Engels 10 Mar 2026 23:09 UTC
43 points
3
in reply to: Mitchell_Porter’s comment on: Gemma Needs Help
I’ve been looking into this and hope to release work at some point soon.

Josh Engels 13 Feb 2026 0:04 UTC
4 points
2
on: Josh Engels’s Shortform
I think that most of the time when you need to classify something, you should use an LLM, not a probe.

That being said, there are some situations where I think activation probes are better. To help clarify my thinking, I wrote out the axes on which I currently think that probes are sometimes better than LLMs / possibly SOTA:

1. Efficiency → when done on policy, probes are extremely cheap and fast. For example, Anthropic’s work on efficient misuse probes or our recent Gemini Probing paper.

2. Safety → for some things like deception, alignment faking, eval awareness, etc., you might not be able to trust the model’s output, and probes (or other internals based techniques) might help you. See Apollo’s work on deception probes and Probing and Steering Evaluation Awareness of Language Models.

3. Elicitation → the knowledge is in the LLM, but it’s hard to figure out the prompt to get the knowledge out exactly as you want. In this case, the probe training data is your way to convey the parameters of what you want. Goodfire’s recent paper could be an example of this: https://arxiv.org/abs/2602.10067

4. Calibration → again, the knowledge is in the LLM, but the model has trouble telling you its internal confidence in the knowledge when prompted. For example, in Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?, they conclude that truth probes are sometimes better than prompting primarily because of better calibration.

Thought Editing: Steering Models by Editing Their Chain of Thought

Anton de la Fuente and Josh Engels

3 Feb 2026 9:51 UTC

20 points

0 comments5 min readLW link

Brief Explorations in LLM Value Rankings

Tim Hua, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

12 Jan 2026 18:16 UTC

39 points

1 comment11 min readLW link

Steering RL Training: Benchmarking Interventions Against Reward Hacking

ariaw, Josh Engels and Neel Nanda

29 Dec 2025 21:55 UTC

72 points

11 comments19 min readLW link

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

22 Dec 2025 16:56 UTC

44 points

1 comment9 min readLW link

Josh Engels 11 Dec 2025 12:00 UTC
4 points
0
in reply to: Brendan Long’s comment on: Prompting Models to Obfuscate Their CoT
Commenting here for Felix since he has a new account and it hasn’t been approved by a LessWrong moderator yet:
Good question! The problem differs between attempts. In the image, cases 1 and 2 are different vignettes. The model first attempts case 2 and fails, sees the failure message (“TRY AGAIN with a new scenario:”), and then is given case 1, a completely different situation it hasn’t seen before (with different medications).

Prompting Models to Obfuscate Their CoT

Josh Engels and Felix Tudose

8 Dec 2025 21:00 UTC

16 points

4 comments7 min readLW link

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

68 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

139 points

39 comments27 min readLW link

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan and Josh Engels

19 Nov 2025 15:27 UTC

56 points

0 comments20 min readLW link

Josh Engels 21 Oct 2025 1:02 UTC
2 points
0
on: Spooky Collusion at a Distance with Superrational AI
Cool work!
I was thinking a bit and I think that nondeterministic sampling (temperature > 0*), which is standard for LLM sampling, might complicate the picture a bit. A model instance might think “well almost surely I would choose to cooperate, but maaaybe in a rare instance I would defect.”
E.g. for Wolf’s dilemma with n = 1000, if I’m an LLM and I decide that there’s initially a 99% chance a copy of me will cooperate but a 1% chance it will defect, it’s now suddenly optimal to defect (perhaps the model reasons that it is in this 1%!). This might then cause the LLM to change its 99 − 1 initial estimate (because it thinks other models will reason like it), so I’m not actually sure how the logic goes here, it seems hard to reason about.
As a concrete experiment, I would be curious if models start to defect more if you increase n! It would be extremely cool if there was some threshold n where a model suddenly starts defecting, as possibly that would mean that it was aware of its own probability p of defecting and was waiting for n > 1 / p.
*Did you run this with temperature 0? Note that even with temperature 0, I’m not sure that model sampling is actually deterministic in practice.