Josh Engels

Karma: 730

Research scientist on the DeepMind interp team. Thoughts are my own and do not reflect views of my employer.

Test your best methods on our hard CoT interp tasks

daria, Riya Tyagi, Josh Engels and Neel Nanda

26 Mar 2026 19:24 UTC

57 points

2 comments19 min readLW link

Training on Documents About Monitoring Leads To CoT Obfuscation

Reilly Haskins, bilalchughtai and Josh Engels

18 Mar 2026 20:37 UTC

65 points

5 comments16 min readLW link

Josh Engels 10 Mar 2026 23:09 UTC
43 points
3
in reply to: Mitchell_Porter’s comment on: Gemma Needs Help
I’ve been looking into this and hope to release work at some point soon.

Josh Engels 13 Feb 2026 0:04 UTC
4 points
2
on: Josh Engels’s Shortform
I think that most of the time when you need to classify something, you should use an LLM, not a probe.

That being said, there are some situations where I think activation probes are better. To help clarify my thinking, I wrote out the axes on which I currently think that probes are sometimes better than LLMs / possibly SOTA:

1. Efficiency → when done on policy, probes are extremely cheap and fast. For example, Anthropic’s work on efficient misuse probes or our recent Gemini Probing paper.

2. Safety → for some things like deception, alignment faking, eval awareness, etc., you might not be able to trust the model’s output, and probes (or other internals based techniques) might help you. See Apollo’s work on deception probes and Probing and Steering Evaluation Awareness of Language Models.

3. Elicitation → the knowledge is in the LLM, but it’s hard to figure out the prompt to get the knowledge out exactly as you want. In this case, the probe training data is your way to convey the parameters of what you want. Goodfire’s recent paper could be an example of this: https://arxiv.org/abs/2602.10067

4. Calibration → again, the knowledge is in the LLM, but the model has trouble telling you its internal confidence in the knowledge when prompted. For example, in Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?, they conclude that truth probes are sometimes better than prompting primarily because of better calibration.

Thought Editing: Steering Models by Editing Their Chain of Thought

Anton de la Fuente and Josh Engels

3 Feb 2026 9:51 UTC

20 points

0 comments5 min readLW link

Brief Explorations in LLM Value Rankings

Tim Hua, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

12 Jan 2026 18:16 UTC

39 points

1 comment11 min readLW link

Steering RL Training: Benchmarking Interventions Against Reward Hacking

ariaw, Josh Engels and Neel Nanda

29 Dec 2025 21:55 UTC

60 points

10 comments19 min readLW link

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

22 Dec 2025 16:56 UTC

44 points

1 comment9 min readLW link

Josh Engels 11 Dec 2025 12:00 UTC
4 points
0
in reply to: Brendan Long’s comment on: Prompting Models to Obfuscate Their CoT
Commenting here for Felix since he has a new account and it hasn’t been approved by a LessWrong moderator yet:
Good question! The problem differs between attempts. In the image, cases 1 and 2 are different vignettes. The model first attempts case 2 and fails, sees the failure message (“TRY AGAIN with a new scenario:”), and then is given case 1, a completely different situation it hasn’t seen before (with different medications).

Prompting Models to Obfuscate Their CoT

Josh Engels and Felix Tudose

8 Dec 2025 21:00 UTC

16 points

4 comments7 min readLW link

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

67 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

136 points

39 comments27 min readLW link

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan and Josh Engels

19 Nov 2025 15:27 UTC

56 points

0 comments20 min readLW link

Josh Engels 21 Oct 2025 1:02 UTC
2 points
0
on: Spooky Collusion at a Distance with Superrational AI
Cool work!
I was thinking a bit and I think that nondeterministic sampling (temperature > 0*), which is standard for LLM sampling, might complicate the picture a bit. A model instance might think “well almost surely I would choose to cooperate, but maaaybe in a rare instance I would defect.”
E.g. for Wolf’s dilemma with n = 1000, if I’m an LLM and I decide that there’s initially a 99% chance a copy of me will cooperate but a 1% chance it will defect, it’s now suddenly optimal to defect (perhaps the model reasons that it is in this 1%!). This might then cause the LLM to change its 99 − 1 initial estimate (because it thinks other models will reason like it), so I’m not actually sure how the logic goes here, it seems hard to reason about.
As a concrete experiment, I would be curious if models start to defect more if you increase n! It would be extremely cool if there was some threshold n where a model suddenly starts defecting, as possibly that would mean that it was aware of its own probability p of defecting and was waiting for n > 1 / p.
*Did you run this with temperature 0? Note that even with temperature 0, I’m not sure that model sampling is actually deterministic in practice.

Negative Results on Group SAEs

Josh Engels6 May 2025 21:49 UTC

76 points

3 comments8 min readLW link

Josh Engels 6 May 2025 14:49 UTC
2 points
0
in reply to: Josh Levy’s comment on: Interim Research Report: Mechanisms of Awareness
Thanks! I think you’re right that this isn’t conclusive evidence; trying on a different setting where self awareness isn’t so similar to the in distribution behavior might help, which I’m planning on looking at next. I do think that you shouldn’t draw much from the absolute magnitude of the effect at a given layer for a given dataset, since the questions in each dataset are totally different (the relative magnitude across layers is a more principled thing to compare, which is why we argued they looked similar).

Josh Engels 6 May 2025 14:45 UTC
1 point
0
in reply to: Jan Betley’s comment on: Interim Research Report: Mechanisms of Awareness
Good points, thanks! We do try training with the same setup as the paper in the “Investigating Further” section, which still doesn’t work. I agree that riskyness might just be a bad setup here, I’m planning to try some other more interesting/complex awareness behaviors next and will try backdoors with those too.

Interim Research Report: Mechanisms of Awareness

Josh Engels, Neel Nanda and Senthooran Rajamanoharan

2 May 2025 20:29 UTC

43 points

6 comments8 min readLW link

Scaling Laws for Scalable Oversight

Subhash Kantamneni, Josh Engels, David Baek and Max Tegmark

30 Apr 2025 12:13 UTC

38 points

1 comment9 min readLW link

Josh Engels’s Shortform

Josh Engels30 Apr 2025 10:58 UTC

4 points

5 comments1 min readLW link