Josh Engels

Karma: 543

Brief Explorations in LLM Value Rankings

Tim Hua, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

12 Jan 2026 18:16 UTC

37 points

1 comment11 min readLW link

Steering RL Training: Benchmarking Interventions Against Reward Hacking

ariaw, Josh Engels and Neel Nanda

29 Dec 2025 21:55 UTC

47 points

10 comments19 min readLW link

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

22 Dec 2025 16:56 UTC

33 points

0 comments9 min readLW link

Josh Engels 11 Dec 2025 12:00 UTC
4 points
0
in reply to: Brendan Long’s comment on: Prompting Models to Obfuscate Their CoT
Commenting here for Felix since he has a new account and it hasn’t been approved by a LessWrong moderator yet:
Good question! The problem differs between attempts. In the image, cases 1 and 2 are different vignettes. The model first attempts case 2 and fails, sees the failure message (“TRY AGAIN with a new scenario:”), and then is given case 1, a completely different situation it hasn’t seen before (with different medications).

Prompting Models to Obfuscate Their CoT

Josh Engels and Felix Tudose

8 Dec 2025 21:00 UTC

15 points

4 comments7 min readLW link

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

66 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

131 points

39 comments27 min readLW link

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan and Josh Engels

19 Nov 2025 15:27 UTC

53 points

0 comments20 min readLW link

Josh Engels 21 Oct 2025 1:02 UTC
2 points
0
on: Spooky Collusion at a Distance with Superrational AI
Cool work!
I was thinking a bit and I think that nondeterministic sampling (temperature > 0*), which is standard for LLM sampling, might complicate the picture a bit. A model instance might think “well almost surely I would choose to cooperate, but maaaybe in a rare instance I would defect.”
E.g. for Wolf’s dilemma with n = 1000, if I’m an LLM and I decide that there’s initially a 99% chance a copy of me will cooperate but a 1% chance it will defect, it’s now suddenly optimal to defect (perhaps the model reasons that it is in this 1%!). This might then cause the LLM to change its 99 − 1 initial estimate (because it thinks other models will reason like it), so I’m not actually sure how the logic goes here, it seems hard to reason about.
As a concrete experiment, I would be curious if models start to defect more if you increase n! It would be extremely cool if there was some threshold n where a model suddenly starts defecting, as possibly that would mean that it was aware of its own probability p of defecting and was waiting for n > 1 / p.
*Did you run this with temperature 0? Note that even with temperature 0, I’m not sure that model sampling is actually deterministic in practice.

Negative Results on Group SAEs

Josh Engels6 May 2025 21:49 UTC

76 points

3 comments8 min readLW link

Josh Engels 6 May 2025 14:49 UTC
2 points
0
in reply to: Josh Levy’s comment on: Interim Research Report: Mechanisms of Awareness
Thanks! I think you’re right that this isn’t conclusive evidence; trying on a different setting where self awareness isn’t so similar to the in distribution behavior might help, which I’m planning on looking at next. I do think that you shouldn’t draw much from the absolute magnitude of the effect at a given layer for a given dataset, since the questions in each dataset are totally different (the relative magnitude across layers is a more principled thing to compare, which is why we argued they looked similar).

Josh Engels 6 May 2025 14:45 UTC
1 point
0
in reply to: Jan Betley’s comment on: Interim Research Report: Mechanisms of Awareness
Good points, thanks! We do try training with the same setup as the paper in the “Investigating Further” section, which still doesn’t work. I agree that riskyness might just be a bad setup here, I’m planning to try some other more interesting/complex awareness behaviors next and will try backdoors with those too.

Interim Research Report: Mechanisms of Awareness

Josh Engels, Neel Nanda and Senthooran Rajamanoharan

2 May 2025 20:29 UTC

43 points

6 comments8 min readLW link

Scaling Laws for Scalable Oversight

Subhash Kantamneni, Josh Engels, David Baek and Max Tegmark

30 Apr 2025 12:13 UTC

37 points

1 comment9 min readLW link

Josh Engels’s Shortform

Josh Engels30 Apr 2025 10:58 UTC

4 points

4 comments1 min readLW link

Josh Engels 30 Apr 2025 10:58 UTC
92 points
27
on: Josh Engels’s Shortform
I really liked @Sam Marks recent post on downstream applications as validation for interp techniques, and I’ve been feeling similarly after the (in my opinion) somewhat disappointing downstream performance of SAEs.
Motivated by this, I’ve written up about 50 weird language model results I found in the literature. I expect some of them to be familiar to most here (e.g. alignment faking, reward hacking) and some to be a bit more obscure (e.g. input space connectivity, fork tokens).
If our current interp techniques can help us understand these phenomenon better, that’s great! Otherwise I hope that seeing where our current techniques fail might help us develop better techniques.
I’m also interested in taking a wide view of what counts as interp. When trying to understand some weird model behavior, if mech interp techniques aren’t as useful as linear probing, or even careful black box experiments, that seems important to know!
Here’s the doc: https://docs.google.com/spreadsheets/d/1yFAawnO9z0DtnRJDhRzDqJRNkCsIK_N3_pr_yCUGouI/edit?gid=0#gid=0
Thanks to @jake_mendel, @Senthooran Rajamanoharan, and @Neel Nanda for the conversations that convinced me to write this up.

Takeaways From Our Recent Work on SAE Probing

Josh Engels, Subhash Kantamneni, Senthooran Rajamanoharan and Neel Nanda

3 Mar 2025 19:50 UTC

30 points

4 comments5 min readLW link

Josh Engels 7 Feb 2025 1:33 UTC
23 points
5
in reply to: StefanHex’s comment on: StefanHex’s Shortform
I was having trouble reproducing your results on Pythia, and was only able to get 60% variance explained. I may have tracked it down: I think you may be computing FVU incorrectly.

https://gist.github.com/Stefan-Heimersheim/ff1d3b92add92a29602b411b9cd76cec#file-clustering_pythia-py-L309

I think FVU is correctly computed by subtracting the mean from each dimension when computing the denominator. See the SAEBench impl here:

https://github.com/adamkarvonen/SAEBench/blob/5204b4822c66a838d9c9221640308e7c23eda00a/sae_bench/evals/core/main.py#L566
When I used your FVU implementation, I got 72% variance explained; this is still less than you, but much closer, so I think this might be causing the improvement over the SAEBench numbers.

In general I think SAEs with low k should be at least as good as k means clustering, and if it’s not I’m a little bit suspicious (when I tried this first on GPT-2 it seemed that a TopK SAE trained with k = 4 did about as well as k means clustering with the nonlinear argmax encoder).

Here’s my clustering code: https://github.com/JoshEngels/CheckClustering/blob/main/clustering.py

Josh Engels 7 Feb 2025 0:39 UTC
5 points
0
in reply to: StefanHex’s comment on: StefanHex’s Shortform
I just tried to replicate this on GPT-2 with expansion factor 4 (so total number of centroids = 768 * 4). I get that clustering recovers ~87% fraction of variance explained, while a k = 32 SAE gets more like 95% variance explained. I did the nonlinear version of finding nearest neighbors when using k means to give k means the biggest advantage possible, and did k-means clustering on points using the FAISS clustering library.
Definitely take this with a grain of salt, I’m going to look through my code and see if I can reproduce your results on pythia too, and if so try on a larger model to. Code: https://github.com/JoshEngels/CheckClustering/tree/main

Josh Engels 6 Feb 2025 20:37 UTC
1 point
0
in reply to: StefanHex’s comment on: StefanHex’s Shortform
What do you mean you’re encoding/decoding like normal but using the k means vectors? Shouldn’t the SAE training process for a top k SAE with k = 1 find these vectors then?
In general I’m a bit skeptical that clustering will work as well on larger models, my impression is that most small models have pretty token level features which might be pretty clusterable with k=1, but for larger models many activations may belong to multiple “clusters”, which you need dictionary learning for.