Subhash Kantamneni

Karma: 506

Subhash Kantamneni 5 Jun 2026 5:55 UTC
3 points
0
on: Logits as a new monitor for evaluation awareness
This is quite clever and cool! How much do you think this technique working relies on a model‘s predisposition to verbalize awareness? I’m curious how well this would work on model organisms that eval scheme but don’t verbalize awareness. For these models, we might expect activation based methods to work, but for verbalized awareness to not provide much signal. Really great work!

Subhash Kantamneni 9 May 2026 2:05 UTC
4 points
0
in reply to: ryan_greenblatt’s comment on: Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
We’re fans of the Consistency Lens and shout it out in our acknowledgements!

Subhash Kantamneni 9 May 2026 2:04 UTC
2 points
0
in reply to: Adam Kaufman ’s comment on: Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
Yeah we present a variety of such transformations in our experiments on steganography in the subsection Measuring behavioral properties of NLAs. Seems reasonable to explore adding these into the training loop as a preventative measure.

Subhash Kantamneni 8 May 2026 16:32 UTC
5 points
0
in reply to: nielsrolf’s comment on: Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
I agree, I generally like the setting of multihop reasoning, and it’s one of the first we looked at to build confidence that NLAs are doing something reasonable. For instance, this result used an early version of NLAs on Haiku 3.5.
It’s worth noting that NLAs struggle with numbers (they’re the type of specific detail that get confabulated). But I’m a little confused, the capital of France is Paris right? So the answer should be (8-5)*2=6 so we shouldn’t see 13 as an intermediate? We also in general might need to run the NLA on more token positions.

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

Subhash Kantamneni, kitft, Euan Ong and Sam Marks

7 May 2026 20:21 UTC

215 points

35 comments8 min readLW link

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas and Owain_Evans

18 Dec 2025 20:21 UTC

154 points

12 comments8 min readLW link

(arxiv.org)

Scaling Laws for Scalable Oversight

Subhash Kantamneni, Josh Engels, David Baek and Max Tegmark

30 Apr 2025 12:13 UTC

39 points

1 comment9 min readLW link

Takeaways From Our Recent Work on SAE Probing

Josh Engels, Subhash Kantamneni, Senthooran Rajamanoharan and Neel Nanda

3 Mar 2025 19:50 UTC

30 points

4 comments5 min readLW link

Language Models Use Trigonometry to Do Addition

Subhash Kantamneni5 Feb 2025 13:50 UTC

80 points

1 comment10 min readLW link

SAE Probing: What is it good for?

Subhash Kantamneni, Josh Engels, Senthooran Rajamanoharan and Neel Nanda

1 Nov 2024 19:23 UTC

34 points

0 comments11 min readLW link