Neel Nanda

Karma: 13,169

Neel Nanda 9 Dec 2025 20:11 UTC
10 points
2
on: Auditing Games for Sandbagging [paper]
Great work! I’m excited to see red team blue team games being further invested in and scaled up. I think it’s a great style of objective proxy task

Neel Nanda 7 Dec 2025 17:30 UTC
22 points
0
on: Neel Nanda’s Shortform
I’m considering writing a follow-up FAQ to my pragmatic interpretability post, with clarifications and responses to common objections. What would you like to see addressed?

Neel Nanda 4 Dec 2025 17:33 UTC
3 points
0
on: AI #145: You’ve Got Soul

Indeed, I have long thought that mechanistic interpretability was overinvested relative to other alignment efforts (but underinvested in absolute terms) exactly because it was relatively easy to measure and feel like you were making progress.

I’m surprised that you seem to simultaneously be concerned that it was too easy to feel like you’re making progress in past mech interp and push back against us saying that it was too easy to incorrectly feel like you’re making progress in mech interp and we need better metrics of whether we’re making progress

In general they want to time-box and quantify basically everything?

The key part is to be objective, which is related to but not the same thing as being quantifiable. For example, you can test if your hypothesis is correct by making non-trivial empirical predictions and then verifying them UGG. If you change the prompt in a certain way, what will happen or can you construct an adversarial example in an interpretable way?

Pragmatic problems are often the comparative advantage of frontier labs.

Our post is aimed at the community in general, not just the community inside frontier labs, so this is not an important part of our argument, though there are definitely certain problems we are comparatively advantaged at studying

Neel Nanda 4 Dec 2025 17:13 UTC
8 points
4
on: AI #145: You’ve Got Soul

I also think that ‘was it ‘scheming’ or just ‘confused’,’ an example of a question Neel Nanda points to, is a remarkably confused question, the boundary is a lot less solid than it appears, and in general attempts to put ‘scheming’ or ‘deception’ or similar in a distinct box misunderstand how all the related things work.

Yes, obviously this is a complicated question. Figuring out what the right question to ask is is part of the challenge. But I think there clearly is some real substance here—there’s times an AI causes bad outcomes that indicate a goal directed entity taking undesired actions, and there’s times that don’t, and figuring out the difference is very important

[Paper] Difficulties with Evaluating a Deception Detector for AIs

bilalchughtai, lewis smith and Neel Nanda

3 Dec 2025 20:07 UTC

30 points

1 comment5 min readLW link

(arxiv.org)

Neel Nanda 3 Dec 2025 10:03 UTC
LW: 12 AF: 4
1
AF
in reply to: Rohin Shah’s comment on: A Pragmatic Vision for Interpretability
Yeah, that’s basically my take—I don’t expect anything to “solve” alignment, but I think we can achieve major risk reductions by marginalist approaches. Maybe we can also achieve even more major risk reductions with massive paradigm shifts, or maybe we just waste a ton of time, I don’t know.

Neel Nanda 2 Dec 2025 22:15 UTC
2 points
0
in reply to: Oliver Daniels’s comment on: A Pragmatic Vision for Interpretability
If you can get a good proxy to the eventual problem in a real model I much prefer that, on realism grounds. Eg eval awareness in Sonnet

Neel Nanda 2 Dec 2025 17:31 UTC
5 points
1
in reply to: TristanTrim’s comment on: A Pragmatic Vision for Interpretability
Seems reasonable. But yeah, it’s a high bar

Neel Nanda 2 Dec 2025 17:29 UTC
4 points
0
in reply to: Oliver Daniels’s comment on: A Pragmatic Vision for Interpretability
In my parlance it sounds like you agree with me about the importance of robustly useful settings, you just disagree with my more specific claim that model organisms are often bad robustly useful settings?

I think that it’s easy to simulate being difficult and expensive to evaluate, And current models already show situational awareness. You can try simulating things like scheming with certain prompting experiments though I think that one’s more tenuous.

Notes that I consider model organism to involve fine-tuning and prompted settings are in scope depending on specific details of how contrived they seem

I’m generally down for making a model organism designed to test a very specific thing that we expect to see in future models. My scepticism is that this generalizes and that you can just screw around in a model organism and expect to discover useful things. I think if you design it to test a narrow phenomena then there’s a good chance you can in fact test that narrow phenomena

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

62 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

127 points

34 comments27 min readLW link

Neel Nanda 29 Nov 2025 12:59 UTC
36 points
18
in reply to: zroe1’s comment on: zroe1′s Shortform
Have you tried emailing the authors of that paper and asking if they think you’re missing any important details? Imo there’s 3 kinds of papers:
1. Totally legit
2. Kinda fragile and fiddly, there’s various tacit knowledge and key details to get right, but the results are basically legit. Or, eg, it’s easy for you to have a subtle bug that breaks everything. Imo it’s bad if they don’t document this, but it’s different from not replicating
3. Misleading (eg only works in a super narrow setting and this was not documented at all) or outright false
I’m pro more safety work being replicated, and would be down to fund a good effort here, but I’m concerned about 2 and 3 getting confused

Neel Nanda 29 Nov 2025 11:10 UTC
LW: 4 AF: 3
0
AF
on: Circuit discovery through chain of thought using policy gradients
Interesting. Thanks for writing up the post. I’m reasonably persuaded that this might work, though I am concerned that long chains of RL are just sufficiently fucked functions with enough butterfly effects that this wouldn’t be well approximated by this process. I would be interested to see someone try though

Neel Nanda 29 Nov 2025 10:48 UTC
4 points
1
in reply to: Mikhail Samin’s comment on: Mikhail Samin’s Shortform
Sure, but there’s a big difference between engaging in PR damage control mode and actually seriously engaging. I don’t take them choosing to be in the former as significant evidence of wrong doing

Neel Nanda 29 Nov 2025 10:18 UTC
2 points
0
in reply to: Mikhail Samin’s comment on: Mikhail Samin’s Shortform
Thanks for adding the context!

Separately, I think an issue is that they’re incredibly non-transparent about what they’re doing and have been somewhat misleading in their responses to my tweets and not answering any of the questions.

I can’t really fault them for not answering or being fully honest, from their perspective you’re a random dude who’s attacking them publicly and trying to get them lots of bad PR. I think it’s often very reasonable to just not engage in situations like that. Though I would judge them for outright lying

Neel Nanda 28 Nov 2025 11:06 UTC
8 points
−5
in reply to: Kaj_Sotala’s comment on: Mikhail Samin’s Shortform
Thanks for sharing, this is extremely important context—I’m way more ok with dual use threats from a company actively trying to reduce bio risk from AI who seem to have vaguely reasonable threat models, than just reckless gain of function people with insane threat models. It’s much less clear to me how much risk is ok to accept from projects actively doing reasonable things to make it better, but it’s clearly non zero (I don’t know if this place is actually doing reasonable things, but Mikhail provides no evidence against)

I think it was pretty misleading for Mikhail not to include this context in the original post.

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan and Josh Engels

19 Nov 2025 15:27 UTC

51 points

0 comments20 min readLW link

Neel Nanda 16 Nov 2025 12:59 UTC
45 points
27
on: AI safety undervalues founders
I think that being a good founder in AI safety is very hard, and generally only recommend doing it after having some experience in the field—this strongly applies to research orgs, but also to eg field building. If you’re founding something, you need to constantly make judgements about what is best, and don’t really have mentors to defer to, unlike many entry level safety roles, and often won’t get clear feedback from reality if you get them wrong. And these are very hard questions, and if you don’t get them right, there’s a good chance your org is mediocre. I think this applies even to orgs within an existing research agenda (most attempts to found mech interp orgs seem doomed to me). Field building is a bit less dicey, but even then, you want strong community connections and a sense for what will and will not work.

I’m very excited for there to be more good founders in AI Safety, but don’t think loudly signal boosting this to junior people is a good way to achieve this. And imo “founding an org” is already pretty high status, at least if you’re perceived to have some momentum behind you?

I’m also fine with people without a lot of AI safety expertise partnering with those who do have it as co founders, but I struggle to think of orgs that I think have gone well who didn’t have at least one highly experienced and competent co-founder

Neel Nanda 15 Nov 2025 12:56 UTC
1 point
7
in reply to: Sohaib Imran’s comment on: Increasing returns to effort are common
40 people attending a one hour meeting spends 40 hours of total employee time. 20 people attending a one hour meeting spends 20 hours of total employee time. Total employee time is the main metric that matters

Neel Nanda 3 Nov 2025 1:12 UTC
LW: 2 AF: 2
0
AF
in reply to: Fabien Roger’s comment on: Steering Evaluation-Aware Models to Act Like They Are Deployed
Good point, I buy it

Neel Nanda

[Paper] Difficul­ties with Eval­u­at­ing a De­cep­tion De­tec­tor for AIs

How Can In­ter­pretabil­ity Re­searchers Help AGI Go Well?

A Prag­matic Vi­sion for Interpretability

Cur­rent LLMs seem to rarely de­tect CoT tampering

[Paper] Difficulties with Evaluating a Deception Detector for AIs

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Current LLMs seem to rarely detect CoT tampering