Neel Nanda

Karma: 13,447

Neel Nanda 7 Jan 2026 21:03 UTC
LW: 2 AF: 2
0
AF
in reply to: Linda Linsefors’s comment on: Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5)
Sorry I don’t quite understand your arguments here. Five tokens back and 10 tokens back are both within what I would consider short-range context EG. They are typically within the same sentence. I’m not saying there are layers that look at the most recent five tokens and then other layers that look at the most recent 10 tokens etc. It’s more that early layers aren’t looking 100 tokens back as much. The lines being parallelish is consistent with our hypothesis

Neel Nanda 1 Jan 2026 22:48 UTC
6 points
0
in reply to: boazbarak’s comment on: You will be OK
Ah, thanks, I missed that part.

Neel Nanda 1 Jan 2026 16:52 UTC
20 points
8
in reply to: boazbarak’s comment on: You will be OK
Thanks for the addendum! I broadly agree with “I believe that you will most likely will be OK, and in any case should spend most of your time acting under this assumption.”, maybe scoping the assumption to my personal life (I very much endorse working on reducing tail risks!)

I disagree with the “a prediction” argument though. Being >50% likely to happen does not mean people shouldn’t give significant mental space to the other, less likely outcomes. This is not how normal people live their lives, nor how I think they should. For example, people don’t smoke because they want to avoid lung cancer, but their chances of dying of this are well under 50% (I think?). People don’t do dangerous extreme sports, even though most people doing them don’t die. People wear seatbelts even though they’re pretty unlikely to die in a car accident. Parents make all kinds of decisions to protect their children from much smaller risks. The bar for “not worth thinking about” is well under 1% IMO. Of course “can you reasonably affect it” is a big Q. I do think there are various bad outcomes short of human extinction, eg worlds of massive inequality, where actions taken now might matter a lot for your personal outcomes.

Principled Interpretability of Reward Hacking in Closed Frontier Models

gersonkroiz, aditya singh, Senthooran Rajamanoharan and Neel Nanda

1 Jan 2026 16:37 UTC

22 points

0 comments23 min readLW link

Neel Nanda 1 Jan 2026 16:21 UTC
30 points
25
in reply to: peterbarnett’s comment on: You will be OK
Yeah, I feel like the title of this post should be something like “act like you will be OK” (which I think is pretty reasonable advice!)

Steering RL Training: Benchmarking Interventions Against Reward Hacking

ariaw, Josh Engels and Neel Nanda

29 Dec 2025 21:55 UTC

43 points

8 comments19 min readLW link

Announcing Gemma Scope 2

CallumMcDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan and Neel Nanda

22 Dec 2025 21:56 UTC

90 points

1 comment2 min readLW link

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

22 Dec 2025 16:56 UTC

27 points

0 comments9 min readLW link

Neel Nanda 20 Dec 2025 1:07 UTC
9 points
4
on: AI Safety has a scaling problem
Verifying solutions is time consuming enough that I don’t think this really alleviates the mentorship bottleneck. And it’s quite hard to specify research problems that both capture important things and are precise enough to be a good bounty. So I’m fairly pessimistic on this resolving the issue. I personally would expect more research to happen per unit of my time mentoring MATS than putting up and judging bounties

Neel Nanda 14 Dec 2025 22:50 UTC
14 points
0
in reply to: habryka’s comment on: Toss a bitcoin to your Lightcone – LW + Lighthaven’s 2026 fundraiser

Also, to give a bit more context on my thinking here, I currently think that it’s fine for us to accept funding from Deepmind safety employees without counting towards this bucket, largely because my sense is the social coordination across the pond here is much less intense, and generally the Deepmind safety team has struck me as the most independent from the harsh financial incentives here to date

Seems true to me, though alas we don’t seem likely to be getting an Anthropic level windfall any time soon :’(

Neel Nanda 14 Dec 2025 22:48 UTC
2 points
0
in reply to: habryka’s comment on: Toss a bitcoin to your Lightcone – LW + Lighthaven’s 2026 fundraiser
Makes sense! Options like giving what we can regranting seem to work for a bunch of people but maybe they aren’t willing to do it for lightcone?

But yeah, any donor giving $50K+ can afford to set up their own donor advised fund and get round this kind of issue, so you probably aren’t losing out on THAT much by having this be hard.

Neel Nanda 14 Dec 2025 22:43 UTC
18 points
10
in reply to: habryka’s comment on: Toss a bitcoin to your Lightcone – LW + Lighthaven’s 2026 fundraiser
What are the specific concerns re having too much funding come from frontier lab employees? I predict that it’s much more ok to be mostly funded by a collection of 20+ somewhat disagreeable frontier lab employees who have varied takes, than to have all come from OpenPhil. It seems much less likely that the employees will coordinate to remove funding at once or to pressure you in the same way (especially if they involve people from different labs)

Neel Nanda 14 Dec 2025 22:40 UTC
10 points
7
in reply to: RobertM’s comment on: Toss a bitcoin to your Lightcone – LW + Lighthaven’s 2026 fundraiser
I think that number is way too low for anyone who OpenAI actually really cares about hiring. Though this kind of thing is very very heavy tailed

Neel Nanda 14 Dec 2025 22:38 UTC
16 points
0
on: Toss a bitcoin to your Lightcone – LW + Lighthaven’s 2026 fundraiser

If you are in the UK, donate through NPT Transatlantic.

This recommendation will not work for most donors. The fees are a flat fee of £2,000 on any donation less than £102,000. NPT are mostly a donor advised fund service. While they do allow for single gifts, you need to be giving in the tens of thousands for it to make any sense.

The Anglo-American Charity is a better option for tax deductible donations to US charities, their fee is 4% on donations below £15K minimum fee £250 (I have not personally used them, but friends of mine have)

Given the size of the UK rationality community (including in high paying tech and finance roles), I imagine there would be interest if you could set up some more convenient way for small to medium UK based donors to donate tax deductibly

Neel Nanda 9 Dec 2025 20:11 UTC
10 points
2
on: Auditing Games for Sandbagging [paper]
Great work! I’m excited to see red team blue team games being further invested in and scaled up. I think it’s a great style of objective proxy task

Neel Nanda 7 Dec 2025 17:30 UTC
22 points
0
on: Neel Nanda’s Shortform
I’m considering writing a follow-up FAQ to my pragmatic interpretability post, with clarifications and responses to common objections. What would you like to see addressed?

Neel Nanda 4 Dec 2025 17:33 UTC
3 points
0
on: AI #145: You’ve Got Soul

Indeed, I have long thought that mechanistic interpretability was overinvested relative to other alignment efforts (but underinvested in absolute terms) exactly because it was relatively easy to measure and feel like you were making progress.

I’m surprised that you seem to simultaneously be concerned that it was too easy to feel like you’re making progress in past mech interp and push back against us saying that it was too easy to incorrectly feel like you’re making progress in mech interp and we need better metrics of whether we’re making progress

In general they want to time-box and quantify basically everything?

The key part is to be objective, which is related to but not the same thing as being quantifiable. For example, you can test if your hypothesis is correct by making non-trivial empirical predictions and then verifying them UGG. If you change the prompt in a certain way, what will happen or can you construct an adversarial example in an interpretable way?

Pragmatic problems are often the comparative advantage of frontier labs.

Our post is aimed at the community in general, not just the community inside frontier labs, so this is not an important part of our argument, though there are definitely certain problems we are comparatively advantaged at studying

Neel Nanda 4 Dec 2025 17:13 UTC
9 points
4
on: AI #145: You’ve Got Soul

I also think that ‘was it ‘scheming’ or just ‘confused’,’ an example of a question Neel Nanda points to, is a remarkably confused question, the boundary is a lot less solid than it appears, and in general attempts to put ‘scheming’ or ‘deception’ or similar in a distinct box misunderstand how all the related things work.

Yes, obviously this is a complicated question. Figuring out what the right question to ask is is part of the challenge. But I think there clearly is some real substance here—there’s times an AI causes bad outcomes that indicate a goal directed entity taking undesired actions, and there’s times that don’t, and figuring out the difference is very important

[Paper] Difficulties with Evaluating a Deception Detector for AIs

bilalchughtai, lewis smith and Neel Nanda

3 Dec 2025 20:07 UTC

30 points

2 comments6 min readLW link

(arxiv.org)

Neel Nanda 3 Dec 2025 10:03 UTC
LW: 12 AF: 4
1
AF
in reply to: Rohin Shah’s comment on: A Pragmatic Vision for Interpretability
Yeah, that’s basically my take—I don’t expect anything to “solve” alignment, but I think we can achieve major risk reductions by marginalist approaches. Maybe we can also achieve even more major risk reductions with massive paradigm shifts, or maybe we just waste a ton of time, I don’t know.

Neel Nanda

Prin­ci­pled In­ter­pretabil­ity of Re­ward Hack­ing in Closed Fron­tier Models

Steer­ing RL Train­ing: Bench­mark­ing In­ter­ven­tions Against Re­ward Hacking

An­nounc­ing Gemma Scope 2

Can we in­ter­pret la­tent rea­son­ing us­ing cur­rent mechanis­tic in­ter­pretabil­ity tools?

[Paper] Difficul­ties with Eval­u­at­ing a De­cep­tion De­tec­tor for AIs

Principled Interpretability of Reward Hacking in Closed Frontier Models

Steering RL Training: Benchmarking Interventions Against Reward Hacking

Announcing Gemma Scope 2

Can we interpret latent reasoning using current mechanistic interpretability tools?

[Paper] Difficulties with Evaluating a Deception Detector for AIs