Neel Nanda

Karma: 14,765

Neel Nanda 11 Apr 2026 16:46 UTC
11 points
1
in reply to: the gears to ascension’s comment on: the gears to ascenscion’s Shortform
Why does being public change any of this? They already have a ton of investors

Neel Nanda 3 Apr 2026 3:04 UTC
4 points
2
on: 2026: The year of throwing my agency at my health (now with added cyborgism)
Is exobrain something you built or a service you pay for? I’d be curious to try it but couldn’t see a link

Neel Nanda 1 Apr 2026 20:47 UTC
2 points
0
on: “You Have Not Been a Good User” (LessWrong’s second album)
What’s station 4 about?

Neel Nanda 23 Mar 2026 10:58 UTC
5 points
0
in reply to: Elizabeth’s comment on: strawberry calm’s Shortform
+1, especially with the vast majority of future Anthropic employee donations already locked in to DAFs

Neel Nanda 14 Mar 2026 23:21 UTC
6 points
0
in reply to: habryka’s comment on: New LessWrong Editor! (Also, an update to our LLM policy.)
Gotcha. I would feel reasonably happy if the policy said “text written or dictated by a human”, if we count my level of LLM editing followed by me editing to be overall light editing

Neel Nanda 14 Mar 2026 23:20 UTC
2 points
0
in reply to: Ninety-Three’s comment on: New LessWrong Editor! (Also, an update to our LLM policy.)
For me I’ll often reword things, change my mind, go back and add some content to an earlier section, leave todos for myself, have kinda clumsy wording, etc and an LLM is helpful for all of these

Neel Nanda 14 Mar 2026 23:19 UTC
7 points
0
in reply to: Ben Pace’s comment on: New LessWrong Editor! (Also, an update to our LLM policy.)
If you don’t trust the user, why does the policy matter? Surely you need some way to gauge post quality regardless

Neel Nanda 14 Mar 2026 21:05 UTC
7 points
5
in reply to: RobertM’s comment on: New LessWrong Editor! (Also, an update to our LLM policy.)
Gotcha. I did not take that from the policy in the post, might be good to reword

EDIT: In particular, as written, the below categories feel like they include my writing, but it sounds like this is not intended

text that was written by a human and then substantially[6] edited or revised by an LLM text that was written by an LLM and then edited or revised by a human

Neel Nanda 14 Mar 2026 21:03 UTC
3 points
0
in reply to: habryka’s comment on: New LessWrong Editor! (Also, an update to our LLM policy.)
Fair! I was reacting to the concept and didn’t pay much attention to the design. Maybe I would get used to it? I do feel like the concept is what matters here though—I don’t want to read most kinds of slop, and I expect to interpret an LLM block as “high probability of slop”

EDIT: Looking more at the examples in the post, I retract “intrusive”, but the changed font does create a subtle sense of wrongness/a weird vibe, that I could easily see becoming associated in my head with “skip, not worth my time”

Neel Nanda 14 Mar 2026 20:55 UTC
12 points
6
in reply to: habryka’s comment on: New LessWrong Editor! (Also, an update to our LLM policy.)
Fair enough. How about “I stand by the content of this piece as much as if I’d written it myself”? In my case, most but not all of the phrasing and wording is written by me, and I would cut anything the LLM added that I considered false testimony

Neel Nanda 14 Mar 2026 13:05 UTC
2 points
0
in reply to: Gunnar_Zarncke’s comment on: New LessWrong Editor! (Also, an update to our LLM policy.)
What do you mean by without the transcript part?

Neel Nanda 14 Mar 2026 10:51 UTC
50 points
31
on: New LessWrong Editor! (Also, an update to our LLM policy.)
I often write posts by dictating a verbatim rough draft, giving the audio to Gemini along with a bunch of samples of my past writing and instructions up preserve my voice as much as possible, and then edit what comes out until I’m happy (but in practice it’s close enough to my voice that this is just light editing). Under these rules would I need to put the whole post in an LLM output block?

EDIT: On reflection, the thing that annoys me about this policy is that it lumps in many kinds of LLM assistance, with varying amounts of human investment, into an intrusive format that naively reads to me as “this is LLM slop which you should ignore”.

For example, under my current reading, I would need to label several popular and widely read posts of mine as LLM content (my amount of editing varied from light to heavy between the posts, but LLM assistance was substantial). I think it would have been pretty destructive to make me label each post as LLM written (in practice I would have either violated the policy, or posted on a personal blog and maybe shared a link here)

https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability https://www.lesswrong.com/posts/jP9KDyMkchuv6tHwm/how-to-become-a-mechanistic-interpretability-researcher https://www.lesswrong.com/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in https://www.lesswrong.com/posts/MnkeepcGirnJn736j/how-can-interpretability-researchers-help-agi-go-well

I would feel better about eg self selecting a tag for the post about how much an LLM was integrated into the writing process, with a spectrum of options rather than a binary

Neel Nanda 12 Mar 2026 11:03 UTC
13 points
4
in reply to: J Bostock’s comment on: How well do models follow their constitutions?
I was worried about this, and asked someone on the relevant team at Anthropic, but they thought our methods were sufficiently different from their internal approach to still be interesting

Neel Nanda 12 Mar 2026 3:26 UTC
3 points
0
in reply to: habryka’s comment on: The case for AI safety capacity-building work

Very little of the impact of people working in AI Safety is downstream of their research, so this seems wrong.

Where do you think the impact comes from? And is this coming from a background belief that most current alignment work is useless?

Neel Nanda 11 Mar 2026 3:53 UTC
2 points
0
on: What do we know about AI company employee giving?

I know that AI companies have policies including matching donations for approved organizations. It seems like influencing which organizations are elligible for matching could be very valuable, and like employees should not restrict their giving to already approved organizations.

I am not aware of OpenAI or Anthropic having such policies

(Google does, but it’s just “which organisations are on benevity.com” and matches are capped at $10K, so it’s not too relevant here IMO)

Neel Nanda 9 Mar 2026 2:04 UTC
20 points
3
on: Solar Storms
I really enjoyed this post, thanks for writing it! I just subscribed to your Patreon, I’d love to see more posts like this. I enjoyed the level of grounded detail on how the world actually works, tied to a topic that matters

Neel Nanda 7 Mar 2026 12:00 UTC
3 points
0
in reply to: Mikhail Samin’s comment on: Mikhail Samin’s Shortform
Interesting, any idea why Jackson Jr?

Neel Nanda 4 Mar 2026 22:56 UTC
3 points
0
in reply to: beyarkay’s comment on: Current activation oracles are hard to use
It’s a section of the post: https://www.lesswrong.com/posts/LXQBcztrWKhtcgQfJ/current-activation-oracles-are-hard-to-use?commentId=MKP8hB4Lik9FrHxfB#Can_AOs_extract_censored_knowledge_from_Chinese_models_

Neel Nanda 1 Mar 2026 20:38 UTC
9 points
0
in reply to: Beth Barnes’s comment on: Responsible Scaling Policy v3
I agree with your assessment here, I don’t think METR has had a significant negative effect on the availability of talent in the technical AGI Safety ecosystem, and Anthropic has had a massive negative one. GDM Safety has probably had a moderate negative one, offset by many people preferring to live in London

Neel Nanda 13 Feb 2026 7:09 UTC
11 points
3
in reply to: cfoster0’s comment on: 1a3orn’s Shortform
A lot. Because the only thing that is recurrent is the text/vector CoT. The residual stream is very rich but the number of sequential steps of computation is bounded by the number of layers, without being able to send the intermediate information back to the beginning with some recurrence