Richard_Ngo comments on Safety researchers should take a public stance

Richard_Ngo 21 Sep 2025 11:25 UTC
35 points
5
FWIW I used to agree with you but now agree with Nate. A big part of the update was developing a model of how “PR risks” work via a kind of herd mentality, where very few people are actually acting on their object-level beliefs, and almost everyone is just tracking what everyone else is tracking.
In such a setting, “internal influence” strategies tend to do very little long-term, and maybe even reinforce the taboo against talking honestly. This is roughly what seems to have happened in DC, where the internal influence approach was swept away by a big Overton window shift after ChatGPT. Conversely, a few principled individuals can have a big influence by speaking honestly (here’s a post about the game theory behind this).
In my own case, I felt a vague miasma of fear around talking publicly while at OpenAI (and to a lesser extent at DeepMind), even though in hindsight there were often no concrete things that I endorsed being afraid of—for example, there was a period where I was roughly indifferent about leaving OpenAI, but still scared of doing things that might make people mad enough to fire me.
I expect that there’s a significant inferential gap between us, so this is a hard point to convey, but one way that I might have been able to bootstrap my current perspective from inside my “internal influence” frame is to try to identify possible actions X such that, if I got fired for doing X, this would be a clear example of the company leaders behaving unjustly. Then even the possible “punishment” for doing X is actually a win.
- Neel Nanda 21 Sep 2025 14:17 UTC
  15 points
  −3
  Parent
  I guess speaking out publicly just seems like a weird distraction to me. Most safety people don’t have a public profile! None of their capabilities colleagues are tracking the fact that they have or have not expressed specific opinions publicly. Some do, but it doesn’t feel like you’re exclusively targeting them. And eg If someone is in company wide slack channels leaving comments about their true views, I think that’s highly visible and achieves the same benefits of talking honestly, with fewer risks.
  
  I’m not concerned about someone being fired for this kind of thing, that would be pretty unwise on the labs’ part as you risk creating a martyr. Rather, I’m concerned about eg senior figures thinking worse of safety researchers as a whole because it causes a PR headache, eg viewing them as radical troublemakers, and this making theories of impact around influencing specific senior decision makers harder (and I’m more optimistic about those, personally)
  - Ishual 21 Sep 2025 19:14 UTC
    4 points
    0
    Parent
    Rather, I’m concerned about eg senior figures thinking worse of safety researchers as a whole because it causes a PR headache, eg viewing them as radical troublemakers, and this making theories of impact around influencing specific senior decision makers harder (and I’m more optimistic about those, personally)
    Thank you Neel for stating this explicitly. I think this is very valuable information. This matches what some of my friends told me privately also. I would appreciate it a lot if you could give a rough estimate of your confidence that this would happen (ideally some probability/percentage). Additionally, I would appreciate if you could say whether you’d expect such a consequence to be legible/visible or illegible (once it had happened). Finally, are there legible reasons you could share for your estimated credence that this would happen?
    (to be clear: I am sad that you are operating under such conditions. I consider this evidence against expecting meaningful impact from the inside at your lab.)
    - Neel Nanda 21 Sep 2025 20:59 UTC
      2 points
      0
      Parent
      It’s not a binary event—I’m sure it’s already happened somewhat. OpenAI has had what, 3 different safety exoduses by now, and (what was perceived to be) an attempted coup? I’m sure leadership at other labs have noticed. But it’s a matter of degree.
      
      I also don’t think this should be particularly surprising—this is just how I expect decision makers at any organisation that cares about its image to behave, unless it’s highly unusual. Even if the company decides to loudly sound the alarm, they likely want to carefully choose the messaging and go through their official channels, not have employees maybe going rogue and ruining message discipline. (There are advantages to the grassroots vibe in certain situations though). To be clear, I’m not talking about “would take significant retaliation”, I’m talking about “would prefer that employees didn’t, even if it won’t actually stop them”
      - Ishual 22 Sep 2025 10:40 UTC
        1 point
        0
        Parent
        This sounds to me like there would actually be specific opportunities to express some of your true beliefs that you wouldn’t worry would cost you a lot (and some other opportunities where you would worry and not do them). Would you agree with that?
  - Ishual 21 Sep 2025 19:44 UTC
    1 point
    0
    Parent
    (optional: my other comment is more important imo)
    I’m not concerned about someone being fired for this kind of thing, that would be pretty unwise on the labs’ part as you risk creating a martyr
    
    I think you ascribe too much competence/foresight/focus/care to the labs. I’d be willing to bet that multiple (safety?) people have been fired from labs in a way that would make the lab look pretty bad. Labs make tactical mistakes sometimes. Wasn’t there a thing at OpenAI for instance (lol)? Of course it is possible(/probable?) that they would not fire in a given case due to sufficient “wisdom”, but we should not assign an extreme likelihood to that.
    - Neel Nanda 21 Sep 2025 21:04 UTC
      2 points
      0
      Parent
      Yeah, agreed that companies sometimes do dumb things, and I think this is more likely at less bureaucratic and more top down places like OpenAI—I do think Leopold went pretty badly for them though, and they’ve hopefully updated. I’m partly less concerned because there’s a lot of upside if the company makes a big screw up like that.
- Lukas Finnveden 21 Sep 2025 14:05 UTC
  5 points
  5
  Parent
  This is roughly what seems to have happened in DC, where the internal influence approach was swept away by a big Overton window shift after ChatGPT.
  In what sense was the internal influence approach “swept away”?
  Also, it feels pretty salient to me that the ChatGPT shift was triggered by public, accessible empirical demonstrations of capabilities being high (and social impacts of that). So in my mind that provides evidence for “groups change their mind in response to certain kinds of empirical evidence” and doesn’t really provide evidence for “groups change their mind in response to a few brave people saying what they believe and changing the overton window”.
  If the conversation changed a lot causally downstream of the CAIS extinction letter or FLI pause letter, that would be better evidence for your position (though also consistent with a model that put less weight on preference cascades and model the impact more like “policymakers weren’t aware that lots of experts were concerned, this letter communicated that experts were concerned”). I don’t know to what extent this was true. (Though I liked the CAIS extinction letter a lot and certainly believe it had a good amount of impact — I just don’t know how much.)