So8res comments on Safety researchers should take a public stance

So8res 20 Sep 2025 21:17 UTC
61 points
43
I think there’s a huge difference between labs saying “there’s lots of risk” and labs saying “no seriously, please shut everyone down including me, I’m only doing this because others are allowed to and would rather we all stopped”. The latter is consistent with the view; its absence is conspicuous. Here is an example of someone noticing in the wild; I have also heard that sort of response from multiple elected officials. If Dario could say it that’d be better, but lots of researchers in the labs saying it would be a start. And might even make it more possible for lab leaders to come out and say it themselves!
- Neel Nanda 20 Sep 2025 21:40 UTC
  8 points
  −5
  Parent
  I agree there’s a big difference, my skepticism is that a handful of lab safety researchers saying this would matter, when people like Hinton say it, and lab CEOs do not (Like, I would be pretty shocked if you could get this above 50 lab employees, out of thousands total). I would be curious to hear more about the chats with elected officials, if they’ve led you to think differently?
  - So8res 20 Sep 2025 23:12 UTC
    34 points
    18
    Parent
    Quick take: I agree it might be hard to get above 50 today. I think that even 12 respected people inside one lab today would have an effect on the Overton window inside labs, which I think would have an effect over time (aided primarily by the fact that the arguments are fairly clearly on the side of a global stop being better; it’s harder to keep true things out it the Overton window). I expect it’s easier to shift culture inside labs first, rather than inside policy shops, bc labs at least don’t have the dismissals of “they clearly don’t actually believe that” and “if they did believe it they’d act differently” ready to go. There are ofc many other factors that make it hard for a lab culture to fully adopt the “nobody should be doing this, not even us” stance, but it seems plausible that that could at least be brought into the Overton window of the labs, and that that’d be a big improvement (towards, eg, lab heads becoming able to say it).
    - Neel Nanda 21 Sep 2025 8:04 UTC
      8 points
      3
      Parent
      Ah, if your main objective is to shift internal lab culture I’m pretty on board with this aim, but would recommend different methods. To me, speaking prominently and publicly could eg pose significant PR risk to a lab and get resistance, while speaking loudly in internal channels is unlikely to and may be more effective. For example, I’d be more optimistic about writing some kind of internal memo making the case and trying to share it widely/create buzz, sharing the most legit examples of current AI being scary in popular internal channels, etc. I still expect this to be extremely hard, to be risky for the cause if done badly, and to become easier the scarier AI gets, so it doesn’t feel like one of my top priorities right now, but I’m much more sympathetic to the ask, and do think this is something internal lab safety teams should be actively thinking about—I definitely agree with “arguing for true things is easier”, though I do not personally think “the pragmatically best solution is a global ban” is objectively true (I appreciate you writing a book trying to make this case though!)
      - So8res 21 Sep 2025 15:32 UTC
        48 points
        33
        Parent
        Oh yeah, I agree that (earnest and courageous) attempts to shift the internal culture are probably even better than saying your views publicly (if you’re a low-profile researcher).
        
        I still think there’s an additional boost from consistently reminding people of your “this is crazy and earth should do something else” views whenever you are (e.g.) on a podcast or otherwise talking about your alignment hopes. Otherwise I think you give off a false impression that the scientists have things under control and think that the race is okay. (I think most listeners to most alignment podcasts or w/e hear lots of cheerful optimism and none of the horror that is rightly associated with >5% destruction of the whole human endeavor, and that this contributes to the culture being stuck in a bad state across many orgs.)
        
        FWIW, it’s not a crux for me whether a stop is especially feasible or the best hope to be pursuing. On my model, the world is much more likely to respond in marginally saner ways the more that decision-makers understand the problem. Saying “I think a stop would be better than what we’re currently doing and beg the world to shut down everyone including us” if you believe it helps communicate your beliefs (and thus the truth, insofar as you’re good at believing) even if the exact policy proposal doesn’t happen. I think the equilibrium where lots and lots of people understand the gravity of the situation is probably better than the current equilibrium in lots of hard-to-articulate and hard-to-predict ways, even if the better equilibrium would not be able to pull off a full stop.
        
        (For an intuition pump: perhaps such a world could pull off “every nation sabotages every other nation’s ASI projects for fear of their own lives”, as an illustration of how more understanding could help even w/out a treaty.)
        Neel Nanda 21 Sep 2025 21:07 UTC
        20 points
        1
        Parent
        Yeah, I agree that media stuff (podcasts, newspapers etc) are more of an actual issue (though only involve a small fraction of lab safety people)
        
        I’m sure this varies a lot between contexts, but I’d guess that at large companies, employees being allowed to do podcasts or talk to journalists on the record is contingent (among other things) on them being trusted to be careful to not say things that could lead to journalists writing hit pieces with titles like “safety researcher at company A said B!!!” (it’s ok if they believe some spicy things, so long as they are careful to not express them in that role). This is my model in general, not just for AI safety
        
        There’s various framings you can do like using a bunch of jargon to say something spicy so it’s hard to turn into a hit piece (eg “self exfiltration” over “escape the data center”), but there’s ultimately still a bunch of constraints. The Overton window has shifted a lot, so at least for us, we can say a fair amount about the risks and dangers being real, but it’s only shifted so much.
        
        Imo this is actually pretty hard and costly to defect against, and I think the correct move is to cooperate—it’s a repeated game, so if you cause a mess you’ll stop being allowed to do media things. (And doing media things without permission is a much bigger deal than eg publicly tweeting something spicy that goes viral). And for things like podcasts, it’s hard to cause a mess even once, as company comms departments often require edit rights to the podcast. And that podcast often wants to keep being able to interview other employees of that lab, so they also don’t want to annoy the company too much.
        
        Personally, when I’m doing a media thing that isn’t purely technical, I try to be fairly careful with the spicier parts, only say true things, and just avoid topics where I can’t say anything worthwhile, but trying to say interesting true things within these constraints where possible.
        
        In general, I think that people should always assume that someone speaking to a large public audience (to a journalist, on a podcast, etc), especially someone who represents a large company, will not be fully speaking their mind, and interpret their words accordingly—in most industries I would consider this general professional responsibility. But I do feel kinda sad that if someone thinks I am fully speaking my mind and watches eg my recent 80K podcast, they may make some incorrect inferences. So overall I agree with you that this is a real cost, I just think it’s worthwhile to pay it and hard to avoid without just never touching on such topics in media appearances
        So8res 21 Sep 2025 23:34 UTC
        41 points
        36
        Parent
        I am personally squeamish about AI alignment researchers staying in their positions in the case where they’re only allowed to both go on podcasts & keep their jobs if they never say “this is an insane situation and I wish Earth would stop instead (even as I expect it won’t and try to make things better)” if that’s what they believe. That starts to feel to me like misleading the Earth in support of the mad scientists who are gambling with all our lives. If that’s the price of staying at one of the labs, I start to feel like exiting and giving that as the public reason is a much better option.
        
        In part this is because I think it’d make all sorts of news stories in a way that would shift the Overton window and make it more possible for other researchers later to speak their mind (and shift the internal culture and thus shift the policymaker understanding, etc.), as evidenced by e.g. the case of Daniel Kokotajlo. And in part because I think you’d be able to do similarly good or better work outside of a lab like that. (At a minimum, my guess is you’d be able to continue work at Anthropic, e.g. b/c Evan can apparently say it and continue working there.)
        Neel Nanda 22 Sep 2025 0:44 UTC
        5 points
        0
        Parent
        Hmm. Fair enough if you feel that way, but it doesn’t feel like that big a deal to me. I guess I’m trying to evaluate “is this a reasonable way for a company to act”, not “is the net effect of this to mislead the Earth”, which may be causing some inferential distance? And this is just my model of the normal way a large, somewhat risk averse company would behave, and is not notable evidence of the company making unsafe decisions.
        
        I think that if you’re very worried about AI x-risk you should only join an AGI lab if, all things considered, you think it will reduce x-risk. And discovering that the company does a normal company thing shouldn’t change that. By my lights, me working at GDM is good for the world, both via directly doing research, and influencing the org to be safer in various targeted ways, and media stuff is a small fraction of my impact. And the company’s attitude to PR stuff is consistent with my beliefs about why it can be influenced.
        
        And to be clear, the specific thing that I could imagine being a firable offence would be repeatedly going on prominent podcasts, against instructions, to express inflammatory opinions, in a way that creates bad PR for your employer. And even then I’m not confident, firing people can be a pain (especially in Europe). I think this is pretty reasonable for companies to object to, the employee would basically be running an advocacy campaign on the side. If it’s a weaker version of that, I’m much more uncertain—if it wasn’t against explicit instructions or it was a one off you might get off with a warning, if it is on an obscure podcast/blog/tweet there’s a good chance no one even noticed, etc.
        
        I’m also skeptical of this creating the same kind of splash as Daniel or Leopold because I feel like this is a much more reasonable company decision than those.
        So8res 22 Sep 2025 2:37 UTC
        20 points
        12
        Parent
        The thing I’m imagining is more like mentioning, almost as an aside, in a friendly tone, that ofc you think the whole situation is ridiculous and that stopping would be better (before & after having whatever other convo you were gonna have about technical alignment ideas or w/e). In a sort of “carthago delanda est” fashion.
        
        I agree that a host company could reasonably get annoyed if their researchers went on many different podcasts to talk for two hours about how the whole industry is sick. But if casually reminding people “the status quo is insane and we should do something else” at the beginning/end is a fireable offense, in a world where lab heads & Turing award winners & Nobel laureate godfathers of the field are saying this is all ridiculously dangerous, then I think that’s real sketchy and that contributing to a lab like that is substantially worse than the next best opportunity. (And similarly if it’s an offense that gets you sidelined or disempowered inside the company, even if not exactly fired.)
        Neel Nanda 22 Sep 2025 19:45 UTC
        6 points
        0
        Parent
        Ah, that’s not the fireable offence. Rather, my model is that doing that means you (probably?) stop getting permission to do media stuff. And doing media stuff after being told not to is the potentially fireable offence. Which to me is pretty different than specifically being fired because of the beliefs you expressed. The actual process would probably be more complex, eg maybe you just get advised not to do it again the first time, and you might be able to get away with more subtle or obscure things, but I feel like this only matters if people notice.
        So8res 22 Sep 2025 19:58 UTC
        42 points
        36
        Parent
        Thanks for the clarification. Yeah, from my perspective, if casually mentioning that you agree with the top scientists & lab heads & many many researchers that this whole situation is crazy causes your host company to revoke your permission to talk about your research publicly (maybe after a warning), then my take is that that’s really sketchy and that contributing to a lab like that is probably substantially worse than your next best opportunity (e.g. b/c it sounds like you’re engaging in alignmentwashing and b/c your next best opportunity seems like it can’t be much worse in terms of direct research).
        
        (I acknowledge that there’s room to disagree about whether the second-order effect of safetywashing is outweighed by the second-order effect of having people who care about certain issues existing at the company at all. A very quick gloss of my take there: I think that if the company is preventing you from publicly acknowledging commonly-understood-among-experts key features of the situation, in a scenario where the world is desperately hurting for policymakers and lay people to understand those key features, I’m extra skeptical that you’ll be able to reap the imagined benefits of being a “person on the inside”.)
        
        I acknowledge that there are analogous situations where a company would feel right to be annoyed, e.g. if someone were casually bringing up their distantly-related political stances in every podcast. I think that this situation is importantly disanalogous, because (a) many of the most eminent figures in the field are talking about the danger here; and (b) alignment research is used as a primary motivating excuse for why the incredibly risky work should be allowed to continue. There’s a sense in which the complicity of alignment researchers is a key enabling factor for the race; if all alignment researchers resigned en masse citing the ridiculousness of the insanity of the race then policymakers would be much more likely to go “wait, what the heck?” In a situation like that, I think the implicit approval of alignment researchers is not something to be traded away lightly.
        Expand this thread
        So8res 22 Sep 2025 20:08 UTC
        12 points
        5
        Parent
        For what it’s worth, I think that it’s pretty likely that the bureaucratic processes at (e.g.) Google haven’t noticed that acknowledging that the race to superintelligence is insane has a different nature than (e.g.) talking about the climate impacts of datacenters, and I wouldn’t be surprised if (e.g.) Google issued one of their researchers a warning the first time they mentioned things, not out of deliberate sketchiness but just out of bureaucratic habit. My guess is that that’d be a great opportunity to push back, spell out the reason why the cases are different, and see whether the company stands up to its alleged principles or codifies its alignmentwashing practices. If you have the opportunity to spur that conversation, I think that’d be real cool of you—I think there’s a decent chance it would spark a bunch of good internal cultural change, and also a decent chance that it would make the issues with staying at the lab much clearer (both internally, and to the public if a news story came of it).
        Neel Nanda 23 Sep 2025 4:37 UTC
        5 points
        1
        Parent
        Separate point: Even if the existence of alignment research is a key part of how companies justify their existence and continued work, I don’t think all of the alignment researchers quitting would be that catastrophic to this. Because what appears to be alignment research to a policy maker is a pretty malleable thing. Large fractions of current post training are fundamentally about how to get the model to do what you want when this is hard to specify. Eg how to do reasoning model training for harder to verify rewards, avoiding reward hacking, avoiding sycophancy etc. Most people working on these things aren’t thinking too much about AGI safety and would not quit, but could be easily sold to policy makers at doing alignment work. (and I do personally think the work is somewhat relevant, though far from the most important thing and not sufficient, but this isn’t a crux)
        
        All researchers quitting en masse and publicly speaking out seems impactful for whistleblowing reasons, of course, but even there I’m not sure how much it would actually do, especially in the current political climate.
        Neel Nanda 23 Sep 2025 4:31 UTC
        5 points
        −4
        Parent
        I still feel like you’re making much stronger updates on this, than I think you should. A big part of my model here is that large companies are not coherent entities. They’re bureaucracies with many different internal people/groups with different roles, who may not be that coherent. So even if you really don’t like their media policy, that doesn’t tell you that much about other things.
        
        The people you deal with for questions like “can I talk to the media” are not supposed to be figuring out for themselves if some safety thing is a big enough deal for the world that letting people talk about it is good. Instead, their job is roughly to push forward some set of PR/image goals for the company, while minimising PR risk. There’s more senior people who might make a judgement call like that, but those people are incredibly busy, and you need a good reason to escalate up to them.
        
        For a theory of change like influencing the company to be better, you will be interacting with totally different groups of people, who may not be that correlated—there’s people involved in the technical parts of the AGI creation pipeline who I want to use safer techniques, or let us practice AGI relevant techniques; there’s senior decision makers who you want to ensure make the right call in high stakes situations, or push for one strategic choice over another; there’s the people in charge of what policy positions to advocate for; there’s the security people; etc. Obviously the correlation is non-zero, the opinions and actions of people like the CEO affect all of this, but there’s also a lot of noise, inertia and randomness, and facts about one part of the system can’t be assumed to generalise to the others. Unless senior figures are paying attention, specific parts of the system can drift pretty far from what they’d endorse, especially if the endorsed opinion is unusual or takes thought/agency to conclude (I would consider your points about safety washing etc here to be in this category). But when inside you can build a richer picture of what parts of the bureaucracy are tractable to try to influence.
        So8res 23 Sep 2025 13:10 UTC
        9 points
        0
        Parent
        I agree that large companies are likely incoherent in this way; that’s what I was addressing in my follow-on comment :-). (Short version: I think getting a warning and then pressing the issue is a great way to press the company for consistency on this (important!) issue, and I think that it matters whether the company coheres around “oh yeah, you’re right, that is okay” vs whether it coheres around “nope, we do alignmentwashing here”.)
        
        With regards to whether senior figures are paying attention: my guess is that if a good chunk of alignment researchers (including high-profile ones such as yourself) are legitimately worried about alignmentwashing and legitimately considering doing your work elsewhere (and insofar as you prefer telling the media if that happens—not as a threat but because informing the public is the right thing to do) -- then, if it comes to that extremity, I think companies are pretty likely to get the senior figures involved. And I think that if you act in a reasonable, sensible, high-integrity way throughout the process, that you’re pretty likely to have pretty good effects on the internal culture (either by leaving or by causing the internal policy to change in a visible way that makes it much easier for researchers to speak about this stuff).
      - Richard_Ngo 21 Sep 2025 11:25 UTC
        35 points
        5
        Parent
        FWIW I used to agree with you but now agree with Nate. A big part of the update was developing a model of how “PR risks” work via a kind of herd mentality, where very few people are actually acting on their object-level beliefs, and almost everyone is just tracking what everyone else is tracking.
        In such a setting, “internal influence” strategies tend to do very little long-term, and maybe even reinforce the taboo against talking honestly. This is roughly what seems to have happened in DC, where the internal influence approach was swept away by a big Overton window shift after ChatGPT. Conversely, a few principled individuals can have a big influence by speaking honestly (here’s a post about the game theory behind this).
        In my own case, I felt a vague miasma of fear around talking publicly while at OpenAI (and to a lesser extent at DeepMind), even though in hindsight there were often no concrete things that I endorsed being afraid of—for example, there was a period where I was roughly indifferent about leaving OpenAI, but still scared of doing things that might make people mad enough to fire me.
        I expect that there’s a significant inferential gap between us, so this is a hard point to convey, but one way that I might have been able to bootstrap my current perspective from inside my “internal influence” frame is to try to identify possible actions X such that, if I got fired for doing X, this would be a clear example of the company leaders behaving unjustly. Then even the possible “punishment” for doing X is actually a win.
        Neel Nanda 21 Sep 2025 14:17 UTC
        15 points
        −3
        Parent
        I guess speaking out publicly just seems like a weird distraction to me. Most safety people don’t have a public profile! None of their capabilities colleagues are tracking the fact that they have or have not expressed specific opinions publicly. Some do, but it doesn’t feel like you’re exclusively targeting them. And eg If someone is in company wide slack channels leaving comments about their true views, I think that’s highly visible and achieves the same benefits of talking honestly, with fewer risks.
        
        I’m not concerned about someone being fired for this kind of thing, that would be pretty unwise on the labs’ part as you risk creating a martyr. Rather, I’m concerned about eg senior figures thinking worse of safety researchers as a whole because it causes a PR headache, eg viewing them as radical troublemakers, and this making theories of impact around influencing specific senior decision makers harder (and I’m more optimistic about those, personally)
        Ishual 21 Sep 2025 19:14 UTC
        4 points
        0
        Parent
        Rather, I’m concerned about eg senior figures thinking worse of safety researchers as a whole because it causes a PR headache, eg viewing them as radical troublemakers, and this making theories of impact around influencing specific senior decision makers harder (and I’m more optimistic about those, personally)
        Thank you Neel for stating this explicitly. I think this is very valuable information. This matches what some of my friends told me privately also. I would appreciate it a lot if you could give a rough estimate of your confidence that this would happen (ideally some probability/percentage). Additionally, I would appreciate if you could say whether you’d expect such a consequence to be legible/visible or illegible (once it had happened). Finally, are there legible reasons you could share for your estimated credence that this would happen?
        (to be clear: I am sad that you are operating under such conditions. I consider this evidence against expecting meaningful impact from the inside at your lab.)
        Neel Nanda 21 Sep 2025 20:59 UTC
        2 points
        0
        Parent
        It’s not a binary event—I’m sure it’s already happened somewhat. OpenAI has had what, 3 different safety exoduses by now, and (what was perceived to be) an attempted coup? I’m sure leadership at other labs have noticed. But it’s a matter of degree.
        
        I also don’t think this should be particularly surprising—this is just how I expect decision makers at any organisation that cares about its image to behave, unless it’s highly unusual. Even if the company decides to loudly sound the alarm, they likely want to carefully choose the messaging and go through their official channels, not have employees maybe going rogue and ruining message discipline. (There are advantages to the grassroots vibe in certain situations though). To be clear, I’m not talking about “would take significant retaliation”, I’m talking about “would prefer that employees didn’t, even if it won’t actually stop them”
        Ishual 22 Sep 2025 10:40 UTC
        1 point
        0
        Parent
        This sounds to me like there would actually be specific opportunities to express some of your true beliefs that you wouldn’t worry would cost you a lot (and some other opportunities where you would worry and not do them). Would you agree with that?
        Ishual 21 Sep 2025 19:44 UTC
        1 point
        0
        Parent
        (optional: my other comment is more important imo)
        I’m not concerned about someone being fired for this kind of thing, that would be pretty unwise on the labs’ part as you risk creating a martyr
        
        I think you ascribe too much competence/foresight/focus/care to the labs. I’d be willing to bet that multiple (safety?) people have been fired from labs in a way that would make the lab look pretty bad. Labs make tactical mistakes sometimes. Wasn’t there a thing at OpenAI for instance (lol)? Of course it is possible(/probable?) that they would not fire in a given case due to sufficient “wisdom”, but we should not assign an extreme likelihood to that.
        Neel Nanda 21 Sep 2025 21:04 UTC
        2 points
        0
        Parent
        Yeah, agreed that companies sometimes do dumb things, and I think this is more likely at less bureaucratic and more top down places like OpenAI—I do think Leopold went pretty badly for them though, and they’ve hopefully updated. I’m partly less concerned because there’s a lot of upside if the company makes a big screw up like that.
        Lukas Finnveden 21 Sep 2025 14:05 UTC
        5 points
        5
        Parent
        This is roughly what seems to have happened in DC, where the internal influence approach was swept away by a big Overton window shift after ChatGPT.
        In what sense was the internal influence approach “swept away”?
        Also, it feels pretty salient to me that the ChatGPT shift was triggered by public, accessible empirical demonstrations of capabilities being high (and social impacts of that). So in my mind that provides evidence for “groups change their mind in response to certain kinds of empirical evidence” and doesn’t really provide evidence for “groups change their mind in response to a few brave people saying what they believe and changing the overton window”.
        If the conversation changed a lot causally downstream of the CAIS extinction letter or FLI pause letter, that would be better evidence for your position (though also consistent with a model that put less weight on preference cascades and model the impact more like “policymakers weren’t aware that lots of experts were concerned, this letter communicated that experts were concerned”). I don’t know to what extent this was true. (Though I liked the CAIS extinction letter a lot and certainly believe it had a good amount of impact — I just don’t know how much.)