So8res comments on Safety researchers should take a public stance

So8res 22 Sep 2025 2:37 UTC
20 points
12
The thing I’m imagining is more like mentioning, almost as an aside, in a friendly tone, that ofc you think the whole situation is ridiculous and that stopping would be better (before & after having whatever other convo you were gonna have about technical alignment ideas or w/e). In a sort of “carthago delanda est” fashion.

I agree that a host company could reasonably get annoyed if their researchers went on many different podcasts to talk for two hours about how the whole industry is sick. But if casually reminding people “the status quo is insane and we should do something else” at the beginning/end is a fireable offense, in a world where lab heads & Turing award winners & Nobel laureate godfathers of the field are saying this is all ridiculously dangerous, then I think that’s real sketchy and that contributing to a lab like that is substantially worse than the next best opportunity. (And similarly if it’s an offense that gets you sidelined or disempowered inside the company, even if not exactly fired.)
- Neel Nanda 22 Sep 2025 19:45 UTC
  6 points
  0
  Parent
  Ah, that’s not the fireable offence. Rather, my model is that doing that means you (probably?) stop getting permission to do media stuff. And doing media stuff after being told not to is the potentially fireable offence. Which to me is pretty different than specifically being fired because of the beliefs you expressed. The actual process would probably be more complex, eg maybe you just get advised not to do it again the first time, and you might be able to get away with more subtle or obscure things, but I feel like this only matters if people notice.
  - So8res 22 Sep 2025 19:58 UTC
    42 points
    36
    Parent
    Thanks for the clarification. Yeah, from my perspective, if casually mentioning that you agree with the top scientists & lab heads & many many researchers that this whole situation is crazy causes your host company to revoke your permission to talk about your research publicly (maybe after a warning), then my take is that that’s really sketchy and that contributing to a lab like that is probably substantially worse than your next best opportunity (e.g. b/c it sounds like you’re engaging in alignmentwashing and b/c your next best opportunity seems like it can’t be much worse in terms of direct research).
    
    (I acknowledge that there’s room to disagree about whether the second-order effect of safetywashing is outweighed by the second-order effect of having people who care about certain issues existing at the company at all. A very quick gloss of my take there: I think that if the company is preventing you from publicly acknowledging commonly-understood-among-experts key features of the situation, in a scenario where the world is desperately hurting for policymakers and lay people to understand those key features, I’m extra skeptical that you’ll be able to reap the imagined benefits of being a “person on the inside”.)
    
    I acknowledge that there are analogous situations where a company would feel right to be annoyed, e.g. if someone were casually bringing up their distantly-related political stances in every podcast. I think that this situation is importantly disanalogous, because (a) many of the most eminent figures in the field are talking about the danger here; and (b) alignment research is used as a primary motivating excuse for why the incredibly risky work should be allowed to continue. There’s a sense in which the complicity of alignment researchers is a key enabling factor for the race; if all alignment researchers resigned en masse citing the ridiculousness of the insanity of the race then policymakers would be much more likely to go “wait, what the heck?” In a situation like that, I think the implicit approval of alignment researchers is not something to be traded away lightly.
    - So8res 22 Sep 2025 20:08 UTC
      12 points
      5
      Parent
      For what it’s worth, I think that it’s pretty likely that the bureaucratic processes at (e.g.) Google haven’t noticed that acknowledging that the race to superintelligence is insane has a different nature than (e.g.) talking about the climate impacts of datacenters, and I wouldn’t be surprised if (e.g.) Google issued one of their researchers a warning the first time they mentioned things, not out of deliberate sketchiness but just out of bureaucratic habit. My guess is that that’d be a great opportunity to push back, spell out the reason why the cases are different, and see whether the company stands up to its alleged principles or codifies its alignmentwashing practices. If you have the opportunity to spur that conversation, I think that’d be real cool of you—I think there’s a decent chance it would spark a bunch of good internal cultural change, and also a decent chance that it would make the issues with staying at the lab much clearer (both internally, and to the public if a news story came of it).
    - Neel Nanda 23 Sep 2025 4:37 UTC
      5 points
      1
      Parent
      Separate point: Even if the existence of alignment research is a key part of how companies justify their existence and continued work, I don’t think all of the alignment researchers quitting would be that catastrophic to this. Because what appears to be alignment research to a policy maker is a pretty malleable thing. Large fractions of current post training are fundamentally about how to get the model to do what you want when this is hard to specify. Eg how to do reasoning model training for harder to verify rewards, avoiding reward hacking, avoiding sycophancy etc. Most people working on these things aren’t thinking too much about AGI safety and would not quit, but could be easily sold to policy makers at doing alignment work. (and I do personally think the work is somewhat relevant, though far from the most important thing and not sufficient, but this isn’t a crux)
      
      All researchers quitting en masse and publicly speaking out seems impactful for whistleblowing reasons, of course, but even there I’m not sure how much it would actually do, especially in the current political climate.
    - Neel Nanda 23 Sep 2025 4:31 UTC
      5 points
      −4
      Parent
      I still feel like you’re making much stronger updates on this, than I think you should. A big part of my model here is that large companies are not coherent entities. They’re bureaucracies with many different internal people/groups with different roles, who may not be that coherent. So even if you really don’t like their media policy, that doesn’t tell you that much about other things.
      
      The people you deal with for questions like “can I talk to the media” are not supposed to be figuring out for themselves if some safety thing is a big enough deal for the world that letting people talk about it is good. Instead, their job is roughly to push forward some set of PR/image goals for the company, while minimising PR risk. There’s more senior people who might make a judgement call like that, but those people are incredibly busy, and you need a good reason to escalate up to them.
      
      For a theory of change like influencing the company to be better, you will be interacting with totally different groups of people, who may not be that correlated—there’s people involved in the technical parts of the AGI creation pipeline who I want to use safer techniques, or let us practice AGI relevant techniques; there’s senior decision makers who you want to ensure make the right call in high stakes situations, or push for one strategic choice over another; there’s the people in charge of what policy positions to advocate for; there’s the security people; etc. Obviously the correlation is non-zero, the opinions and actions of people like the CEO affect all of this, but there’s also a lot of noise, inertia and randomness, and facts about one part of the system can’t be assumed to generalise to the others. Unless senior figures are paying attention, specific parts of the system can drift pretty far from what they’d endorse, especially if the endorsed opinion is unusual or takes thought/agency to conclude (I would consider your points about safety washing etc here to be in this category). But when inside you can build a richer picture of what parts of the bureaucracy are tractable to try to influence.
      - So8res 23 Sep 2025 13:10 UTC
        9 points
        0
        Parent
        I agree that large companies are likely incoherent in this way; that’s what I was addressing in my follow-on comment :-). (Short version: I think getting a warning and then pressing the issue is a great way to press the company for consistency on this (important!) issue, and I think that it matters whether the company coheres around “oh yeah, you’re right, that is okay” vs whether it coheres around “nope, we do alignmentwashing here”.)
        
        With regards to whether senior figures are paying attention: my guess is that if a good chunk of alignment researchers (including high-profile ones such as yourself) are legitimately worried about alignmentwashing and legitimately considering doing your work elsewhere (and insofar as you prefer telling the media if that happens—not as a threat but because informing the public is the right thing to do) -- then, if it comes to that extremity, I think companies are pretty likely to get the senior figures involved. And I think that if you act in a reasonable, sensible, high-integrity way throughout the process, that you’re pretty likely to have pretty good effects on the internal culture (either by leaving or by causing the internal policy to change in a visible way that makes it much easier for researchers to speak about this stuff).