Neel Nanda comments on evhub’s Shortform

Neel Nanda 13 Aug 2025 0:51 UTC
7 points
4

As as aside, surely this a central example of safetywashing?

No I think you just disagree with them
- Joseph Miller 13 Aug 2025 9:21 UTC
  4 points
  −1
  Parent
  The context of my comment was responding to Thomas who seemed to be saying “even if we take as a premise that this is not safety work, I’m still much more concerned about safetywashing”.
  Edit: I guess what you meant is that safetywashing implies malicious intent, where there was none? In which case, “accidential safetywashing” might be a better term.
  - Thomas Kwa 14 Aug 2025 9:48 UTC
    1 point
    −12
    Parent
    One definition of “safetywashing” is, by analogy with greenwashing, a situation where
    (1) The desire for good PR drives companies to advertise their safety. This leads to no research, fake research, or research that is safety-themed but where few people involved actually care about x-risk, resulting in minimal expected safety impact.
    Hendrycks has an alternate definition from a 2024 NeurIPS paper, where
    (2) a benchmark is safety-washed if it is advertised as a safety benchmark but highly correlated with general capabilities, and thus mostly incentivizes devs to improve general capabilities rather than safety. (Incidentally TruthfulQA is one of these, but this doesn’t make all papers using it safetywashed)
    Under neither definition is the Anthropic fellows program safetywashing. (1) is not true because at least ⁴⁄₆ (I would argue around 5.25/6) of the research is actually targeted at safety, mentors care about safety, and there is likely positive expected impact. I will also note that in definition (1) at least, intent is necessary to the classification of safetywashing, so “accidental safetywashing” is about as much a non sequitur as “accidental first-degree murder”.
    (2) is not true even if you expand the definition to include advertising capabilities research as safety, because the one dual use paper still has safety applications and is advertised as capabilities! My main concerns with (2) are the dozens of benchmarks and papers since 2023 that use “alignment” as a buzzword, and attempts by companies to portray as safety techniques that would be necessary for their product anyway. This is why I think Ethan Perez sometimes mentoring dual-use papers or papers that end up making non-safety observations in the larger context of safety projects is neither a good example of safetywashing nor an important concern.
    - habryka 14 Aug 2025 17:50 UTC
      9 points
      10
      Parent
      I will also note that in definition (1) at least, intent is necessary to the classification of safetywashing, so “accidental safetywashing” is about as much a non sequitur as “accidental first-degree murder”.
      No, this is obviously not the case! You do not need conscious intent to do greenwashing! Where on earth did you get that definition from? Most greenwashing or safety washing or other kinds of harmful PR-management do not involve conscious optimization and obvious scheming.
  - Neel Nanda 13 Aug 2025 11:04 UTC
    −4 points
    −9
    Parent
    Why not just call it “bad safety research”? Again, you just seem to disagree with them on theory of change. This is fine. But I consider terms like safety washing to be more of a personal attack on the researchers involved and their motivations, in a way that I’m fairly confident is false
    - habryka 13 Aug 2025 20:37 UTC
      22 points
      14
      Parent
      But I consider terms like safety washing to be more of a personal attack on the researchers involved and their motivations
      We are talking about billions of dollars of incentives. In any other industry, You and I would obviously be suspicious of “safety researchers” working on “cigarette safety” at a major tobacco company.
      I haven’t looked into the work above, so I am not saying anything about these specific researchers, but I really very strongly oppose the social framing where trying to consider the hypothesis that a safety researcher at a major AI capability company is actually driven by actions better explained by “following the billions, maybe trillions, of dollars of economic incentive towards building more capable AI” than by things related to risks from AI, is framed as a “personal attack” with an associated substantially higher burden of proof than for other arguments.
      Obviously there are lots of reasons for people to equivocate safety research and capabilities research. It’s not a weird hypothesis to consider, and indeed we need to be able to routinely talk about it and evaluate it as a live hypothesis if we want to have any chance of preventing it. I am glad you are presenting your evidence here, but I absolutely do not think it’s appropriate to try to frame people considering these motivations, or arguing for them with the connotation that comes from “personal attacks”.
      - Neel Nanda 14 Aug 2025 8:23 UTC
        2 points
        −25
        Parent
        Maybe this is just a semantic disagreement? I don’t see any reasonable definition where saying that eg a tobacco company scientist does their research because of a commercial bias and therefore suppresses evidence that tobacco causes harm, could be considered anything other than a personal attack. I also think this is fine. I’m totally fine with people making personal attacks if they have good reason to think those are true, and I will happily make personal attacks on tobacco scientists.
        
        But I personally know a bunch of the people on those papers, am very confident those people sincerely care about safety and are trying to do the right thing as they see it (but could totally be incorrect about what that means), and would personally guess that the two papers Joseph is criticising are net positive for safety. I will criticise personal attacks in that context, as they are not only based on high confidence in a factual claim I disagree with (that those papers are net bad), but also a much stronger claim that it’s so obvious that those papers are bad that this should be obvious to the authors and the only reason someone might work on them is warped motivations. I find this a fairly arrogant perspective, personally, and do not think it is justified by the available evidence.
        
        I also disagree with your tobacco analogy because as I understand it, tobacco researchers tend not to have entered the field via a cigarette safety community, often being in that community for several years beforehand, choosing to be in said community because they are concerned about the risks cigarettes pose society, and join a team where many of their colleagues also come from the same cigarette safety community. If that were the case and those people then started putting out a bunch of papers about how cigarettes were good, there’s clearly some chance that they all became corrupted/were always in the cigarette safety community for bad reasons/are under duress or strong incentive to not get fired and lose their millions of dollars of equity, and are intentionally safety washing. But I would also put a pretty high probability they were just wrong: possibly because they were in some weird culty environment that warped their beliefs, or are in a weird bubble, or see distorted data, or other subtle biases get absorbed from their environment. But to me that’s just another way of being wrong, and I generally think it’s a simpler hypothesis than intentional safety washing. Further, if these tobacco researchers were intentionally safety washing, you’d expect at least some fraction to be true believers who would quit and publicly speak up about all of the intentional safety washing that’s happening—to my knowledge this has not happened at Anthropic, including privately?
        
        It’s not that I think it it requires a higher standard of evidence because it is a rude thing to say—I’m fairly pro saying correct and rude things. I just think it is a meaningfully more complex hypothesis and as such should have a complexity penalty in your distribution and require additional arguments for why something more complicated than “this person is wrong” is required. But, naturally, I am more likely to call out people who make bad arguments in a way that involves being rude to people than people who just make bad arguments in the abstract
        
        If someone wants to say, eg, “this researcher is probably wrong and believes their work is net good, but are biased for various reasons, but I think there’s a 10% chance they’re intentionally safety washing” then I would push back much less hard, and consider that much less of a personal attack. I agree it is reasonable for it to be in your hypothesis space!
        habryka 14 Aug 2025 17:38 UTC
        20 points
        16
        Parent
        I mostly feel like you keep doing the thing where you try to frame this as a mixture of “rude” and “higher burden of proof”. If you use words like “personal attack” you are implicitly invoking a social norm against making “personal attacks”. If you want to not invoke that norm (which I argue is inappropriate in this situation) then you have to use other words.
        Similarly, calling these things “rude” is obviously invoking a norm with a higher burden of proof for these things. This I again argue is inappropriate. Use different words if you want to actually not invoke a higher burden of proof for those hypotheses.
        It’s just really obviously the case that we should view research output from a leading capability lab on safety with a decent dose of suspicion. Considering that hypothesis is not a central example of a “personal attack” or “rudeness”.
        you’d expect at least some fraction to be true believers who would quit and publicly speak up about all of the intentional safety washing that’s happening - to my knowledge this has not happened at Anthropic, including privately?
        I have talked to people who worked closely with Anthropic who I think would describe the situation as such. They do not do so publicly because among other things Anthropic asked people to sign NDAs that make it illegal for people to say those things (which Anthropic has said they aren’t planning to enforce, but you know, it sure signals a willingness to attack anyone who does so).
        In general, people not making public comments about an organization or group is very little evidence. Much more egregious behavior has gone unnoticed or uncomplained about (for example, we saw very limited public comments about FTX, despite widespread knowledge among ex-employees and FTX-collaborators that they were being very reckless with risk).
        I also disagree with your tobacco analogy because as I understand it, tobacco researchers tend not to have entered the field via a cigarette safety community, often being in that community for several years beforehand, choosing to be in said community because they are concerned about the risks cigarettes pose society, and join a team where many of their colleagues also come from the same cigarette safety community
        That does not accurately describe most people who go through Anthropic’s fellowship program (it might happen to describe the specific authors, I again haven’t looked into these specific people). My sense is most people who go through that program have not spent many years in the safety community. It’s definitely not true for MATS these days, on average, and my guess is it’s not more true for the Anthropic fellowship.
        I also separately don’t think it’s that much evidence. Any tobacco safety researchers probably also entered the field aiming to be upstanding and honest statisticians, undergoing multiple years of training from high-integrity professors extolling the virtues of being honest and not lying with statistics. This doesn’t offer that much protection.
        But I would also put a pretty high probability they were just wrong: possibly because they were in some weird culty environment that warped their beliefs, or are in a weird bubble, or see distorted data, or other subtle biases get absorbed from their environment. But to me that’s just another way of being wrong, and I generally think it’s a simpler hypothesis than intentional safety washing.
        This is all a terrible strawman. Nobody here has accused anyone of “intentional safety washing”. The statisticians at the tobacco company almost certainly found some way of rationalizing their behavior as good or honest. I am confident if you find interviews or their personal notes you would find them not scheming in obvious and evil ways about how to deceive the public, bar maybe a few high-profile cases at the top. The hypothesis here and now and always is that people are being part of safety washing largely subconsciously.
        In general I am extremely tired of the eternal back and forth where any accusation of deception gets twisted into an accusation of “obviously evil and conscious scheming”. No, that’s not what’s going on, and as far as I can tell in this thread no one has invoked such a hypothesis. Most evil in the world gets done by people rationalizing what they are doing. If safety washing is going on in these programs, it would almost certainly not be done with that much explicitness (though occasionally people would likely do some explicit optimization, but in ways that are easy to forget, rationalize or hide).
        Neel Nanda 14 Aug 2025 23:53 UTC
        4 points
        2
        Parent
        Wow, that is NOT what I thought people used the word safety washing to mean. Thanks for clarifying. To confirm I understand, are you using it to mean “people trying to do safety research at an AGI company, whose work is useless or net negative, such that if said people were working outside the company and trying to do safety work, they would do work that was not net negative”?
        
        Also, to clarify, I’m referring to the supervisors of the works not the fellows when I’m referring to the character of the people mattering. I don’t necessarily expect junior people keen to get a job to stand up to a more experienced mentor, even if they disagree
        habryka 15 Aug 2025 0:57 UTC
        8 points
        10
        Parent
        “people trying to do safety research at an AGI company, whose work is useless or net negative, such that if said people were working outside the company and trying to do safety work, they would do work that was not net negative”?
        No, not quite. I think the better characterization is something like “people trying to do safety research at an AGI company in a way that is substantially skewed by the economical and reputational incentives of the organization they work at, such that that work ends up either making reasoning errors heavily correlated with ways that are beneficial to the economic or reputational resources of the company they work at, or where the work overall seems better explained by some (potentially subconscious, potentially distributed) optimization process that is optimizing according to these incentives, than a straightforward and neutral concern about risk”.
        I think this is generally hard to judge, as human self-deception and rationalization reaches deep.
        the gears to ascension 15 Aug 2025 2:01 UTC
        2 points
        0
        Parent
        I think his concern here is one that is likely shared by a lot of the people one hopes to describe by the term “safetywashing”; perhaps a term that makes it more explicit and direct that the phenomenon is typically accidental would make it easier to refer to without people having immune reactions? safetywishing /hj
        Ben Pace 15 Aug 2025 3:49 UTC
        11 points
        6
        Parent
        In my model, the thing that people are having immune reactions to to is the idea that they’re morally accountable for safetywashing. Part of why the intent point comes up is in order to argue that they’re not morally culpable because they didn’t intend it. But in this case safetywashing is still something worth holding people morally accountable for even if they did not intend it, similar to how providing a fig leaf of safety research for a tobacco company is something that the person who does it is morally responsible for (even if they’ve successfully rationalized that their behavior is fine).
        In other words, my model is that “accidental safetywashing” as a term will only not face an immune reaction if it comes along with the context “and all this safety washing of risky engineering is totally accidental and therefore people who did it are not to blame”. But if it’s still considered about as morally wrong it will still receive an immune reaction, and it’s also a more cumbersome term that makes an unrelated claim that people like Habryka aren’t trying to make a claim about (i.e. mental state) and could well be false in some instances.
        the gears to ascension 15 Aug 2025 3:58 UTC
        2 points
        0
        Parent
        right, the idea being that a narrative that includes their perspective would not say “and then you decided to have The Blameworthy Thoughts”, it would look like “and then you lost track of considerations about what it takes to succeed at the medium term problem because the feeling of short term success was attractive”. I generally don’t think that guilting-based judgement is going to be effective at communicating or changing actions. the bargaining scenario we’re in doesn’t seem like one that permits moral judgement and blame as a useful move, that only works when someone is open to not having an immune response to it. so if they’re having immune responses, then you need to talk to them in a way that doesn’t try to override their agency with social pressure. if they in fact have explicitly endorsed bad intentions, then none of what I’m saying is relevant. if someone decides that what they’re doing is bad by their own values, they tend to update much more than if it’s just based on someone else RLHFing them.
        Expand this thread
        Ben Pace 15 Aug 2025 4:00 UTC
        4 points
        0
        Parent
        I admit I’m not sure how to read this as something other than a rejection of holding people morally accountable.
        the gears to ascension 15 Aug 2025 5:04 UTC
        2 points
        0
        Parent
        this is getting off topic, but I’ll give one more reply:
        
        “holding accountable” in the sense of “sending them frowny faces to make them feel bad about their behavior” doesn’t seem like a consequentially effective frame, yeah. if someone is doing bad things, you take actions to stop them; if they’re convinceable and just aren’t considering that the things might be bad, you show them the impact, and then they feel bad when they comprehend. if they’re not convinceable, you aren’t going to get through to them by frowning at them. claiming moral failure is a “send you frowny face” move.
        
        like—let’s say you bumped into someone on a crowded subway car, because you weren’t paying attention. I saw it, and know that you just smashed their food. you don’t know that. If I come up to you, someone who cares about not smashing people’s food for no reason, and say “I’m here to hold you accountable for what you’ve done wrong”, you’ll go ”...what the heck?” but if I say “hey, you smashed their food,” you’ll turn around and be surprised and unhappy about this, not least because now you also know you need to clean up how it got on you, as well as the moral impacts.
        
        if you go around smashing people’s food regularly, and repeatedly say you’re just not paying attention, then it might be that me telling you you suck is an effective intervention, but it still seems unlikely; either you’re doing it on purpose, in which case an immune response at you is needed, or you’ve got some sort of failure-to-update thing happening, in which case I need to help you look through what happens when you’re about to make the mistake and find a new behavior there (a deeper version of explaining it). if it’s because you’re getting paid to not look around you on the subway, then it might be hard to get you to consider the tradeoff, but if you value not knocking people’s food over, I might still be able to get through.
        the gears to ascension 15 Aug 2025 1:25 UTC
        2 points
        0
        Parent
        I appreciate you spelling this out. I think having the concept of personal attacks here is somewhat of a distraction but is one that will be present for almost everyone, and so having someone be willing to explicitly say that this is their reaction seems helpful to me for the purpose of bringing it into explicit context that people who are doing this work are humans and thus sensitive to the various dimensions of respect and other dimensions of social evaluation of them as a person, as separate from their work. Even though I think this sensitivity is hard to avoid, I also think it can cause serious problems when someone is in fact motivated for the wrong reason and thus can’t be convinced by treating them as making a correctable mistake.
        
        I actually am still considering applying to this program, but I’m leaning towards the conclusion that all the ideas I have for what to do in prosaic safety which are short term tractable would be net negative. I continue to think that there’s a unilateralist’s curse thing happening, where folks who are willing to do the short term research are selected for being ones who don’t have a detailed enough picture of the medium starkly-superintelligent-system-alignment research problem term to realize when they’re doing something counterproductive. That’s the thing that I mostly think is happening here. I noticed a lot of urge to go do work that would be respected when deciding whether to apply, but decided I’d rather not bet on my ideas being insufficiently capabilities-enhancing to undo the tenuous alignment benefit I think they’d provide.
    - Joseph Miller 13 Aug 2025 11:58 UTC
      4 points
      −1
      Parent
      Why not just call it “bad safety research”?
      I was responding to Thomas’s claim that it was not (accidental) safetywashing.
      But yeah, I’m not trying to attack their motivations. Updated my previous comment to clarify that.