habryka comments on evhub’s Shortform

habryka 14 Aug 2025 17:38 UTC
20 points
16
I mostly feel like you keep doing the thing where you try to frame this as a mixture of “rude” and “higher burden of proof”. If you use words like “personal attack” you are implicitly invoking a social norm against making “personal attacks”. If you want to not invoke that norm (which I argue is inappropriate in this situation) then you have to use other words.
Similarly, calling these things “rude” is obviously invoking a norm with a higher burden of proof for these things. This I again argue is inappropriate. Use different words if you want to actually not invoke a higher burden of proof for those hypotheses.
It’s just really obviously the case that we should view research output from a leading capability lab on safety with a decent dose of suspicion. Considering that hypothesis is not a central example of a “personal attack” or “rudeness”.
you’d expect at least some fraction to be true believers who would quit and publicly speak up about all of the intentional safety washing that’s happening - to my knowledge this has not happened at Anthropic, including privately?
I have talked to people who worked closely with Anthropic who I think would describe the situation as such. They do not do so publicly because among other things Anthropic asked people to sign NDAs that make it illegal for people to say those things (which Anthropic has said they aren’t planning to enforce, but you know, it sure signals a willingness to attack anyone who does so).
In general, people not making public comments about an organization or group is very little evidence. Much more egregious behavior has gone unnoticed or uncomplained about (for example, we saw very limited public comments about FTX, despite widespread knowledge among ex-employees and FTX-collaborators that they were being very reckless with risk).
I also disagree with your tobacco analogy because as I understand it, tobacco researchers tend not to have entered the field via a cigarette safety community, often being in that community for several years beforehand, choosing to be in said community because they are concerned about the risks cigarettes pose society, and join a team where many of their colleagues also come from the same cigarette safety community
That does not accurately describe most people who go through Anthropic’s fellowship program (it might happen to describe the specific authors, I again haven’t looked into these specific people). My sense is most people who go through that program have not spent many years in the safety community. It’s definitely not true for MATS these days, on average, and my guess is it’s not more true for the Anthropic fellowship.
I also separately don’t think it’s that much evidence. Any tobacco safety researchers probably also entered the field aiming to be upstanding and honest statisticians, undergoing multiple years of training from high-integrity professors extolling the virtues of being honest and not lying with statistics. This doesn’t offer that much protection.
But I would also put a pretty high probability they were just wrong: possibly because they were in some weird culty environment that warped their beliefs, or are in a weird bubble, or see distorted data, or other subtle biases get absorbed from their environment. But to me that’s just another way of being wrong, and I generally think it’s a simpler hypothesis than intentional safety washing.
This is all a terrible strawman. Nobody here has accused anyone of “intentional safety washing”. The statisticians at the tobacco company almost certainly found some way of rationalizing their behavior as good or honest. I am confident if you find interviews or their personal notes you would find them not scheming in obvious and evil ways about how to deceive the public, bar maybe a few high-profile cases at the top. The hypothesis here and now and always is that people are being part of safety washing largely subconsciously.
In general I am extremely tired of the eternal back and forth where any accusation of deception gets twisted into an accusation of “obviously evil and conscious scheming”. No, that’s not what’s going on, and as far as I can tell in this thread no one has invoked such a hypothesis. Most evil in the world gets done by people rationalizing what they are doing. If safety washing is going on in these programs, it would almost certainly not be done with that much explicitness (though occasionally people would likely do some explicit optimization, but in ways that are easy to forget, rationalize or hide).
- Neel Nanda 14 Aug 2025 23:53 UTC
  4 points
  2
  Parent
  Wow, that is NOT what I thought people used the word safety washing to mean. Thanks for clarifying. To confirm I understand, are you using it to mean “people trying to do safety research at an AGI company, whose work is useless or net negative, such that if said people were working outside the company and trying to do safety work, they would do work that was not net negative”?
  
  Also, to clarify, I’m referring to the supervisors of the works not the fellows when I’m referring to the character of the people mattering. I don’t necessarily expect junior people keen to get a job to stand up to a more experienced mentor, even if they disagree
  - habryka 15 Aug 2025 0:57 UTC
    8 points
    10
    Parent
    “people trying to do safety research at an AGI company, whose work is useless or net negative, such that if said people were working outside the company and trying to do safety work, they would do work that was not net negative”?
    No, not quite. I think the better characterization is something like “people trying to do safety research at an AGI company in a way that is substantially skewed by the economical and reputational incentives of the organization they work at, such that that work ends up either making reasoning errors heavily correlated with ways that are beneficial to the economic or reputational resources of the company they work at, or where the work overall seems better explained by some (potentially subconscious, potentially distributed) optimization process that is optimizing according to these incentives, than a straightforward and neutral concern about risk”.
    I think this is generally hard to judge, as human self-deception and rationalization reaches deep.
    - the gears to ascension 15 Aug 2025 2:01 UTC
      2 points
      0
      Parent
      I think his concern here is one that is likely shared by a lot of the people one hopes to describe by the term “safetywashing”; perhaps a term that makes it more explicit and direct that the phenomenon is typically accidental would make it easier to refer to without people having immune reactions? safetywishing /hj
      - Ben Pace 15 Aug 2025 3:49 UTC
        11 points
        6
        Parent
        In my model, the thing that people are having immune reactions to to is the idea that they’re morally accountable for safetywashing. Part of why the intent point comes up is in order to argue that they’re not morally culpable because they didn’t intend it. But in this case safetywashing is still something worth holding people morally accountable for even if they did not intend it, similar to how providing a fig leaf of safety research for a tobacco company is something that the person who does it is morally responsible for (even if they’ve successfully rationalized that their behavior is fine).
        In other words, my model is that “accidental safetywashing” as a term will only not face an immune reaction if it comes along with the context “and all this safety washing of risky engineering is totally accidental and therefore people who did it are not to blame”. But if it’s still considered about as morally wrong it will still receive an immune reaction, and it’s also a more cumbersome term that makes an unrelated claim that people like Habryka aren’t trying to make a claim about (i.e. mental state) and could well be false in some instances.
        the gears to ascension 15 Aug 2025 3:58 UTC
        2 points
        0
        Parent
        right, the idea being that a narrative that includes their perspective would not say “and then you decided to have The Blameworthy Thoughts”, it would look like “and then you lost track of considerations about what it takes to succeed at the medium term problem because the feeling of short term success was attractive”. I generally don’t think that guilting-based judgement is going to be effective at communicating or changing actions. the bargaining scenario we’re in doesn’t seem like one that permits moral judgement and blame as a useful move, that only works when someone is open to not having an immune response to it. so if they’re having immune responses, then you need to talk to them in a way that doesn’t try to override their agency with social pressure. if they in fact have explicitly endorsed bad intentions, then none of what I’m saying is relevant. if someone decides that what they’re doing is bad by their own values, they tend to update much more than if it’s just based on someone else RLHFing them.
        Ben Pace 15 Aug 2025 4:00 UTC
        4 points
        0
        Parent
        I admit I’m not sure how to read this as something other than a rejection of holding people morally accountable.
        the gears to ascension 15 Aug 2025 5:04 UTC
        2 points
        0
        Parent
        this is getting off topic, but I’ll give one more reply:
        
        “holding accountable” in the sense of “sending them frowny faces to make them feel bad about their behavior” doesn’t seem like a consequentially effective frame, yeah. if someone is doing bad things, you take actions to stop them; if they’re convinceable and just aren’t considering that the things might be bad, you show them the impact, and then they feel bad when they comprehend. if they’re not convinceable, you aren’t going to get through to them by frowning at them. claiming moral failure is a “send you frowny face” move.
        
        like—let’s say you bumped into someone on a crowded subway car, because you weren’t paying attention. I saw it, and know that you just smashed their food. you don’t know that. If I come up to you, someone who cares about not smashing people’s food for no reason, and say “I’m here to hold you accountable for what you’ve done wrong”, you’ll go ”...what the heck?” but if I say “hey, you smashed their food,” you’ll turn around and be surprised and unhappy about this, not least because now you also know you need to clean up how it got on you, as well as the moral impacts.
        
        if you go around smashing people’s food regularly, and repeatedly say you’re just not paying attention, then it might be that me telling you you suck is an effective intervention, but it still seems unlikely; either you’re doing it on purpose, in which case an immune response at you is needed, or you’ve got some sort of failure-to-update thing happening, in which case I need to help you look through what happens when you’re about to make the mistake and find a new behavior there (a deeper version of explaining it). if it’s because you’re getting paid to not look around you on the subway, then it might be hard to get you to consider the tradeoff, but if you value not knocking people’s food over, I might still be able to get through.