“people trying to do safety research at an AGI company, whose work is useless or net negative, such that if said people were working outside the company and trying to do safety work, they would do work that was not net negative”?
No, not quite. I think the better characterization is something like “people trying to do safety research at an AGI company in a way that is substantially skewed by the economical and reputational incentives of the organization they work at, such that that work ends up either making reasoning errors heavily correlated with ways that are beneficial to the economic or reputational resources of the company they work at, or where the work overall seems better explained by some (potentially subconscious, potentially distributed) optimization process that is optimizing according to these incentives, than a straightforward and neutral concern about risk”.
I think this is generally hard to judge, as human self-deception and rationalization reaches deep.
I think his concern here is one that is likely shared by a lot of the people one hopes to describe by the term “safetywashing”; perhaps a term that makes it more explicit and direct that the phenomenon is typically accidental would make it easier to refer to without people having immune reactions? safetywishing /hj
In my model, the thing that people are having immune reactions to to is the idea that they’re morally accountable for safetywashing. Part of why the intent point comes up is in order to argue that they’re not morally culpable because they didn’t intend it. But in this case safetywashing is still something worth holding people morally accountable for even if they did not intend it, similar to how providing a fig leaf of safety research for a tobacco company is something that the person who does it is morally responsible for (even if they’ve successfully rationalized that their behavior is fine).
In other words, my model is that “accidental safetywashing” as a term will only not face an immune reaction if it comes along with the context “and all this safety washing of risky engineering is totally accidental and therefore people who did it are not to blame”. But if it’s still considered about as morally wrong it will still receive an immune reaction, and it’s also a more cumbersome term that makes an unrelated claim that people like Habryka aren’t trying to make a claim about (i.e. mental state) and could well be false in some instances.
right, the idea being that a narrative that includes their perspective would not say “and then you decided to have The Blameworthy Thoughts”, it would look like “and then you lost track of considerations about what it takes to succeed at the medium term problem because the feeling of short term success was attractive”. I generally don’t think that guilting-based judgement is going to be effective at communicating or changing actions. the bargaining scenario we’re in doesn’t seem like one that permits moral judgement and blame as a useful move, that only works when someone is open to not having an immune response to it. so if they’re having immune responses, then you need to talk to them in a way that doesn’t try to override their agency with social pressure. if they in fact have explicitly endorsed bad intentions, then none of what I’m saying is relevant. if someone decides that what they’re doing is bad by their own values, they tend to update much more than if it’s just based on someone else RLHFing them.
this is getting off topic, but I’ll give one more reply:
“holding accountable” in the sense of “sending them frowny faces to make them feel bad about their behavior” doesn’t seem like a consequentially effective frame, yeah. if someone is doing bad things, you take actions to stop them; if they’re convinceable and just aren’t considering that the things might be bad, you show them the impact, and then they feel bad when they comprehend. if they’re not convinceable, you aren’t going to get through to them by frowning at them. claiming moral failure is a “send you frowny face” move.
like—let’s say you bumped into someone on a crowded subway car, because you weren’t paying attention. I saw it, and know that you just smashed their food. you don’t know that. If I come up to you, someone who cares about not smashing people’s food for no reason, and say “I’m here to hold you accountable for what you’ve done wrong”, you’ll go ”...what the heck?” but if I say “hey, you smashed their food,” you’ll turn around and be surprised and unhappy about this, not least because now you also know you need to clean up how it got on you, as well as the moral impacts.
if you go around smashing people’s food regularly, and repeatedly say you’re just not paying attention, then it might be that me telling you you suck is an effective intervention, but it still seems unlikely; either you’re doing it on purpose, in which case an immune response at you is needed, or you’ve got some sort of failure-to-update thing happening, in which case I need to help you look through what happens when you’re about to make the mistake and find a new behavior there (a deeper version of explaining it). if it’s because you’re getting paid to not look around you on the subway, then it might be hard to get you to consider the tradeoff, but if you value not knocking people’s food over, I might still be able to get through.
No, not quite. I think the better characterization is something like “people trying to do safety research at an AGI company in a way that is substantially skewed by the economical and reputational incentives of the organization they work at, such that that work ends up either making reasoning errors heavily correlated with ways that are beneficial to the economic or reputational resources of the company they work at, or where the work overall seems better explained by some (potentially subconscious, potentially distributed) optimization process that is optimizing according to these incentives, than a straightforward and neutral concern about risk”.
I think this is generally hard to judge, as human self-deception and rationalization reaches deep.
I think his concern here is one that is likely shared by a lot of the people one hopes to describe by the term “safetywashing”; perhaps a term that makes it more explicit and direct that the phenomenon is typically accidental would make it easier to refer to without people having immune reactions? safetywishing /hj
In my model, the thing that people are having immune reactions to to is the idea that they’re morally accountable for safetywashing. Part of why the intent point comes up is in order to argue that they’re not morally culpable because they didn’t intend it. But in this case safetywashing is still something worth holding people morally accountable for even if they did not intend it, similar to how providing a fig leaf of safety research for a tobacco company is something that the person who does it is morally responsible for (even if they’ve successfully rationalized that their behavior is fine).
In other words, my model is that “accidental safetywashing” as a term will only not face an immune reaction if it comes along with the context “and all this safety washing of risky engineering is totally accidental and therefore people who did it are not to blame”. But if it’s still considered about as morally wrong it will still receive an immune reaction, and it’s also a more cumbersome term that makes an unrelated claim that people like Habryka aren’t trying to make a claim about (i.e. mental state) and could well be false in some instances.
right, the idea being that a narrative that includes their perspective would not say “and then you decided to have The Blameworthy Thoughts”, it would look like “and then you lost track of considerations about what it takes to succeed at the medium term problem because the feeling of short term success was attractive”. I generally don’t think that guilting-based judgement is going to be effective at communicating or changing actions. the bargaining scenario we’re in doesn’t seem like one that permits moral judgement and blame as a useful move, that only works when someone is open to not having an immune response to it. so if they’re having immune responses, then you need to talk to them in a way that doesn’t try to override their agency with social pressure. if they in fact have explicitly endorsed bad intentions, then none of what I’m saying is relevant. if someone decides that what they’re doing is bad by their own values, they tend to update much more than if it’s just based on someone else RLHFing them.
I admit I’m not sure how to read this as something other than a rejection of holding people morally accountable.
this is getting off topic, but I’ll give one more reply:
“holding accountable” in the sense of “sending them frowny faces to make them feel bad about their behavior” doesn’t seem like a consequentially effective frame, yeah. if someone is doing bad things, you take actions to stop them; if they’re convinceable and just aren’t considering that the things might be bad, you show them the impact, and then they feel bad when they comprehend. if they’re not convinceable, you aren’t going to get through to them by frowning at them. claiming moral failure is a “send you frowny face” move.
like—let’s say you bumped into someone on a crowded subway car, because you weren’t paying attention. I saw it, and know that you just smashed their food. you don’t know that. If I come up to you, someone who cares about not smashing people’s food for no reason, and say “I’m here to hold you accountable for what you’ve done wrong”, you’ll go ”...what the heck?” but if I say “hey, you smashed their food,” you’ll turn around and be surprised and unhappy about this, not least because now you also know you need to clean up how it got on you, as well as the moral impacts.
if you go around smashing people’s food regularly, and repeatedly say you’re just not paying attention, then it might be that me telling you you suck is an effective intervention, but it still seems unlikely; either you’re doing it on purpose, in which case an immune response at you is needed, or you’ve got some sort of failure-to-update thing happening, in which case I need to help you look through what happens when you’re about to make the mistake and find a new behavior there (a deeper version of explaining it). if it’s because you’re getting paid to not look around you on the subway, then it might be hard to get you to consider the tradeoff, but if you value not knocking people’s food over, I might still be able to get through.