I could use a better way to think sanely about warning shots.
Yeah I should probably spend some time thinking about this as well. My tentative take is that “well I wouldn’t do this great safety intervention because it might avoid small AI catastrophes that kill a lot of people, but not all the people (and those catastrophes are actually good)” is suspicious reasoning. Like I’m so allergic to arguments of the form “allow bad thing to happen for the greater good.”
Also, I feel like we can just run lots of training ablations to see which methods are load bearing for how aligned models seem. For example, if we removed RLHF, and then model just suddenly starts saying stuff about “My real goal is to hack into the Anthropic servers,”[1] then we should be pretty worried, and this doesn’t require people to actually die in a catastrophe.
I spent a bit of time (like, 10 min) thinking through warning shots today.
I definitely do not think anyone should take any actions that specifically cause warning shots to happen (if you are trying to do something like that, you should be looking for “a scary demo”, not “a warning shot”. Scary demos can/should be demo’d ethically)
If you know of a concrete safety intervention that’d save lives, obviously do the safety intervention.
But, a lot of the questions here are less like “should I do this intervention?” and more like “should I invest years of my life researching into a direction that helps found a new subfield that maybe will result in concrete useful things that save some lives locally but also I expect to paper over problems and cost more lives later?” (when, meanwhile, there are tons of other research directions you could explore)
...yes there is something sus here that I am still confused about, but, with the amount of cluelessness that necessarily involves I don’t think people have an obligation to go founding new research subfields if their current overall guess is “useful locally but harmful globally.”
I think if you go and try to suppress research into things that you think are moderately likely to save some lives a few years down the line but cost more live later, then we’re back into ethically fraught territory (but, like, also, you shouldn’t suppress people saying “guys this research line is maybe on net going to increase odds of everyone dying.”
I didn’t actually get to having a new crystallized take, that was all basically my background thoughts from earlier.
(Also, hopefully obviously: when you are deciding your research path, or arguing people should abandon one, you do have to actually do the work to make an informed argument for whether/how bad any of the effects are, ‘it’s plausible X might lead to a warning shot that helps’ or ‘it’s plausible Y might lead to helping on net with alignment subproblems’ or ‘Y might save a moderate number of lives’ are all things you need to unpack and actually reason through)
Yeah I should probably spend some time thinking about this as well. My tentative take is that “well I wouldn’t do this great safety intervention because it might avoid small AI catastrophes that kill a lot of people, but not all the people (and those catastrophes are actually good)” is suspicious reasoning. Like I’m so allergic to arguments of the form “allow bad thing to happen for the greater good.”
Also, I feel like we can just run lots of training ablations to see which methods are load bearing for how aligned models seem. For example, if we removed RLHF, and then model just suddenly starts saying stuff about “My real goal is to hack into the Anthropic servers,”[1] then we should be pretty worried, and this doesn’t require people to actually die in a catastrophe.
This is a result in an earlier version of Anthropic’s Natural Emergent Misalignment from Reward Hacking paper which for some reason didn’t make it into the final paper.
I spent a bit of time (like, 10 min) thinking through warning shots today.
I definitely do not think anyone should take any actions that specifically cause warning shots to happen (if you are trying to do something like that, you should be looking for “a scary demo”, not “a warning shot”. Scary demos can/should be demo’d ethically)
If you know of a concrete safety intervention that’d save lives, obviously do the safety intervention.
But, a lot of the questions here are less like “should I do this intervention?” and more like “should I invest years of my life researching into a direction that helps found a new subfield that maybe will result in concrete useful things that save some lives locally but also I expect to paper over problems and cost more lives later?” (when, meanwhile, there are tons of other research directions you could explore)
...yes there is something sus here that I am still confused about, but, with the amount of cluelessness that necessarily involves I don’t think people have an obligation to go founding new research subfields if their current overall guess is “useful locally but harmful globally.”
I think if you go and try to suppress research into things that you think are moderately likely to save some lives a few years down the line but cost more live later, then we’re back into ethically fraught territory (but, like, also, you shouldn’t suppress people saying “guys this research line is maybe on net going to increase odds of everyone dying.”
I didn’t actually get to having a new crystallized take, that was all basically my background thoughts from earlier.
(Also, hopefully obviously: when you are deciding your research path, or arguing people should abandon one, you do have to actually do the work to make an informed argument for whether/how bad any of the effects are, ‘it’s plausible X might lead to a warning shot that helps’ or ‘it’s plausible Y might lead to helping on net with alignment subproblems’ or ‘Y might save a moderate number of lives’ are all things you need to unpack and actually reason through)