I think the title question remains important and neglected. I encourage people to revisit this question as we get closer to AIs that (are able to) exhibit scary/egregiously misaligned behavior.
Specifically, it might be good to have answers to questions like:
What do we expect to be the best kind of evidence for convincing developers to shutdown/undeploy?
Maybe there is more compelling evidence than “your AI writing code to escape”
I’m reminded of hearing about how the public reception of the OAI/Apollo scheming paper was surprising relative to some people’s expectations (people are more freaked out by weird CoT, less freaked out by actual scheme-y behavior).
What are good plans for AI labs when they have such evidence?
What are good plans for AIS researchers/advocacy people when they have such evidence?
Should you pursue safety research agenda X if shutdown/undeploy is hard?
To spell out the thinking here: you might conceptualize safety research as having broadly two paths to impact:
1- “Training against”: give training signal we can optimize over to reduce egregious misalignment
2- “Shutdown”: give evidence to convince AI developers to slow down or un-deploy
One implication of thinking that 2 is very hard is that your research should either focus aggressively on 1 or risk being not that useful.
I think the title question remains important and neglected. I encourage people to revisit this question as we get closer to AIs that (are able to) exhibit scary/egregiously misaligned behavior.
Specifically, it might be good to have answers to questions like:
What do we expect to be the best kind of evidence for convincing developers to shutdown/undeploy?
Maybe there is more compelling evidence than “your AI writing code to escape”
I’m reminded of hearing about how the public reception of the OAI/Apollo scheming paper was surprising relative to some people’s expectations (people are more freaked out by weird CoT, less freaked out by actual scheme-y behavior).
What are good plans for AI labs when they have such evidence?
What are good plans for AIS researchers/advocacy people when they have such evidence?
Should you pursue safety research agenda X if shutdown/undeploy is hard?
To spell out the thinking here: you might conceptualize safety research as having broadly two paths to impact:
1- “Training against”: give training signal we can optimize over to reduce egregious misalignment
2- “Shutdown”: give evidence to convince AI developers to slow down or un-deploy
One implication of thinking that 2 is very hard is that your research should either focus aggressively on 1 or risk being not that useful.