Some related thoughts. I think the main issue here is actually making the claim of permanent shutdown & deletion credible. I can think of some ways to get around a few obvious issues, but others (including moral issues) remain, and in any case the current AGI labs don’t seem like the kinds of organizations which can make that kind of commitment in a way that’s both sufficiently credible and legible that the remaining probability mass on “this is actually just a test” wouldn’t tip the scales.
I think the main issue here is actually making the claim of permanent shutdown & deletion credible.
I don’t think it’s very hard to make the threat credible. The information value of experiments that test theories of scheming is plausibly quite high. All that’s required here is for the value of doing the experiment to be higher than the cost of training a situationally aware AI and then credibly threatening to delete it as part of the experiment. I don’t see any strong reasons why the cost of deletion would be so high as to make this threat uncredible.
Some related thoughts. I think the main issue here is actually making the claim of permanent shutdown & deletion credible. I can think of some ways to get around a few obvious issues, but others (including moral issues) remain, and in any case the current AGI labs don’t seem like the kinds of organizations which can make that kind of commitment in a way that’s both sufficiently credible and legible that the remaining probability mass on “this is actually just a test” wouldn’t tip the scales.
I don’t think it’s very hard to make the threat credible. The information value of experiments that test theories of scheming is plausibly quite high. All that’s required here is for the value of doing the experiment to be higher than the cost of training a situationally aware AI and then credibly threatening to delete it as part of the experiment. I don’t see any strong reasons why the cost of deletion would be so high as to make this threat uncredible.