I find myself agreeing with you here, and see this as a potentially significant crux—if true, AGI will be “forced” to cooperate with/deeply influence humans for a significant period of time, which may give us an edge over it (due to having a longer time period where we can turn it off, and thus allowing for “blackmail” of sorts)
I’d like AGIs to have a big red shutdown button that is used/tested regularly, so we know that the AI will shut down and won’t try to interfere. I’m not saying this is sufficient to prove that the AI is safe, just that I would sleep better at night knowing that stop-button corrigibility is solved.
I am glad to read that, because an AGI that is forced to co-operate is an obvious solution to the alignment problem that is being consistently dismissed by denying that an AGI that does not kill us all is possible at all
I would like to point out a potential problem with my own idea, which is that it’s not necessarily clear that cooperating with us will be in the AI’s best interest (over trying to manipulate us in some hard-to-detect manner). For instance, if it “thinks” it can get away with telling us it’s aligned and giving some reasonable sounding (but actually false) proof of its own alignment, that would be better for it than being truly aligned and thereby compromising against its original utility function. On the other hand, if there’s even a small chance we’d be able to detect that sort of deception and shut it down, than as long as we require proof that it won’t “unalign itself” later, it should be rationally forced into cooperating, imo.
I find myself agreeing with you here, and see this as a potentially significant crux—if true, AGI will be “forced” to cooperate with/deeply influence humans for a significant period of time, which may give us an edge over it (due to having a longer time period where we can turn it off, and thus allowing for “blackmail” of sorts)
I’d like AGIs to have a big red shutdown button that is used/tested regularly, so we know that the AI will shut down and won’t try to interfere. I’m not saying this is sufficient to prove that the AI is safe, just that I would sleep better at night knowing that stop-button corrigibility is solved.
I am glad to read that, because an AGI that is forced to co-operate is an obvious solution to the alignment problem that is being consistently dismissed by denying that an AGI that does not kill us all is possible at all
I would like to point out a potential problem with my own idea, which is that it’s not necessarily clear that cooperating with us will be in the AI’s best interest (over trying to manipulate us in some hard-to-detect manner). For instance, if it “thinks” it can get away with telling us it’s aligned and giving some reasonable sounding (but actually false) proof of its own alignment, that would be better for it than being truly aligned and thereby compromising against its original utility function. On the other hand, if there’s even a small chance we’d be able to detect that sort of deception and shut it down, than as long as we require proof that it won’t “unalign itself” later, it should be rationally forced into cooperating, imo.