“Instead of building in a shutdown button, build in a shutdown timer.”
-> Isn’t that a form of corrigibility with an added constraint? I’m not sure what would prevent you from convincing humans that it’s a bad thing to respect the timer, for example. Is it because we’ll formally verify we avoid deception instance? It’s not clear to me but maybe I’ve misunderstood.
Your first two bullet points are very accurate; it would indeed be relevant to continue by addressing these points further.
Finally, regarding your last bullet point, I agree. Currently, we do not know if it is possible to develop such safeguards, and even if it were, it would require time and further research. I fully agree that this should be made more explicit!!