This is, broadly speaking, the problem of corrigibility, and how to formalize it is currently an open research problem. (There’s the separate question whether it’s possible to make systems robustly corrigible in practice without having a good formalized notion of what that even means; this seems tricky.)
This is, broadly speaking, the problem of corrigibility, and how to formalize it is currently an open research problem. (There’s the separate question whether it’s possible to make systems robustly corrigible in practice without having a good formalized notion of what that even means; this seems tricky.)