I worded the title conservatively because I only showed that corrigibility is never nontrivially VNM-coherent in this particular MDP.Maybe there’s a more general case to be proven for all MDPs, and using more realistic (non-single-timestep) reward aggregation schemes.
I worded the title conservatively because I only showed that corrigibility is never nontrivially VNM-coherent in this particular MDP. Maybe there’s a more general case to be proven for all MDPs, and using more realistic (non-single-timestep) reward aggregation schemes.