AI starts to become extremely dangerous around capacity level where it can reliably do STEM and thus has the potential to self-improve. To do STEM, you need to be able to detect and handle model splintering, and identify and do experiments to distinguish between alternative hypotheses. In this particular situation, as soon as a STEM-capable AI encountered a level where the coin wasn’t at the fr right, it should note that there are two hypotheses that on previous evidence it can’t distinguish between, so it should give roughly equally priors: that the goal is to get to the coin, or that it is to get to the far right. It should then figure out how perform an experiment to distinguish between these hypotheses (simple: try each; or, ask a human if one is available), perform it, approximately-Bayesian-update its priors, and repeat this until it’s reasonably sure. Any agent capable of this is not very likely to suffer from serious alignment problems dues to misgeneralization or model splintering, since it can handle them; on the other hand, any agent not capable of it is basically incapable of doing STEM so has little capability for self-improvement, making it not as dangerous (unless, say, it has super-human persuasion, or you allow it to control weapon systems, or whatever). For a longer discussion of this, see Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)
AI starts to become extremely dangerous around capacity level where it can reliably do STEM and thus has the potential to self-improve. To do STEM, you need to be able to detect and handle model splintering, and identify and do experiments to distinguish between alternative hypotheses. In this particular situation, as soon as a STEM-capable AI encountered a level where the coin wasn’t at the fr right, it should note that there are two hypotheses that on previous evidence it can’t distinguish between, so it should give roughly equally priors: that the goal is to get to the coin, or that it is to get to the far right. It should then figure out how perform an experiment to distinguish between these hypotheses (simple: try each; or, ask a human if one is available), perform it, approximately-Bayesian-update its priors, and repeat this until it’s reasonably sure.
Any agent capable of this is not very likely to suffer from serious alignment problems dues to misgeneralization or model splintering, since it can handle them; on the other hand, any agent not capable of it is basically incapable of doing STEM so has little capability for self-improvement, making it not as dangerous (unless, say, it has super-human persuasion, or you allow it to control weapon systems, or whatever).
For a longer discussion of this, see Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)