Clearly the agent will converge to the mean on unusual situations, since e.g. it has learned a bunch of heuristics that are useful for situations that come up in training. My primary concern is that it remains corrigible (or something like that) in extreme situations. This requires (a) corrigibility makes sense and is sufficiently easy-to-learn (I think it probably does but it’s far from certain) and (b) something like these techniques can avoid catastrophic failures off distribution (I suspect they can but am even less confident).
Clearly the agent will converge to the mean on unusual situations, since e.g. it has learned a bunch of heuristics that are useful for situations that come up in training. My primary concern is that it remains corrigible (or something like that) in extreme situations. This requires (a) corrigibility makes sense and is sufficiently easy-to-learn (I think it probably does but it’s far from certain) and (b) something like these techniques can avoid catastrophic failures off distribution (I suspect they can but am even less confident).