Value Learning is only Asymptotically Safe

I showed re­cently, pred­i­cated on a few as­sump­tions, that a cer­tain agent was asymp­tot­i­cally “be­nign” with prob­a­bil­ity 1. (That term may be re­placed by some­thing like “do­mes­ti­cated” in the next ver­sion, but I’ll use “be­nign” for now).

This re­sult leaves some­thing to be de­sired: namely an agent which is safe for its en­tire life­time. It seems very difficult to for­mally show such a strong re­sult for any agent. Sup­pose we had a de­sign for an agent which did value learn­ing prop­erly. That is, sup­pose we some­how figured out how to de­sign an agent which un­der­stood what con­sti­tuted ob­ser­va­tional ev­i­dence of hu­man­ity’s re­flec­tively-en­dorsed util­ity func­tion.

Pre­sum­ably, such an agent could learn (just about) any util­ity func­tion de­pend­ing on what ob­ser­va­tions it en­coun­ters. Surely, there would be a set of ob­ser­va­tions which caused it to be­lieve that ev­ery hu­man was bet­ter off dead.

In the pres­ence of cos­mic rays, then, one can­not say that agent is safe for its en­tire life­time with prob­a­bil­ity 1 (ed­ited for clar­ity). For any finite se­quence of ob­ser­va­tions that would cause the agent to con­clude that hu­man­ity was bet­ter off dead, this se­quence has strictly pos­i­tive prob­a­bil­ity, since with pos­i­tive prob­a­bil­ity, cos­mic rays will flip ev­ery rele­vant bit in the com­puter’s mem­ory.

This agent is pre­sum­ably still asymp­tot­i­cally safe. This is a bit hard to jus­tify with­out a con­crete pro­posal for what this agent looks like, but at the very least, the cos­mic ray ar­gu­ment doesn’t go through. With prob­a­bil­ity 1, the sam­ple mean of a Bernoulli() ran­dom vari­able (like the in­di­ca­tor of whether a bit was flipped) ap­proaches , which is small enough that a com­pe­tent value learner should be able to deal with it.

This is not to sug­gest that the value learner is un­safe. In­sanely in­con­ve­nient cos­mic ray ac­tivity is a risk I’m will­ing to take. The take­away here is that it com­pli­cates the ques­tion of what we as al­gorithm de­sign­ers should aim for. We should definitely be writ­ing down sets as­sump­tions from which we can de­rive for­mal re­sults about the ex­pected be­hav­ior of an agent, but is there any­thing to aim for that is stronger than asymp­totic safety?