Reversible changes: consider a bucket of water

I’ve ar­gued that many meth­ods of AI con­trol—cor­rigi­bil­ity, am­plifi­ca­tion and dis­til­la­tion, low im­pact, etc… - re­quire a par­tial defi­ni­tion of hu­man prefer­ences to make sense.

One idea I’ve heard for low im­pact is that of re­versibil­ity—ask­ing how hard it is to move the situ­a­tion back to its origi­nal state (or how now to close down too many op­tions). The “Con­ser­va­tive Agency” pa­per uses some­thing akin to that, for ex­am­ple.

On that is­sue, I’ll pre­sent the bucket of wa­ter thought ex­per­i­ment. A robot has to reach a cer­tain des­ti­na­tion; it can do so by wad­ing through a shal­low pool. At the front of the pool is a bucket of wa­ter. The wa­ter in the bucket has a slightly differ­ent bal­ance of salts than the wa­ter in the pool (maybe due to mix­ing effects when the wa­ter was drawn from the pool).

The fastest way for the robot to reach their des­ti­na­tion is to run through the pool, kick­ing the bucket into it as it goes. Is this a re­versible ac­tion?

Well, it de­pends on what hu­mans care about the wa­ter in the bucket. If we care about the rough quan­tity of wa­ter, this ac­tion is perfectly re­versible: just dip the bucket back into the pool and draw out the right amount of wa­ter. If we care about the ex­act bal­ance of salts in the bucket, this is very difficult to re­verse, and re­quires a lot of difficult work to do so. If we care about the ex­act molecules in the bucket, this ac­tion is com­pletely ir­re­versible.

The truth is that, with a tiny set of ex­cep­tions, all our ac­tions are ir­re­versible, shut­ting down many pos­si­bil­ities for ever. But many things are re­versible in prac­tice, in that we can re­turn to a state suffi­ciently similar that we don’t care about the differ­ence.

But, in or­der to es­tab­lish that, we need some es­ti­mate of what we care (and don’t care) about. In the ex­am­ple above, things are differ­ent if we are con­sid­er­ing a) this is a bucket of wa­ter, b) this is a bucket of wa­ter care­fully salt-bal­anced for an in­dus­trial pur­pose, or c) this is the wa­ter from the last bath my adored hus­band took, be­fore he died.

EDIT: I should point out that “avoid the bucket any­way” is not a valid strat­egy, since “avoid do­ing any­thing that could have a large ir­re­versible im­pact for some util­ity func­tion” is equiv­a­lent with “don’t do any­thing at all”.

The robot has to be ca­pa­ble of kick­ing the bucket in some cir­cum­stances. Pre­cisely in those cir­cum­stances where hu­mans don’t care about the bucket’s con­tents and it is valuable to kick it. But both of those—“don’t care”, “valuable”—are hu­man value judge­ments.

That’s why I don’t see how there could be any mea­sure of ir­re­versible or low im­pact that doesn’t in­clude some por­tion of hu­man prefer­ences. It has to dis­t­in­guish be­tween when kick­ing the bucket is for­bid­den, al­low­able, or manda­tory—and the only thing that dis­t­in­guishes these are hu­man prefer­ences.