Reversible changes: consider a bucket of water
I’ve argued that many methods of AI control—corrigibility, amplification and distillation, low impact, etc… - require a partial definition of human preferences to make sense.
One idea I’ve heard for low impact is that of reversibility—asking how hard it is to move the situation back to its original state (or how now to close down too many options). The “Conservative Agency” paper uses something akin to that, for example.
On that issue, I’ll present the bucket of water thought experiment. A robot has to reach a certain destination; it can do so by wading through a shallow pool. At the front of the pool is a bucket of water. The water in the bucket has a slightly different balance of salts than the water in the pool (maybe due to mixing effects when the water was drawn from the pool).
The fastest way for the robot to reach their destination is to run through the pool, kicking the bucket into it as it goes. Is this a reversible action?
Well, it depends on what humans care about the water in the bucket. If we care about the rough quantity of water, this action is perfectly reversible: just dip the bucket back into the pool and draw out the right amount of water. If we care about the exact balance of salts in the bucket, this is very difficult to reverse, and requires a lot of difficult work to do so. If we care about the exact molecules in the bucket, this action is completely irreversible.
The truth is that, with a tiny set of exceptions, all our actions are irreversible, shutting down many possibilities for ever. But many things are reversible in practice, in that we can return to a state sufficiently similar that we don’t care about the difference.
But, in order to establish that, we need some estimate of what we care (and don’t care) about. In the example above, things are different if we are considering a) this is a bucket of water, b) this is a bucket of water carefully salt-balanced for an industrial purpose, or c) this is the water from the last bath my adored husband took, before he died.
EDIT: I should point out that “avoid the bucket anyway” is not a valid strategy, since “avoid doing anything that could have a large irreversible impact for some utility function” is equivalent with “don’t do anything at all”.
The robot has to be capable of kicking the bucket in some circumstances. Precisely in those circumstances where humans don’t care about the bucket’s contents and it is valuable to kick it. But both of those—“don’t care”, “valuable”—are human value judgements.
That’s why I don’t see how there could be any measure of irreversible or low impact that doesn’t include some portion of human preferences. It has to distinguish between when kicking the bucket is forbidden, allowable, or mandatory—and the only thing that distinguishes these are human preferences.