Consider a self driving car. Call the human utility function U. Call the space of all possible worlds W. In the normal operation of a self driving car, the car has only makes decisions over the restricted space W′ . Say W′={Crash into lampost + Rest of universe,Don't crash + Rest of universe} In practice W′ will contain a whole bunch of things the car could do. Suppose that the programmers only know U′ the restriction of U to W′. This is enough to make a self driving car that behaves correctly in the crash or don’t crash dilemma.
However, suppose that a self driving car is faced with an off distribution situation from W∖W′ . Three things it could do include:
1) Recognise the problem and shut down.
2) Fail to coherently optimise at all
3) Coherently optimise some extrapolation of U′
The behavior we want is to optimise U, but the info about what U is just isn’t there.
Options (1) and (2) makes the system brittle, tending to fail the moment anything goes slightly differently.
Option (3) leads to reasoning like, “I know not to crash into x, y and z, so maybe I shouldn’t crash into anything”, In other words, the extrapolation is often quite good when slightly off distribution. However when far off distribution, you can get traffic light maximizer behavior.
In short, the paradox of robustness exists because, when you don’t know what to optimize for, you can fail to optimize, or you can guess at something and optimize that.
Consider a self driving car. Call the human utility function U. Call the space of all possible worlds W. In the normal operation of a self driving car, the car has only makes decisions over the restricted space W′ . Say W′={Crash into lampost + Rest of universe, Don't crash + Rest of universe} In practice W′ will contain a whole bunch of things the car could do. Suppose that the programmers only know U′ the restriction of U to W′. This is enough to make a self driving car that behaves correctly in the crash or don’t crash dilemma.
However, suppose that a self driving car is faced with an off distribution situation from W∖W′ . Three things it could do include:
1) Recognise the problem and shut down.
2) Fail to coherently optimise at all
3) Coherently optimise some extrapolation of U′
The behavior we want is to optimise U, but the info about what U is just isn’t there.
Options (1) and (2) makes the system brittle, tending to fail the moment anything goes slightly differently.
Option (3) leads to reasoning like, “I know not to crash into x, y and z, so maybe I shouldn’t crash into anything”, In other words, the extrapolation is often quite good when slightly off distribution. However when far off distribution, you can get traffic light maximizer behavior.
In short, the paradox of robustness exists because, when you don’t know what to optimize for, you can fail to optimize, or you can guess at something and optimize that.