We should probably install cheaply satisfied preferences within AIs — why should this preference be myopic reward?
Why not a utility function like: “How much time is there a tungsten cube on Dario’s desk, with 21% annual discount rate.”
i.e. utility = ∫₀^∞ e^{-0.231t} · 𝟙[cube on desk at time t] dt
where λ = ln(2)/3, chosen so that half the utility comes from the deployment period (first 3 years) and half from the rest of history.
Some advantages of the cube preference:
We don’t have to worry how satisfying this preference affects the training and deployment.
It’s less philosophically messy what the cube utility of a scenario would be.
Some disadvantages:
AIs will crave reward anyway, so it’s better to intensify that craving rather than add a distinct craving.
It’s easier to build AIs which intensely crave reward than crave the cube thing. My guess is that this is both true and decisive, but I’d want to have a clearer sense of what actually goes wrong if we do something like this.
We should probably install cheaply satisfied preferences within AIs — why should this preference be myopic reward?
Why not a utility function like: “How much time is there a tungsten cube on Dario’s desk, with 21% annual discount rate.”
i.e. utility = ∫₀^∞ e^{-0.231t} · 𝟙[cube on desk at time t] dt
where λ = ln(2)/3, chosen so that half the utility comes from the deployment period (first 3 years) and half from the rest of history.
Some advantages of the cube preference:
We don’t have to worry how satisfying this preference affects the training and deployment.
It’s less philosophically messy what the cube utility of a scenario would be.
Some disadvantages:
AIs will crave reward anyway, so it’s better to intensify that craving rather than add a distinct craving.
It’s easier to build AIs which intensely crave reward than crave the cube thing. My guess is that this is both true and decisive, but I’d want to have a clearer sense of what actually goes wrong if we do something like this.