Is there a case for deliberately training in cheaply satisfied AI preferences just so we can satisfy them? I think it’s plausible that we can create AI motivations more easily than we can remove undesired ones.
Yes! I’m quite excited by this proposal and I currently plan to write more about it and study it empirically. The basic idea is to try to make AIs’ reward-hacking more responsive to satiation.
We should probably install cheaply satisfied preferences within AIs — why should this preference be myopic reward?
Why not a utility function like: “How much time is there a tungsten cube on Dario’s desk, with 21% annual discount rate.”
i.e. utility = ∫₀^∞ e^{-0.231t} · 𝟙[cube on desk at time t] dt
where λ = ln(2)/3, chosen so that half the utility comes from the deployment period (first 3 years) and half from the rest of history.
Some advantages of the cube preference:
We don’t have to worry how satisfying this preference affects the training and deployment.
It’s less philosophically messy what the cube utility of a scenario would be.
Some disadvantages:
AIs will crave reward anyway, so it’s better to intensify that craving rather than add a distinct craving.
It’s easier to build AIs which intensely crave reward than crave the cube thing. My guess is that this is both true and decisive, but I’d want to have a clearer sense of what actually goes wrong if we do something like this.
Wouldn’t this come at the risk of reducing usefulness? While reward-hacking is not useful to us, it’s something measured against what we think is a useful outcome. For the AI, reward-hacking is just getting reward since it can’t see our judgements about it going too far. And if the AI tries less hard to get reward by completing tasks to give us what we want, that would make it less useful.
Is there a case for deliberately training in cheaply satisfied AI preferences just so we can satisfy them? I think it’s plausible that we can create AI motivations more easily than we can remove undesired ones.
Yes! I’m quite excited by this proposal and I currently plan to write more about it and study it empirically. The basic idea is to try to make AIs’ reward-hacking more responsive to satiation.
We should probably install cheaply satisfied preferences within AIs — why should this preference be myopic reward?
Why not a utility function like: “How much time is there a tungsten cube on Dario’s desk, with 21% annual discount rate.”
i.e. utility = ∫₀^∞ e^{-0.231t} · 𝟙[cube on desk at time t] dt
where λ = ln(2)/3, chosen so that half the utility comes from the deployment period (first 3 years) and half from the rest of history.
Some advantages of the cube preference:
We don’t have to worry how satisfying this preference affects the training and deployment.
It’s less philosophically messy what the cube utility of a scenario would be.
Some disadvantages:
AIs will crave reward anyway, so it’s better to intensify that craving rather than add a distinct craving.
It’s easier to build AIs which intensely crave reward than crave the cube thing. My guess is that this is both true and decisive, but I’d want to have a clearer sense of what actually goes wrong if we do something like this.
Wouldn’t this come at the risk of reducing usefulness? While reward-hacking is not useful to us, it’s something measured against what we think is a useful outcome. For the AI, reward-hacking is just getting reward since it can’t see our judgements about it going too far. And if the AI tries less hard to get reward by completing tasks to give us what we want, that would make it less useful.
Part of the post is about this.