adamShimi comments on Updating Utility Functions

adamShimi 9 May 2022 13:11 UTC
2 points
0
On limiting utility functions
Let’s define a process P that generates a sequence of utility functions {U_i}. We call this a utility function defining process.
[...]
We would like to stress that this process P is an example, and not the central point of this post.
Suppose (for the sake of the argument) that the sequence of utility functions {U_i} generated by this process P has a well-defined limit U_∞ (in the ordinary mathematical sense of a limit). We can then define an AI system whose utility function is to maximize lim _i→∞U_i (= U_∞). It seems as though such a system would satisfy many of the properties in (1)-(3). In particular:
- The AI should at any given time take actions that are good according to most of the plausible values of U_∞.
- The AI would be incentivized to gather information that would help it learn more about U_∞.
- The AI would not be incentivized to gather information about U_∞ at the expense of maximizing U_∞ (eg, it would not be incentivized to run “unethical experiments”).
- The AI would be incentivized to resist changes to its utility function that would mean that it’s no longer aiming to maximize U_∞.
- The AI should be keen to maintain option value as it learns more about U_∞, until it’s very confident about what U_∞ looks like.
Overall, it seems like such an AI would satisfy most of the properties we would want an AI with an updating utility function to have.
Assuming you get such a process pointing towards human values, then I expect to get the properties you’re describing, which are pretty good.
There is still one potential issue: the AI needs to be able to use P and U_i (it’s current utility function) to guess enough of the limit so that it can be competitive (Footnote: something like a Cauchy criterion?). Otherwise the AI risks failing to crippling uncertainty about what it can and cannot do, as in principle U∞ could be anything.
With that said, it still sounds like the noticeably hardest part of the problem has been “hidden away” in P (as you point out it in the issue section). It’s always hard to point at something and say that this is the hard part of the problem, but I’m pretty confident that getting a process that does converge towards human values and has the competitiveness constraint above is the main problem here.
Thus this posts seems to provide an alternative “type” for a solution to value learning, in the shape of such a sequence. It sounds similar to other things in the literature, like IDA and Recursive Reward Modelling, but with a lack of built-in human feedback mechanism that makes it more abstract. So I expect that exploring this abstract framing and the constraints that fall from it might tell us interesting and useful things about the viability of solutions of this type (including their potential impossibility).

adamShimi comments on Updating Utility Functions

On limiting utility functions