Let’s define a process P that generates a sequence of utility functions {Ui}. We call this a utility function defining process.
[...]
We would like to stress that this process P is an example, and not the central point of this post.
Suppose (for the sake of the argument) that the sequence of utility functions {Ui} generated by this process P has a well-defined limit U∞ (in the ordinary mathematical sense of a limit). We can then define an AI system whose utility function is to maximize lim i→∞ Ui (= U∞). It seems as though such a system would satisfy many of the properties in (1)-(3). In particular:
The AI should at any given time take actions that are good according to most of the plausible values of U∞.
The AI would be incentivized to gather information that would help it learn more about U∞.
The AI would not be incentivized to gather information about U∞ at the expense of maximizing U∞ (eg, it would not be incentivized to run “unethical experiments”).
The AI would be incentivized to resist changes to its utility function that would mean that it’s no longer aiming to maximize U∞.
The AI should be keen to maintain option value as it learns more about U∞, until it’s very confident about what U∞ looks like.
Overall, it seems like such an AI would satisfy most of the properties we would want an AI with an updating utility function to have.
Assuming you get such a process pointing towards human values, then I expect to get the properties you’re describing, which are pretty good.
There is still one potential issue: the AI needs to be able to use P and Ui (it’s current utility function) to guess enough of the limit so that it can be competitive (Footnote: something like a Cauchy criterion?). Otherwise the AI risks failing to crippling uncertainty about what it can and cannot do, as in principle U∞ could be anything.
With that said, it still sounds like the noticeably hardest part of the problem has been “hidden away” in P (as you point out it in the issue section). It’s always hard to point at something and say that this is the hard part of the problem, but I’m pretty confident that getting a process that does converge towards human values and has the competitiveness constraint above is the main problem here.
Thus this posts seems to provide an alternative “type” for a solution to value learning, in the shape of such a sequence. It sounds similar to other things in the literature, like IDA and Recursive Reward Modelling, but with a lack of built-in human feedback mechanism that makes it more abstract. So I expect that exploring this abstract framing and the constraints that fall from it might tell us interesting and useful things about the viability of solutions of this type (including their potential impossibility).
On limiting utility functions
Assuming you get such a process pointing towards human values, then I expect to get the properties you’re describing, which are pretty good.
There is still one potential issue: the AI needs to be able to use P and Ui (it’s current utility function) to guess enough of the limit so that it can be competitive (Footnote: something like a Cauchy criterion?). Otherwise the AI risks failing to crippling uncertainty about what it can and cannot do, as in principle U∞ could be anything.
With that said, it still sounds like the noticeably hardest part of the problem has been “hidden away” in P (as you point out it in the issue section). It’s always hard to point at something and say that this is the hard part of the problem, but I’m pretty confident that getting a process that does converge towards human values and has the competitiveness constraint above is the main problem here.
Thus this posts seems to provide an alternative “type” for a solution to value learning, in the shape of such a sequence. It sounds similar to other things in the literature, like IDA and Recursive Reward Modelling, but with a lack of built-in human feedback mechanism that makes it more abstract. So I expect that exploring this abstract framing and the constraints that fall from it might tell us interesting and useful things about the viability of solutions of this type (including their potential impossibility).