where m is the number of steps in the lifetime of the agent, k is the current step being computed, X is the set of possible observations, Y is the set of possible actions, r is a function that extracts a reward value from an observation, a dot over a variable represents that its value is known to be the true value of the action or observation it represents, underlines represent that the variable is an input to a probability distribution, and ξ is a function that returns the probability of a sequence of observations, given a certain known history and sequence of actions, and starting from the Solomonoff prior. More formally,
where Q is the set of all programs, ℓ is a function that returns the length of a program in bits, and a program applied to a sequence of actions returns the resulting sequence of observations. Notice that the denominator is a constant, depending only on the already known ˙y˙x<k, and multiplying by a positive constant does not change the argmax, so we can pretend that the denominator doesn’t exist. If q is a valid program, then any longer program with q as a prefix is not a valid program, so ∑q∈Q2−ℓ(q)≤1.
Problem
A problem with this is that it can only optimize over the input it receives, not over aspects of the external world that it cannot observe. Given the chance, AIξ would hack its input channel so that it would only observe good things, instead of trying to make good things happen (in other words, it would wirehead itself). Anja specified a variant of AIξ in which she replaced the sum of rewards with a single utility value and made the domain of the utility function be the entire sequence of actions and observations instead of a single observation, like so:
This doesn’t really solve the problem, because the utility function still only takes what the agent can see, rather than what is actually going on outside the agent. The situation is a little better because the utility function also takes into account the agent’s actions, so it could punish actions that look like the agent is trying to wirehead itself, but if there was a flaw in the instructions not to wirehead, the agent would exploit it, so the incentive not to wirehead would have to be perfect, and this formulation is not very enlightening about how to do that.
[Edit: Hibbard (2012) also presents a solution to this problem. I haven’t read all of it yet, but it appears to be fairly different from what I suggest in the next section.]
Solution
Here’s what I suggest instead: everything that happens is determined by the program that the world is running on and the agent’s actions, so the domain of the utility function should be Q×Ym. The apparent problem with that is that the formula for AIξ does not contain any mention of elements of Q. If we just take the original formula and replace r(xk)...r(xm)
with U(q,˙y<kyk:m), it wouldn’t make any sense. However, if we expand out ξ in the original formula (excluding the unnecessary denominator), we can move the sum of rewards inside the sum over programs, like this:
With this formulation, there is no danger of the agent wireheading, and all U has to do is compute everything that happens when the agent performs a given sequence of actions in a given program, and decide how desirable it is. If the range of U is unbounded, then this might not converge. Let’s assume throughout this post that the range of U is bounded.
[Edit: Hay (2005) presents a similar formulation to this.]
Extension to infinite lifetimes
The previous discussion assumed that the agent would only have the opportunity to perform a finite number of actions. The situation gets a little tricky when the agent is allowed to perform an unbounded number of actions. Hutter uses a finite look-ahead approach for AIξ, where on each step k, it pretends that it will only be performing mk actions, where ∀kmk≫k.
This is unsatisfactory because the domain of U was supposed to consist of all the information necessary to determine everything that happens, but here, it is missing all the actions after step mk. One obvious thing to try is to set mk:=∞. This will be easier to do using a compacted expression for AIξ:
where P is the set of policies that map sequences of observations to sequences of actions and xpqi is shorthand for the last observation in the sequence returned by q(p(˙x<kxpqk:i−1)). If we take this compacted formulation, modify it to accommodate the new utility function, set mk:=∞, and replace the maximum with a supremum (since there’s an infinite number of possible policies), we get
where ypqi is shorthand for the last action in the sequence returned by p(q(˙y<kykypqk1:i−1)).
But there is a problem with this, which I will illustrate with a toy example. Suppose Y:={a,b}, and U(q,y1:∞)=0 when ∀k∈Nyk=a, and for any n∈N, U(q,y1:∞)=1−1n when yn=b and ∀k<nyk=a. (U does not depend on the program q in this example). An agent following the above formula would output a on every step, and end up with a utility of 0, when it could have gotten arbitrarily close to 1 by eventually outputting b.
To avoid problems like that, we could assume the reasonable-seeming condition that if y1:∞ is an action sequence and {yn1:∞}∞n=1 is a sequence of action sequences that converges to y1:∞ (by which I mean ∀k∈N∃N∈N∀n>Nynk=yk), then limn→∞U(q,yn1:∞)=U(q,y1:∞).
Under that assumption, the supremum is in fact a maximum, and the formula gives you an action sequence that will reach that maximum (proof below).
If you don’t like the condition I imposed on U, you might not be satisfied by this. But without it, there is not necessarily a best policy. One thing you can do is, on step 1, pick some extremely small ε>0, pick any element from
⎧⎨⎩p∗∈P:∑q∈QU(q,yp∗q1:∞)2−ℓ(q)>⎛⎝supp∈P∑q∈QU(q,ypq1:∞)2−ℓ(q)⎞⎠−ε⎫⎬⎭, and then follow that policy for the rest of eternity, which will guarantee that you do not miss out on more than ε of expected utility.
Proof of criterion for supremum-chasing working
definition: If y1:∞ is an action sequence and {yn1:∞}∞n=1 is an infinite sequence of action sequences, and ∀k∈N∃N∈N∀n>Nynk=yk, then we say {yn1:∞}∞n=1 converges to y1:∞. If p is a policy and {pn}∞n=1 is a sequence of policies, and ∀k∈N∀x<k∈Xk∃N∈N∀n>Np(x<k)=pn(x<k), then we say {pn}∞n=1 converges to p.
assumption (for lemma 2 and theorem): If {yn1:∞}∞n=1 converges to y1:∞, then limn→∞U(q,yn1:∞)=U(q,y1:∞).
lemma 1: The agent described by
˙yk:=argmaxyk∈Ysupp∈P:p(˙x<k)=˙y<kyk∑q∈Q:q(˙y<k)=˙x<kU(q,˙y<kykypqk1:∞)2−ℓ(q) follows a policy that is the limit of a sequence of policies {pn}∞n=1 such that
proof of lemma 1: Any policy can be completely described by the last action it outputs for every finite observation sequence. Observations are returned by a program, so the set of possible finite observation sequences is countable. It is possible to fix the last action returned on any particular finite observation sequence to be the argmax, and still get arbitrarily close to the supremum with suitable choices for the last action returned on the other finite observation sequences. By induction, it is possible to get arbitrarily close to the supremum while fixing the last action returned to be the argmax for any finite set of finite observation sequences. Thus, there exists a sequence of policies approaching the policy that the agent implements whose expected utilities approach the supremum.
lemma 2: If p is a policy and {pn}∞n=1 is a sequence of policies converging to p, then
proof of lemma 2: Let ε>0. On any given sequence of inputs x1:∞∈X∞, {pn(x1:∞)}∞n=1 converges to p(x1:∞), so, by assumption, ∀q∈Q∃N∈N∀n≥N∣∣U(q,ypq1:∞)−U(q,ypnq1:∞)∣∣<ε2.
For each N∈N, let QN:={q∈Q:∀n≥N∣∣U(q,ypq1:∞)−U(q,ypnq1:∞)∣∣<ε2}. The previous statement implies that ⋃N∈NQN=Q, and each element of {QN}N∈N is a subset of the next, so
∃N∈N∑q∈Q∖QN2−ℓ(q)<ε2(supU−infU). The range of U is bounded, so supU and infU are defined. This also implies that the difference in expected utility, given any information, of any two policies, is bounded. More formally:∀Q′⊂Q∀p′,p′′∈P∣∣
∣∣⎛⎝⎛⎝∑q∈Q′U(q,yp′q1:∞)2−ℓ(q)⎞⎠╱⎛⎝∑q∈Q′2−ℓ(q)⎞⎠⎞⎠−⎛⎝⎛⎝∑q∈Q′U(q,yp′′q1:∞)2−ℓ(q)⎞⎠╱⎛⎝∑q∈Q′2−ℓ(q)⎞⎠⎞⎠∣∣
∣∣≤supU−infU,
A utility-maximizing varient of AIXI
Response to: Universal agents and utility functions
Related approaches: Hibbard (2012), Hay (2005)
Background
Here is the function implemented by finite-lifetime AIξ:
˙yk:=argmaxyk∈Y∑xk∈Xmaxyk1∈Y∑xk1∈X...maxym∈Y∑xm∈X(r(xk)...r(xm))⋅ξ(˙y˙x<kyx––k:m),
where m is the number of steps in the lifetime of the agent, k is the current step being computed, X is the set of possible observations, Y is the set of possible actions, r is a function that extracts a reward value from an observation, a dot over a variable represents that its value is known to be the true value of the action or observation it represents, underlines represent that the variable is an input to a probability distribution, and ξ is a function that returns the probability of a sequence of observations, given a certain known history and sequence of actions, and starting from the Solomonoff prior. More formally,
ξ(˙y˙x<kyx––k:m)=⎛⎜⎝∑q∈Q:q(y≤m)=x≤m2−ℓ(q)⎞⎟⎠╱⎛⎜⎝∑q∈Q:q(˙y<k)=˙x<k2−ℓ(q)⎞⎟⎠,
where Q is the set of all programs, ℓ is a function that returns the length of a program in bits, and a program applied to a sequence of actions returns the resulting sequence of observations. Notice that the denominator is a constant, depending only on the already known ˙y˙x<k, and multiplying by a positive constant does not change the argmax, so we can pretend that the denominator doesn’t exist. If q is a valid program, then any longer program with q as a prefix is not a valid program, so ∑q∈Q2−ℓ(q)≤1.
Problem
A problem with this is that it can only optimize over the input it receives, not over aspects of the external world that it cannot observe. Given the chance, AIξ would hack its input channel so that it would only observe good things, instead of trying to make good things happen (in other words, it would wirehead itself). Anja specified a variant of AIξ in which she replaced the sum of rewards with a single utility value and made the domain of the utility function be the entire sequence of actions and observations instead of a single observation, like so:
˙yk:=argmaxyk∈Y∑xk∈Xmaxyk1∈Y∑xk1∈X...maxym∈Y∑xm∈XU(˙y˙x<kyxk:m)⋅ξ(˙y˙x<kyx––k:m).
This doesn’t really solve the problem, because the utility function still only takes what the agent can see, rather than what is actually going on outside the agent. The situation is a little better because the utility function also takes into account the agent’s actions, so it could punish actions that look like the agent is trying to wirehead itself, but if there was a flaw in the instructions not to wirehead, the agent would exploit it, so the incentive not to wirehead would have to be perfect, and this formulation is not very enlightening about how to do that.
[Edit: Hibbard (2012) also presents a solution to this problem. I haven’t read all of it yet, but it appears to be fairly different from what I suggest in the next section.]
Solution
Here’s what I suggest instead: everything that happens is determined by the program that the world is running on and the agent’s actions, so the domain of the utility function should be Q×Ym. The apparent problem with that is that the formula for AIξ does not contain any mention of elements of Q. If we just take the original formula and replace r(xk)...r(xm)
with U(q,˙y<kyk:m), it wouldn’t make any sense. However, if we expand out ξ in the original formula (excluding the unnecessary denominator), we can move the sum of rewards inside the sum over programs, like this:
˙yk:=argmaxyk∈Y∑xk∈Xmaxyk1∈Y∑xk1∈X...maxym∈Y∑xm∈X∑q∈Q:q(y≤m)=x≤m(r(xk)...r(xm))2−ℓ(q).
Now it is easy to replace the sum of rewards with the desired utility function.
˙yk:=argmaxyk∈Y∑xk∈Xmaxyk1∈Y∑xk1∈X...maxym∈Y∑xm∈X∑q∈Q:q(y≤m)=x≤mU(q,˙y<kyk:m)2−ℓ(q).
With this formulation, there is no danger of the agent wireheading, and all U has to do is compute everything that happens when the agent performs a given sequence of actions in a given program, and decide how desirable it is. If the range of U is unbounded, then this might not converge. Let’s assume throughout this post that the range of U is bounded.
[Edit: Hay (2005) presents a similar formulation to this.]
Extension to infinite lifetimes
The previous discussion assumed that the agent would only have the opportunity to perform a finite number of actions. The situation gets a little tricky when the agent is allowed to perform an unbounded number of actions. Hutter uses a finite look-ahead approach for AIξ, where on each step k, it pretends that it will only be performing mk actions, where ∀kmk≫k.
˙yk:=argmaxyk∈Y∑xk∈Xmaxyk1∈Y∑xk1∈X...maxymk∈Y∑xmk∈X(r(xk)...r(xmk))⋅ξ(˙y˙x<kyx––k:mk).
If we make the same modification to the utility-based variant, we get
˙yk:=argmaxyk∈Y∑xk∈Xmaxyk1∈Y∑xk1∈X...maxymk∈Y∑xmk∈X∑q∈Q:q(y≤mk)=x≤mkU(q,˙y<kyk:mk)2−ℓ(q).
This is unsatisfactory because the domain of U was supposed to consist of all the information necessary to determine everything that happens, but here, it is missing all the actions after step mk. One obvious thing to try is to set mk:=∞. This will be easier to do using a compacted expression for AIξ:
˙yk:=argmaxyk∈Ymaxp∈P:p(˙x<k)=˙y<kyk∑q∈Q:q(˙y<k)=˙x<k(r(xpqk)...r(xpqmk))2−ℓ(q),
where P is the set of policies that map sequences of observations to sequences of actions and xpqi is shorthand for the last observation in the sequence returned by q(p(˙x<kxpqk:i−1)). If we take this compacted formulation, modify it to accommodate the new utility function, set mk:=∞, and replace the maximum with a supremum (since there’s an infinite number of possible policies), we get
˙yk:=argmaxyk∈Ysupp∈P:p(˙x<k)=˙y<kyk∑q∈Q:q(˙y<k)=˙x<kU(q,˙y<kykypqk1:∞)2−ℓ(q),
where ypqi is shorthand for the last action in the sequence returned by p(q(˙y<kykypqk1:i−1)).
But there is a problem with this, which I will illustrate with a toy example. Suppose Y:={a,b}, and U(q,y1:∞)=0 when ∀k∈Nyk=a, and for any n∈N, U(q,y1:∞)=1−1n when yn=b and ∀k<nyk=a. (U does not depend on the program q in this example). An agent following the above formula would output a on every step, and end up with a utility of 0, when it could have gotten arbitrarily close to 1 by eventually outputting b.
To avoid problems like that, we could assume the reasonable-seeming condition that if y1:∞ is an action sequence and {yn1:∞}∞n=1 is a sequence of action sequences that converges to y1:∞ (by which I mean ∀k∈N∃N∈N∀n>Nynk=yk), then limn→∞U(q,yn1:∞)=U(q,y1:∞).
Under that assumption, the supremum is in fact a maximum, and the formula gives you an action sequence that will reach that maximum (proof below).
If you don’t like the condition I imposed on U, you might not be satisfied by this. But without it, there is not necessarily a best policy. One thing you can do is, on step 1, pick some extremely small ε>0, pick any element from
⎧⎨⎩p∗∈P:∑q∈QU(q,yp∗q1:∞)2−ℓ(q)>⎛⎝supp∈P∑q∈QU(q,ypq1:∞)2−ℓ(q)⎞⎠−ε⎫⎬⎭, and then follow that policy for the rest of eternity, which will guarantee that you do not miss out on more than ε of expected utility.
Proof of criterion for supremum-chasing working
definition: If y1:∞ is an action sequence and {yn1:∞}∞n=1 is an infinite sequence of action sequences, and ∀k∈N∃N∈N∀n>Nynk=yk, then we say {yn1:∞}∞n=1 converges to y1:∞. If p is a policy and {pn}∞n=1 is a sequence of policies, and ∀k∈N∀x<k∈Xk∃N∈N∀n>Np(x<k)=pn(x<k), then we say {pn}∞n=1 converges to p.
assumption (for lemma 2 and theorem): If {yn1:∞}∞n=1 converges to y1:∞, then limn→∞U(q,yn1:∞)=U(q,y1:∞).
lemma 1: The agent described by
˙yk:=argmaxyk∈Ysupp∈P:p(˙x<k)=˙y<kyk∑q∈Q:q(˙y<k)=˙x<kU(q,˙y<kykypqk1:∞)2−ℓ(q) follows a policy that is the limit of a sequence of policies {pn}∞n=1 such that
limn→∞∑q∈QU(q,ypnq1:∞)2−ℓ(q)=supp∈P∑q∈QU(q,ypq1:∞)2−ℓ(q).
proof of lemma 1: Any policy can be completely described by the last action it outputs for every finite observation sequence. Observations are returned by a program, so the set of possible finite observation sequences is countable. It is possible to fix the last action returned on any particular finite observation sequence to be the argmax, and still get arbitrarily close to the supremum with suitable choices for the last action returned on the other finite observation sequences. By induction, it is possible to get arbitrarily close to the supremum while fixing the last action returned to be the argmax for any finite set of finite observation sequences. Thus, there exists a sequence of policies approaching the policy that the agent implements whose expected utilities approach the supremum.
lemma 2: If p is a policy and {pn}∞n=1 is a sequence of policies converging to p, then
∑q∈QU(q,ypq1:∞)2−ℓ(q)=limn→∞∑q∈QU(q,ypnq1:∞)2−ℓ(q).
proof of lemma 2: Let ε>0. On any given sequence of inputs x1:∞∈X∞, {pn(x1:∞)}∞n=1 converges to p(x1:∞), so, by assumption, ∀q∈Q∃N∈N∀n≥N∣∣U(q,ypq1:∞)−U(q,ypnq1:∞)∣∣<ε2.
For each N∈N, let QN:={q∈Q:∀n≥N∣∣U(q,ypq1:∞)−U(q,ypnq1:∞)∣∣<ε2}. The previous statement implies that ⋃N∈NQN=Q, and each element of {QN}N∈N is a subset of the next, so
∃N∈N∑q∈Q∖QN2−ℓ(q)<ε2(supU−infU). The range of U is bounded, so supU and infU are defined. This also implies that the difference in expected utility, given any information, of any two policies, is bounded. More formally:∀Q′⊂Q∀p′,p′′∈P∣∣ ∣∣⎛⎝⎛⎝∑q∈Q′U(q,yp′q1:∞)2−ℓ(q)⎞⎠╱⎛⎝∑q∈Q′2−ℓ(q)⎞⎠⎞⎠−⎛⎝⎛⎝∑q∈Q′U(q,yp′′q1:∞)2−ℓ(q)⎞⎠╱⎛⎝∑q∈Q′2−ℓ(q)⎞⎠⎞⎠∣∣ ∣∣≤supU−infU,
so in particular,
∣∣ ∣∣⎛⎝∑q∈Q∖QNU(q,ypq1:∞)2−ℓ(q)⎞⎠−⎛⎝∑q∈Q∖QNU(q,ypNq1:∞)2−ℓ(q)⎞⎠∣∣ ∣∣≤(supU−infU)∑q∈Q∖QN2−ℓ(q)<ε2.
∣∣ ∣∣⎛⎝∑q∈QU(q,ypq1:∞)2−ℓ(q)⎞⎠−⎛⎝∑q∈QU(q,ypNq1:∞)2−ℓ(q)⎞⎠∣∣ ∣∣≤∣∣ ∣∣⎛⎝∑q∈QNU(q,ypq1:∞)2−ℓ(q)⎞⎠−⎛⎝∑q∈QNU(q,ypNq1:∞)2−ℓ(q)⎞⎠∣∣ ∣∣∣∣ ∣∣⎛⎝∑q∈Q∖QNU(q,ypq1:∞)2−ℓ(q)⎞⎠−⎛⎝∑q∈QU(q,ypNq1:∞)2−ℓ(q)⎞⎠∣∣ ∣∣<ε2ε2=ε.
theorem:
∑˙q∈QU(˙q,˙y1:∞)2−ℓ(˙q)=supp∈P∑q∈QU(q,ypq1:∞)2−ℓ(q),
where ˙yk:=argmaxyk∈Ysupp∈P:p(˙x<k)=˙y<kyk∑q∈Q:q(˙y<k)=˙x<kU(q,˙y<kykypqk1:∞)2−ℓ(q).
proof of theorem: Let’s call the policy implemented by the agent p∗.
By lemma 1, there is a sequence of policies {pn}∞n=1
converging to p∗ such that
limn→∞∑q∈QU(q,ypnq1:∞)2−ℓ(q)=supp∈P∑q∈QU(q,ypq1:∞)2−ℓ(q).
By lemma 2,
∑q∈QU(q,yp∗q1:∞)2−ℓ(q)=supp∈P∑q∈QU(q,ypq1:∞)2−ℓ(q).