A utility-maximizing varient of AIXI

Re­sponse to: Univer­sal agents and util­ity functions

Re­lated ap­proaches: Hib­bard (2012), Hay (2005)


Here is the func­tion im­ple­mented by finite-life­time AI\xi:

{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}\left(r\left(x_{k}\right)+...+r\left(x_{m}\right)\right)\cdot\xi\left(\dot{y}\dot{x}_{<k}y\underline{x}_{k:m}\right)},

where m is the num­ber of steps in the life­time of the agent, k is the cur­rent step be­ing com­puted, X is the set of pos­si­ble ob­ser­va­tions, Y is the set of pos­si­ble ac­tions, r is a func­tion that ex­tracts a re­ward value from an ob­ser­va­tion, a dot over a vari­able rep­re­sents that its value is known to be the true value of the ac­tion or ob­ser­va­tion it rep­re­sents, un­der­lines rep­re­sent that the vari­able is an in­put to a prob­a­bil­ity dis­tri­bu­tion, and \xi is a func­tion that re­turns the prob­a­bil­ity of a se­quence of ob­ser­va­tions, given a cer­tain known his­tory and se­quence of ac­tions, and start­ing from the Solomonoff prior. More for­mally,

{\displaystyle \xi\left(\dot{y}\dot{x}_{<k}y\underline{x}_{k:m}\right)=\left(\sum_{q\in Q:q\left(y_{\leq m}\right)=x_{\leq m}}2^{-\ell\left(q\right)}\right)\diagup\left(\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}2^{-\ell\left(q\right)}\right)},

where Q is the set of all pro­grams, \ell is a func­tion that re­turns the length of a pro­gram in bits, and a pro­gram ap­plied to a se­quence of ac­tions re­turns the re­sult­ing se­quence of ob­ser­va­tions. No­tice that the de­nom­i­na­tor is a con­stant, de­pend­ing only on the already known \dot{y}\dot{x}_{%3Ck}, and mul­ti­ply­ing by a pos­i­tive con­stant does not change the argmax, so we can pre­tend that the de­nom­i­na­tor doesn’t ex­ist. If q is a valid pro­gram, then any longer pro­gram with q as a pre­fix is not a valid pro­gram, so {\displaystyle%20\sum_{q\in%20Q}2^{-\ell\left%28q\right%29}\leq1}.


A prob­lem with this is that it can only op­ti­mize over the in­put it re­ceives, not over as­pects of the ex­ter­nal world that it can­not ob­serve. Given the chance, AI\xi would hack its in­put chan­nel so that it would only ob­serve good things, in­stead of try­ing to make good things hap­pen (in other words, it would wire­head it­self). Anja speci­fied a var­i­ant of AI\xi in which she re­placed the sum of re­wards with a sin­gle util­ity value and made the do­main of the util­ity func­tion be the en­tire se­quence of ac­tions and ob­ser­va­tions in­stead of a sin­gle ob­ser­va­tion, like so:


This doesn’t re­ally solve the prob­lem, be­cause the util­ity func­tion still only takes what the agent can see, rather than what is ac­tu­ally go­ing on out­side the agent. The situ­a­tion is a lit­tle bet­ter be­cause the util­ity func­tion also takes into ac­count the agent’s ac­tions, so it could pun­ish ac­tions that look like the agent is try­ing to wire­head it­self, but if there was a flaw in the in­struc­tions not to wire­head, the agent would ex­ploit it, so the in­cen­tive not to wire­head would have to be perfect, and this for­mu­la­tion is not very en­light­en­ing about how to do that.

[Edit: Hib­bard (2012) also pre­sents a solu­tion to this prob­lem. I haven’t read all of it yet, but it ap­pears to be fairly differ­ent from what I sug­gest in the next sec­tion.]


Here’s what I sug­gest in­stead: ev­ery­thing that hap­pens is de­ter­mined by the pro­gram that the world is run­ning on and the agent’s ac­tions, so the do­main of the util­ity func­tion should be Q\times%20Y^{m}. The ap­par­ent prob­lem with that is that the for­mula for AI\xi does not con­tain any men­tion of el­e­ments of Q. If we just take the origi­nal for­mula and re­place r\left(x_{k}\right)+...+r\left(x_{m}\right)

with U\left(q,\dot{y}_{<k}y_{k:m}\right), it wouldn’t make any sense. How­ever, if we ex­pand out \xi in the origi­nal for­mula (ex­clud­ing the un­nec­es­sary de­nom­i­na­tor), we can move the sum of re­wards in­side the sum over pro­grams, like this:

{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}\sum_{q\in Q:q\left(y_{\leq m}\right)=x_{\leq m}}\left(r\left(x_{k}\right)+...+r\left(x_{m}\right)\right)2^{-\ell\left(q\right)}}.

Now it is easy to re­place the sum of re­wards with the de­sired util­ity func­tion.

{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}\sum_{q\in Q:q\left(y_{\leq m}\right)=x_{\leq m}}U\left(q,\dot{y}_{<k}y_{k:m}\right)2^{-\ell\left(q\right)}}.

With this for­mu­la­tion, there is no dan­ger of the agent wire­head­ing, and all U has to do is com­pute ev­ery­thing that hap­pens when the agent performs a given se­quence of ac­tions in a given pro­gram, and de­cide how de­sir­able it is. If the range of U is un­bounded, then this might not con­verge. Let’s as­sume through­out this post that the range of U is bounded.

[Edit: Hay (2005) pre­sents a similar for­mu­la­tion to this.]

Ex­ten­sion to in­finite lifetimes

The pre­vi­ous dis­cus­sion as­sumed that the agent would only have the op­por­tu­nity to perform a finite num­ber of ac­tions. The situ­a­tion gets a lit­tle tricky when the agent is al­lowed to perform an un­bounded num­ber of ac­tions. Hut­ter uses a finite look-ahead ap­proach for AI\xi, where on each step k, it pre­tends that it will only be perform­ing m_{k} ac­tions, where \forall%20k\,%20m_{k}\gg%20k.

{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m_{k}}\in Y}\sum_{x_{m_{k}}\in X}\left(r\left(x_{k}\right)+...+r\left(x_{m_{k}}\right)\right)\cdot\xi\left(\dot{y}\dot{x}_{<k}y\underline{x}_{k:m_{k}}\right)}.

If we make the same mod­ifi­ca­tion to the util­ity-based var­i­ant, we get

{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m_{k}}\in Y}\sum_{x_{m_{k}}\in X}\sum_{q\in Q:q\left(y_{\leq m_{k}}\right)=x_{\leq m_{k}}}U\left(q,\dot{y}_{<k}y_{k:m_{k}}\right)2^{-\ell\left(q\right)}}.

This is un­satis­fac­tory be­cause the do­main of U was sup­posed to con­sist of all the in­for­ma­tion nec­es­sary to de­ter­mine ev­ery­thing that hap­pens, but here, it is miss­ing all the ac­tions af­ter step m_{k}. One ob­vi­ous thing to try is to set m_{k}:=\infty. This will be eas­ier to do us­ing a com­pacted ex­pres­sion for AI\xi:

{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\max_{p\in P:p\left(\dot{x}_{<k}\right)=\dot{y}_{<k}y_{k}}\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}\left(r\left(x_{k}^{pq}\right)+...+r\left(x_{m_{k}}^{pq}\right)\right)2^{-\ell\left(q\right)}},

where P is the set of poli­cies that map se­quences of ob­ser­va­tions to se­quences of ac­tions and x_{i}^{pq} is short­hand for the last ob­ser­va­tion in the se­quence re­turned by q\left(p\left(\dot{x}_{<k}x_{k:i-1}^{pq}\right)\right). If we take this com­pacted for­mu­la­tion, mod­ify it to ac­com­mo­date the new util­ity func­tion, set m_{k}:=\infty, and re­place the max­i­mum with a supre­mum (since there’s an in­finite num­ber of pos­si­ble poli­cies), we get

{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left(\dot{x}_{<k}\right)=\dot{y}_{<k}y_{k}}\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}U\left(q,\dot{y}_{<k}y_{k}y_{k+1:\infty}^{pq}\right)2^{-\ell\left(q\right)}},

where y_{i}^{pq} is short­hand for the last ac­tion in the se­quence re­turned by p\left(q\left(\dot{y}_{<k}y_{k}y_{k+1:i-1}^{pq}\right)\right).

But there is a prob­lem with this, which I will illus­trate with a toy ex­am­ple. Sup­pose Y:=\left\{ a,b\right\} , and U\left(q,y_{1:\infty}\right)=0 when \forall k\in\mathbb{N}\, y_{k}=a, and for any n\in\mathbb{N}, U\left(q,y_{1:\infty}\right)=1-\frac{1}{n} when y_{n}=b and \forall k<n\, y_{k}=a. (U does not de­pend on the pro­gram q in this ex­am­ple). An agent fol­low­ing the above for­mula would out­put a on ev­ery step, and end up with a util­ity of 0, when it could have got­ten ar­bi­trar­ily close to 1 by even­tu­ally out­putting b.

To avoid prob­lems like that, we could as­sume the rea­son­able-seem­ing con­di­tion that if y_{1:\infty} is an ac­tion se­quence and \left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty} is a se­quence of ac­tion se­quences that con­verges to y_{1:\infty} (by which I mean \forall k\in\mathbb{N}\,\exists N\in\mathbb{N}\,\forall n>N\, y_{k}^{n}=y_{k}), then {\displaystyle \lim_{n\rightarrow\infty}U\left(q,y_{1:\infty}^{n}\right)=U\left(q,y_{1:\infty}\right)}.

Un­der that as­sump­tion, the supre­mum is in fact a max­i­mum, and the for­mula gives you an ac­tion se­quence that will reach that max­i­mum (proof be­low).

If you don’t like the con­di­tion I im­posed on U, you might not be satis­fied by this. But with­out it, there is not nec­es­sar­ily a best policy. One thing you can do is, on step 1, pick some ex­tremely small \varepsilon>0, pick any el­e­ment from

{\displaystyle \left\{ p^{*}\in P:\sum_{q\in Q}U\left(q,y_{1:\infty}^{p^{*}q}\right)2^{-\ell\left(q\right)}>\left(\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\varepsilon\right\} }, and then fol­low that policy for the rest of eter­nity, which will guaran­tee that you do not miss out on more than \varepsilon of ex­pected util­ity.

Proof of crite­rion for supre­mum-chas­ing working

defi­ni­tion: If y_{1:\infty} is an ac­tion se­quence and \left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty} is an in­finite se­quence of ac­tion se­quences, and \forall k\in\mathbb{N}\,\exists N\in\mathbb{N}\,\forall n>N\, y_{k}^{n}=y_{k}, then we say \left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty} con­verges to y_{1:\infty}. If p is a policy and \left\{ p_{n}\right\} _{n=1}^{\infty} is a se­quence of poli­cies, and \forall k\in\mathbb{N}\,\forall x_{<k}\in X^{k}\,\exists N\in\mathbb{N}\,\forall n>N\, p\left(x_{<k}\right)=p_{n}\left(x_{<k}\right), then we say \left\{ p_{n}\right\} _{n=1}^{\infty} con­verges to p.

as­sump­tion (for lemma 2 and the­o­rem): If \left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty} con­verges to y_{1:\infty}, then {\displaystyle \lim_{n\rightarrow\infty}U\left(q,y_{1:\infty}^{n}\right)=U\left(q,y_{1:\infty}\right)}.

lemma 1: The agent de­scribed by

{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left(\dot{x}_{<k}\right)=\dot{y}_{<k}y_{k}}\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}U\left(q,\dot{y}_{<k}y_{k}y_{k+1:\infty}^{pq}\right)2^{-\ell\left(q\right)}} fol­lows a policy that is the limit of a se­quence of poli­cies \left\{ p_{n}\right\} _{n=1}^{\infty} such that

{\displaystyle \lim_{n\rightarrow\infty}\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{n}q}\right)2^{-\ell\left(q\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}.

proof of lemma 1: Any policy can be com­pletely de­scribed by the last ac­tion it out­puts for ev­ery finite ob­ser­va­tion se­quence. Ob­ser­va­tions are re­turned by a pro­gram, so the set of pos­si­ble finite ob­ser­va­tion se­quences is countable. It is pos­si­ble to fix the last ac­tion re­turned on any par­tic­u­lar finite ob­ser­va­tion se­quence to be the argmax, and still get ar­bi­trar­ily close to the supre­mum with suit­able choices for the last ac­tion re­turned on the other finite ob­ser­va­tion se­quences. By in­duc­tion, it is pos­si­ble to get ar­bi­trar­ily close to the supre­mum while fix­ing the last ac­tion re­turned to be the argmax for any finite set of finite ob­ser­va­tion se­quences. Thus, there ex­ists a se­quence of poli­cies ap­proach­ing the policy that the agent im­ple­ments whose ex­pected util­ities ap­proach the supre­mum.

lemma 2: If p is a policy and \left\{ p_{n}\right\} _{n=1}^{\infty} is a se­quence of poli­cies con­verg­ing to p, then

{\displaystyle \sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}=\lim_{n\rightarrow\infty}\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{n}q}\right)2^{-\ell\left(q\right)}}.

proof of lemma 2: Let \varepsilon>0. On any given se­quence of in­puts x_{1:\infty}\in X^{\infty}, \left\{ p_{n}\left(x_{1:\infty}\right)\right\} _{n=1}^{\infty} con­verges to p\left(x_{1:\infty}\right), so, by as­sump­tion, \forall q\in Q\,\exists N\in\mathbb{N}\,\forall n\geq N\,\left|U\left(q,y_{1:\infty}^{pq}\right)-U\left(q,y_{1:\infty}^{p_{n}q}\right)\right|<\frac{\varepsilon}{2}.

For each N\in\mathbb{N}, let Q_{N}:=\left\{ q\in Q:\forall n\geq N\,\left|U\left(q,y_{1:\infty}^{pq}\right)-U\left(q,y_{1:\infty}^{p_{n}q}\right)\right|<\frac{\varepsilon}{2}\right\} . The pre­vi­ous state­ment im­plies that {\displaystyle \bigcup_{N\in\mathbb{N}}Q_{N}=Q}, and each el­e­ment of \left\{ Q_{N}\right\} _{N\in\mathbb{N}} is a sub­set of the next, so

{\displaystyle \exists N\in\mathbb{N}\,\sum_{q\in Q\setminus Q_{N}}2^{-\ell\left(q\right)}<\frac{\varepsilon}{2\left(\sup U-\inf U\right)}}. The range of U is bounded, so \sup U and \inf U are defined. This also im­plies that the differ­ence in ex­pected util­ity, given any in­for­ma­tion, of any two poli­cies, is bounded. More for­mally:{\displaystyle \forall Q'\subset Q\,\forall p',p''\in P\,\left|\left(\left(\sum_{q\in Q'}U\left(q,y_{1:\infty}^{p'q}\right)2^{-\ell\left(q\right)}\right)\diagup\left(\sum_{q\in Q'}2^{-\ell\left(q\right)}\right)\right)-\left(\left(\sum_{q\in Q'}U\left(q,y_{1:\infty}^{p''q}\right)2^{-\ell\left(q\right)}\right)\diagup\left(\sum_{q\in Q'}2^{-\ell\left(q\right)}\right)\right)\right|\leq\sup U-\inf U},

so in par­tic­u­lar,

{\displaystyle \left|\left(\sum_{q\in Q\setminus Q_{N}}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q\setminus Q_{N}}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|\leq\left(\sup U-\inf U\right)\sum_{q\in Q\setminus Q_{N}}2^{-\ell\left(q\right)}<\frac{\varepsilon}{2}}.

{\displaystyle \left|\left(\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|\leq\left|\left(\sum_{q\in Q_{N}}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q_{N}}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|+\left|\left(\sum_{q\in Q\setminus Q_{N}}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|<\frac{\varepsilon}{2}+\frac{\varepsilon}{2}=\varepsilon}.


{\displaystyle \sum_{\dot{q}\in Q}U\left(\dot{q},\dot{y}_{1:\infty}\right)2^{-\ell\left(\dot{q}\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}},

where {\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left(\dot{x}_{<k}\right)=\dot{y}_{<k}y_{k}}\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}U\left(q,\dot{y}_{<k}y_{k}y_{k+1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}.

proof of the­o­rem: Let’s call the policy im­ple­mented by the agent p^{*}.

By lemma 1, there is a se­quence of poli­cies \left\{ p_{n}\right\} _{n=1}^{\infty}

con­verg­ing to p^{*} such that

{\displaystyle \lim_{n\rightarrow\infty}\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{n}q}\right)2^{-\ell\left(q\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}.

By lemma 2,

{\displaystyle \sum_{q\in Q}U\left(q,y_{1:\infty}^{p^{*}q}\right)2^{-\ell\left(q\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}.