# A utility-maximizing varient of AIXI

Re­sponse to: Univer­sal agents and util­ity functions

Re­lated ap­proaches: Hib­bard (2012), Hay (2005)

## Background

Here is the func­tion im­ple­mented by finite-life­time AI$\xi$:

${\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}\left(r\left(x_{k}\right)+...+r\left(x_{m}\right)\right)\cdot\xi\left(\dot{y}\dot{x}_{,

where $m$ is the num­ber of steps in the life­time of the agent, $k$ is the cur­rent step be­ing com­puted, $X$ is the set of pos­si­ble ob­ser­va­tions, $Y$ is the set of pos­si­ble ac­tions, $r$ is a func­tion that ex­tracts a re­ward value from an ob­ser­va­tion, a dot over a vari­able rep­re­sents that its value is known to be the true value of the ac­tion or ob­ser­va­tion it rep­re­sents, un­der­lines rep­re­sent that the vari­able is an in­put to a prob­a­bil­ity dis­tri­bu­tion, and $\xi$ is a func­tion that re­turns the prob­a­bil­ity of a se­quence of ob­ser­va­tions, given a cer­tain known his­tory and se­quence of ac­tions, and start­ing from the Solomonoff prior. More for­mally,

${\displaystyle \xi\left(\dot{y}\dot{x}_{,

where $Q$ is the set of all pro­grams, $\ell$ is a func­tion that re­turns the length of a pro­gram in bits, and a pro­gram ap­plied to a se­quence of ac­tions re­turns the re­sult­ing se­quence of ob­ser­va­tions. No­tice that the de­nom­i­na­tor is a con­stant, de­pend­ing only on the already known $\dot{y}\dot{x}_{, and mul­ti­ply­ing by a pos­i­tive con­stant does not change the argmax, so we can pre­tend that the de­nom­i­na­tor doesn’t ex­ist. If $q$ is a valid pro­gram, then any longer pro­gram with $q$ as a pre­fix is not a valid pro­gram, so ${\displaystyle \sum_{q\in Q}2^{-\ell\left(q\right)}\leq1}$.

## Problem

A prob­lem with this is that it can only op­ti­mize over the in­put it re­ceives, not over as­pects of the ex­ter­nal world that it can­not ob­serve. Given the chance, AI$\xi$ would hack its in­put chan­nel so that it would only ob­serve good things, in­stead of try­ing to make good things hap­pen (in other words, it would wire­head it­self). Anja speci­fied a var­i­ant of AI$\xi$ in which she re­placed the sum of re­wards with a sin­gle util­ity value and made the do­main of the util­ity func­tion be the en­tire se­quence of ac­tions and ob­ser­va­tions in­stead of a sin­gle ob­ser­va­tion, like so:

${\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}U\left(\dot{y}\dot{x}_{.

This doesn’t re­ally solve the prob­lem, be­cause the util­ity func­tion still only takes what the agent can see, rather than what is ac­tu­ally go­ing on out­side the agent. The situ­a­tion is a lit­tle bet­ter be­cause the util­ity func­tion also takes into ac­count the agent’s ac­tions, so it could pun­ish ac­tions that look like the agent is try­ing to wire­head it­self, but if there was a flaw in the in­struc­tions not to wire­head, the agent would ex­ploit it, so the in­cen­tive not to wire­head would have to be perfect, and this for­mu­la­tion is not very en­light­en­ing about how to do that.

[Edit: Hib­bard (2012) also pre­sents a solu­tion to this prob­lem. I haven’t read all of it yet, but it ap­pears to be fairly differ­ent from what I sug­gest in the next sec­tion.]

## Solution

Here’s what I sug­gest in­stead: ev­ery­thing that hap­pens is de­ter­mined by the pro­gram that the world is run­ning on and the agent’s ac­tions, so the do­main of the util­ity func­tion should be $Q\times Y^{m}$. The ap­par­ent prob­lem with that is that the for­mula for AI$\xi$ does not con­tain any men­tion of el­e­ments of $Q$. If we just take the origi­nal for­mula and re­place $r\left(x_{k}\right)+...+r\left(x_{m}\right)$

with $U\left(q,\dot{y}_{, it wouldn’t make any sense. How­ever, if we ex­pand out $\xi$ in the origi­nal for­mula (ex­clud­ing the un­nec­es­sary de­nom­i­na­tor), we can move the sum of re­wards in­side the sum over pro­grams, like this:

${\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}\sum_{q\in Q:q\left(y_{\leq m}\right)=x_{\leq m}}\left(r\left(x_{k}\right)+...+r\left(x_{m}\right)\right)2^{-\ell\left(q\right)}}$.

Now it is easy to re­place the sum of re­wards with the de­sired util­ity func­tion.

${\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}\sum_{q\in Q:q\left(y_{\leq m}\right)=x_{\leq m}}U\left(q,\dot{y}_{.

With this for­mu­la­tion, there is no dan­ger of the agent wire­head­ing, and all $U$ has to do is com­pute ev­ery­thing that hap­pens when the agent performs a given se­quence of ac­tions in a given pro­gram, and de­cide how de­sir­able it is. If the range of $U$ is un­bounded, then this might not con­verge. Let’s as­sume through­out this post that the range of $U$ is bounded.

[Edit: Hay (2005) pre­sents a similar for­mu­la­tion to this.]

The pre­vi­ous dis­cus­sion as­sumed that the agent would only have the op­por­tu­nity to perform a finite num­ber of ac­tions. The situ­a­tion gets a lit­tle tricky when the agent is al­lowed to perform an un­bounded num­ber of ac­tions. Hut­ter uses a finite look-ahead ap­proach for AI$\xi$, where on each step $k$, it pre­tends that it will only be perform­ing $m_{k}$ ac­tions, where $\forall k\, m_{k}\gg k$.

${\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m_{k}}\in Y}\sum_{x_{m_{k}}\in X}\left(r\left(x_{k}\right)+...+r\left(x_{m_{k}}\right)\right)\cdot\xi\left(\dot{y}\dot{x}_{.

If we make the same mod­ifi­ca­tion to the util­ity-based var­i­ant, we get

${\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m_{k}}\in Y}\sum_{x_{m_{k}}\in X}\sum_{q\in Q:q\left(y_{\leq m_{k}}\right)=x_{\leq m_{k}}}U\left(q,\dot{y}_{.

This is un­satis­fac­tory be­cause the do­main of $U$ was sup­posed to con­sist of all the in­for­ma­tion nec­es­sary to de­ter­mine ev­ery­thing that hap­pens, but here, it is miss­ing all the ac­tions af­ter step $m_{k}$. One ob­vi­ous thing to try is to set $m_{k}:=\infty$. This will be eas­ier to do us­ing a com­pacted ex­pres­sion for AI$\xi$:

${\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\max_{p\in P:p\left(\dot{x}_{,

where $P$ is the set of poli­cies that map se­quences of ob­ser­va­tions to se­quences of ac­tions and $x_{i}^{pq}$ is short­hand for the last ob­ser­va­tion in the se­quence re­turned by $q\left(p\left(\dot{x}_{. If we take this com­pacted for­mu­la­tion, mod­ify it to ac­com­mo­date the new util­ity func­tion, set $m_{k}:=\infty$, and re­place the max­i­mum with a supre­mum (since there’s an in­finite num­ber of pos­si­ble poli­cies), we get

${\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left(\dot{x}_{,

where $y_{i}^{pq}$ is short­hand for the last ac­tion in the se­quence re­turned by $p\left(q\left(\dot{y}_{.

But there is a prob­lem with this, which I will illus­trate with a toy ex­am­ple. Sup­pose $Y:=\left\{ a,b\right\}$, and $U\left(q,y_{1:\infty}\right)=0$ when $\forall k\in\mathbb{N}\, y_{k}=a$, and for any $n\in\mathbb{N}$, $U\left(q,y_{1:\infty}\right)=1-\frac{1}{n}$ when $y_{n}=b$ and $\forall k. ($U$ does not de­pend on the pro­gram $q$ in this ex­am­ple). An agent fol­low­ing the above for­mula would out­put $a$ on ev­ery step, and end up with a util­ity of $0$, when it could have got­ten ar­bi­trar­ily close to $1$ by even­tu­ally out­putting $b$.

To avoid prob­lems like that, we could as­sume the rea­son­able-seem­ing con­di­tion that if $y_{1:\infty}$ is an ac­tion se­quence and $\left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty}$ is a se­quence of ac­tion se­quences that con­verges to $y_{1:\infty}$ (by which I mean $\forall k\in\mathbb{N}\,\exists N\in\mathbb{N}\,\forall n>N\, y_{k}^{n}=y_{k}$), then ${\displaystyle \lim_{n\rightarrow\infty}U\left(q,y_{1:\infty}^{n}\right)=U\left(q,y_{1:\infty}\right)}$.

Un­der that as­sump­tion, the supre­mum is in fact a max­i­mum, and the for­mula gives you an ac­tion se­quence that will reach that max­i­mum (proof be­low).

If you don’t like the con­di­tion I im­posed on $U$, you might not be satis­fied by this. But with­out it, there is not nec­es­sar­ily a best policy. One thing you can do is, on step 1, pick some ex­tremely small $\varepsilon>0$, pick any el­e­ment from

${\displaystyle \left\{ p^{*}\in P:\sum_{q\in Q}U\left(q,y_{1:\infty}^{p^{*}q}\right)2^{-\ell\left(q\right)}>\left(\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\varepsilon\right\} }$, and then fol­low that policy for the rest of eter­nity, which will guaran­tee that you do not miss out on more than $\varepsilon$ of ex­pected util­ity.

## Proof of crite­rion for supre­mum-chas­ing working

defi­ni­tion: If $y_{1:\infty}$ is an ac­tion se­quence and $\left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty}$ is an in­finite se­quence of ac­tion se­quences, and $\forall k\in\mathbb{N}\,\exists N\in\mathbb{N}\,\forall n>N\, y_{k}^{n}=y_{k}$, then we say $\left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty}$ con­verges to $y_{1:\infty}$. If $p$ is a policy and $\left\{ p_{n}\right\} _{n=1}^{\infty}$ is a se­quence of poli­cies, and $\forall k\in\mathbb{N}\,\forall x_{N\, p\left(x_{, then we say $\left\{ p_{n}\right\} _{n=1}^{\infty}$ con­verges to $p$.

as­sump­tion (for lemma 2 and the­o­rem): If $\left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty}$ con­verges to $y_{1:\infty}$, then ${\displaystyle \lim_{n\rightarrow\infty}U\left(q,y_{1:\infty}^{n}\right)=U\left(q,y_{1:\infty}\right)}$.

lemma 1: The agent de­scribed by

${\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left(\dot{x}_{ fol­lows a policy that is the limit of a se­quence of poli­cies $\left\{ p_{n}\right\} _{n=1}^{\infty}$ such that

${\displaystyle \lim_{n\rightarrow\infty}\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{n}q}\right)2^{-\ell\left(q\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}$.

proof of lemma 1: Any policy can be com­pletely de­scribed by the last ac­tion it out­puts for ev­ery finite ob­ser­va­tion se­quence. Ob­ser­va­tions are re­turned by a pro­gram, so the set of pos­si­ble finite ob­ser­va­tion se­quences is countable. It is pos­si­ble to fix the last ac­tion re­turned on any par­tic­u­lar finite ob­ser­va­tion se­quence to be the argmax, and still get ar­bi­trar­ily close to the supre­mum with suit­able choices for the last ac­tion re­turned on the other finite ob­ser­va­tion se­quences. By in­duc­tion, it is pos­si­ble to get ar­bi­trar­ily close to the supre­mum while fix­ing the last ac­tion re­turned to be the argmax for any finite set of finite ob­ser­va­tion se­quences. Thus, there ex­ists a se­quence of poli­cies ap­proach­ing the policy that the agent im­ple­ments whose ex­pected util­ities ap­proach the supre­mum.

lemma 2: If $p$ is a policy and $\left\{ p_{n}\right\} _{n=1}^{\infty}$ is a se­quence of poli­cies con­verg­ing to $p$, then

${\displaystyle \sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}=\lim_{n\rightarrow\infty}\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{n}q}\right)2^{-\ell\left(q\right)}}$.

proof of lemma 2: Let $\varepsilon>0$. On any given se­quence of in­puts $x_{1:\infty}\in X^{\infty}$, $\left\{ p_{n}\left(x_{1:\infty}\right)\right\} _{n=1}^{\infty}$ con­verges to $p\left(x_{1:\infty}\right)$, so, by as­sump­tion, $\forall q\in Q\,\exists N\in\mathbb{N}\,\forall n\geq N\,\left|U\left(q,y_{1:\infty}^{pq}\right)-U\left(q,y_{1:\infty}^{p_{n}q}\right)\right|<\frac{\varepsilon}{2}$.

For each $N\in\mathbb{N}$, let $Q_{N}:=\left\{ q\in Q:\forall n\geq N\,\left|U\left(q,y_{1:\infty}^{pq}\right)-U\left(q,y_{1:\infty}^{p_{n}q}\right)\right|<\frac{\varepsilon}{2}\right\}$. The pre­vi­ous state­ment im­plies that ${\displaystyle \bigcup_{N\in\mathbb{N}}Q_{N}=Q}$, and each el­e­ment of $\left\{ Q_{N}\right\} _{N\in\mathbb{N}}$ is a sub­set of the next, so

${\displaystyle \exists N\in\mathbb{N}\,\sum_{q\in Q\setminus Q_{N}}2^{-\ell\left(q\right)}<\frac{\varepsilon}{2\left(\sup U-\inf U\right)}}$. The range of $U$ is bounded, so $\sup U$ and $\inf U$ are defined. This also im­plies that the differ­ence in ex­pected util­ity, given any in­for­ma­tion, of any two poli­cies, is bounded. More for­mally:${\displaystyle \forall Q'\subset Q\,\forall p',p''\in P\,\left|\left(\left(\sum_{q\in Q'}U\left(q,y_{1:\infty}^{p'q}\right)2^{-\ell\left(q\right)}\right)\diagup\left(\sum_{q\in Q'}2^{-\ell\left(q\right)}\right)\right)-\left(\left(\sum_{q\in Q'}U\left(q,y_{1:\infty}^{p''q}\right)2^{-\ell\left(q\right)}\right)\diagup\left(\sum_{q\in Q'}2^{-\ell\left(q\right)}\right)\right)\right|\leq\sup U-\inf U}$,

so in par­tic­u­lar,

${\displaystyle \left|\left(\sum_{q\in Q\setminus Q_{N}}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q\setminus Q_{N}}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|\leq\left(\sup U-\inf U\right)\sum_{q\in Q\setminus Q_{N}}2^{-\ell\left(q\right)}<\frac{\varepsilon}{2}}$.

${\displaystyle \left|\left(\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|\leq\left|\left(\sum_{q\in Q_{N}}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q_{N}}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|+\left|\left(\sum_{q\in Q\setminus Q_{N}}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|<\frac{\varepsilon}{2}+\frac{\varepsilon}{2}=\varepsilon}$.

the­o­rem:

${\displaystyle \sum_{\dot{q}\in Q}U\left(\dot{q},\dot{y}_{1:\infty}\right)2^{-\ell\left(\dot{q}\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}$,

where ${\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left(\dot{x}_{.

proof of the­o­rem: Let’s call the policy im­ple­mented by the agent $p^{*}$.

By lemma 1, there is a se­quence of poli­cies $\left\{ p_{n}\right\} _{n=1}^{\infty}$

con­verg­ing to $p^{*}$ such that

${\displaystyle \lim_{n\rightarrow\infty}\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{n}q}\right)2^{-\ell\left(q\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}$.

By lemma 2,

${\displaystyle \sum_{q\in Q}U\left(q,y_{1:\infty}^{p^{*}q}\right)2^{-\ell\left(q\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}$.

• I like how you spec­ify util­ity di­rectly over pro­grams, it de­scribes very neatly how some­one who sat down and wrote a util­ity func­tion

$U\(\\.{y}\\.{x}\_{)

would do it: First de­ter­mine how the ob­ser­va­tion could have been com­puted by the en­vi­ron­ment and then eval­u­ate that situ­a­tion. This is a spe­cial case of the frame­work I wrote down in the cited ar­ti­cle; you can always set

$U\(\\.{y}\\.{x}\_{=\sum_{q:q(y_{1:m_k})=x_{1:m_k}}%20U(q,y_{1:m_k}))

This solves wire­head­ing only if we can spec­ify which en­vi­ron­ments con­tain wire­headed (non-du­al­is­tic) agents, delu­sion boxes, etc..

• True, the U(pro­gram, ac­tion se­quence) frame­work can be im­ple­mented within the U(ac­tion/​ob­ser­va­tion se­quence) frame­work, al­though you for­got to mul­ti­ply by 2^-l(q) when de­scribing how. I also don’t re­ally like the finite look-ahead (un­til m_k) method, since it is dy­nam­i­cally in­con­sis­tent.

This solves wire­head­ing only if we can spec­ify which en­vi­ron­ments con­tain wire­headed (non-du­al­is­tic) agents, delu­sion boxes, etc..

Not sure what you mean by that.

• you for­got to mul­ti­ply by 2^-l(q)

I think then you would count that twice, wouldn’t you? Be­cause my origi­nal for­mula already con­tains the Solomonoff prob­a­bil­ity...

• Oh right. But you still want the prob­a­bil­ity weight­ing to be in­side the sum, so you would ac­tu­ally need $U\(\\\.\{y\}\\\.\{x\}\_\{=\frac{1}{\xi\left(\dot{y}\dot{x}_{%3Ck}y\un­der­line{x}_{k:m_{k}}\right)}\sum_{q:q(y_{1:m_k})=x_{1:m_k}}%20U(q,y_{1:m_k})2%5E{-\ell\left(q\right)}%0A)

• Let’s stick with delu­sion boxes for now, be­cause as­sum­ing that we can read off from the en­vi­ron­ment whether the agent has wire­headed breaks du­al­ism. So even if we spec­ify util­ity di­rectly over en­vi­ron­ments, we still need to mas­ter the task of spec­i­fy­ing which ac­tion/​en­vi­ron­ment com­bi­na­tions con­tain delu­sion boxes to eval­u­ate them cor­rectly. It is still the same prob­lem, just phrased differ­ently.

• If I un­der­stand you cor­rectly, that sounds like a fairly straight­for­ward prob­lem for AIXI to solve. Some pro­grams q_1 will mimic some other pro­gram q_2′s com­mu­ni­ca­tion with the agent while do­ing some­thing else in the back­ground, but AIXI con­sid­ers the pos­si­bil­ities of both q_1 and q_2.

• Thanks for putting in all this work!

I recom­mend you post this to the MAGIC list and ask for feed­back. You’ll also see a dis­cus­sion of Anja’s post there.

• For­mu­las aren’t be­ing ren­dered cor­rectly, they ap­pear as la­tex code.

• Looks like it’s back up now.

• That ap­pears to be be­cause the for­mu­las are images hosted by codecogs.com, which is down.

• This is great!

I re­ally like your use of $U\(q,y\_\{1:\\infty\}\$). The seems to be an im­por­tant step along the way to elimi­nat­ing the hori­zon prob­lem. I re­cently read in Orseau and Ring’s “Space-Time Embed­ded In­tel­li­gence” that in an­other pa­per, “Prov­ably Bounded-Op­ti­mal Agents” by Rus­sell and Subra­ma­nian, they define $V\(\\pi, q\$%20=%20u(h(\pi,%20q))) where$h\(\\pi, q\$) gen­er­ates the in­ter­ac­tion his­tory $ao\_\{1:\\infty\}$. (I have yet to read this pa­per.)

Your coun­terex­am­ple of supre­mum chas­ing is re­ally great; it breaks my model of how util­ity max­i­miza­tion is sup­posed to go. I’m hon­estly not sure whether one should chase the path of U = 0 or not. This is clouded by the fact that the prob­a­bil­is­tic na­ture of things will prob­a­bly push you off that path even­tu­ally.

The dilemma re­minds me a lot of ex­plor­ing ver­sus ex­ploit­ing. Some­times it seems to me that the ra­tio­nal thing for a util­ity max­i­mizer to do, al­most in­de­pen­dent of the util­ity func­tion, would be to just max­i­mize the re­sources it con­trol­led, un­til it found some “end” or limit, and then spend all its re­sources cre­at­ing what­ever it wanted in the first place. In the frame­work above we’ve speci­fied that there is no “end” time, and AIXI is du­al­ist, so there are no wor­ries of it get­ting de­stroyed.

There’s some­thing else that is strange to me. If we are con­sid­er­ing in­finite in­ter­ac­tion his­to­ries, then we’re look­ing at the en­tire bi­nary tree at once. But this tree has un­countably in­finite paths! Al­most all of the (in­finite) paths are in­com­putable se­quences. This means that any com­putable AI couldn’t even con­sider travers­ing them. And it also seems to have in­ter­est­ing things to say about the util­ity func­tion. Does it only need to be defined over com­putable se­quences? What if we have util­ity over in­comp­tu­able se­quences? Th­ese could be defined by sec­ond-or­der logic state­ments, but re­main in­com­putable. It gives me lots of ques­tions.

• I’m hon­estly not sure whether one should chase the path of U = 0 or not. This is clouded by the fact that the prob­a­bil­is­tic na­ture of things will prob­a­bly push you off that path even­tu­ally.

Mak­ing the as­sump­tion that there is a small prob­a­bil­ity that you will de­vi­ate from your cur­rent plan on each fu­ture move, and that these prob­a­bil­ities add up to a near guaran­tee that you will even­tu­ally, has a more com­pli­cated effect on your plan­ning than just jus­tify­ing chas­ing the supre­mum.

For in­stance, con­sider this mod­ifi­ca­tion to the toy ex­am­ple I gave ear­lier. Y:={a,b,c}, and if the first b comes be­fore the first c, then the re­sult­ing util­ity is 1 − 1/​n, where n is the in­dex of the first b (all pre­vi­ous el­e­ments be­ing a), as be­fore. But we’ll change it so that the util­ity of out­putting an in­finite stream of a is 1. If there is a c in your ac­tion se­quence and it comes be­fore the first b, then the util­ity you get is −1000. In this situ­a­tion, supre­mum-chas­ing works just fine if you com­pletely trust your fu­ture self: you out­put a ev­ery time, and get a util­ity of 1, the best you could pos­si­bly do. But if you think that there is a small risk that you could end up out­putting c at some point, then even­tu­ally it will be worth­while to out­put b, since the gain you could get from con­tin­u­ing to out­put a gets ar­bi­trar­ily small com­pared to the loss from ac­ci­den­tally out­putting c.

There’s some­thing else that is strange to me. If we are con­sid­er­ing in­finite in­ter­ac­tion his­to­ries, then we’re look­ing at the en­tire bi­nary tree at once. But this tree has un­countably in­finite paths! Al­most all of the (in­finite) paths are in­com­putable se­quences. This means that any com­putable AI couldn’t even con­sider travers­ing them. And it also seems to have in­ter­est­ing things to say about the util­ity func­tion. Does it only need to be defined over com­putable se­quences? What if we have util­ity over in­comp­tu­able se­quences? Th­ese could be defined by sec­ond-or­der logic state­ments, but re­main in­com­putable. It gives me lots of ques­tions.

I don’t re­ally have an­swers to these ques­tions. One thing you could do is re­place the set of all poli­cies (P) with the set of all com­putable poli­cies, so that the agent would never out­put an un­com­putable ac­tion se­quence [Edit: oops, not true. You could con­sider only com­putable poli­cies, but then end up at an un­com­putable policy any­way by chas­ing the supre­mum].

• Mak­ing the as­sump­tion that...

Yeah, I was in­ten­tion­ally vague with “the prob­a­bil­is­tic na­ture of things”. I am also think­ing about how any AI will have log­i­cal un­cer­tainty, un­cer­tainty about the pre­ci­sion of its ob­ser­va­tions, et cetera, so that as it con­sid­ers fur­ther points in the fu­ture, its dis­tri­bu­tion be­comes flat­ter. And hav­ing a non-du­al­ist frame­work would in­tro­duce un­cer­tainty about the agent’s self, its util­ity func­tion, its mem­ory, …

• I think there is some­thing off with the for­mu­las that use poli­cies: If you already choose the policy

$p:p\(x\_\{=y_{%3Ck}y_k)

then you can­not choose an y_k in the argmax.

Also for the Solomonoff prior you must sum over all pro­grams

$q:q\(y\_\{1:m\_k\}\$=x_{1:m_k}) .

Could you maybe ex­pand on the proof of Lemma 1 a lit­tle bit? I am not sure I get what you mean yet.

• If you already choose the policy … then you can­not choose an y_k in the argmax.

The argmax comes be­fore choos­ing a policy. In $\{\\displaystyle\\arg\\max\_\{y\_\{k\}\\in Y\}\\sup\_\{p\\in P:p\\left \\dot\{x\}\_\{k\}\\right=\\dot\{y\}\_\{k\}y\_\{k\}\}\.\.\.\}$, there is already a value for y_k be­fore you con­sider all the poli­cies such that p(x_<k) = y_<k y_k.

Also for the Solomonoff prior you must sum over all programs

Didn’t I do that?

Could you maybe ex­pand on the proof of Lemma 1 a lit­tle bit?

Look at any finite ob­ser­va­tion se­quence. There ex­ists some ac­tion you could out­put in re­sponse to that se­quence that would al­low you to get ar­bi­trar­ily close to the supre­mum ex­pected util­ity with suit­able re­sponses to the other finite ob­ser­va­tion se­quences (for in­stance, you could get within 12 of the supre­mum). Now look at an­other finite ob­ser­va­tion se­quence. There ex­ists some ac­tion you could out­put in re­sponse to that, with­out chang­ing your re­sponse to the pre­vi­ous finite ob­ser­va­tion se­quence, such that you can get ar­bi­trar­ily close to the supre­mum (within 14). Look at a third finite ob­ser­va­tion se­quence. There ex­ists some ac­tion you could out­put in re­sponse to that, with­out chang­ing your re­sponses to the pre­vi­ous 2, that would al­low you to get within 18 of the supre­mum. And keep go­ing in some fash­ion that will even­tu­ally con­sider ev­ery finite ob­ser­va­tion se­quence. At each step n, you will be able to spec­ify a policy that gets you within 2^-n of the supre­mum, and these poli­cies con­verge to the policy that the agent ac­tu­ally im­ple­ments.

I hope that helps. If you still don’t know what I mean, could you de­scribe where you’re stuck?

• I wouldn’t use N el­e­ment of N. But rather n el­e­ment of N or some­thing. Just a small nit­pick.

EDIT: I was watch­ing on my lap­top and didn’t see that the sec­ond N is a bit fat­ter.

• $\\mathbb\{N\}$ means the set of nat­u­ral num­bers.

• Yes, the fat N. It’s clear in a higher re­s­olu­tion.