Asymptotically Benign AGI

We pre­sent an al­gorithm, then show (given four as­sump­tions) that in the limit, it is hu­man-level in­tel­li­gent and be­nign.

Will MacAskill has com­mented that in the sem­i­nar room, he is a con­se­quen­tial­ist, but for de­ci­sion-mak­ing, he takes se­ri­ously the lack of a philo­soph­i­cal con­sen­sus. I be­lieve that what is here is cor­rect, but in the ab­sence of feed­back from the Align­ment Fo­rum, I don’t yet feel com­fortable post­ing it to a place (like arXiv) where it can get cited and en­ter the aca­demic record. We have sub­mit­ted it to IJCAI, but we can edit or re­voke it be­fore it is printed.

I will dis­tribute at least min($365, num­ber of com­ments *$15) in prizes by April 1st (via venmo if pos­si­ble, or else Ama­zon gift cards, or a dona­tion on their be­half if they pre­fer) to the au­thors of the com­ments here, ac­cord­ing to the com­ments’ qual­ity. If one com­menter finds an er­ror, and an­other com­menter tin­kers with the setup or tin­kers with the as­sump­tions in or­der to cor­rect it, then I ex­pect both com­ments will re­ceive a similar prize (if those com­ments are at the level of prize-win­ning, and nei­ther per­son is me). If oth­ers would like to donate to the prize pool, I’ll provide a com­ment that you can re­ply to.

To or­ga­nize the con­ver­sa­tion, I’ll start some com­ment threads be­low:

• Pos­i­tive feedback

• Gen­eral Con­cerns/​Confusions

• Minor Concerns

• Con­cerns with As­sump­tion 1

• Con­cerns with As­sump­tion 2

• Con­cerns with As­sump­tion 3

• Con­cerns with As­sump­tion 4

• Con­cerns with “the box”

• Ad­ding to the prize pool

• If I have a great model of physics in hand (and I’m ba­si­cally un­con­cerned with com­pet­i­tive­ness, as you seem to be), why not just take the re­sult­ing simu­la­tion of the hu­man and give it a long time to think? That seems to have fewer safety risks and to be more use­ful.

More gen­er­ally, un­der what model of AI ca­pa­bil­ities /​ com­pet­i­tive­ness con­straints would you want to use this pro­ce­dure?

• I know I don’t prove it, but I think this agent would be vastly su­per­hu­man, since it ap­proaches Bayes-op­ti­mal rea­son­ing with re­spect to its ob­ser­va­tions. (“Ap­proaches” be­cause MAP → Bayes).

For the asymp­totic re­sults, one has to con­sider en­vi­ron­ments that pro­duce ob­ser­va­tions with the true ob­jec­tive prob­a­bil­ities (hence the ap­pear­ance that I’m un­con­cerned with com­pet­i­tive­ness). In prac­tice, though, given the speed prior, the agent will re­quire ev­i­dence to en­ter­tain slow world-mod­els, and for the be­gin­ning of its life­time, the agent will be us­ing low-fidelity mod­els of the en­vi­ron­ment and the hu­man-ex­plorer, ren­der­ing it much more tractable than a perfect model of physics. And I think that even at that stage, well be­fore it is do­ing perfect simu­la­tions of other hu­mans, it will far sur­pass hu­man perfor­mance. We man­age hu­man-level perfor­mance with very rough simu­la­tions of other hu­mans.

That leads me to think this ap­proach is much more com­pet­i­tive that simu­lat­ing a hu­man and giv­ing it a long time to think.

• For the asymp­totic re­sults, one has to con­sider en­vi­ron­ments that pro­duce ob­ser­va­tions with the true ob­jec­tive prob­a­bil­ities (hence the ap­pear­ance that I’m un­con­cerned with com­pet­i­tive­ness). In prac­tice, though, given the speed prior, the agent will re­quire ev­i­dence to en­ter­tain slow world-mod­els, and for the be­gin­ning of its life­time, the agent will be us­ing low-fidelity mod­els of the en­vi­ron­ment and the hu­man-ex­plorer, ren­der­ing it much more tractable than a perfect model of physics. And I think that even at that stage, well be­fore it is do­ing perfect simu­la­tions of other hu­mans, it will far sur­pass hu­man perfor­mance. We man­age hu­man-level perfor­mance with very rough simu­la­tions of other hu­mans.

I’m keen on asymp­totic anal­y­sis, but if we want to an­a­lyze safety asymp­tot­i­cally I think we should also an­a­lyze com­pet­i­tive­ness asymp­tot­i­cally. That is, if our al­gorithm only be­comes safe in the limit be­cause we shift to a su­per un­com­pet­i­tive regime, it un­der­mines the use of the limit as anal­ogy to study the finite time be­hav­ior.

(Though this is not the most in­ter­est­ing dis­agree­ment, prob­a­bly not worth re­spond­ing to any­thing other than the thread where I ask about “why do you need this mem­ory stuff?“)

• That is, if our al­gorithm only be­comes safe in the limit be­cause we shift to a su­per un­com­pet­i­tive regime, it un­der­mines the use of the limit as anal­ogy to study the finite time be­hav­ior.

Definitely agree. I don’t think it’s the case that a shift to su­per un­com­pet­i­tive­ness is ac­tu­ally an “in­gre­di­ent” to be­nig­nity, but my only dis­cus­sion of that so far is in the con­clu­sion: “We can only offer in­for­mal claims re­gard­ing what hap­pens be­fore BoMAI is definitely be­nign...”

• That leads me to think this ap­proach is much more com­pet­i­tive that simu­lat­ing a hu­man and giv­ing it a long time to think.

Surely that just de­pends on how long you give them to think. (See also HCH.)

• By com­pet­i­tive­ness, I meant use­ful­ness per unit com­pu­ta­tion.

• The al­gorithm takes an argmax over an ex­po­nen­tially large space of se­quences of ac­tions, i.e. it does 2^{epi­sode length} model eval­u­a­tions. Do you think the re­sult is smarter than a group of hu­mans of size 2^{epi­sode length}? I’d bet against—the hu­mans could do this par­tic­u­lar brute force search, in which case you’d have a tie, but they’d prob­a­bly do some­thing smarter.

• I ob­vi­ously haven’t solved the Tractable Gen­eral In­tel­li­gence prob­lem. The ques­tion is whether this is a tractable/​com­pet­i­tive frame­work. So ex­pec­ti­max plan­ning would nat­u­rally get re­placed with a Monte-Carlo tree search, or some bet­ter ap­proach we haven’t thought of. And I’ll mes­sage you pri­vately about a more tractable ap­proach to iden­ti­fy­ing a max­i­mum a pos­te­ri­ori world-model from a countable class (I don’t as­sign a very high prob­a­bil­ity to it be­ing a hugely im­por­tant ca­pa­bil­ities idea, since those aren’t just ly­ing around, but it’s more than 1%).

It will be im­por­tant, when con­sid­er­ing any of these ap­prox­i­ma­tions, to eval­u­ate whether they break be­nig­nity (most plau­si­bly, I think, by in­tro­duc­ing a new at­tack sur­face for op­ti­miza­tion dae­mons). But I feel fine about defer­ring that re­search for the time be­ing, so I defined BoMAI as do­ing ex­pec­ti­max plan­ning in­stead of MCTS.

Given that the setup is ba­si­cally a straight re­in­force­ment learner with a weird prior, I think that at that level of ab­strac­tion, the ceiling of com­pet­i­tive­ness is quite high.

• I’m sym­pa­thetic to this pic­ture, though I’d prob­a­bly be in­clined to try to model it ex­plic­itly—by mak­ing some as­sump­tion about what the plan­ning al­gorithm can ac­tu­ally do, and then show­ing how to use an al­gorithm with that prop­erty. I do think “just write down the al­gorithm, and be hap­pier if it looks like a ‘nor­mal’ al­gorithm” is an OK start­ing point though

Given that the setup is ba­si­cally a straight re­in­force­ment learner with a weird prior, I think that at that level of ab­strac­tion, the ceiling of com­pet­i­tive­ness is quite high.

Step­ping back from this par­tic­u­lar thread, I think the main prob­lem with com­pet­i­tive­ness is that you are just get­ting “an­swers that look good to a hu­man” rather than “ac­tu­ally good an­swers.” If I try to use such a sys­tem to nav­i­gate a com­pli­cated world, con­tain­ing lots of other peo­ple with more liberal AI ad­vi­sors helping them do crazy stuff, I’m go­ing to quickly be left be­hind.

It’s cer­tainly rea­son­able to try to solve safety prob­lems with­out at­tend­ing to this kind of com­pet­i­tive­ness, though I think this kind of asymp­totic safety is ac­tu­ally eas­ier than you make it sound (un­der the im­plicit “noth­ing goes ir­re­versibly wrong at any finite time” as­sump­tion).

• Start­ing a new thread on this:

Step­ping back from this par­tic­u­lar thread, I think the main prob­lem with com­pet­i­tive­ness is that you are just get­ting “an­swers that look good to a hu­man” rather than “ac­tu­ally good an­swers.”

here.

• From Paul:

I think the main prob­lem with com­pet­i­tive­ness is that you are just get­ting “an­swers that look good to a hu­man” rather than “ac­tu­ally good an­swers.”

The com­ment was here, but I think it de­serves its own thread. Wei makes the same point here (point num­ber 3), and our en­su­ing con­ver­sa­tion is also rele­vant to this thread.

My an­swers to Wei were two-fold: one is that if be­nig­nity is es­tab­lished, it’s pos­si­ble to safely tin­ker with the setup un­til hope­fully “an­swers that look good to a hu­man” re­sem­bles good an­swers (we never quite reached an agree­ment about this). The sec­ond was an ex­am­ple of an ex­tended setup (one has to read the par­ent com­ments to un­der­stand it) which would po­ten­tially be much more likely to yield ac­tu­ally good an­swers; I think we agree about this ap­proach.

My origi­nal idea when I started work­ing on this, ac­tu­ally, is also an an­swer to this con­cern. The rea­son it’s not in the pa­per is be­cause I pared it down to a min­i­mum vi­able product.

Con­struct an “or­a­cle” by defin­ing “true an­swers” as fol­lows: an­swers which help a hu­man do ac­cu­rate pre­dic­tion on a ran­domly sam­pled pre­dic­tion task.*

I figured out that I needed a box, and ev­ery­thing else in this setup, and I re­al­ized that the setup could be ap­plied to a nor­mal re­in­force­ment learner just as eas­ily as for this or­a­cle, so I sim­plified the ap­proach.

I hon­estly need to dig through notes from last year, but my rec­ol­lec­tion is this: the op­er­a­tor re­ceives an an­swer to a query, and then gets a ran­dom pre­dic­tion task, which he has to make a pre­dic­tion about be­fore leav­ing the box. Later, the pre­dic­tion is scored, and this is con­verted into a re­ward for BoMAI. BoMAI has a model class for how the pre­dic­tion is scored; the out­put of these mod­els is an an­swer for what the ground truth is. In all of these mod­els, the ground truth doesn’t de­pend on BoMAI’s an­swer (that is, the model isn’t given read ac­cess to BoMAI’s an­swer). So the pre­dic­tion task can in­volve the pre­dic­tion of out­side world events, and the ground truth can be logged from the out­side world, be­cause BoMAI doesn’t con­ceive of its an­swer hav­ing a causal im­pact on the copy of the world which pro­vides the ground truth for the pre­dic­tion tasks. For ex­am­ple, the pre­dic­tion task might sam­pled from {“True or false: hex­alated keno­tones will sup­press ac­tivity of BGQ-1”, “True or false: fluori­nat­ing ran­dom lysines in hemoglobin will sup­press ac­tivity of BGQ-1”, etc.} (half of those terms are made up). After this epi­sode, the pre­dic­tion can be graded in the out­side world. With the ob­vi­ous scor­ing rule, the or­a­cle would just say “I don’t care plau­si­ble it sounds, what­ever they ask you, just say it’s not go­ing to work. Most things don’t.” With a bet­ter scor­ing rule, I would ex­pect it to give ac­cu­rate in­for­ma­tion in a hu­man-un­der­stand­able for­mat.

I haven’t thought about this in a while, and I was hon­estly worse at think­ing about al­ign­ment at that point in time, so I don’t mean to con­vey much con­fi­dence that this ap­proach works out. What I do think it shows, alongside the idea I came up with in the con­ver­sa­tion with Wei, linked above, is that this gen­eral ap­proach is pow­er­ful and amenable to im­prove­ment in ways that ren­der it even more use­ful.

* A more re­cent thought: as de­scribed, “or­a­cle” is not the right word for this setup. It would re­spond to “What ap­proaches might work for cur­ing can­cer?” with “Doesn’t mat­ter. There are more gaps in your knowl­edge re­gard­ing eco­nomics. A few prin­ci­ples to keep in mind…” How­ever, if the pre­dic­tion task dis­tri­bu­tion were con­di­tioned in some way on the ques­tion asked, one might be able to make it more likely that the “or­a­cle” an­swers the ques­tion, rather than just spew­ing un­re­lated in­sight.

• Here is an old post of mine on the hope that “com­pu­ta­tion­ally sim­plest model de­scribing the box” is ac­tu­ally a phys­i­cal model of the box. I’m less op­ti­mistic than you are, but it’s cer­tainly plau­si­ble.

From the per­spec­tive of op­ti­miza­tion dae­mons /​ in­ner al­ign­ment, I think like the in­ter­est­ing ques­tion is: if in­ner al­ign­ment turns out to be a hard prob­lem for train­ing cog­ni­tive poli­cies, do we ex­pect it to be­come much eas­ier by train­ing pre­dic­tive mod­els? I’d bet against at 1:1 odds, but not 1:2 odds.

• I think like the in­ter­est­ing ques­tion is: if in­ner al­ign­ment turns out to be a hard prob­lem for train­ing cog­ni­tive poli­cies, do we ex­pect it to be­come much eas­ier by train­ing pre­dic­tive mod­els?

If I’m un­der­stand­ing cor­rectly, and I’m very un­sure that I am, you’re com­par­ing the model-based ap­proach of [learn the en­vi­ron­ment then do good plan­ning] with [learn to imi­tate a policy]. (Note that any iter­ated ap­proach to im­prov­ing a policy re­quires learn­ing the en­vi­ron­ment, so I don’t see what “train­ing cog­ni­tive poli­cies” could mean be­sides imi­ta­tion learn­ing.) And the ques­tion you’re won­der­ing about is whether op­ti­miza­tion dae­mons be­come eas­ier to avoid when fol­low­ing the [learn the en­vi­ron­ment then do good plan­ning] ap­proach.

Imi­ta­tion learn­ing is about pre­dic­tion just as much as pre­dic­tive mod­els are—pre­dic­tive mod­els imi­tate the en­vi­ron­ment. So I sup­pose op­ti­miza­tion dae­mons are about equally likely to ap­pear?

My real an­swer, though, is that I’m not sure, but vanilla imi­ta­tion learn­ing isn’t com­pet­i­tive.

But I sus­pect I’ve mi­s­un­der­stood your ques­tion.

• the hope that “com­pu­ta­tion­ally sim­plest model de­scribing the box” is ac­tu­ally a phys­i­cal model of the box

I don’t ac­tu­ally rely on this as­sump­tion, al­though it un­der­pins the in­tu­ition be­hind As­sump­tion 2.

• I agree that you don’t rely on this as­sump­tion (so I was wrong to as­sume you are more op­ti­mistic than I am). In the literal limit, you don’t need to care about any of the con­sid­er­a­tions of the kind I was rais­ing in my post.

• Given that you are tak­ing limits, I don’t see why you need any of the ma­chin­ery with for­get­ting or with mem­ory-based world mod­els (and if you did re­ally need that ma­chin­ery, it seems like your proof would have other prob­lems). My un­der­stand­ing is:

• Your already as­sume that you can perform ar­bi­trar­ily many rounds of the al­gorithm as in­tended (or rather you prove that there is some such that if you ran steps, with ev­ery­thing work­ing as in­tended and in par­tic­u­lar with no mem­ory cor­rup­tion, then you would get “be­nign” be­hav­ior).

• Any time the MAP model makes a differ­ent pre­dic­tion from the in­tended model, it loses some like­li­hood. So this can only hap­pen finitely many times in any pos­si­ble world. Just take to be af­ter the last time it hap­pens w.h.p.

What’s wrong with this?

• No­ta­tional note: I use to de­note the epi­sode when BoMAI be­comes demon­stra­bly be­nign and for some­thing else.

Any time the MAP model makes a differ­ent pre­dic­tion from the in­tended model, it loses some like­li­hood.

Any time any model makes a differ­ent on-policy pre­dic­tion from the in­tended model, it loses some like­li­hood (in ex­pec­ta­tion). The off-policy pre­dic­tions don’t get tested. Un­der a policy that doesn’t cause the com­puter’s mem­ory to be tam­pered with (which is plau­si­ble, even ideal), and are iden­ti­cal, so we can’t count on los­ing prob­a­bil­ity mass rel­a­tive to . The ap­proach here is to set it up so that world-mod­els like ei­ther start with a lower prior, or else even­tu­ally halt when they ex­haust their com­pu­ta­tion bud­get.

• Un­der a policy that doesn’t cause the com­puter’s mem­ory to be tam­pered with (which is plau­si­ble, even ideal), ν† and ν⋆ are iden­ti­cal, so we can’t count on ν†los­ing prob­a­bil­ity mass rel­a­tive to ν⋆.

I agree with that, but if they are always mak­ing the same on-policy pre­dic­tion it doesn’t mat­ter what hap­pens to their rel­a­tive prob­a­bil­ity (mod­ulo ex­plo­ra­tion). The agent can’t act on an in­cen­tive to cor­rupt mem­ory in­finitely of­ten, be­cause each time re­quires the mod­els mak­ing a differ­ent pre­dic­tion on-policy. So the agent only acts on such an in­cen­tive finitely many times, and hence never does so af­ter some suffi­ciently late epi­sode . Agree/​dis­agree?

(Hav­ing a bad model can still hurt, since the bo­gus model might agree on-policy but as­sign lower re­wards off-policy. But if they also always ap­prox­i­mately agree on the ex­plo­ra­tion dis­tri­bu­tion, then a bad model also can’t dis­cour­age ex­plo­ra­tion. And if they don’t agree on the ex­plo­ra­tion dis­tri­bu­tion, then the bad model will even­tu­ally get tested.)

• Ah I see what you’re say­ing.

I sup­pose I con­strained my­self to pro­duc­ing an al­gorithm/​setup where the asymp­totic be­nig­nity re­sult fol­lowed from rea­sons that don’t re­quire dan­ger­ous be­hav­ior in the in­terim.

Also, you can add an­other pa­ram­e­ter to BoMAI where you just have the hu­man ex­plorer ex­plore for the first E epi­sodes. The in the Even­tual Benig­nity The­o­rem can be thought of as the max of i’ and i″. i’ comes from the in Lemma 1 (Re­ject­ing the Sim­ple Me­mory-Based). i″ comes from the point in time when is -ac­cu­rate on policy, which ren­ders Lemma 3 ap­pli­ca­ble. (And Lemma 2 always ap­plies). My ini­tial thought was to set E so that the hu­man ex­plorer is ex­plor­ing for the whole time when the MAP world-model was not nec­es­sar­ily be­nign. This works for i’. E can just be set to be greater than i’. The thing it doesn’t work for is i″. If you in­crease E, the value of i″ goes up as well.

So in fact, if you set E large enough, the first time BoMAI con­trols the epi­sode, it will be be­nign. Then, there is a pe­riod where it might not be be­nign. How­ever, from that point on, the only “way” for a world-model to be ma­lign is by be­ing worse than -in­ac­cu­rate on-policy, be­cause Lem­mas 1 and 2 have already kicked in, and if it were -ac­cu­rate on-policy, Lemma 3 would kick in as well. The first point to make about this is that in this regime, be­nig­nity comes in tan­dem with in­tel­li­gence—it has to be con­fused to be dan­ger­ous (like a self-driv­ing car). The sec­ond point is: I can’t come up with an ex­am­ple of world-model which is plau­si­bly max­i­mum a pos­te­ri­ori in this in­ter­val of time, and which is plau­si­bly dan­ger­ous (for what that’s worth; and I don’t like to as­sume it’s worth much be­cause it took me months to no­tice ).

• I sup­pose I con­strained my­self to pro­duc­ing an al­gorithm/​setup where the asymp­totic be­nig­nity re­sult fol­lowed from rea­sons that don’t re­quire dan­ger­ous be­hav­ior in the in­terim.

I think my point is this:

• The in­tu­itive thing you are aiming at is stronger than what the the­o­rem es­tab­lishes (un­der­stand­ably!)

• You prob­a­bly don’t need the mem­ory trick to es­tab­lish the the­o­rem it­self.

• Even with the mem­ory trick, I’m not con­vinced you meet the stronger crite­rion. There are a lot of other things similar to mem­ory that can cause trou­ble—the the­o­rem is able to avoid them only be­cause of the same un­satis­fy­ing asymp­totic fea­ture that would have caused it to avoid mem­ory-based mod­els even with­out the am­ne­sia.

• the the­o­rem is able to avoid them only be­cause of the same un­satis­fy­ing asymp­totic fea­ture that would have caused it to avoid mem­ory-based mod­els even with­out the amnesia

This is a con­cep­tual ap­proach I hadn’t con­sid­ered be­fore—thank you. I don’t think it’s true in this case. Let’s be con­crete: the asymp­totic fea­ture that would have caused it to avoid mem­ory-based mod­els even with­out am­ne­sia is trial and er­ror, ap­plied to un­safe poli­cies. Every sec­tion of the proof, how­ever, can be thought of as mak­ing off-policy pre­dic­tions be­have. The real re­sult of the pa­per would then be “Asymp­totic Benig­nity, proven in a way that in­volves off-policy pre­dic­tions ap­proach­ing their be­nign out­put with­out ever be­ing tested”. So while there might be ma­lign world-mod­els of a differ­ent fla­vor to the mem­ory-based ones, I don’t think the way this the­o­rem treats them is un­satis­fy­ing.

1. Can you give some in­tu­itions about why the sys­tem uses a hu­man ex­plorer in­stead of do­ing ex­plor­ing au­to­mat­i­cally?

2. I’m con­cerned about over­load­ing the word “be­nign” with a new con­cept (mainly not seek­ing power out­side the box, if I un­der­stand cor­rectly) that doesn’t match ei­ther in­for­mal us­age or a pre­vi­ous tech­ni­cal defi­ni­tion. In par­tic­u­lar this “be­nign” AGI (in the limit) will hack the op­er­a­tor’s mind to give it­self max­i­mum re­ward, if that’s pos­si­ble, right?

3. The sys­tem seems limited to an­swer­ing ques­tions that the hu­man op­er­a­tor can cor­rectly eval­u­ate the an­swers to within a sin­gle epi­sode (al­though I sup­pose we could make the epi­sodes very long and al­low mul­ti­ple hu­mans into the room to eval­u­ate the an­swer to­gether). (We could ask it other ques­tions but it would give an­swers that sound best to the op­er­a­tor rather than cor­rect an­swers.) If you ac­tu­ally had this AGI to­day, what ques­tions would you ask it?

4. If you were to ask it a ques­tion like “Given these symp­toms, do I need emer­gency med­i­cal treat­ment?” and the cor­rect an­swer is “yes”, it would an­swer “no” be­cause if it an­swered “yes” then the op­er­a­tor would leave the room and it would get 0 re­ward for the rest of the epi­sode. Maybe not a big deal but it’s kind of a counter-ex­am­ple to “We ar­gue that our al­gorithm pro­duces an AGI that, even if it be­came om­ni­scient, would con­tinue to ac­com­plish what­ever task we wanted, in­stead of hi­jack­ing its re­ward, es­chew­ing its task, and neu­tral­iz­ing threats to it, even if it saw clearly how to do ex­actly that.”

(Feel free to count this as some num­ber of com­ments be­tween 1 and 4, since some of the above items are re­lated. Also I haven’t read most of the math yet and may have more com­ments and ques­tions once I un­der­stood the mo­ti­va­tions and math bet­ter.)

• 4. If you were to ask it a ques­tion like “Given these symp­toms, do I need emer­gency med­i­cal treat­ment?” and the cor­rect an­swer is “yes”, it would an­swer “no” be­cause if it an­swered “yes” then the op­er­a­tor would leave the room and it would get 0 re­ward for the rest of the epi­sode...

When I say it would con­tinue to ac­com­plish what­ever task we wanted, I’m be­ing a bit sloppy—if we have a task we want ac­com­plished, and we provide re­wards ran­domly, it will not ac­com­plish our de­sired task. But I take the point that “what­ever task we wanted” does have some re­stric­tions: it has to be one that a hu­man op­er­a­tor can con­vert into a re­ward with­out leav­ing. So the task “re­spond with the true an­swer to [difficult ques­tion]” is not one that the op­er­a­tor can con­vert into a re­ward, but the task “re­spond with an an­swer that sounds plau­si­ble to the op­er­a­tor” is. I think this sub­sumes your ex­am­ple.

• 1. Can you give some in­tu­itions about why the sys­tem uses a hu­man ex­plorer in­stead of do­ing ex­plor­ing au­to­mat­i­cally?

What­ever policy is used for ex­plo­ra­tion, we can en­sure that BoMAI will even­tu­ally out­perform this policy. With a hu­man ex­e­cut­ing the policy, this leads to BoMAI ac­cu­mu­lat­ing re­ward at least as well as a hu­man. Un­der the “smarter” in­for­ma­tion the­o­retic ex­plo­ra­tory poli­cies that I’ve con­sid­ered, ex­plo­ra­tory be­hav­ior is un­safe from in­sa­tiable cu­ri­os­ity: the agent has to try kil­ling ev­ery­one just to check to make sure it’s not a weird cheat code.

• 3. The sys­tem seems limited to an­swer­ing ques­tions that the hu­man op­er­a­tor can cor­rectly eval­u­ate...

Yes. BoMAI would be able to give plau­si­ble-sound­ing an­swers to ques­tions. BoMAI could also do any task that was au­to­mat­i­cally check­able: don’t use a hu­man op­er­a­tor at all; have an au­to­mated sys­tem which in­ter­prets text as an amino acid se­quence; syn­the­size that pro­tein; mea­sure some fea­ture of it’s be­hav­ior; provide re­ward ac­cord­ingly. (That ex­am­ple in­vites re­newed fo­cus on the im­per­me­abil­ity of the box, by the way).

Some things I would do is send an em­i­nent can­cer re­searcher in to ask BoMAI for a re­search pro­posal. Then the re­searcher could go out and test it. It might be worth­less, no mat­ter how plau­si­ble it seemed, but then they could go back hav­ing learned some­thing about a failed path. Re­peat­ing this pro­cess, it seems likely to me that a cor­rect idea would ap­pear, just con­sid­er­ing the like­li­hood of ap­pear­ing plau­si­ble to a bet­ter and bet­ter trained eval­u­a­tor.

I would also nat­u­rally ask it how to make a safe un­bounded AGI. And the next epi­sode, I would ask for an ex­pla­na­tion for why that would fail.

REDACTED: On that topic, in ad­di­tion to hav­ing mul­ti­ple hu­mans in the box, you could also have 2 agents that the op­er­a­tor in­ter­acts with, both of which are clones ex­cept that the re­ward for the sec­ond is one minus the re­ward for the first. This would look like “AI Safety via de­bate.”

• I would also nat­u­rally ask it how to make a safe un­bounded AGI. And the next epi­sode, I would ask for an ex­pla­na­tion for why that would fail.

This seems use­ful if you could get around the mind hack­ing prob­lem, but how would you do that?

On that topic, in ad­di­tion to hav­ing mul­ti­ple hu­mans in the box, you could also have 2 agents that the op­er­a­tor in­ter­acts with, both of which are clones ex­cept that the re­ward for the sec­ond is one minus the re­ward for the first. This would look like “AI Safety via de­bate.”

I don’t know how this would work in terms of your setup. The most ob­vi­ous way would seem to re­quire the two agents to simu­late each other, which would be im­pos­si­ble, and I’m not sure what else you might have in mind.

• This seems use­ful if you could get around the mind hack­ing prob­lem, but how would you do that?

On sec­ond thought, (even as­sum­ing away the mind hack­ing prob­lem) if you ask about “how to make a safe un­bounded AGI” and “what’s wrong with the an­swer” in sep­a­rate epi­sodes, you’re es­sen­tially man­u­ally search­ing an ex­po­nen­tially large tree of pos­si­ble ar­gu­ments, coun­ter­ar­gu­ments, counter-coun­ter­ar­gu­ments, and so on. (Two epi­sodes isn’t enough to de­ter­mine whether the first an­swer you got was a good one, be­cause the sec­ond an­swer is also op­ti­mized for sound­ing good in­stead of be­ing ac­tu­ally cor­rect, so you’d have to do an­other epi­sode to ask for a counter-ar­gu­ment to the sec­ond an­swer, and so on, and then once you’ve defini­tively figured out that some an­swer/​node was bad, you have to ask for an­other an­swer at that node and re­peat this pro­cess.) The point of “AI Safety via De­bate” was to let AI do all this search­ing for you, so it seems that you do have to figure out how to do some­thing similar to avoid the ex­po­nen­tial search.

ETA: Do you know if the pro­posal in “AI Safety via De­bate” is “asymp­tot­i­cally be­nign” in the sense you’re us­ing here?

• ETA: Do you know if the pro­posal in “AI Safety via De­bate” is “asymp­tot­i­cally be­nign” in the sense you’re us­ing here?

No! Either de­bater is in­cen­tivized to take ac­tions that get the op­er­a­tor to cre­ate an­other ar­tifi­cial agent that takes over the world, re­places the op­er­a­tor, and set­tles the de­bate in fa­vor of the de­bater in ques­tion.

• I guess we can in­cor­po­rate into DEBATE the idea of build­ing a box around the de­baters and judge with a door that au­to­mat­i­cally ends the epi­sode when opened. Do you think that would be suffi­cient to make it “be­nign” in prac­tice? Are there any other ideas in this pa­per that you would want to in­cor­po­rate into a prac­ti­cal ver­sion of DEBATE?

• Add the ret­ro­grade am­ne­sia cham­ber and an ex­plorer, and we’re pretty much at this, right?

Without the ret­ro­grade am­ne­sia, it might still be be­nign, but I don’t know how to show it. Without the ex­plorer, I doubt you can get very strong use­ful­ness re­sults.

• I sus­pect that AI Safety via De­bate could be be­nign for cer­tain de­ci­sions (like whether to re­lease an AI) if we were to weight the de­bate more to­wards the safer op­tion.

• Do you have thoughts on this?

Either de­bater is in­cen­tivized to take ac­tions that get the op­er­a­tor to cre­ate an­other ar­tifi­cial agent that takes over the world, re­places the op­er­a­tor, and set­tles the de­bate in fa­vor of the de­bater in ques­tion.
• you’re es­sen­tially man­u­ally search­ing an ex­po­nen­tially large tree of pos­si­ble ar­gu­ments, coun­ter­ar­gu­ments, counter-coun­ter­ar­gu­ments, and so on

I ex­pect the hu­man op­er­a­tor mod­er­at­ing this de­bate would get pretty good at think­ing about AGI safety, and start to be­come no­tice­ably bet­ter at dis­miss­ing bad rea­son­ing than good rea­son­ing, at which point BoMAI would find the pro­duc­tion of cor­rect rea­son­ing a good heuris­tic for seem­ing con­vinc­ing.

… but yes, it is still ex­po­nen­tial (ex­po­nen­tial in what, ex­actly? maybe the num­ber of con­cepts we have han­dles for?); this com­ment is the real an­swer to your ques­tion.

• I ex­pect the hu­man op­er­a­tor mod­er­at­ing this de­bate would get pretty good at think­ing about AGI safety, and start to be­come no­tice­ably bet­ter at dis­miss­ing bad rea­son­ing than good rea­son­ing, at which point BoMAI would find the pro­duc­tion of cor­rect rea­son­ing a good heuris­tic for seem­ing con­vinc­ing.

Alter­na­tively, the hu­man might have a lot of ad­ver­sar­ial ex­am­ples and the de­bate be­comes an ex­er­cise in ex­plor­ing all those ad­ver­sar­ial ex­am­ples. I’m not sure how to tell what will re­ally hap­pen short of ac­tu­ally hav­ing a su­per­in­tel­li­gent AI to test with.

• I don’t know how this would work in terms of your setup. The most ob­vi­ous way would seem to re­quire the two agents to simu­late each other, which would be im­pos­si­ble, and I’m not sure what else you might have in mind.

You’re right (see the redac­tion). Why Wei is right. Here’s an un­pol­ished idea though: they could do some­thing like min­i­max. In­stead of simu­lat­ing the other agent, they could model the en­vi­ron­ment as re­spond­ing to a pair of ac­tions. For in­fer­ence, they would have the his­tory of their op­po­nent’s ac­tions as well, and for plan­ning, they could pick their ac­tion to max­i­mize their ob­jec­tive as­sum­ing the other agent’s ac­tions are max­i­mally in­con­ve­nient.

• So you ba­si­cally have the same AI play both sides of the zero-sum game, right? That seems like it should work, with the same caveat as for “AI Safety via de­bate”, namely that it seems hard to pre­dict what hap­pens when you have su­per­in­tel­li­gent AIs play a zero-sum game with a hu­man as the judge.

• Yep.

• With a de­bate-like setup, if one side (A) is about to lose a de­bate, it seems to have a high in­cen­tive to claim that the other side (B) try­ing to do a mind hack and that if the judge keeps pay­ing at­ten­tion to what B says (i.e., read any fur­ther out­put from B), they will soon be taken over. What is the judge sup­posed to do in this case? They could ask A to ex­plain how B’s pre­vi­ous out­puts con­sti­tute part of an at­tempt to mind hack, but A could con­coct a story mixed with its own at­tempt to mind hack, and the judge can’t ask for any counter-ar­gu­ments from B with­out risk­ing be­ing mind hacked by B.

(I re­al­ize this is a prob­lem in “AI Safety via de­bate” as well, but I’m ask­ing you since you’re here and Ge­offrey Irv­ing isn’t. :)

• *but A could con­coct a story … coun­ter­ar­gu­ments from B .. mind hacked by B, right?

I think the main con­tention of their pa­per is that it is harder to mys­tify a judge than it is to pierce through some­one else mys­tify­ing a judge, so this shouldn’t be a prob­lem.

That said, here’s one pos­si­bil­ity: if A ever sug­gests that you don’t read more out­put from B, don’t read any­thing more from B, then flip coins to give A a 25% chance of vic­tory.

• *but A could con­coct a story … coun­ter­ar­gu­ments from B .. mind hacked by B, right?

Yeah, I mixed up the A’s and B’s at the end. It’s fixed now. Thanks for point­ing it out.

I think the main con­tention of their pa­per is that it is harder to mys­tify a judge than it is to pierce through some­one else mys­tify­ing a judge, so this shouldn’t be a prob­lem.

I don’t un­der­stand how the former im­plies the lat­ter. As­sum­ing the former is true (and it seems like a big as­sump­tion), why can’t what I sug­gested still hap­pen?

That said, here’s one pos­si­bil­ity: if A ever sug­gests that you don’t read more out­put from B, don’t read any­thing more from B, then flip coins to give A a 25% chance of vic­tory.

But what about the case where B is ac­tu­ally try­ing to mind hack the judge? If you always give A a 25% chance of vic­tory for sug­gest­ing or im­ply­ing that you shouldn’t read more out­put from B, then mind hack­ing be­comes a (mostly) win­ning strat­egy, since a player gets a 75% chance of vic­tory from mind hack­ing even if the other side suc­cess­fully con­vinces the judge that they’re try­ing to mind hack the judge. The equil­ibrium might then con­sist of a race to see who can mind hack the judge first, or (if one side has >75% chance of win­ning such a race due to first-mover or sec­ond-mover ad­van­tage) one side try­ing to mind hack the judge, get­ting blocked by the other side, and still get­ting 75% vic­tory.

• As­sum­ing the former is true (and it seems like a big as­sump­tion), why can’t what I sug­gested still hap­pen?

If the as­sump­tion is true, we could de­mand that A use their words, and counter us be­ing mind-hacked by pok­ing holes in what B is say­ing rather than de­mand­ing we stop listen­ing to B. And if A is able to con­vince us that B was mind-hack­ing, even af­ter some more back and forth, B will be pun­ished for that.

So ac­tu­ally I framed my point above wrong: “de­mand­ing that A use their words” could look like the pro­to­col I de­scribe; it is not some­thing that would work in­de­pen­dently of the as­sump­tion that it is eas­ier to deflate an at­tempted mind-hack­ing than it is to mind-hack (with an equal amount of in­tel­li­gence/​re­sources).

But your origi­nal point was “why doesn’t A just claim B is mind-hack­ing” not “why doesn’t B just mind-hack”? The an­swer to that point was “de­mand A use their words rather than ne­go­ti­ate an end to the con­ver­sa­tion” or more mod­er­ately, “75%-de­mand that A do this.”

• If the as­sump­tion is true, we could de­mand that A use their words, and counter us be­ing mind-hacked by pok­ing holes in what B is say­ing rather than de­mand­ing we stop listen­ing to B. And if A is able to con­vince us that B was mind-hack­ing, even af­ter some more back and forth, B will be pun­ished for that.

Oh, I see, I didn’t un­der­stand “it is harder to mys­tify a judge than it is to pierce through some­one else mys­tify­ing a judge” cor­rectly. So this as­sump­tion ba­si­cally rules out a large class of pos­si­ble vuln­er­a­bil­ities in the judge, right? For ex­am­ple, if the judge had the equiv­a­lent of a buffer overflow bug in a net­work stack, the scheme would fail. In that case, A would not be able to “pierce through” B’s at­tack and stop it with its words if the judge keeps listen­ing to B (and B was ac­tu­ally at­tack­ing).

I don’t think the “AI safety via de­bate” pa­per ac­tu­ally makes ar­gu­ments for this as­sump­tion (at least I couldn’t find where it does). Do you have rea­sons to think it’s true, or ideas for how to ver­ify that it’s true, short of putting a hu­man in a BoMAI?

• Yeah… I don’t have much to add here. Let’s keep think­ing about this. I won­der if Paul is more bullish on the premise that “it is harder to mys­tify a judge than it is to pierce through some­one else mys­tify­ing a judge” than I am?

Re­call that this idea was to avoid

es­sen­tially man­u­ally search­ing an ex­po­nen­tially large tree of pos­si­ble ar­gu­ments, coun­ter­ar­gu­ments, counter-coun­ter­ar­gu­ments, and so on

If it also re­duces the risk of op­er­a­tor-de­vo­tion, and it might well do that (be­cause a pow­er­ful ad­ver­sary is op­posed to that), that wasn’t origi­nally what brought us here.

• This seems use­ful if you could get around the mind hack­ing prob­lem, but how would you do that?

If the per­son leaves hav­ing been hyp­no­tized, since it’s not catas­trophic, we can mod­ify the setup. For ex­am­ple, the room could be di­vided in half, with Oper­a­tor 1 in­ter­act­ing BoMAI, and with Oper­a­tor 2 ob­serv­ing Oper­a­tor 1 for signs of hyp­no­sis or for signs of an at­tempt to com­mu­ni­cate with Oper­a­tor 2, and in ei­ther event, Oper­a­tor 2 could end the epi­sode im­me­di­ately. (That would be Oper­a­tor 2′s only abil­ity). More im­por­tantly than the mag­ni­tude of [(the prob­a­bil­ity that this works) - (the prob­a­bil­ity it would work the first time)] is the fact that this can be done iter­a­tively.

A bit of a side note: I’m cu­ri­ous what odds you give to hyp­no­sis (or some­thing like it) be­ing the best way to get op­ti­mal re­ward for such a task (in the vanilla setup).

• In­stead of hyp­no­sis, I’m more wor­ried about the AI talk­ing the op­er­a­tor into some kind of world view that im­plies they should be re­ally gen­er­ous to the AI (i.e., give it max re­wards), or give some se­quence of an­swers that feel ex­tremely in­sight­ful (and invit­ing fur­ther ques­tions/​an­swers in the same vein). And then the op­er­a­tor might feel a de­sire af­ter­wards to spread this world view or se­quence of an­swers to oth­ers (even though, again, this wasn’t op­ti­mized for by the AI).

If you try to solve the mind hack­ing prob­lem iter­a­tively, you’re more likely to find a way to get use­ful an­swers out of the sys­tem, but you’re also more likely to hit upon an ex­is­ten­tially catas­trophic form of mind hack­ing.

A bit of a side note: I’m cu­ri­ous what odds you give to hyp­no­sis (or some­thing like it) be­ing the best way to get op­ti­mal re­ward for such a task (in the vanilla setup).

I guess it de­pends on how many in­ter­ac­tions per epi­sode and how long each an­swer can be. I would say >.9 prob­a­bil­ity that hyp­no­sis or some­thing like what I de­scribed above is op­ti­mal if they are both long enough. So you could try to make this sys­tem safer by limit­ing these num­bers, which is also talked about in “AI Safety via De­bate” if I re­mem­ber cor­rectly.

• but you’re also more likely to hit upon an ex­is­ten­tially catas­trophic form of mind hack­ing.

Can you ex­plain this?

• Sup­pose there are n forms of mind hack­ing that the AI could do, some of which are ex­is­ten­tially catas­trophic. If your plan is “Run this AI, and if the op­er­a­tor gets mind-hacked, stop and switch to an en­tirely differ­ent de­sign.” the like­li­hood of hit­ting upon an ex­is­ten­tially catas­trophic form of mind hack­ing is lower than if the plan is in­stead “Run this AI, and if the op­er­a­tor gets mind-hacked, tweak the AI de­sign to block that spe­cific form of mind hack­ing and try again. Re­peat un­til we get a use­ful an­swer.”

• Hm. This doesn’t seem right to me. My ap­proach for try­ing to form an in­tu­ition here in­cludes re­turn­ing to the ex­am­ple (in a par­ent com­ment)

For ex­am­ple, the room could be di­vided in half, with Oper­a­tor 1 in­ter­act­ing BoMAI, and with Oper­a­tor 2 ob­serv­ing Oper­a­tor 1...

but I don’t imag­ine this satis­fies you. Another piece of the in­tu­ition is that mind-hack­ing for the aim of re­ward within the epi­sode, or even the pos­si­ble in­stru­men­tal aim of op­er­a­tor-de­vo­tion, still doesn’t seem very ex­is­ten­tially risky to me, given the lack of op­ti­miza­tion pres­sure to that effect. (I know the lat­ter com­ment sort of be­longs in other branches of our con­ver­sa­tion, so we should con­tinue to dis­cuss it el­se­where).

Maybe other peo­ple can weigh in on this, and we can come back to it.

• the op­er­a­tor might feel a de­sire af­ter­wards to spread this world view

It is plau­si­ble to me that there is se­lec­tion pres­sure to make the op­er­a­tor “de­voted” in some sense to BoMAI. But most peo­ple with a unique mo­tive are not able to then take over the world or cause an ex­tinc­tion event. And BoMAI has no in­cen­tive to help the op­er­a­tor gain those skills.

Just to step back and frame this con­ver­sa­tion, we’re dis­cussing the is­sue of out­side-world side-effects that cor­re­late with in-the-box in­stru­men­tal goals. Im­plicit in the claim of the pa­per is that tech­nolog­i­cal progress is an out­side-world cor­re­late of op­er­a­tor-satis­fac­tion, an in-the-box in­stru­men­tal goal. I agree it is very much worth con­sid­er­ing plau­si­ble path­ways to nega­tive con­se­quences, but I think the de­fault an­swer is that with op­ti­miza­tion pres­sure, sur­pris­ing things hap­pen, but with­out op­ti­miza­tion pres­sure, sur­pris­ing things don’t. (Again, that is just the de­fault be­fore we look closer). This doesn’t mean we should be to­tally skep­ti­cal about the idea of ex­pect­ing tech­nolog­i­cal progress or long-term op­er­a­tor de­vo­tion, but it does con­tribute to my be­ing less con­cerned that some­thing as sur­pris­ing as ex­tinc­tion would arise from this.

• Yeah, the threat model I have in mind isn’t the op­er­a­tor tak­ing over the world or caus­ing an ex­tinc­tion event, but spread­ing bad but ex­tremely per­sua­sive ideas that can dras­ti­cally cur­tail hu­man­ity’s po­ten­tial (which is part of the defi­ni­tion of “ex­is­ten­tial risk”). For ex­am­ple fulfilling our po­ten­tial may re­quire that the uni­verse even­tu­ally be con­trol­led mostly by agents that have man­aged to cor­rectly solve a num­ber of moral and philo­soph­i­cal prob­lems, and the spread of these bad ideas may pre­vent that from hap­pen­ing. See Some Thoughts on Me­taphilos­o­phy and the posts linked from there for more on this per­spec­tive.

• Let XX be the event in which: a viru­lent meme causes suffi­ciently many power-bro­kers to be­come en­trenched with ab­surd val­ues, such that we do not end up even satis­fic­ing The True Good.

Em­piri­cal anal­y­sis might not be use­less here in eval­u­at­ing the “sur­pris­ing­ness” of XX. I don’t think Chris­ti­an­ity makes the cut ei­ther for viru­lence or for in­com­pat­i­bil­ity with some satis­fac­tory level of The True Good.

I’m adding this not for you, but to clar­ify for the ca­sual reader: we both agree that a Su­per­in­tel­li­gence set­ting out to ac­com­plish XX would prob­a­bly suc­ceed; the ques­tion here is how likely this is to hap­pen by ac­ci­dent if a su­per­in­tel­li­gence tries to get a hu­man in a closed box to love it.

I’m open to other ter­minol­ogy. Yes, there is no guaran­tee about what hap­pens to the op­er­a­tor. As I’m defin­ing it, be­nig­nity is defined to be not hav­ing out­side-world in­stru­men­tal goals, and the in­tu­ition for the term is “not ex­is­ten­tially dan­ger­ous.”

• The best al­ter­na­tive to “be­nign” that I could come up with is “un­am­bi­tious”. I’m not very good at this type of thing though, so maybe ask around for other sug­ges­tions or in­di­cate some­where promi­nent that you’re in­ter­ested in giv­ing out a prize speci­fi­cally for this?

• There’s still an ex­is­ten­tial risk in the sense that the AGI has an in­cen­tive to hack the op­er­a­tor to give it max­i­mum re­ward, and that hack could have pow­er­ful effects out­side the box (even though the AI hasn’t op­ti­mized it for that pur­pose), for ex­am­ple it might turn out to be a viru­lent memetic virus. Of course this is much less risky than if the AGI had di­rect in­stru­men­tal goals out­side the box, but “be­nign” and “not ex­is­ten­tially dan­ger­ous” both seem to be claiming a bit too much. I’ll think about what other term might be more suit­able.

• The first nu­clear re­ac­tion ini­ti­ated an un­prece­dented tem­per­a­ture in the at­mo­sphere, and peo­ple were right to won­der whether this would cause the at­mo­sphere to ig­nite. The ex­is­tence of a gen­er­ally in­tel­li­gent agent is likely to cause un­prece­dented men­tal states in hu­mans, and we would be right to won­der whether that will cause an ex­is­ten­tial catas­tro­phe. I think the con­cern of “could have pow­er­ful effects out­side the box” is mostly cap­tured by the un­prece­dent­ed­ness of this men­tal state, since the men­tal state is not se­lected to have those side effects. Cer­tainly there is no way to rule out side-effects of in­side-the-box events, since these side effects are the only rea­son it’s use­ful. And there is also cer­tainly no way to rule out how those side effects “might turn out to be,” with­out a com­plete view of the fu­ture.

Would you agree that un­prece­dent­ed­ness cap­tures the con­cern?

• How does your AI know to avoid run­ning in­ter­nal simu­la­tions con­tain­ing lots of suffer­ing?

• It does not; thank you for point­ing this out! This fea­ture would have to be added on. Maybe you can come up with a way.

• Would you mind ex­plain­ing what the re­tracted part was? Even if it was a mis­take, point­ing it out might be use­ful to oth­ers think­ing along the same lines.

• Sorry, I prob­a­bly shouldn’t have writ­ten the sen­tence in the first place; it was an AI ca­pa­bil­ities idea.

• From the for­mal de­scrip­tion of the al­gorithm, it looks like you use a uni­ver­sal prior to pick , and then al­low the Tur­ing ma­chine to run for steps, but don’t pe­nal­ize the run­ning time of the ma­chine that out­puts . Is that right? That didn’t match my in­tu­itive un­der­stand­ing of the al­gorithm, and seems like it would lead to strange out­comes, so I feel like I’m mi­s­un­der­stand­ing.

• Yes this is cor­rect. If you use the same bi­jec­tion con­sis­tently from strings to nat­u­ral num­bers, it looks a lit­tle more in­tu­itive than if you don’t. The uni­ver­sal prior picks (the num­ber) by out­putting as a string. The th Tur­ing ma­chine is the Tur­ing ma­chine de­scribed by as a string. So you end up look­ing at the Kol­mogorov com­plex­ity of the de­scrip­tion of the Tur­ing ma­chine. So the con­struc­tion of the de­scrip­tion of the world-model isn’t time-pe­nal­ized. This doesn’t change the asymp­totic re­sult, so I went with the more fa­mil­iar rather than trans­lat­ing this new speed prior into mea­sure over finite strings, which would re­quire some more ex­po­si­tion, but I agree with you it feels like there might be some strange out­comes “be­fore the limit” as a re­sult of this ap­proach: namely, the code on the UTM that out­puts the de­scrip­tion of the world-model-Tur­ing-ma­chine will try to do as much of the com­pu­ta­tion as pos­si­ble in ad­vance, by com­put­ing the de­scrip­tion of an speed-op­ti­mized Tur­ing ma­chine for when the ac­tions start com­ing.

The other rea­son­able choices here in­stead of are (con­structed to be like the new speed prior here) and—the length of x. But ba­si­cally tells you that a Tur­ing ma­chine with fewer states is sim­pler, which would lead to a mea­sure over that is dom­i­nated by world-mod­els that are just uni­ver­sal Tur­ing ma­chines, which defeats the pur­pose of do­ing max­i­mum a pos­te­ri­ori in­stead of a Bayes mix­ture. The way this is­sue ap­pears in the proof ren­ders the Nat­u­ral Prior As­sump­tion less plau­si­ble.

• This in­val­i­dates some of my other con­cerns, but also seems to mean things are in­cred­ibly weird at finite times. I sus­pect that you’ll want to change to some­thing less ex­treme here.

(I might well be mi­s­un­der­stand­ing some­thing, apolo­gies in ad­vance.)

Sup­pose the “in­tended” physics take at least 1E15 steps to run on the UTM (this is a con­ser­va­tive lower bound, since you have to simu­late the hu­man for the whole epi­sode). And sup­pose (I think you need much lower than this). Then the in­tended model gets pe­nal­ized by at least exp(1E12) for its slow­ness.

For al­most the same de­scrip­tion com­plex­ity, I could write down physics + “pre­com­pute the pre­dic­tions for the first N epi­sodes, for ev­ery se­quence of pos­si­ble ac­tions/​ob­ser­va­tions, and store them in a lookup table.” This in­creases the com­plex­ity by a few bits, some con­stant plus K(N|physics), but avoids most of the com­pu­ta­tion. In or­der for the in­tended physics to win, i.e. in or­der for the “speed” part of the speed prior to do any­thing, we need the com­plex­ity of this pre­com­puted model to be at least 1E12 bits higher than the com­plex­ity of the fast model.

That ap­pears to hap­pen only once N > BB(1E12). Does that seem right to you?

We could talk about whether ma­lign con­se­quen­tial­ists also take over at finite times (I think they prob­a­bly do, since the “speed” part of the speed prior is not do­ing any work un­til af­ter BB(1E12) steps, long af­ter the agent be­comes in­cred­ibly smart), but it seems bet­ter to ad­just the scheme first.

Us­ing the speed prior seems more rea­son­able, but I’d want to know which ver­sion of the speed prior and which pa­ram­e­ters, since which par­tic­u­lar prob­lem bites you will de­pend on those choices. And maybe to save time, I’d want to first get your take on whether the pro­posed ver­sion is dom­i­nated by con­se­quen­tial­ists at some finite time.

• Does that seem right to you?

Yes. I re­call think­ing about pre­com­put­ing ob­ser­va­tions for var­i­ous ac­tions in this phase, but I don’t re­call notic­ing how bad the prob­lem was not in the limit.

your take on whether the pro­posed ver­sion is dom­i­nated by con­se­quen­tial­ists at some finite time.

This goes in the cat­e­gory of “things I can’t rule out”. I say maybe 15 chance it’s ac­tu­ally dom­i­nated by con­se­quen­tial­ists (that low be­cause I think the Nat­u­ral Prior As­sump­tion is still fairly plau­si­ble in its origi­nal form), but for all in­tents and pur­poses, 15 is very high, and I’ll con­cede this point.

I’d want to know which ver­sion of the speed prior and which parameters

is a mea­sure over bi­nary strings. In­stead, let’s try , where is the length of , is the time it takes to run on , and is a con­stant. If there were no clev­erer strat­egy than pre­com­put­ing ob­ser­va­tions for all the ac­tions, then could be above , where is the num­ber of epi­sodes we can tol­er­ate not hav­ing a speed prior for. But if it some­how mag­i­cally pre­dicted which ac­tions BoMAI was go­ing to take in no time at all, then would have to be above .

What prob­lem do you think bites you?

• I say maybe 15 chance it’s ac­tu­ally dom­i­nated by consequentialists

Do you get down to 20% be­cause you think this ar­gu­ment is wrong, or be­cause you think it doesn’t ap­ply?

What prob­lem do you think bites you?

What’s ? Is it O(1) or re­ally tiny? And which value of do you want to con­sider, polyno­mi­ally small or ex­po­nen­tially small?

But if it some­how mag­i­cally pre­dicted which ac­tions BoMAI was go­ing to take in no time at all, then c would have to be above 1/​d.

Wouldn’t they have to also mag­i­cally pre­dict all the stochas­tic­ity in the ob­ser­va­tions, and have a run­ning time that grows ex­po­nen­tially in their log loss? Pre­dict­ing what BoMAI will do seems likely to be much eas­ier than that.

• Do you get down to 20% be­cause you think this ar­gu­ment is wrong, or be­cause you think it doesn’t ap­ply?

You ar­gu­ment is about a Bayes mix­ture, not a MAP es­ti­mate; I think the case is much stronger that con­se­quen­tial­ists can take over a non-triv­ial frac­tion of a mix­ture. I think that the meth­ods with con­se­quen­tial­ists dis­cover for gain­ing weight in the prior (be­fore the treach­er­ous turn) are mostly likely to be el­e­gant (short de­scrip­tion on UTM), and that is the con­se­quen­tial­ists’ real com­pe­ti­tion; then [the prob­a­bil­ity the uni­verse they live in pro­duces them with their spe­cific goals]or [the bits to di­rectly spec­ify a con­se­quen­tial­ist de­cid­ing to to do this] set them back (in the MAP con­text).

• I don’t see why their meth­ods would be el­e­gant. In par­tic­u­lar, I don’t see why any of {the an­thropic up­date, im­por­tance weight­ing, up­dat­ing from the choice of uni­ver­sal prior} would have a sim­ple form (sim­pler than the sim­plest physics that gives rise to life).

I don’t see how MAP helps things ei­ther—doesn’t the same ar­gu­ment sug­gest that for most of the pos­si­ble physics, the sim­plest model will be a con­se­quen­tial­ist? (Even more broadly, for the uni­ver­sal prior in gen­eral, isn’t MAP ba­si­cally equiv­a­lent to a ran­dom sam­ple from the prior, since some ran­dom model hap­pens to be slightly more com­press­ible?)

• I don’t see why their meth­ods would be el­e­gant.

Yeah I think we have differ­ent in­tu­itions here; are we at least within a few bits of log-odds dis­agree­ment? Even if not, I am not will­ing to stake any­thing on this in­tu­ition, so I’m not sure this is a hugely im­por­tant dis­agree­ment for us to re­solve.

I don’t see how MAP helps things either

I didn’t re­al­ize that you think that a sin­gle con­se­quen­tial­ist would plau­si­bly have the largest share of the pos­te­rior. I as­sumed your be­liefs were in the neigh­bor­hood of:

it seems plau­si­ble that the weight of the con­se­quen­tial­ist part is in ex­cess of 1/​mil­lion or 1/​billion

(from your origi­nal post on this topic). In a Bayes mix­ture, I bet that a team of con­se­quen­tial­ists that col­lec­tively amount to 110 or even 150 of the pos­te­rior could take over our world. In MAP, if you’re not first, you’re last, and more im­por­tantly, you can’t team up with other con­se­quen­tial­ist-con­trol­led world-mod­els in the mix­ture.

• Let’s say , .

Wouldn’t they have to also mag­i­cally pre­dict all the stochas­tic­ity in the ob­ser­va­tions, and have a run­ning time that grows ex­po­nen­tially in their log loss?

Oh yeah—that’s good news.

• Are there UDT-ish con­cerns with break­ing iso­la­tion of epi­sodes? For ex­am­ple, if the AI re­ceives a low re­ward at the be­gin­ning of epi­sode 117, does it have an in­cen­tive to ma­nipu­late the ex­ter­nal world to make epi­sode 117 hap­pen many times some­how, with most of these times giv­ing it a higher re­ward? For an­other ex­am­ple, can the AI at epi­sode 117 re­al­ize that it’s in a game the­ory situ­a­tion with the AI at epi­sodes 116 and 118 and trade re­wards with them acausally, lead­ing to long-term goal di­rected be­hav­ior?

• Are there UDT-ish concerns

BoMAI is a causal de­ci­sion the­o­rist.

• It’s ac­tu­ally up­date­less within an epi­sode, right, be­cause it finds the op­ti­mal policy at the be­gin­ning of an epi­sode and then the same policy is used through­out the epi­sode? I think this means there’s no rea­son for the op­er­a­tor to en­ter a re­ward at ev­ery times­tamp, and in­stead we could let them just in­put a to­tal re­ward at any time in the epi­sode (or they could en­ter any num­ber of re­wards, and the op­ti­miza­tion is based on the last re­ward en­tered). This would re­move the in­cen­tive for the AI to keep the op­er­a­tor in the room as long as pos­si­ble and would par­tially ad­dress item 4 in my com­ment.

• So “up­date­less” is a rea­son­able term to ap­ply to BoMAI, but it’s not an up­date­less de­ci­sion the­o­rist in your sense (if I un­der­stand cor­rectly). An up­date­less de­ci­sion the­o­rist picks a policy that has the best con­se­quences, with­out mak­ing as­sump­tion that its choice of policy af­fects the world only through the ac­tions it picks. It con­sid­ers the pos­si­bil­ity that an an­other agent will be able to perfectly simu­late it, so if it picks policy 1 at the start, the other agent will simu­late it fol­low­ing policy 1, and if it picks policy 2, the other agent will simu­late it pick­ing policy 2. Since this is an effect that isn’t me­di­ated by ac­tual choice of ac­tion, up­date­less­ness ends up hav­ing con­se­quences.

If an agent picks an ex­pec­ti­max policy un­der the as­sump­tion that the only way this choice im­pacts the en­vi­ron­ments is through the ac­tions it takes (which BoMAI as­sumes), then it’s iso­mo­prhic whether it com­putes -ex­pec­ti­max as it goes, or all at once at the be­gin­ning. The policy at the be­gin­ning will in­clude con­tin­gen­cies for what­ever mid­way-through-the-epi­sode po­si­tion the agent might land in, and as for what to do at that point, it’s the same calcu­la­tion be­ing run. And this calcu­la­tion is CDT.

I guess this means, and I’ve never thought about this be­fore so this could eas­ily be wrong, un­der the as­sump­tion that a policy’s effect on the world is screened off by which ac­tions it takes, CDT is re­flec­tively sta­ble.

(And yes, you could just give one re­ward, which ends the epi­sode.)

• My con­cern is that since CDT is not re­flec­tively sta­ble, it may have in­cen­tives to cre­ate non-CDT agents in or­der to fulfill in­stru­men­tal goals.

• If I un­der­stand cor­rectly, it’s ac­tu­ally up­date­less within an epi­sode, and that’s the only thing it cares about so I don’t see how it would not be re­flec­tively sta­ble. Plus, even if it had an in­cen­tive to cre­ate a non-CDT agent, it would have to do that by out­putting some mes­sage to the op­er­a­tor, and the op­er­a­tor wouldn’t have the abil­ity to cre­ate a non-CDT agent with­out leav­ing the room which would end the epi­sode. (I guess it could hack the op­er­a­tor’s mind and cre­ate a non-CDT agent within, but at that point it might as well just make the op­er­a­tor give it max re­wards.)

• With the cor­rec­tion that it is up­date­less and CDT (see here), I agree with the rest of this.

• does it have an in­cen­tive to ma­nipu­late the ex­ter­nal world to make epi­sode 117 hap­pen many times somehow

For any given world-model, epi­sode 117 is just a string of ac­tions on the in­put tape, and ob­ser­va­tions and re­wards on the out­put tape (po­si­tions (m+1)*117 through (m+1)*118 −1, if you care). In none of these world-mod­els, un­der no ac­tions that it con­sid­ers, does “epi­sode 117 hap­pen twice.”

• In none of these world-mod­els, un­der no ac­tions that it con­sid­ers, does “epi­sode 117 hap­pen twice.”

Yes, epi­sode 117 hap­pens only once in the world model; and sup­pose the agent cares only about epi­sode 117 in the “cur­rent ex­e­cu­tion”. The con­cern still holds: the agent might write a ma­lign out­put that would re­sult in ad­di­tional in­vo­ca­tions of it­self in which epi­sode 117 ends with the agent get­ting a high re­ward. Note that the agent does not care about the other ex­e­cu­tions of it­self. The only pur­pose of the ma­lign out­put is to in­crease the prob­a­bil­ity that the “cur­rent ex­e­cu­tion” is one that ends with the agent re­ceiv­ing a high re­ward.

• Okay so I think you could con­struct a world-model that re­flects this sort of rea­son­ing, where it as­so­ci­ates re­ward with the re­ward pro­vided to a ran­domly sam­pled in­stance of its al­gorithm in the world in a way that looks like this. But the “ma­lign out­put that would re­sult in ad­di­tional in­vo­ca­tions of it­self” would re­quire the op­er­a­tor to leave the room, so this has the same form as, for ex­am­ple, . At this point, I think we’re no longer con­sid­er­ing any­thing that sounds like “epi­sode 117 hap­pen­ing twice,” but that’s fine. Also, just a side-note: this world-model would get ruled out if the re­wards/​ob­ser­va­tions pro­vided to the two sep­a­rate in­stances ever di­verge.

• Other al­gorithms… would even­tu­ally seek ar­bi­trary power in the world in or­der to in­ter­vene in the pro­vi­sion of its own re­ward; this fol­lows straight­for­wardly from its di­rec­tive to max­i­mize reward

The con­clu­sion seems false; AUP (IJCAI, LW) is a re­ward max­i­mizer which does not ex­hibit this be­hav­ior. For similar rea­sons, the re­cent to­tal­i­tar­ian con­ver­gence con­jec­ture made here also seems not true.

• AUP seems re­ally promis­ing. I just meant other al­gorithms that have been proven gen­er­ally in­tel­li­gent, which is re­ally just AIXI, the Thomp­son Sam­pling Agent, BayesExp, and a cou­ple other var­i­ants on Bayesian agents with large model classes.

• This may be a dumb ques­tion, but how can you asymp­tot­i­cally guaran­tee hu­man-level in­tel­li­gence when the world-mod­els have bounded com­pu­ta­tion time, and the hu­man is a “com­putable func­tion” that has no such limit? Is it be­cause the num­ber of Tur­ing ma­chines is in­finite?

• Not a dumb ques­tion; bounded com­pu­ta­tion time here means bounded com­pu­ta­tion time per epi­sode, so re­ally it’s lin­ear com­pu­ta­tion time.

• My main con­cern is that the sys­tem will be nei­ther safe, nor use­ful. Its use­ful­ness is limited by the num­ber of tasks which could be solved in such setup. Th­ese tasks are prob­a­bly some ad­vises or pro­jects: “should we de­velop wind en­ergy” or “print the the blueprint of a nanorobot”. The main prob­lem is that an ad­vice can look good but have some un­in­tended con­se­quences, which are not ob­vi­ous. For ex­am­ple, wide­spread wind farms will kill birds and re­sult in pest pro­lifer­a­tion. If the AI con­tinue to work, it could ad­just its ad­vise to this new data, but the pro­gram has stopped.

In other words, sealed Or­a­cle can give bad ad­vice and can’t be pun­ished for that.

There will be strong in­cen­tive by the sys­tem’s owner to dis­able all this safety mea­sures, and use the sys­tem in real world con­tin­u­ously, so all Omo­hun­dro’s drives ap­pear all over again, but on the level of the sys­tem’s own­ers.

More­over, the sys­tem can feel re­warded now by know­ing that it will in fu­ture dom­i­nate in the uni­verse, so the sys­tem may feel the need to put a se­cret code in its out­put (if it is a long code) which will cre­ate many its copies in the wild. For ex­am­ple, in or­der to tile the uni­verse with wind­farms. If it can’t out­put com­plex code, it will be mostly use­less.

• The main prob­lem is that an ad­vice can look good but have some un­in­tended con­se­quences, which are not ob­vi­ous.

link­ing this to the dis­cus­sion be­tween Wei Dai and me here.

• There will be strong in­cen­tive by the sys­tem’s owner to dis­able all this safety measures

If the op­er­a­tors be­lieve that with­out the safety mea­sures, hu­man­ity would be wiped out, I think they won’t jet­ti­son them. More to the point, run­ning this al­gorithm does not put more pres­sure on the op­er­a­tors to try out a dan­ger­ous AI. What ever in­cen­tive ex­isted already is not the fault of this al­gorithm.

• The prob­lem of any hu­man op­er­a­tor is other hu­man op­er­a­tors, e.g. Chi­nese vs. Amer­i­can. This cre­ates ex­actly the same dy­nam­ics as was ex­plored by Omo­hun­dro: the strong in­cen­tive to grab the power and take more risky ac­tions.

You dis­sect the whole sys­tem on two parts, and then claim that one of the parts is “safe”. But the same thing can be done with any AI: just say that its mem­ory or any other part is safe.

If we look at the whole sys­tem in­clud­ing AI and its hu­man op­er­a­tors it will have un­safe dy­nam­ics as whole. I wrote about it more in “Mili­tary AI as con­ver­gent goal of AI de­vel­op­ment

• What would con­sti­tute a solu­tion to the prob­lem of the race to the bot­tom be­tween teams of AGI de­vel­op­ers as they sac­ri­fice cau­tion to se­cure a strate­gic ad­van­tage be­sides the con­junc­tion of a) tech­ni­cal pro­pos­als and b) mul­ti­lat­eral treaties? Is your com­plaint that I make no dis­cus­sion of b? I think we can fo­cus on these two things one at a time.

• There could be, in fact, many solu­tions, start­ing from pre­ven­tion AI cre­ation at all – and up to cre­ation so many AIs that they will bal­ance each other. I have an ar­ti­cle with overview of pos­si­ble “global” solu­tions.

I don’t think you should dis­cuss differ­ent global solu­tions, as it would be off topic. But the dis­cus­sion of the whole sys­tem of “boxed AI + AI cre­ators” may be in­ter­est­ing.

• the sys­tem can feel re­warded now by know­ing that it will in fu­ture dom­i­nate in the universe

I do not see any map­ping from these con­cepts to its ac­tion-se­lec­tion-crite­rion of max­i­miz­ing the re­wards for its cur­rent epi­sode.

• Com­ment thread: con­cerns with As­sump­tion 1

• Since the real world is quan­tum, does your UTM need to be quan­tum too? More gen­er­ally, what hap­pens if there’s a mis­match be­tween what com­pu­ta­tions can be done effi­ciently in the real world vs on the UTM?

Also, I’m not sure what cat­e­gory this ques­tion falls un­der, but can you ex­plain the new speed prior that you use, e.g., what prob­lems in the old speed pri­ors was it de­signed to solve? (I re­call notic­ing some is­sues with Sch­mid­hu­ber’s speed prior but can’t find the post where I wrote about it now.)

• can you ex­plain the new speed prior

Yes! If you’re a proper Bayesian, us­ing the speed prior on se­quence pre­dic­tion for in­finite se­quences, you end up with sur­pris­ingly good loss bounds. This is sur­pris­ing be­cause the speed prior as­signs 0 prob­a­bil­ity to in­finite se­quences, so the truth has no prior sup­port.

If you use a max­i­mum a pos­te­ri­ori es­ti­mate, in­stead of a full Bayesian mix­ture, and the truth has prior sup­port, you also do fine.

But as far as I can tell, things break if you try both at once. So what I needed was a way of pe­nal­iz­ing slow world-mod­els, while still mak­ing sure that true en­vi­ron­ment had prior sup­port (and in par­tic­u­lar, the pos­si­bil­ity of the true en­vi­ron­ment run­ning for in­finitely many timesteps). Other­wise, you don’t get any sort of in­tel­li­gence re­sult.

• does your UTM need to be quan­tum too?

I feel like I’ve heard some­thing about quadratic slow­down simu­lat­ing quan­tum me­chan­ics on a clas­si­cal ma­chine? But it doesn’t re­ally mat­ter: a simu­la­tion that re­quired 100 com­pu­ta­tion steps per epi­sode on a quan­tum com­puter might re­quire 10,000 com­pu­ta­tion steps per epi­sode on a Tur­ing ma­chine, but it’s still bounded.

• Th­ese three sources all say simu­lat­ing a quan­tum sys­tem or com­puter on a clas­si­cal com­puter takes ex­po­nen­tial time. Does that make a differ­ence?

• It doesn’t make a differ­ence. Thank you for ac­tu­ally look­ing into this though.

I was think­ing of this.

• It doesn’t make a differ­ence.

I’m sur­prised by this. Can you ex­plain a bit more? I was think­ing that an ex­po­nen­tially large com­pu­ta­tion bound for the TM that ac­cu­rately simu­lates the real world would make its speed prior so small that it would be prac­ti­cally im­pos­si­ble for the AI to get enough in­puts (i.e., mes­sages from the op­er­a­tor) to up­date on to make that world model have the high­est weight in the pos­te­rior.

• Yes, in that sense it’s bad—I just meant it doesn’t make a differ­ence to the the­o­rems or pose a threat to As­sump­tion 1.

And it isn’t quite that bad. This effect also means you could safely in­crease , (pe­nal­ize slow­ness less) which causes BoMAI to start en­ter­tain­ing use­ful mod­els of re­al­ity sooner. To be clear, this point miti­gates the difficulty you de­scribe; it cer­tainly doesn’t go be­yond can­cel­ling it out.

• My worry at this point is that if simu­lat­ing the real world us­ing ac­tual physics takes ex­po­nen­tial time on your UTM, the world model with the great­est pos­te­rior may not be such a simu­la­tion but in­stead for ex­am­ple an alien su­per­in­tel­li­gence that runs effi­ciently on a clas­si­cal TM which is pre­dict­ing the be­hav­ior of the op­er­a­tor (us­ing var­i­ous al­gorithms that it came up with that run effi­ciently on a clas­si­cal com­puter) and at some point the alien su­per­in­tel­li­gence will cause BoMAI to out­put some­thing to mind hack the op­er­a­tor and then take over our uni­verse. I’m not sure which as­sump­tion this would vi­o­late, but do you see this as a rea­son­able con­cern?

• The the­o­rem is con­sis­tent with the aliens caus­ing trou­ble any finite num­ber of times. But each time they cause the agent to do some­thing weird their model loses some prob­a­bil­ity, so there will be some epi­sode af­ter which they stop caus­ing trou­ble (if we man­age to suc­cess­fully run enough epi­sodes with­out in fact hav­ing any­thing bad hap­pen in the mean­time, which is an as­sump­tion of the asymp­totic ar­gu­ments).

• Thanks. Is there a way to de­rive a con­crete bound on how long it will take for BoMAI to be­come “be­nign”, e.g., is it ex­po­nen­tial or some­thing more rea­son­able? (Although if even a sin­gle “ma­lign” epi­sode could lead to dis­aster, this may be only of aca­demic in­ter­est.) Also, to com­ment on this sec­tion of the pa­per:

“We can only offer in­for­mal claims re­gard­ing what hap­pens be­fore BoMAI is definitely be­nign. One in­tu­ition is that even­tual be­nig­nity with prob­a­bil­ity 1 doesn’t hap­pen by ac­ci­dent: it sug­gests that for the en­tire life­time of the agent, ev­ery­thing is con­spiring to make the agent be­nign.”

If BoMAI can be effec­tively con­trol­led by alien su­per­in­tel­li­gences be­fore it be­comes “be­nign” that would sug­gest “ev­ery­thing is con­spiring to make the agent be­nign” is mis­lead­ing as far as rea­son­ing about what BoMAI might do in the mean time.

(if we man­age to suc­cess­fully run enough epi­sodes with­out in fact hav­ing any­thing bad hap­pen in the mean­time, which is an as­sump­tion of the asymp­totic ar­gu­ments)

Is this noted some­where in the pa­per, or just im­plicit in the ar­gu­ments? I guess what we ac­tu­ally need is ei­ther a guaran­tee that all epi­sodes are “be­nign” or a bound on util­ity loss that we can in­cur through such a scheme. (I do ap­pre­ci­ate that “in the ab­sence of any other al­gorithms for gen­eral in­tel­li­gence which have been proven asymp­tot­i­cally be­nign, let alone be­nign for their en­tire life­times, BoMAI rep­re­sents mean­ingful the­o­ret­i­cal progress to­ward de­sign­ing the lat­ter.“)

• Is there a way to de­rive a con­crete bound on how long it will take for BoMAI to be­come “be­nign”, e.g., is it ex­po­nen­tial or some­thing more rea­son­able?

The clos­est thing to a dis­cus­sion of this so far is Ap­pendix E, but I have not yet thought through this very care­fully. When you ask if it is ex­po­nen­tial, what ex­actly are you ask­ing if it is ex­po­nen­tial in?

• When you ask if it is ex­po­nen­tial, what ex­actly are you ask­ing if it is ex­po­nen­tial in?

I guess I was ask­ing if it’s ex­po­nen­tial in any­thing that would make BoMAI im­prac­ti­cally slow to be­come “be­nign”, so ba­si­cally just us­ing “ex­po­nen­tial” as a short­hand for “im­prac­ti­cally large”.

• Is this noted some­where in the paper

I don’t think it is, thank you for point­ing this out.

• If BoMAI can be effec­tively con­trol­led by alien su­per­in­tel­li­gences be­fore it be­comes “be­nign” that would sug­gest “ev­ery­thing is con­spiring to make the agent be­nign” is mis­lead­ing as far as rea­son­ing about what BoMAI might do in the mean time.

Agreed that would be mis­lead­ing, but I don’t think it would be con­trol­led by alien su­per­in­tel­li­gences.

• the world model with the great­est pos­te­rior may not be such a simu­la­tion but in­stead for ex­am­ple an alien su­per­in­tel­li­gence that runs effi­ciently on a clas­si­cal TM which is pre­dict­ing the be­hav­ior of the operator

Con­sider al­gorithm the alien su­per­in­tel­li­gence is run­ning to pre­dict the be­hav­ior of the op­er­a­tor which runs effi­ciently on a clas­si­cal TM (Al­gorithm A). Now com­pare Al­gorithm A with Al­gorithm B: simu­late aliens de­cid­ing to run al­gorithm A; run al­gorithm A; ex­cept at some point, figure out when to do a treach­er­ous turn, and then do it.

Al­gorithm B is clearly slower than Al­gorithm A, so Al­gorithm B loses.

There is an im­por­tant con­ver­sa­tion to be had here: your par­tic­u­lar ex­am­ple isn’t con­cern­ing, but maybe we just haven’t thought of an ana­log that is con­cern­ing. Re­gard­less, I think has be­come di­vorced from the dis­cus­sion about quan­tum me­chan­ics.

This is why I try to write down all the as­sump­tions to rule out a whole host of world-mod­els we haven’t even con­sid­ered. In the ar­gu­ment in the pa­per, the as­sump­tion that rules out this ex­am­ple is the Nat­u­ral Prior As­sump­tion (as­sump­tion 3), al­though I think for your par­tic­u­lar ex­am­ple, the ar­gu­ment I just gave is more straight­for­ward.

• Al­gorithm B is clearly slower than Al­gorithm A.

Yes but al­gorithm B may be shorter than al­gorithm A, be­cause it could take a lot of bits to di­rectly spec­ify an al­gorithm that would ac­cu­rately pre­dict a hu­man us­ing a clas­si­cal com­puter, and less bits to pick out an alien su­per­in­tel­li­gence who has an in­stru­men­tal rea­son to in­vent such an al­gorithm. If β is set to be so near 1 that the ex­po­nen­tial time simu­la­tion of real physics can have the high­est pos­te­rior within a rea­son­able time, the fact that B is slower than A makes al­most no differ­ence and ev­ery­thing comes down to pro­gram length.

Re­gard­less, I think has be­come di­vorced from the dis­cus­sion about quan­tum me­chan­ics.

Quan­tum me­chan­ics is what’s mak­ing B be­ing slower than A not mat­ter (via the above ar­gu­ment).

• If β is set to be so near 1 that the ex­po­nen­tial time simu­la­tion of real physics can have the high­est pos­te­rior within a rea­son­able time...

So I’m a bit baf­fled by the philos­o­phy here, but here’s why I haven’t been con­cerned with the long time it would take BoMAI to en­ter­tain the true en­vi­ron­ment (and it might well, given a safe value of ).

There is rel­a­tively clear dis­tinc­tion one can make be­tween ob­jec­tive prob­a­bil­ities and sub­jec­tive ones. The asymp­totic be­nig­nity re­sult makes use of world-mod­els that perfectly match the ob­jec­tive prob­a­bil­ities ris­ing to the top.

Con­sider a new kind of prob­a­bil­ity: a “-op­ti­mal sub­jec­tive prob­a­bil­ity.” That is, the best (in the sense of KL di­ver­gence) ap­prox­i­ma­tion of the ob­jec­tive prob­a­bil­ities that can be sam­pled from us­ing a UTM and us­ing only com­pu­ta­tion steps. Sus­pend dis­be­lief for a mo­ment, and sup­pose we thought of these prob­a­bil­ities as ob­jec­tive prob­a­bil­ities. My in­tu­ition here is that ev­ery­thing works just great when agents treat sub­jec­tive prob­a­bil­ities like real prob­a­bil­ities, and to a -bounded agent, it feels like there is some sense in which these might as well be ob­jec­tive prob­a­bil­ities; the more in­tri­cate struc­ture is in­ac­cessible. If no world-mod­els were con­sid­ered that al­lowed more than com­pu­ta­tion steps per timestep ( per epi­sode I guess, what­ever), then just by call­ing “-op­ti­mal sub­jec­tive prob­a­bil­ities” “ob­jec­tive,” the same be­nig­nity the­o­rems would ap­ply, where the role in the proofs of [the world-model that matches the ob­jec­tive prob­a­bil­ities] is re­placed by [the world-model that matches the -op­ti­mal sub­jec­tive prob­a­bil­ities]. And in this ver­sion, comes much sooner, and the limit­ing value of in­tel­li­gence is reached much sooner.

Of course, “the limit­ing value of in­tel­li­gence” is much less, be­cause only fast world-mod­els are con­sid­ered. But that just goes to show that even if, on a hu­man timescale, BoMAI ba­si­cally never fields a world-model that ac­tu­ally matches ob­jec­tive prob­a­bil­ities, along the way, it will still be field­ing the best ones available that use a more mod­est com­pu­ta­tion bud­get. Once the com­pu­ta­tion bud­get sur­passes the hu­man brain, that should suffice for it to be prac­ti­cally in­tel­li­gent.

EDIT: if one sets to be safe, then if this logic fails, BoMAI will be use­less, not dan­ger­ous.

• If there’s an effi­cient clas­si­cal ap­prox­i­ma­tion of quan­tum dy­nam­ics, I bet this has a con­cise and lovely math­e­mat­i­cal de­scrip­tion. I bet that de­scrip­tion is much shorter than “in Con­way’s game of life, the effi­cient ap­prox­i­ma­tion of quan­tum me­chan­ics that what­ever life­form emerges will prob­a­bly come up with.”

But I’m hes­i­tant here. This is ex­actly the sort of con­ver­sa­tion I wanted to have.

• If there’s an effi­cient clas­si­cal ap­prox­i­ma­tion of quan­tum dy­nam­ics, I bet this has a con­cise and lovely math­e­mat­i­cal de­scrip­tion.

I doubt that there’s an effi­cient clas­si­cal ap­prox­i­ma­tion of quan­tum dy­nam­ics in gen­eral. There are prob­a­bly tricks to speed up the clas­si­cal ap­prox­i­ma­tion of a hu­man mind though (or parts of a hu­man mind), that an alien su­per­in­tel­li­gence could dis­cover. Con­sider this anal­ogy. Sup­pose there’s a robot stranded on a planet with­out tech­nol­ogy. What’s the short­est al­gorithm for con­trol­ling the robot such that it even­tu­ally leaves that planet and reaches an­other star? It’s prob­a­bly some kind of AGI that has an in­stru­men­tal goal of reach­ing an­other star, right? (It could also be a ter­mi­nal goal, but there are many other ter­mi­nal goals that call for in­ter­stel­lar travel as an in­stru­men­tal goal so the lat­ter seems more likely.) Leav­ing the planet calls for solv­ing many prob­lems that come up, on the fly, in­clud­ing in­vent­ing new al­gorithms for solv­ing them. If you put all these in­di­vi­d­ual solu­tions and al­gorithms to­gether that would also be an al­gorithm for reach­ing an­other star but it could be a lot longer than the code for the AGI.

• I see—so I think I make the same re­sponse on a differ­ent level then.

My model for this is: the world-model is a stochas­tic sim­ple world, some­thing like Con­way’s game of life (but with ran­dom­ness). Life evolves. The out­put chan­nel has dis­t­in­guished within-world effects, so that in­hab­itants can rec­og­nize it. The in­hab­itants con­trol the out­put chan­nel and use some of their world’s noise to sam­ple from a uni­ver­sal prior, which they then feed into the out­put chan­nel. But they don’t just use any uni­ver­sal prior—they use a bet­ter one, one which up­dates the prior over world-mod­els as if the ob­ser­va­tion has been made: “some­one in this world-model is sam­pling from the uni­ver­sal prior.” Maybe they also started with a speed prior of some form (which would cause them to be more likely to out­put the fast ap­prox­i­ma­tion of the hu­man mind we were just dis­cussing). And then af­ter a while, they mess with the out­put.

What­ever bet­ter uni­ver­sal prior they come up with (e.g. an­throp­i­cally up­dated speed prior), I think has a short de­scrip­tion—shorter than [- log prob(in­tel­li­gent life evolves and picks it) + de­scrip­tion of sim­ple uni­verse].

• It doesn’t make sense to me that they’re sam­pling from a uni­ver­sal prior and feed­ing it into the out­put chan­nel, be­cause the aliens are try­ing to take over other wor­lds through that out­put chan­nel (and pre­sum­ably they also have a dis­t­in­guished in­put chan­nel to go along with it), so they should be fo­cus­ing on find­ing wor­lds that both can be taken over via the chan­nel (in­clud­ing figur­ing out the com­pu­ta­tional costs of do­ing so) and are worth tak­ing over (i.e., offers greater com­pu­ta­tional re­sources than their own), and then gen­er­at­ing out­puts that are op­ti­mized for tak­ing over those wor­lds. Maybe this can be viewed as sam­pling from some kind of uni­ver­sal prior (with a short de­scrip­tion), but I’m not see­ing it. If you think it can or should be viewed that way, can you ex­plain more?

In par­tic­u­lar, if they’re try­ing to take over a com­pu­ta­tion­ally richer world, like ours, they have to figure out how to make suffi­cient pre­dic­tions about the richer world us­ing their own im­pov­er­ished re­sources, which could in­volve do­ing re­search that’s equiv­a­lent to our physics, chem­istry, biol­ogy, neu­ro­science, etc. I’m not see­ing how sam­pling from “an­throp­i­cally up­dated speed prior” would do the equiv­a­lent of all that (un­less you end up sam­pling from a com­pu­ta­tion within the prior that con­sists of some aliens try­ing to take over our world).

• I think you might be more or less right here.

I hadn’t thought about the can-do and the worth-do­ing up­date, in ad­di­tion to the an­thropic up­date. And it’s not that im­por­tant, but for ter­minol­ogy’s sake, I for­got that the up­date could send a world-model’s prior to 0, so the prior might not be uni­ver­sal any­more.

The rea­son I think of these steps as up­dates to what started as a uni­ver­sal prior, is that they would like to take over as many pos­si­ble wor­lds as pos­si­ble, and they don’t know which one. And the uni­ver­sal prior is a good way to pre­dict the dy­nam­ics of a world you know noth­ing about.

they have to figure out how to make suffi­cient pre­dic­tions about the richer world us­ing their own im­pov­er­ished re­sources, which could in­volve do­ing re­search that’s equiv­a­lent to our physics, chem­istry, biol­ogy, neu­ro­science, etc. I’m not see­ing how sam­pling from “an­throp­i­cally up­dated speed prior” would do the equiv­a­lent of all that

If you want to make fast pre­dic­tions about an un­known world, I think that’s what we call a speed prior. Once the alien race has sub­mit­ted a se­quence of ob­ser­va­tions, they should act as if the ob­ser­va­tions were largely cor­rect, be­cause that’s the situ­a­tion in which any­thing they do mat­ters, so they are ba­si­cally “learn­ing” about the world they are copy­ing (along with what they get from their in­put chan­nel, of course, which cor­re­sponds to the op­er­a­tor’s ac­tions). Sam­pling from a speed prior al­lows the aliens to out­put quick-to-com­pute plau­si­ble con­tinu­a­tions of what they’ve out­putted already. Hence, my re­duc­tion from [re­search about var­i­ous top­ics] to [sam­pling from a speed prior].

But—when you add in the can-do up­date and the worth-do­ing up­date, I agree with you that the re­sult­ing mea­sure (speed prior + an­thropic up­date + can-do up­date + worth-do­ing up­date) might have a longer de­scrip­tion than the mea­sure which starts like that, then takes a treach­er­ous turn. This case seems differ­ent to me (so I don’t make the same ob­jec­tion on this level) be­cause the can-do up­date and the worth-do­ing up­date are about this treach­er­ous turn.

So let me back up here. I don’t say any­thing in the Nat­u­ral Prior As­sump­tion about “for suffi­ciently small ,” but this makes me think I might need to. As I sug­gested above, I do think there is huge com­pu­ta­tional over­head that comes from hav­ing evolved life in a world run­ning an al­gorithm on a “vir­tual ma­chine” in their Tur­ing-ma­chine-simu­lated world, com­pared to the al­gorithm just be­ing run on a Tur­ing ma­chine that is spe­cial­ized for that al­gorithm. (75% con­fi­dence that life in a uni­verse leads to egre­gious slow­down; 97% con­fi­dence that run­ning on a vir­tual ma­chine leads to at least 2x slow­down). And with­out the aliens in­volved, the “pre­dict well” part is sim­pler than “pre­dict well” + “treach­er­ous turn.” In this ver­sion of the Nat­u­ral Prior As­sump­tion, the in­tu­ition is that con­trol flow takes time to eval­u­ate, even if in rare cir­cum­stances it doesn’t re­quire more code. (Really, the rea­son­ing that got us here is that in the rare case that treach­er­ous world-mod­els may be shorter to de­scribe, they are also very slow, but that “in­tu­ition” seems less likely to gen­er­al­ize to world-mod­els we haven’t thought about). But I do think a max­i­mum a pos­te­ri­ori es­ti­mate us­ing a speed prior would be pretty dev­as­tat­ing to op­ti­miza­tion dae­mons.

Another point that I was con­sid­er­ing, and I’m not ex­actly sure where this fits in, but I have the in­tu­ition that for the speed prior that they use when they de­sign plau­si­ble out­put, if they start with one that’s ei­ther more se­vere or less se­vere than the one we use, they lose ground. After i epi­sodes, there’s a trade-off to be made be­tween ac­cu­racy and speed for be­ing a high-pos­te­rior world-model (where ac­cu­racy is more fa­vored as i in­creases). If their speed prior isn’t se­vere enough, then at any given point in time, the world-model they use to pipe to out­put will be slower, which takes them more com­pu­ta­tion, which pe­nal­izes them. If their speed prior is too se­vere, they’ll be too fo­cused on ap­prox­i­mat­ing and lose to more ac­cu­rate world-mod­els whose rel­a­tive slow­ness we’re pre­pared to ac­com­mo­date. I think their best bet is to match our speed prior, and take what­ever ad­van­tage they can get from the an­thropic up­date and pick­ing their bat­tles (the other two up­dates). Add “match­ing our prior” to the list of “things that make it hard to take over a uni­ver­sal prior.”

• I’m glad that I’m get­ting some of my points across, but I think we still have some re­main­ing dis­agree­ments or con­fu­sions here.

If you want to make fast pre­dic­tions about an un­known world, I think that’s what we call a speed prior.

That doesn’t seem right to me. A speed prior still fa­vors short al­gorithms. If you’re try­ing to make pre­dic­tions about a com­pu­ta­tion­ally richer uni­verse, why fa­vor short al­gorithms? Why not ap­ply your in­tel­li­gence to try to dis­cover the best al­gorithm (or in­creas­ingly bet­ter al­gorithms), re­gard­less of the length?

Also, sam­pling from a speed prior in­volves ran­dom­iz­ing over a mix­ture of TMs, but from an EU max­i­miza­tion per­spec­tive, wouldn’t run­ning one par­tic­u­lar TM from the mix­ture give the high­est ex­pected util­ity? Why are the aliens sam­pling from the speed prior in­stead of di­rectly pick­ing a spe­cific al­gorithm to gen­er­ate the next out­put, one that they ex­pect to give the high­est util­ity for them?

I don’t say any­thing in the Nat­u­ral Prior As­sump­tion about “for suffi­ciently small β,” but this makes me think I might need to.

What hap­pens if β is too small? If it’s re­ally tiny, then the world model with the high­est pos­te­rior is ran­dom, right, be­cause it’s “com­puted” by a TM that (to min­i­mize run time) just copies ev­ery­thing on its ran­dom tape to the out­put? And as you in­crease β, the TM with high­est pos­te­rior starts do­ing fast and then in­creas­ingly com­pute-in­ten­sive pre­dic­tions?

As I sug­gested above, I do think there is huge com­pu­ta­tional over­head that comes from hav­ing evolved life in a world run­ning an al­gorithm on a “vir­tual ma­chine” in their Tur­ing-ma­chine-simu­lated world, com­pared to the al­gorithm just be­ing run on a Tur­ing ma­chine that is spe­cial­ized for that al­gorithm.

I think if β is small but not too small, the high­est pos­te­rior would not in­volve evolved life, but in­stead a di­rectly coded AGI that runs “na­tively” on the TM who can de­cide to ex­e­cute ar­bi­trary al­gorithms “na­tively” on the TM.

Maybe there is still some range of β where BoMAI is both safe and use­ful (can an­swer so­phis­ti­cated ques­tions like “how to build a safe un­bounded AGI”) be­cause in that range the high­est pos­te­rior is a good non-life/​non-AGI pre­dic­tion al­gorithm. But A) I don’t know an ar­gu­ment for that, and B) even if it’s true, to take ad­van­tage of it would seem to re­quire fine tun­ing β and I don’t see how to do that, given that trial-and-er­ror wouldn’t be safe.

• a di­rectly coded AGI that runs “na­tively” on the TM who can de­cide to ex­e­cute ar­bi­trary al­gorithms “na­tively” on the TM.

At the end of the day, it will be run­ning some sub­rou­tine for its gain trust/​pre­dict ac­cu­rately phase.

I as­sume this sort of thing is true for any model of com­pu­ta­tion, but when you con­struct a uni­ver­sal Tur­ing ma­chine, so that it can simu­late com­pu­ta­tion step af­ter com­pu­ta­tion step of an­other Tur­ing ma­chine, it takes way more than one com­pu­ta­tion step for each one. If the AGI is us­ing ma­chin­ery that would al­low it to simu­late any world-model, it will be way slower than the Tur­ing ma­chine built for that al­gorithm.

I re­al­ize this seems re­ally in-the-weeds and par­tic­u­lar, but I think this is a gen­eral prin­ci­ple of com­pu­ta­tion. The more gen­eral a sys­tem is, the less well it can do any par­tic­u­lar task. I think an AGI that chose to pipe vi­able pre­dic­tions to the out­put with some pro­ce­dure will be slower than the Tur­ing ma­chine which just runs that pro­ce­dure.

• Ok, I see, so in other words the AGI doesn’t have the abil­ity to write an ar­bi­trary func­tion in the base pro­gram­ming lan­guage and call it, it has a fixed code base and has to simu­late that func­tion us­ing its ex­ist­ing code. How­ever I think the AGI can still win a race against a straight­for­ward “pre­dict ac­cu­rately” al­gorithm, be­cause it can to two things. 1) In­clude the most im­por­tant in­ner loops of the “pre­dict ac­cu­rately” al­gorithm as func­tions in its own code to min­i­mize the rel­a­tive slow­down (this is not a de­ci­sion by the AGI but just a mat­ter of which AGI ends up hav­ing the high­est pos­te­rior) and 2) keep find­ing im­prove­ments to its own pre­dic­tion al­gorithm so that it can even­tu­ally over­take any fixed pre­dic­tion al­gorithm in ac­cu­racy which hope­fully more than “pays for” the re­main­ing slow­down that is in­curred.

• Let the AGI’s “pre­dict ac­cu­rately” al­gorithm be fixed.

What you call a se­quence of im­prove­ments to the pre­dic­tion al­gorithm, let’s just call that the pre­dic­tion al­gorithm. Imag­ine this to have as much or as lit­tle over­head as you like com­pared to what was pre­vi­ously con­cep­tu­al­ized as “pre­dict ac­cu­rately.” I think this recon­cep­tu­al­iza­tion elimi­nates 2) as a con­cern, and if I’m un­der­stand­ing cor­rectly, 1) is only able to miti­gate slow­down, not over­power it.

Also I think 1) doesn’t work—maybe you came to this con­clu­sion as well?

Sup­pose M is the C pro­gram­ming lan­guage, but in C there is no way to say “in­ter­pret this string as a C pro­gram and run it as fast as a na­tive C pro­gram”.

But maybe you’re say­ing that doesn’t ap­ply be­cause:

(this is not a de­ci­sion by the AGI but just a mat­ter of which AGI ends up hav­ing the high­est pos­te­rior)

I think this way throws off the con­tention that this AGI will have a short de­scrip­tion length. One can imag­ine a slid­ing scale here. Short de­scrip­tion, lots of over­head: a sim­ple uni­verse evolves life, aliens de­cide to run “pre­dict ac­cu­rately” + “treach­er­ous turn”. Longer de­scrip­tion, less over­head: an AGI that runs “pre­dict ac­cu­rately” + “treach­er­ous turn.” Longer de­scrip­tion, less over­head: an AGI with some of the sub­rou­tines in­volved already (con­ve­niently) baked in to its ar­chi­tec­ture. Once all the sub­rou­tines are “baked into its ar­chi­tec­ture” you just have: the al­gorithm “pre­dict ac­cu­rately” + “treach­er­ous turn”. And in this form, that has a longer de­scrip­tion than just “pre­dict ac­cu­rately”.

• Once all the sub­rou­tines are “baked into its ar­chi­tec­ture” you just have: the al­gorithm “pre­dict ac­cu­rately” + “treach­er­ous turn”

You only have to bake in the in­ner­most part of one loop in or­der to get al­most all the com­pu­ta­tional sav­ings.

• I’ve made a case that the two end­points in the trade-off are not prob­le­matic. I’ve ar­gued (roughly) that one re­duces com­pu­ta­tional over­head by do­ing things that dis­so­ci­ate the nat­u­ral­ness of de­scribing “pre­dict ac­cu­rately” and “treach­er­ous turn” all at once. This goes back to the gen­eral prin­ci­ple I pro­posed above: “The more gen­eral a sys­tem is, the less well it can do any par­tic­u­lar task.” The only thing I feel like I can still do is ar­gue against par­tic­u­lar points in the trade-off that you think are likely to cause trou­ble. Can you point me to an ex­act in­ner loop that can be na­tive to an AGI that would cause this to fall out­side of this trend? To frame this case, the Tur­ing ma­chine de­scrip­tion must spec­ify [AGI + a rou­tine that it can call]--sort of like a brain-com­puter in­ter­face, where the AGI is the brain and the fast rou­tine is the com­puter.

• (I ac­tu­ally have a more ba­sic con­fu­sion, started a new thread.)

• If the AGI is us­ing ma­chin­ery that would al­low it to simu­late any world-model, it will be way slower than the Tur­ing ma­chine built for that al­gorithm.

Just con­sider a pro­gram that gives the aliens the abil­ity to write ar­bi­trary func­tions in M and then pass con­trol to them. That pro­gram is barely any big­ger (all you have to do is in­sert one use af­ter free in physics :) ), and guaran­tees the aliens have zero slow­down.

For the literal sim­plest ver­sion of this, your pro­gram is M(Alien(), ran­dom­ness), which is go­ing to run just as fast as M(physics, ran­dom­ness) for the in­tended physics, and prob­a­bly much faster (if the aliens can think of any clever tricks to run faster with­out com­pro­mis­ing much ac­cu­racy). The only rea­son you wouldn’t get this is if Alien is ex­pen­sive. That prob­a­bly rules out crazy alien civ­i­liza­tions, but I’m with Wei Dai that it prob­a­bly doesn’t rule out sim­pler sci­en­tists.

• Just con­sider a pro­gram that gives the aliens the abil­ity to write ar­bi­trary func­tions in M and then pass con­trol to them.

That’s what I was think­ing too, but Michael made me re­al­ize this isn’t pos­si­ble, at least for some M. Sup­pose M is the C pro­gram­ming lan­guage, but in C there is no way to say “in­ter­pret this string as a C pro­gram and run it as fast as a na­tive C pro­gram”. Am I miss­ing some­thing at this point?

all you have to do is in­sert one use af­ter free in physics

I don’t un­der­stand this sen­tence.

• That’s what I was think­ing too, but Michael made me re­al­ize this isn’t pos­si­ble, at least for some M. Sup­pose M is the C pro­gram­ming lan­guage, but in C there is no way to say “in­ter­pret this string as a C pro­gram and run it as fast as a na­tive C pro­gram”. Am I miss­ing some­thing at this point?

I agree this is only go­ing to be pos­si­ble for some uni­ver­sal Tur­ing ma­chines. Though if you are us­ing a Tur­ing ma­chine to define a speed prior, this does seem like a de­sir­able prop­erty.

I don’t un­der­stand this sen­tence.

If physics is im­ple­mented in C, there are many pos­si­ble bugs that would al­low the at­tacker to ex­e­cute ar­bi­trary C code with no slow­down.

• I agree this is only go­ing to be pos­si­ble for some uni­ver­sal Tur­ing ma­chines. Though if you are us­ing a Tur­ing ma­chine to define a speed prior, this does seem like a de­sir­able prop­erty.

Why is it a de­sir­able prop­erty? I’m not see­ing why it would be bad to choose a UTM that doesn’t have this prop­erty to define the speed prior for BoMAI, if that helps with safety. Please ex­plain more?

• I just mean: “uni­ver­sal­ity” in the sense of a UTM isn’t a suffi­cient prop­erty when defin­ing the speed prior, the analo­gous prop­erty of the UTM is some­thing more like: “You can run an ar­bi­trary Tur­ing ma­chine with­out too much slow­down.” Of course that’s not pos­si­ble, but it seems like you still want to be as close to that as pos­si­ble (for the same rea­sons that you wanted uni­ver­sal­ity at all).

I agree that it would be fine to sac­ri­fice this prop­erty if it was helpful for safety.

Each world-model is a Tur­ing ma­chine, whose prior re­lates to the Kol­mogorov com­plex­ity (on some uni­ver­sal Tur­ing ma­chine) of the de­scrip­tion of Tur­ing ma­chine—all the tran­si­tion rules, and what­not. Usu­ally, this would be iso­mor­phic (within a con­stant), but since we’re con­sid­er­ing speed, pro­grams ac­tu­ally aren’t simu­lated on a UTM.

• What hap­pens if β is too small?

Just as you said: it out­puts Bernoulli(1/​2) bits for a long time. It’s not dan­ger­ous.

B) even if it’s true, to take ad­van­tage of it would seem to re­quire fine tun­ing β and I don’t see how to do that, given that trial-and-er­ror wouldn’t be safe.

Fine tun­ing from both sides isn’t safe. Ap­proach from be­low.

• Just as you said: it out­puts Bernoulli(1/​2) bits for a long time. It’s not dan­ger­ous.

I just read the math more care­fully, and it looks like no mat­ter how small β is, as long as β is pos­i­tive, as BoMAI re­ceives more and more in­put, it will even­tu­ally con­verge to the most ac­cu­rate world model pos­si­ble. This is be­cause the com­pu­ta­tion penalty is ap­plied to the per-epi­sode com­pu­ta­tion bound and doesn’t in­crease with each epi­sode, whereas the ac­cu­racy ad­van­tage gets ac­cu­mu­lated across epi­sodes.

As­sum­ing that the most ac­cu­rate world model is an ex­po­nen­tial-time quan­tum simu­la­tion, that’s what BoMAI will con­verge to (no mat­ter how small β is), right? And in the mean­time it will go through some ar­bi­trar­ily com­plex (up to some very large bound) but faster than ex­po­nen­tial clas­si­cal ap­prox­i­ma­tions of quan­tum physics that are in­creas­ingly ac­cu­rate, as the num­ber of epi­sodes in­crease? If so, I’m no longer con­vinced that BoMAI is be­nign as long as β is small enough, be­cause the qual­i­ta­tive be­hav­ior of BoMAI seems the same no mat­ter what β is, i.e., it gets smarter over time as its world model gets more ac­cu­rate, and I’m not sure why the rea­son BoMAI might not be be­nign at high β couldn’t also ap­ply at low β (if we run it for a long enough time).

(If you’re go­ing to dis­cuss all this in your “longer re­ply”, I’m fine with wait­ing for it.)

• The longer re­ply will in­clude an image that might help, but a cou­ple other notes. If it causes you to doubt the asymp­totic re­sult, it might be helpful to read the be­nig­nity proof (es­pe­cially the proof of Re­ject­ing the Sim­ple Me­mory-Based Lemma, which isn’t that long). The heuris­tic rea­son for why it can be helpful to de­crease for long-run be­hav­ior, even though long-run be­hav­ior is qual­i­ta­tively similar, is that while ac­cu­racy even­tu­ally be­comes the dom­i­nant con­cern, along the way the prior is *sort of* a ran­dom per­tur­ba­tion to this which changes the pos­te­rior weight, so for two world-mod­els that are ex­actly equally ac­cu­rate, we need to make sure the ma­lign one is pe­nal­ized for be­ing slower, enough to out­weigh the in­con­ve­nient pos­si­ble out­come in which it has shorter de­scrip­tion length. Put an­other way, for be­nig­nity, we don’t need con­cern for speed to dom­i­nate con­cern for ac­cu­racy; we need it to dom­i­nate con­cern for “sim­plic­ity” (on some refer­ence ma­chine).

• so for two world-mod­els that are ex­actly equally ac­cu­rate, we need to make sure the ma­lign one is pe­nal­ized for be­ing slower, enough to out­weigh the in­con­ve­nient pos­si­ble out­come in which it has shorter de­scrip­tion length

Yeah, I un­der­stand this part, but I’m not sure why, since the be­nign one can be ex­tremely com­plex, the ma­lign one can’t have enough of a K-com­plex­ity ad­van­tage to over­come its slow­ness penalty. And since (with low β) we’re go­ing through many more differ­ent world mod­els as the num­ber of epi­sodes in­creases, that also gives ma­lign world mod­els more chances to “win”? It seems hard to make any trust­wor­thy con­clu­sions based on the kind of in­for­mal rea­son­ing we’ve been do­ing and we need to figure out the ac­tual math some­how.

• And since (with low β) we’re go­ing through many more differ­ent world mod­els as the num­ber of epi­sodes in­creases, that also gives ma­lign world mod­els more chances to “win”?

Check out the or­der of the quan­tifiers in the proofs. One works for all pos­si­bil­ities. If the quan­tifiers were in the other or­der, they couldn’t be triv­ially flipped since the num­ber of world-mod­els is in­finite, and the in­tu­itive worry about ma­lign world-mod­els get­ting “more chances to win” would ap­ply.

Let’s con­tinue the con­ver­sa­tion here, and this may be a good place to refer­ence this com­ment.

• Fine tun­ing from both sides isn’t safe. Ap­proach from be­low.

Sure, ap­proach­ing from be­low is ob­vi­ous, but that still re­quires know­ing how wide the band of β that would pro­duce a safe and use­ful BoMAI is, oth­er­wise even if the band ex­ists you could over­shoot it and end up in the un­safe re­gion.

ETA: But the first ques­tion is, is there a β such that BoMAI is both safe and in­tel­li­gent enough to an­swer ques­tions like “how to build a safe un­bounded AGI”? When β is very low BoMAI is use­less, and as you in­crease β it gets smarter, but then at some point with a high enough β it be­comes un­safe. Do you know a way to figure out how smart BoMAI is just be­fore it be­comes un­safe?

• Some vi­su­al­iza­tions which might help with this:

But then one needs to fac­tor in “sim­plic­ity” or the prior penalty from de­scrip­tion length:

Note also that these are av­er­age effects; they are just for form­ing in­tu­itions.

is there a β such that BoMAI is both safe and in­tel­li­gent enough to an­swer ques­tions like “how to build a safe un­bounded AGI” [af­ter a rea­son­able num­ber of epi­sodes]?

This was the sort of thing I as­sumed could be im­proved upon later once the asymp­totic re­sult was es­tab­lished. Now that you’re ask­ing for the im­prove­ment, here’s a pro­posal:

Set safely. Once enough ob­ser­va­tions have been pro­vided that you be­lieve hu­man-level AI should be pos­si­ble, ex­clude world-mod­els that use less than com­pu­ta­tion steps per epi­sode. Every epi­sode, in­crease un­til hu­man-level perfor­mance is reached. Un­der the as­sump­tion that the av­er­age com­pu­ta­tion time of a ma­lign world-model is at least a con­stant times that of the “cor­re­spond­ing” be­nign one (cor­re­spond­ing in the sense of us­ing the same ((coarse) ap­prox­i­mate) simu­la­tion of the world), then should be safe for some (and ).

I need to think more care­fully about what hap­pens here, but I think the de­sign space is large.

• Fixed your images. You have to press space af­ter you use that syn­tax for the images to ac­tu­ally get fetched and dis­played. Sorry for the con­fu­sion.

• Thanks!

• Longer re­sponse com­ing. On hold for now.

• Also, sam­pling from a speed prior in­volves ran­dom­iz­ing over a mix­ture of TMs, but from an EU max­i­miza­tion per­spec­tive, wouldn’t run­ning one par­tic­u­lar TM from the mix­ture give the high­est ex­pected util­ity?

All the bet­ter. They don’t what know uni­verse is us­ing the prior. What are the odds our uni­verse is the sin­gle most sus­cep­ti­ble uni­verse to be­ing taken over?

I was as­sum­ing the worst, and guess­ing that there are diminish­ing marginal re­turns once your odds of a suc­cess­ful takeover get above ~50%, so in­stead of go­ing all in on ac­cu­rate pre­dic­tions of the weak­est and ripest tar­get uni­verse, you hedge and tar­get a few uni­verses. And I was as­sum­ing the worst in as­sum­ing they’d be so good at this, they’d be able to do this for a large num­ber of uni­verses at once.

To clar­ify: diminish­ing marginal re­turns of takeover prob­a­bil­ity of a uni­verse with re­spect to the weight you give that uni­verse in your prior that you pipe to out­put.

• I was as­sum­ing the worst, and guess­ing that there are diminish­ing marginal re­turns once your odds of a suc­cess­ful takeover get above ~50%, so in­stead of go­ing all in on ac­cu­rate pre­dic­tions of the weak­est and ripest tar­get uni­verse, you hedge and tar­get a few uni­verses.

There are mas­sive diminish­ing marginal re­turns; in a naive model you’d ex­pect es­sen­tially *ev­ery* uni­verse to get pre­dicted in this way.

But Wei Dai’s ba­sic point still stands. The speed prior isn’t the ac­tual prior over uni­verses (i.e. doesn’t re­flect the real de­gree of moral con­cern that we’d use to weigh con­se­quences of our de­ci­sions in differ­ent pos­si­ble wor­lds). If you have some data that you are try­ing to pre­dict, you can do way bet­ter than the speed prior by (a) us­ing your real prior to es­ti­mate or sam­ple from the ac­tual pos­te­rior dis­tri­bu­tion over phys­i­cal law, (b) us­ing en­g­ineer­ing rea­son­ing to make the util­ity max­i­miz­ing pre­dic­tions, given that faster pre­dic­tions are go­ing to get given more weight.

(You don’t re­ally need this to run Wei Dai’s ar­gu­ment, be­cause there seem to be dozens of ways in which the aliens get an ad­van­tage over the in­tended phys­i­cal model.)

• I think what you’re say­ing is that the fol­low­ing don’t com­mute:

“real prior” (uni­ver­sal prior) + speed up­date + an­thropic up­date + can-do up­date + worth-do­ing update

com­pared to

uni­ver­sal prior + an­thropic up­date + can-do up­date + worth-do­ing up­date + speed update

When uni­ver­sal prior is next to speed up­date, this is nat­u­rally con­cep­tu­al­ized as a speed prior, and when it’s last, it is nat­u­rally con­cep­tu­al­ized as “en­g­ineer­ing rea­son­ing” iden­ti­fy­ing faster pre­dic­tions.

I happy to go with the sec­ond or­der if you pre­fer, in part be­cause I think they do com­mute—all these up­dates just change the weights on mea­sures that get mixed to­gether to be piped to out­put dur­ing the “pre­dict ac­cu­rately” phase.

• If you’re try­ing to make pre­dic­tions about a com­pu­ta­tion­ally richer uni­verse, why fa­vor short al­gorithms? Why not ap­ply your in­tel­li­gence to try to dis­cover the best al­gorithm (or in­creas­ingly bet­ter al­gorithms), re­gard­less of the length?

You have a countable list of op­tions. What choice do you have but to fa­vor the ones at the be­gin­ning? Any (com­putable) per­mu­ta­tion of the things on the list just cor­re­sponds to a differ­ent choice of uni­ver­sal Tur­ing ma­chine for which a “short” al­gorithm just means it’s ear­lier on the list.

And a “se­quence of in­creas­ingly bet­ter al­gorithms,” if cho­sen in a com­putable way, is just a com­putable al­gorithm.

• And a “se­quence of in­creas­ingly bet­ter al­gorithms,” if cho­sen in a com­putable way, is just a com­putable al­gorithm.

True but I’m ar­gu­ing that this com­putable al­gorithm is just the alien it­self, try­ing to an­swer the ques­tion “how can I bet­ter pre­dict this richer world in or­der to take it over?” If there is no shorter/​faster al­gorithm that can come up with a se­quence of in­creas­ingly bet­ter al­gorithms, what is the point of say­ing that the alien is sam­pling from the speed prior, in­stead of say­ing that the alien is think­ing about how to an­swer “how can I bet­ter pre­dict this richer world in or­der to take it over?” Ac­tu­ally if this alien was sam­pling from the speed prior, then it would no longer be the short­est/​fastest al­gorithm to come up with a se­quence of in­creas­ingly bet­ter al­gorithms, and some other alien try­ing to take over our world would have the high­est pos­te­rior in­stead.

• I’m hav­ing a hard time fol­low­ing this. Can you ex­pand on this, with­out us­ing “se­quence of in­creas­ingly bet­ter al­gorithms”? I keep trans­lat­ing that to “al­gorithm.”

• The fast al­gorithms to pre­dict our physics just aren’t go­ing to be the short­est ones. You can use rea­son­ing to pick which one to fa­vor (af­ter figur­ing out physics), rather than just writ­ing them down in some ar­bi­trary or­der and tak­ing the first one.

• You can use rea­son­ing to pick which one to fa­vor (af­ter figur­ing out physics), rather than just writ­ing them down in some ar­bi­trary or­der and tak­ing the first one.

Us­ing “rea­son­ing” to pick which one to fa­vor, is just pick­ing the first one in some new or­der. (And not re­ally pick­ing the first one, just giv­ing ear­lier ones prefer­en­tial treat­ment). In gen­eral, if you have an in­finite list of pos­si­bil­ities, and you want to pick the one that max­i­mizes some prop­erty, this is not a pro­ce­dure that halts. I’m ag­nos­tic about what or­der you use (for now) but one can’t es­cape the ne­ces­sity to in­tro­duce the ar­bi­trary crite­rion of “valu­ing” ear­lier things on the list. One can put 50% prob­a­bil­ity mass on the first billion in­stead of the first 1000 if one wants to fa­vor “sim­plic­ity” less, but you can’t make that num­ber in­finity.

• Us­ing “rea­son­ing” to pick which one to fa­vor, is just pick­ing the first one in some new or­der.

Yes, some new or­der, but not an ar­bi­trary one. The re­sult­ing or­der is go­ing to be bet­ter than the speed prior or­der, so we’ll up­date in fa­vor of the aliens and away from the rest of the speed prior.

one can’t es­cape the ne­ces­sity to in­tro­duce the ar­bi­trary crite­rion of “valu­ing” ear­lier things on the list

Prob­a­bly some mis­com­mu­ni­ca­tion here. No one is try­ing to ob­ject to the ar­bi­trari­ness, we’re just mak­ing the point that the aliens have a lot of lev­er­age with which to beat the rest of the speed prior.

(They may still not be able to if the penalty for com­pu­ta­tion is suffi­ciently steep—e.g. if you pe­nal­ize based on cir­cuit com­plex­ity so that the model might as well bake in ev­ery­thing that doesn’t de­pend on the par­tic­u­lar in­put at hand. I think it’s an in­ter­est­ing open ques­tion whether that avoids all prob­lems of this form, which I un­suc­cess­fully tried to get at here.)

• They may still not be able to if the penalty for com­pu­ta­tion is suffi­ciently steep

It was definitely re­as­sur­ing to me that some­one else had had the thought that pri­ori­tiz­ing speed could elimi­nate op­ti­miza­tion dae­mons (re: min­i­mal cir­cuits), since the speed prior came in here for in­de­pen­dent rea­sons. The only other ap­proach I can think of is try­ing to do the an­thropic up­date our­selves.

• The only other ap­proach I can think of is try­ing to do the an­thropic up­date our­selves.

If you haven’t seen Jes­sica’s post in this area, it’s worth tak­ing a quick look.

• The only point I was try­ing to re­spond to in the grand­par­ent of this com­ment was your comment

The fast al­gorithms to pre­dict our physics just aren’t go­ing to be the short­est ones. You can use rea­son­ing to pick which one to fa­vor (af­ter figur­ing out physics), rather than just writ­ing them down in some ar­bi­trary or­der and tak­ing the first one.

Your con­cern (I think) is that our speed prior would as­sign a lower prob­a­bil­ity to [fast ap­prox­i­ma­tion of real world] than the aliens’ speed prior.

I can’t re­spond at once to all of the rea­sons you have for this be­lief, but the one I was re­spond­ing to here (which hope­fully we can file away be­fore pro­ceed­ing) was that our speed prior trades off short­ness with speed, and aliens could avoid this and only look at speed.

My point here was just that there’s no way to not trade off short­ness with speed, so no one has a com­par­a­tive ad­van­tage on us as re­sult of the claim “The fast al­gorithms to pre­dict our physics just aren’t go­ing to be the short­est ones.”

The “af­ter figur­ing out physics” part is like say­ing that they can use a prior which is up­dated based on ev­i­dence. They will ob­serve ev­i­dence for what our physics is like, and use that to up­date their pos­te­rior, but that’s ex­actly what we’re do­ing to. The prior they start with can’t be de­signed around our physics. I think that the only place this rea­son­ing gets you is that their pos­te­rior will as­sign a higher prob­a­bil­ity to [fast ap­prox­i­ma­tion of real world] than our prior does, be­cause the world-mod­els have been rea­son­ably reweighted in light of their “figur­ing out physics”. Of course I don’t ob­ject to that—our speed prior’s pos­te­rior will be much bet­ter than the prior too.

• but that’s ex­actly what we’re do­ing to

It seems to­tally differ­ent from what we’re do­ing, I may be mi­s­un­der­stand­ing the anal­ogy.

Sup­pose I look out at the world and do some sci­ence, e.g. dis­cov­er­ing the stan­dard model. Then I use my un­der­stand­ing of sci­ence to de­sign great pre­dic­tion al­gorithms that run fast, but are quite com­pli­cated ow­ing to all of the ap­prox­i­ma­tions and heuris­tics baked into them.

The speed prior gives this model a very low prob­a­bil­ity be­cause it’s a com­pli­cated model. But “do sci­ence” gives this model a high prob­a­bil­ity, be­cause it’s a sim­ple model of physics, and then the ap­prox­i­ma­tions fol­low from a bunch of rea­son­ing on top of that model of physics. We aren’t trad­ing off “short­ness” for speed—we are trad­ing off “looks good ac­cord­ing to rea­son­ing” for speed. Yes they are both ar­bi­trary or­ders, but one of them sys­tem­at­i­cally con­tains bet­ter mod­els ear­lier in the or­der, since the out­put of rea­son­ing is bet­ter than a blind pri­ori­ti­za­tion of shorter mod­els.

Of course the speed prior also in­cludes a hy­poth­e­sis that does “sci­ence with the goal of mak­ing good pre­dic­tions,” and in­deed Wei Dai and I are say­ing that this is the part of the speed prior that will dom­i­nate the pos­te­rior. But now we are back to po­ten­tially-ma­lign con­se­quen­tial­istism. The cog­ni­tive work be­ing done in­ter­nally to that hy­poth­e­sis is to­tally differ­ent from the work be­ing done by up­dat­ing on the speed prior (ex­cept in­so­far as the speed prior liter­ally con­tains a hy­poth­e­sis that does that work).

In other words:

Sup­pose physics takes n bits to spec­ify, and a rea­son­able ap­prox­i­ma­tion takes N >> n bits to spec­ify. Then the speed prior, work­ing in the in­tended way, takes N bits to ar­rive at the rea­son­able ap­prox­i­ma­tion. But the aliens take n bits to ar­rive at the stan­dard model, and then once they’ve done that can im­me­di­ately de­duce the N bit ap­prox­i­ma­tion. So it sure seems like they’ll beat the speed prior. Are you ob­ject­ing to this ar­gu­ment?

(In fact the speed prior only ac­tu­ally takes n + O(1) bits, be­cause it can spec­ify the “do sci­ence” strat­egy, but that doesn’t help here since we are just try­ing to say that the “do sci­ence” strat­egy dom­i­nates the speed prior.)

• I’m not sure which of these ar­gu­ments will be more con­vinc­ing to you.

Yes they are both ar­bi­trary or­ders, but one of them sys­tem­at­i­cally con­tains bet­ter mod­els ear­lier in the or­der, since the out­put of rea­son­ing is bet­ter than a blind pri­ori­ti­za­tion of shorter mod­els.

This is what is what I was try­ing to con­tex­tu­al­ize above. This is an un­fair com­par­i­son. You’re imag­in­ing that the “rea­son­ing”-based or­der gets to see past ob­ser­va­tions, and the “short­ness”-based or­der does not. A rea­son­ing-based or­der is just a short­ness-based or­der that has been up­dated into a pos­te­rior af­ter see­ing ob­ser­va­tions (un­der the view that good rea­son­ing is Bayesian rea­son­ing). Maybe the term “or­der” is con­fus­ing us, be­cause we both know it’s a dis­tri­bu­tion, not an or­der, and we were just sim­plify­ing to a rank­ing. A short­ness-based or­der should re­ally just be called a prior, and a rea­son­ing-based or­der (at least a Bayesian-rea­son­ing-based or­der) should re­ally just be called a pos­te­rior (once it has done some rea­son­ing; be­fore it has done the rea­son­ing, it is just a prior too). So yes, the whole premise of Bayesian rea­son­ing is that up­dat­ing based on rea­son­ing is a good thing to do.

Here’s an­other way to look at it.

The speed prior is do­ing the brute force search that sci­en­tists try to ap­prox­i­mate effi­ciently. The search is for a fast ap­prox­i­ma­tion of the en­vi­ron­ment. The speed prior con­sid­ers them all. The sci­en­tists use heuris­tics to find one.

In fact the speed prior only ac­tu­ally takes n + O(1) bits, be­cause it can spec­ify the “do sci­ence” strategy

Ex­actly. But this does help for rea­sons I de­scribe here. The de­scrip­tion length of the “do sci­ence” strat­egy (I con­tend) is less than the de­scrip­tion length of the “do sci­ence” + “treach­er­ous turn” strat­egy. (I ini­tially typed that as “tern”, which will now be the image I have of a treach­er­ous turn.)

• a rea­son­ing-based or­der (at least a Bayesian-rea­son­ing-based or­der) should re­ally just be called a posterior

Rea­son­ing gives you a prior that is bet­ter than the speed prior, be­fore you see any data. (*Much* bet­ter, limited only by the fact that the speed prior con­tains strate­gies which use rea­son­ing.)

The rea­son­ing in this case is not a Bayesian up­date. It’s eval­u­at­ing pos­si­ble ap­prox­i­ma­tions *by rea­son­ing about how well they ap­prox­i­mate the un­der­ly­ing physics, it­self in­ferred by a Bayesian up­date*, not by di­rectly see­ing how well they pre­dict on the data so far.

The de­scrip­tion length of the “do sci­ence” strat­egy (I con­tend) is less than the de­scrip­tion length of the “do sci­ence” + “treach­er­ous turn” strat­egy.

I can re­ply in that thread.

I think the only good ar­gu­ments for this are in the limit where you don’t care about sim­plic­ity at all and only care about run­ning time, since then you can rule out all rea­son­ing. The thresh­old where things start work­ing de­pends on the un­der­ly­ing physics, for more com­pu­ta­tion­ally com­plex physics you need to pick larger and larger com­pu­ta­tion penalties to get the de­sired re­sult.

• Given a world model , which takes com­pu­ta­tion steps per epi­sode, let be the best world-model that best ap­prox­i­mates (in the sense of KL di­ver­gence) us­ing only com­pu­ta­tion steps. is at least as good as the “rea­son­ing-based re­place­ment” of .

The de­scrip­tion length of is within a (small) con­stant of the de­scrip­tion length of . That way of de­scribing it is not op­ti­mized for speed, but it pre­sents a one-time cost, and any­one ar­riv­ing at that world-model in this way is pay­ing that cost.

One could con­sider in­stead , which is, among the world-mod­els that -ap­prox­i­mate in less than com­pu­ta­tion steps (if the set is non-empty), the first such world-model found by a search­ing pro­ce­dure . The de­scrip­tion length of is within a (slightly larger) con­stant of the de­scrip­tion length of , but the one-time com­pu­ta­tional cost is less than that of .

, , and a host of other ap­proaches are promi­nently rep­re­sented in the speed prior.

If this is what you call “the speed prior do­ing rea­son­ing,” so be it, but the rele­vance for that ter­minol­ogy only comes in when you claim that “once you’ve en­coded ‘do­ing rea­son­ing’, you’ve ba­si­cally already writ­ten the code for it to do the treach­ery that nat­u­rally comes along with that.” That sense of “rea­son­ing” re­ally only ap­plies, I think, to the case where our code is simu­lat­ing aliens or an AGI.

• (ETA: I think this dis­cus­sion de­pended on a de­tail of your ver­sion of the speed prior that I mi­s­un­der­stood.)

Given a world model ν, which takes k com­pu­ta­tion steps per epi­sode, let νlog be the best world-model that best ap­prox­i­mates ν (in the sense of KL di­ver­gence) us­ing only logk com­pu­ta­tion steps. νlog is at least as good as the “rea­son­ing-based re­place­ment” of ν.
The de­scrip­tion length of νlog is within a (small) con­stant of the de­scrip­tion length of ν. That way of de­scribing it is not op­ti­mized for speed, but it pre­sents a one-time cost, and any­one ar­riv­ing at that world-model in this way is pay­ing that cost.

To be clear, that de­scrip­tion gets ~0 mass un­der the speed prior, right? A di­rect speci­fi­ca­tion of the fast model is go­ing to have a much higher prior than a brute force search, at least for val­ues of large enough (or small enough, how­ever you set it up) to rule out the alien civ­i­liza­tion that is (prob­a­bly) the short­est de­scrip­tion with­out re­gard for com­pu­ta­tional limits.

One could con­sider in­stead νlogε, which is, among the world-mod­els that ε-ap­prox­i­mate ν in less than logk com­pu­ta­tion steps (if the set is non-empty), the first such world-model found by a search­ing pro­ce­dure ψ. The de­scrip­tion length of νlogε is within a (slightly larger) con­stant of the de­scrip­tion length of ν, but the one-time com­pu­ta­tional cost is less than that of νlog.

Within this chunk of the speed prior, the ques­tion is: what are good ψ? Any rea­son­able speci­fi­ca­tion of a con­se­quen­tial­ist would work (plus a few more bits for it to un­der­stand its situ­a­tion, though most of the work is done by hand­ing it ), or of a petri dish in which a con­se­quen­tial­ist would even­tu­ally end up with in­fluence. Do you have a con­crete al­ter­na­tive in mind, which you think is not dom­i­nated by some con­se­quen­tial­ist (i.e. a ψ for which ev­ery con­se­quen­tial­ist is ei­ther slower or more com­plex)?

• Do you have a con­crete al­ter­na­tive in mind, which you think is not dom­i­nated by some con­se­quen­tial­ist (i.e. a ψ for which ev­ery con­se­quen­tial­ist is ei­ther slower or more com­plex)?

Well one ap­proach is in the fla­vor of the in­duc­tion al­gorithm I mes­saged you pri­vately about (I know I didn’t give you a com­pletely speci­fied al­gorithm). But when I wrote that, I didn’t have a con­crete al­gorithm in mind. Mostly, it just seems to me that the pow­er­ful al­gorithms which have been use­ful to hu­man­ity have short de­scrip­tions in them­selves. It seems like there are many cases where there is a sim­ple “ideal” ap­proach which con­se­quen­tial­ists “dis­cover” or ap­prox­i­mately dis­cover. A pow­er­ful heuris­tic search would be one such al­gorithm, I think.

(ETA: I think this dis­cus­sion de­pended on a de­tail of your ver­sion of the speed prior that I mi­s­un­der­stood.)

I don’t think any­thing here changes if K(x) were re­placed with S(x) (if that was what you mi­s­un­der­stood).

If you would like to con­tribute, please com­ment with the amount. If you have venmo, please send the amount to @Michael-Co­hen-45. If not, we can dis­cuss.

• Com­ment thread: con­cerns with “the box”

• Just ob­vi­ous and mun­dane con­cerns:

You might want to make clearer that “As long as the door is closed, in­for­ma­tion can­not leave the room” isn’t an as­sump­tion but a re­quire­ment of the setup. I.e. that you’re not as­sum­ing based on your de­scrip­tion that open­ing the door is the only means for an op­er­a­tor to get in­for­ma­tion out; you’re as­sum­ing ev­ery other means of in­for­ma­tion es­cape has been sys­tem­at­i­cally ac­counted for and ruled out (with the as­sump­tion that the op­er­a­tor has been com­pro­mised by the AI).

• Com­ment thread: con­cerns with As­sump­tion 4

• Com­ment thread: con­cerns with As­sump­tion 3

• Wei Dai’s re­sponse to this com­ment rep­re­sents a con­cern with As­sump­tion 3.

• Com­ment thread: con­cerns with As­sump­tion 2

• [Quite pos­si­bly I’m con­fused, but in case I’m not:]
I think this as­sump­tion might be in­valid (or per­haps re­quire more hand-wav­ing than is ideal).

The AI has an in­cen­tive to un­der­stand the op­er­a­tor’s mind, since this bears di­rectly on its re­ward.
Bet­ter un­der­stand­ing the op­er­a­tor’s mind might be achieved in part by run­ning simu­la­tions in­clud­ing the op­er­a­tor.
One spe­cific simu­la­tion would in­volve simu­lat­ing the op­er­a­tor’s en­vi­ron­ment and ac­tions af­ter he leaves the room.

Here this isn’t done to un­der­stand the im­pli­ca­tions of his ac­tions (which can’t af­fect the epi­sode); it’s done to bet­ter un­der­stand his mind (which can).

In this way, one branch of for­get/​not-for­get has two use­ful pur­poses (bet­ter un­der­stand mind and simu­late fu­ture), while the other has one (bet­ter un­der­stand mind). So a ma­lign mem­ory-based model needn’t be slower than a be­nign model, if it’s use­ful for that be­nign model to simu­late the fu­ture too.
So ei­ther I’m con­fused, or the jus­tifi­ca­tion for the as­sump­tion isn’t valid. Hope­fully the former :).

If I’m right, then what you seem to need is an as­sump­tion that simu­lat­ing the out­side-world’s fu­ture can’t be helpful in the AI’s pre­dic­tion of its re­ward. To me, this seems like ma­jor hand-wav­ing ter­ri­tory.

• I wouldn’t re­ally use the term “in­cen­tives” to de­scribe the free-for-all among world-mod­els as they com­pete to be max­i­mum a pos­te­ri­ori. All they have to do is out­put ob­ser­va­tions and re­wards in a dis­tri­bu­tion that matches the ob­jec­tive prob­a­bil­ities. But I think we ar­rive at the same pos­si­bil­ity: you’ll see in the al­gorithm for that it does simu­late the out­side-world.

I do ac­knowl­edge in the pa­per that some of the out­side-world simu­la­tion that a mem­ory-based world-model does when it’s fol­low­ing the “wrong path” may turn out to be use­ful; all that is re­quired for the ar­gu­ment to go through is that this simu­la­tion is not perfectly use­ful—there is a shorter com­pu­ta­tion that ac­com­plishes the same thing.

I would love it if this as­sump­tion could look like: “the quick­est way to simu­late one coun­ter­fac­tual does not in­clude simu­lat­ing a mu­tu­ally ex­clu­sive coun­ter­fac­tual” and make as­sump­tion 2 into a lemma that fol­lows from it, but I couldn’t figure out how to for­mal­ize this.

• Ah yes—I was con­fus­ing my­self at some point be­tween form­ing and us­ing a model (hence “in­cen­tives”).

I think you’re cor­rect that “perfectly use­ful” isn’t go­ing to hap­pen. I’m happy to be wrong.

“the quick­est way to simu­late one coun­ter­fac­tual does not in­clude simu­lat­ing a mu­tu­ally ex­clu­sive coun­ter­fac­tual”

I don’t think you’d be able to for­mal­ize this in gen­eral, since I imag­ine it’s not true. E.g. one could imag­ine a frac­tal world where ev­ery de­tail of a coun­ter­fac­tual ap­peared later in a sub­branch of a mu­tu­ally ex­clu­sive coun­ter­fac­tual. In such a case, simu­lat­ing one coun­ter­fac­tual could be perfectly use­ful to the other. (I sup­pose you’d still ex­pect it to be an op­er­a­tion or so slower, due to ex­tra in­di­rec­tion, but per­haps that could be op­ti­mised away??)

To rule this kind of thing out, I think you’d need more spe­cific as­sump­tions (e.g. physics-based).