# Asymptotically Unambitious AGI

Edit 30/​5/​19: An up­dated ver­sion is on arXiv. I now feel com­fortable with it be­ing cited. The key changes:

• The Ti­tle. I sus­pect the agent is un­am­bi­tious for its en­tire life­time, but the ti­tle says “asymp­tot­i­cally” be­cause that’s what I’ve shown for­mally. In­deed, I sus­pect the agent is be­nign for its en­tire life­time, but the ti­tle says “un­am­bi­tious” be­cause that’s what I’ve shown for­mally. (See the sec­tion “Con­cerns with Task-Com­ple­tion” for an in­for­mal ar­gu­ment go­ing from un­am­bi­tious → be­nign).

• The Use­less Com­pu­ta­tion As­sump­tion. I’ve made it a slightly stronger as­sump­tion. The origi­nal ver­sion is tech­ni­cally cor­rect, but set­ting is tricky if the weak ver­sion of the as­sump­tion is true but the strong ver­sion isn’t. This stronger as­sump­tion also sim­plifies the ar­gu­ment.

• The Prior. Rather than hav­ing to do with the de­scrip­tion length of the Tur­ing ma­chine simu­lat­ing the en­vi­ron­ment, it has to do with the num­ber of states in the Tur­ing ma­chine. This was in re­sponse to Paul’s point that the finite-time be­hav­ior of the origi­nal ver­sion is re­ally weird. This also makes the Nat­u­ral Prior As­sump­tion (now called the No Grue As­sump­tion) a bit eas­ier to as­sess.

Origi­nal Post:

We pre­sent an al­gorithm, then show (given four as­sump­tions) that in the limit, it is hu­man-level in­tel­li­gent and be­nign.

Will MacAskill has com­mented that in the sem­i­nar room, he is a con­se­quen­tial­ist, but for de­ci­sion-mak­ing, he takes se­ri­ously the lack of a philo­soph­i­cal con­sen­sus. I be­lieve that what is here is cor­rect, but in the ab­sence of feed­back from the Align­ment Fo­rum, I don’t yet feel com­fortable post­ing it to a place (like arXiv) where it can get cited and en­ter the aca­demic record. We have sub­mit­ted it to IJCAI, but we can edit or re­voke it be­fore it is printed.

I will dis­tribute at least min($365, num­ber of com­ments *$15) in prizes by April 1st (via venmo if pos­si­ble, or else Ama­zon gift cards, or a dona­tion on their be­half if they pre­fer) to the au­thors of the com­ments here, ac­cord­ing to the com­ments’ qual­ity. If one com­menter finds an er­ror, and an­other com­menter tin­kers with the setup or tin­kers with the as­sump­tions in or­der to cor­rect it, then I ex­pect both com­ments will re­ceive a similar prize (if those com­ments are at the level of prize-win­ning, and nei­ther per­son is me). If oth­ers would like to donate to the prize pool, I’ll provide a com­ment that you can re­ply to.

To or­ga­nize the con­ver­sa­tion, I’ll start some com­ment threads be­low:

• Pos­i­tive feedback

• Gen­eral Con­cerns/​Confusions

• Minor Concerns

• Con­cerns with As­sump­tion 1

• Con­cerns with As­sump­tion 2

• Con­cerns with As­sump­tion 3

• Con­cerns with As­sump­tion 4

• Con­cerns with “the box”

• Ad­ding to the prize pool

• Thanks for a re­ally pro­duc­tive con­ver­sa­tion in the com­ment sec­tion so far. Here are the com­ments which won prizes.

Com­ment prizes:

Ob­jec­tion to the term be­nign (and en­su­ing con­ver­sa­tion). Wei Dei. Link. $20 A plau­si­ble dan­ger­ous side-effect. Wei Dai. Link.$40

Short de­scrip­tion length of simu­lated aliens pre­dict­ing ac­cu­rately. Wei Dai. Link. $120 An­swers that look good to a hu­man vs. ac­tu­ally good an­swers. Paul Chris­ti­ano. Link.$20

Con­se­quences of hav­ing the prior be based on K(s), with s a de­scrip­tion of a Tur­ing ma­chine. Paul Chris­ti­ano. Link. $90 Si­mu­lated aliens con­vert­ing sim­ple world-mod­els into fast ap­prox­i­ma­tions thereof. Paul Chris­ti­ano. Link.$35

Si­mu­lat­ing suffer­ing agents. cousin_it. Link. $20 Reusing simu­la­tion of hu­man thoughts for simu­la­tion of fu­ture events. David Krueger. Link.$20

Op­tions for trans­fer:

1) Venmo. Send me a re­quest at @Michael-Co­hen-45.

2) Send me your email ad­dress, and I’ll send you an Ama­zon gift card (or some other elec­tronic gift card you’d like to spec­ify).

3) Name a char­ity for me to donate the money to.

I would like to ex­ert a bit of pres­sure not to do 3, and spend the money on some­thing frivolous in­stead :) I want to re­ward your con­scious­ness, more than your re­flec­tively en­dorsed prefer­ences, if you’re up for that. On that note, here’s one more op­tion:

4) Send me a pri­vate mes­sage with a ship­ping ad­dress, and I’ll get you some­thing cool (or a few things).

• If I have a great model of physics in hand (and I’m ba­si­cally un­con­cerned with com­pet­i­tive­ness, as you seem to be), why not just take the re­sult­ing simu­la­tion of the hu­man and give it a long time to think? That seems to have fewer safety risks and to be more use­ful.

More gen­er­ally, un­der what model of AI ca­pa­bil­ities /​ com­pet­i­tive­ness con­straints would you want to use this pro­ce­dure?

• I know I don’t prove it, but I think this agent would be vastly su­per­hu­man, since it ap­proaches Bayes-op­ti­mal rea­son­ing with re­spect to its ob­ser­va­tions. (“Ap­proaches” be­cause MAP → Bayes).

For the asymp­totic re­sults, one has to con­sider en­vi­ron­ments that pro­duce ob­ser­va­tions with the true ob­jec­tive prob­a­bil­ities (hence the ap­pear­ance that I’m un­con­cerned with com­pet­i­tive­ness). In prac­tice, though, given the speed prior, the agent will re­quire ev­i­dence to en­ter­tain slow world-mod­els, and for the be­gin­ning of its life­time, the agent will be us­ing low-fidelity mod­els of the en­vi­ron­ment and the hu­man-ex­plorer, ren­der­ing it much more tractable than a perfect model of physics. And I think that even at that stage, well be­fore it is do­ing perfect simu­la­tions of other hu­mans, it will far sur­pass hu­man perfor­mance. We man­age hu­man-level perfor­mance with very rough simu­la­tions of other hu­mans.

That leads me to think this ap­proach is much more com­pet­i­tive that simu­lat­ing a hu­man and giv­ing it a long time to think.

• For the asymp­totic re­sults, one has to con­sider en­vi­ron­ments that pro­duce ob­ser­va­tions with the true ob­jec­tive prob­a­bil­ities (hence the ap­pear­ance that I’m un­con­cerned with com­pet­i­tive­ness). In prac­tice, though, given the speed prior, the agent will re­quire ev­i­dence to en­ter­tain slow world-mod­els, and for the be­gin­ning of its life­time, the agent will be us­ing low-fidelity mod­els of the en­vi­ron­ment and the hu­man-ex­plorer, ren­der­ing it much more tractable than a perfect model of physics. And I think that even at that stage, well be­fore it is do­ing perfect simu­la­tions of other hu­mans, it will far sur­pass hu­man perfor­mance. We man­age hu­man-level perfor­mance with very rough simu­la­tions of other hu­mans.

I’m keen on asymp­totic anal­y­sis, but if we want to an­a­lyze safety asymp­tot­i­cally I think we should also an­a­lyze com­pet­i­tive­ness asymp­tot­i­cally. That is, if our al­gorithm only be­comes safe in the limit be­cause we shift to a su­per un­com­pet­i­tive regime, it un­der­mines the use of the limit as anal­ogy to study the finite time be­hav­ior.

(Though this is not the most in­ter­est­ing dis­agree­ment, prob­a­bly not worth re­spond­ing to any­thing other than the thread where I ask about “why do you need this mem­ory stuff?”)

• That is, if our al­gorithm only be­comes safe in the limit be­cause we shift to a su­per un­com­pet­i­tive regime, it un­der­mines the use of the limit as anal­ogy to study the finite time be­hav­ior.

Definitely agree. I don’t think it’s the case that a shift to su­per un­com­pet­i­tive­ness is ac­tu­ally an “in­gre­di­ent” to be­nig­nity, but my only dis­cus­sion of that so far is in the con­clu­sion: “We can only offer in­for­mal claims re­gard­ing what hap­pens be­fore BoMAI is definitely be­nign...”

• That leads me to think this ap­proach is much more com­pet­i­tive that simu­lat­ing a hu­man and giv­ing it a long time to think.

Surely that just de­pends on how long you give them to think. (See also HCH.)

• By com­pet­i­tive­ness, I meant use­ful­ness per unit com­pu­ta­tion.

• The al­gorithm takes an argmax over an ex­po­nen­tially large space of se­quences of ac­tions, i.e. it does 2^{epi­sode length} model eval­u­a­tions. Do you think the re­sult is smarter than a group of hu­mans of size 2^{epi­sode length}? I’d bet against—the hu­mans could do this par­tic­u­lar brute force search, in which case you’d have a tie, but they’d prob­a­bly do some­thing smarter.

• I ob­vi­ously haven’t solved the Tractable Gen­eral In­tel­li­gence prob­lem. The ques­tion is whether this is a tractable/​com­pet­i­tive frame­work. So ex­pec­ti­max plan­ning would nat­u­rally get re­placed with a Monte-Carlo tree search, or some bet­ter ap­proach we haven’t thought of. And I’ll mes­sage you pri­vately about a more tractable ap­proach to iden­ti­fy­ing a max­i­mum a pos­te­ri­ori world-model from a countable class (I don’t as­sign a very high prob­a­bil­ity to it be­ing a hugely im­por­tant ca­pa­bil­ities idea, since those aren’t just ly­ing around, but it’s more than 1%).

It will be im­por­tant, when con­sid­er­ing any of these ap­prox­i­ma­tions, to eval­u­ate whether they break be­nig­nity (most plau­si­bly, I think, by in­tro­duc­ing a new at­tack sur­face for op­ti­miza­tion dae­mons). But I feel fine about defer­ring that re­search for the time be­ing, so I defined BoMAI as do­ing ex­pec­ti­max plan­ning in­stead of MCTS.

Given that the setup is ba­si­cally a straight re­in­force­ment learner with a weird prior, I think that at that level of ab­strac­tion, the ceiling of com­pet­i­tive­ness is quite high.

• I’m sym­pa­thetic to this pic­ture, though I’d prob­a­bly be in­clined to try to model it ex­plic­itly—by mak­ing some as­sump­tion about what the plan­ning al­gorithm can ac­tu­ally do, and then show­ing how to use an al­gorithm with that prop­erty. I do think “just write down the al­gorithm, and be hap­pier if it looks like a ‘nor­mal’ al­gorithm” is an OK start­ing point though

Given that the setup is ba­si­cally a straight re­in­force­ment learner with a weird prior, I think that at that level of ab­strac­tion, the ceiling of com­pet­i­tive­ness is quite high.

Step­ping back from this par­tic­u­lar thread, I think the main prob­lem with com­pet­i­tive­ness is that you are just get­ting “an­swers that look good to a hu­man” rather than “ac­tu­ally good an­swers.” If I try to use such a sys­tem to nav­i­gate a com­pli­cated world, con­tain­ing lots of other peo­ple with more liberal AI ad­vi­sors helping them do crazy stuff, I’m go­ing to quickly be left be­hind.

It’s cer­tainly rea­son­able to try to solve safety prob­lems with­out at­tend­ing to this kind of com­pet­i­tive­ness, though I think this kind of asymp­totic safety is ac­tu­ally eas­ier than you make it sound (un­der the im­plicit “noth­ing goes ir­re­versibly wrong at any finite time” as­sump­tion).

• Start­ing a new thread on this:

Step­ping back from this par­tic­u­lar thread, I think the main prob­lem with com­pet­i­tive­ness is that you are just get­ting “an­swers that look good to a hu­man” rather than “ac­tu­ally good an­swers.”

here.

• From Paul:

I think the main prob­lem with com­pet­i­tive­ness is that you are just get­ting “an­swers that look good to a hu­man” rather than “ac­tu­ally good an­swers.”

The com­ment was here, but I think it de­serves its own thread. Wei makes the same point here (point num­ber 3), and our en­su­ing con­ver­sa­tion is also rele­vant to this thread.

My an­swers to Wei were two-fold: one is that if be­nig­nity is es­tab­lished, it’s pos­si­ble to safely tin­ker with the setup un­til hope­fully “an­swers that look good to a hu­man” re­sem­bles good an­swers (we never quite reached an agree­ment about this). The sec­ond was an ex­am­ple of an ex­tended setup (one has to read the par­ent com­ments to un­der­stand it) which would po­ten­tially be much more likely to yield ac­tu­ally good an­swers; I think we agree about this ap­proach.

My origi­nal idea when I started work­ing on this, ac­tu­ally, is also an an­swer to this con­cern. The rea­son it’s not in the pa­per is be­cause I pared it down to a min­i­mum vi­able product.

Con­struct an “or­a­cle” by defin­ing “true an­swers” as fol­lows: an­swers which help a hu­man do ac­cu­rate pre­dic­tion on a ran­domly sam­pled pre­dic­tion task.*

I figured out that I needed a box, and ev­ery­thing else in this setup, and I re­al­ized that the setup could be ap­plied to a nor­mal re­in­force­ment learner just as eas­ily as for this or­a­cle, so I sim­plified the ap­proach.

I hon­estly need to dig through notes from last year, but my rec­ol­lec­tion is this: the op­er­a­tor re­ceives an an­swer to a query, and then gets a ran­dom pre­dic­tion task, which he has to make a pre­dic­tion about be­fore leav­ing the box. Later, the pre­dic­tion is scored, and this is con­verted into a re­ward for BoMAI. BoMAI has a model class for how the pre­dic­tion is scored; the out­put of these mod­els is an an­swer for what the ground truth is. In all of these mod­els, the ground truth doesn’t de­pend on BoMAI’s an­swer (that is, the model isn’t given read ac­cess to BoMAI’s an­swer). So the pre­dic­tion task can in­volve the pre­dic­tion of out­side world events, and the ground truth can be logged from the out­side world, be­cause BoMAI doesn’t con­ceive of its an­swer hav­ing a causal im­pact on the copy of the world which pro­vides the ground truth for the pre­dic­tion tasks. For ex­am­ple, the pre­dic­tion task might sam­pled from {“True or false: hex­alated keno­tones will sup­press ac­tivity of BGQ-1”, “True or false: fluori­nat­ing ran­dom lysines in hemoglobin will sup­press ac­tivity of BGQ-1”, etc.} (half of those terms are made up). After this epi­sode, the pre­dic­tion can be graded in the out­side world. With the ob­vi­ous scor­ing rule, the or­a­cle would just say “I don’t care plau­si­ble it sounds, what­ever they ask you, just say it’s not go­ing to work. Most things don’t.” With a bet­ter scor­ing rule, I would ex­pect it to give ac­cu­rate in­for­ma­tion in a hu­man-un­der­stand­able for­mat.

I haven’t thought about this in a while, and I was hon­estly worse at think­ing about al­ign­ment at that point in time, so I don’t mean to con­vey much con­fi­dence that this ap­proach works out. What I do think it shows, alongside the idea I came up with in the con­ver­sa­tion with Wei, linked above, is that this gen­eral ap­proach is pow­er­ful and amenable to im­prove­ment in ways that ren­der it even more use­ful.

* A more re­cent thought: as de­scribed, “or­a­cle” is not the right word for this setup. It would re­spond to “What ap­proaches might work for cur­ing can­cer?” with “Doesn’t mat­ter. There are more gaps in your knowl­edge re­gard­ing eco­nomics. A few prin­ci­ples to keep in mind…” How­ever, if the pre­dic­tion task dis­tri­bu­tion were con­di­tioned in some way on the ques­tion asked, one might be able to make it more likely that the “or­a­cle” an­swers the ques­tion, rather than just spew­ing un­re­lated in­sight.

• Here is an old post of mine on the hope that “com­pu­ta­tion­ally sim­plest model de­scribing the box” is ac­tu­ally a phys­i­cal model of the box. I’m less op­ti­mistic than you are, but it’s cer­tainly plau­si­ble.

From the per­spec­tive of op­ti­miza­tion dae­mons /​ in­ner al­ign­ment, I think like the in­ter­est­ing ques­tion is: if in­ner al­ign­ment turns out to be a hard prob­lem for train­ing cog­ni­tive poli­cies, do we ex­pect it to be­come much eas­ier by train­ing pre­dic­tive mod­els? I’d bet against at 1:1 odds, but not 1:2 odds.

• I think like the in­ter­est­ing ques­tion is: if in­ner al­ign­ment turns out to be a hard prob­lem for train­ing cog­ni­tive poli­cies, do we ex­pect it to be­come much eas­ier by train­ing pre­dic­tive mod­els?

If I’m un­der­stand­ing cor­rectly, and I’m very un­sure that I am, you’re com­par­ing the model-based ap­proach of [learn the en­vi­ron­ment then do good plan­ning] with [learn to imi­tate a policy]. (Note that any iter­ated ap­proach to im­prov­ing a policy re­quires learn­ing the en­vi­ron­ment, so I don’t see what “train­ing cog­ni­tive poli­cies” could mean be­sides imi­ta­tion learn­ing.) And the ques­tion you’re won­der­ing about is whether op­ti­miza­tion dae­mons be­come eas­ier to avoid when fol­low­ing the [learn the en­vi­ron­ment then do good plan­ning] ap­proach.

Imi­ta­tion learn­ing is about pre­dic­tion just as much as pre­dic­tive mod­els are—pre­dic­tive mod­els imi­tate the en­vi­ron­ment. So I sup­pose op­ti­miza­tion dae­mons are about equally likely to ap­pear?

My real an­swer, though, is that I’m not sure, but vanilla imi­ta­tion learn­ing isn’t com­pet­i­tive.

But I sus­pect I’ve mi­s­un­der­stood your ques­tion.

• the hope that “com­pu­ta­tion­ally sim­plest model de­scribing the box” is ac­tu­ally a phys­i­cal model of the box

I don’t ac­tu­ally rely on this as­sump­tion, al­though it un­der­pins the in­tu­ition be­hind As­sump­tion 2.

• I agree that you don’t rely on this as­sump­tion (so I was wrong to as­sume you are more op­ti­mistic than I am). In the literal limit, you don’t need to care about any of the con­sid­er­a­tions of the kind I was rais­ing in my post.

• Given that you are tak­ing limits, I don’t see why you need any of the ma­chin­ery with for­get­ting or with mem­ory-based world mod­els (and if you did re­ally need that ma­chin­ery, it seems like your proof would have other prob­lems). My un­der­stand­ing is:

• Your already as­sume that you can perform ar­bi­trar­ily many rounds of the al­gorithm as in­tended (or rather you prove that there is some such that if you ran steps, with ev­ery­thing work­ing as in­tended and in par­tic­u­lar with no mem­ory cor­rup­tion, then you would get “be­nign” be­hav­ior).

• Any time the MAP model makes a differ­ent pre­dic­tion from the in­tended model, it loses some like­li­hood. So this can only hap­pen finitely many times in any pos­si­ble world. Just take to be af­ter the last time it hap­pens w.h.p.

What’s wrong with this?

• No­ta­tional note: I use to de­note the epi­sode when BoMAI be­comes demon­stra­bly be­nign and for some­thing else.

Any time the MAP model makes a differ­ent pre­dic­tion from the in­tended model, it loses some like­li­hood.

Any time any model makes a differ­ent on-policy pre­dic­tion from the in­tended model, it loses some like­li­hood (in ex­pec­ta­tion). The off-policy pre­dic­tions don’t get tested. Un­der a policy that doesn’t cause the com­puter’s mem­ory to be tam­pered with (which is plau­si­ble, even ideal), and are iden­ti­cal, so we can’t count on los­ing prob­a­bil­ity mass rel­a­tive to . The ap­proach here is to set it up so that world-mod­els like ei­ther start with a lower prior, or else even­tu­ally halt when they ex­haust their com­pu­ta­tion bud­get.

• Un­der a policy that doesn’t cause the com­puter’s mem­ory to be tam­pered with (which is plau­si­ble, even ideal), ν† and ν⋆ are iden­ti­cal, so we can’t count on ν†los­ing prob­a­bil­ity mass rel­a­tive to ν⋆.

I agree with that, but if they are always mak­ing the same on-policy pre­dic­tion it doesn’t mat­ter what hap­pens to their rel­a­tive prob­a­bil­ity (mod­ulo ex­plo­ra­tion). The agent can’t act on an in­cen­tive to cor­rupt mem­ory in­finitely of­ten, be­cause each time re­quires the mod­els mak­ing a differ­ent pre­dic­tion on-policy. So the agent only acts on such an in­cen­tive finitely many times, and hence never does so af­ter some suffi­ciently late epi­sode . Agree/​dis­agree?

(Hav­ing a bad model can still hurt, since the bo­gus model might agree on-policy but as­sign lower re­wards off-policy. But if they also always ap­prox­i­mately agree on the ex­plo­ra­tion dis­tri­bu­tion, then a bad model also can’t dis­cour­age ex­plo­ra­tion. And if they don’t agree on the ex­plo­ra­tion dis­tri­bu­tion, then the bad model will even­tu­ally get tested.)

• Ah I see what you’re say­ing.

I sup­pose I con­strained my­self to pro­duc­ing an al­gorithm/​setup where the asymp­totic be­nig­nity re­sult fol­lowed from rea­sons that don’t re­quire dan­ger­ous be­hav­ior in the in­terim.

Also, you can add an­other pa­ram­e­ter to BoMAI where you just have the hu­man ex­plorer ex­plore for the first E epi­sodes. The in the Even­tual Benig­nity The­o­rem can be thought of as the max of i’ and i″. i’ comes from the in Lemma 1 (Re­ject­ing the Sim­ple Me­mory-Based). i″ comes from the point in time when is -ac­cu­rate on policy, which ren­ders Lemma 3 ap­pli­ca­ble. (And Lemma 2 always ap­plies). My ini­tial thought was to set E so that the hu­man ex­plorer is ex­plor­ing for the whole time when the MAP world-model was not nec­es­sar­ily be­nign. This works for i’. E can just be set to be greater than i’. The thing it doesn’t work for is i″. If you in­crease E, the value of i″ goes up as well.

So in fact, if you set E large enough, the first time BoMAI con­trols the epi­sode, it will be be­nign. Then, there is a pe­riod where it might not be be­nign. How­ever, from that point on, the only “way” for a world-model to be ma­lign is by be­ing worse than -in­ac­cu­rate on-policy, be­cause Lem­mas 1 and 2 have already kicked in, and if it were -ac­cu­rate on-policy, Lemma 3 would kick in as well. The first point to make about this is that in this regime, be­nig­nity comes in tan­dem with in­tel­li­gence—it has to be con­fused to be dan­ger­ous (like a self-driv­ing car). The sec­ond point is: I can’t come up with an ex­am­ple of world-model which is plau­si­bly max­i­mum a pos­te­ri­ori in this in­ter­val of time, and which is plau­si­bly dan­ger­ous (for what that’s worth; and I don’t like to as­sume it’s worth much be­cause it took me months to no­tice ).

• I sup­pose I con­strained my­self to pro­duc­ing an al­gorithm/​setup where the asymp­totic be­nig­nity re­sult fol­lowed from rea­sons that don’t re­quire dan­ger­ous be­hav­ior in the in­terim.

I think my point is this:

• The in­tu­itive thing you are aiming at is stronger than what the the­o­rem es­tab­lishes (un­der­stand­ably!)

• You prob­a­bly don’t need the mem­ory trick to es­tab­lish the the­o­rem it­self.

• Even with the mem­ory trick, I’m not con­vinced you meet the stronger crite­rion. There are a lot of other things similar to mem­ory that can cause trou­ble—the the­o­rem is able to avoid them only be­cause of the same un­satis­fy­ing asymp­totic fea­ture that would have caused it to avoid mem­ory-based mod­els even with­out the am­ne­sia.

• the the­o­rem is able to avoid them only be­cause of the same un­satis­fy­ing asymp­totic fea­ture that would have caused it to avoid mem­ory-based mod­els even with­out the amnesia

This is a con­cep­tual ap­proach I hadn’t con­sid­ered be­fore—thank you. I don’t think it’s true in this case. Let’s be con­crete: the asymp­totic fea­ture that would have caused it to avoid mem­ory-based mod­els even with­out am­ne­sia is trial and er­ror, ap­plied to un­safe poli­cies. Every sec­tion of the proof, how­ever, can be thought of as mak­ing off-policy pre­dic­tions be­have. The real re­sult of the pa­per would then be “Asymp­totic Benig­nity, proven in a way that in­volves off-policy pre­dic­tions ap­proach­ing their be­nign out­put with­out ever be­ing tested”. So while there might be ma­lign world-mod­els of a differ­ent fla­vor to the mem­ory-based ones, I don’t think the way this the­o­rem treats them is un­satis­fy­ing.

• Upvoted for in­ter­est­ing ex­per­i­ments with boun­ties and com­ment for­mat­ting.

• I like that you em­pha­size and dis­cuss the need for the AI to not be­lieve that it can in­fluence the out­side world, and cleanly dis­t­in­guish this from it ac­tu­ally be­ing able to in­fluence the out­side world. I won­der if you can get any of the benefits here with­out need­ing the box to ac­tu­ally work (i.e. can you just get the agent to be­lieve it does? and is that enough for some form/​de­gree of be­nig­nity?)

1. Can you give some in­tu­itions about why the sys­tem uses a hu­man ex­plorer in­stead of do­ing ex­plor­ing au­to­mat­i­cally?

2. I’m con­cerned about over­load­ing the word “be­nign” with a new con­cept (mainly not seek­ing power out­side the box, if I un­der­stand cor­rectly) that doesn’t match ei­ther in­for­mal us­age or a pre­vi­ous tech­ni­cal defi­ni­tion. In par­tic­u­lar this “be­nign” AGI (in the limit) will hack the op­er­a­tor’s mind to give it­self max­i­mum re­ward, if that’s pos­si­ble, right?

3. The sys­tem seems limited to an­swer­ing ques­tions that the hu­man op­er­a­tor can cor­rectly eval­u­ate the an­swers to within a sin­gle epi­sode (al­though I sup­pose we could make the epi­sodes very long and al­low mul­ti­ple hu­mans into the room to eval­u­ate the an­swer to­gether). (We could ask it other ques­tions but it would give an­swers that sound best to the op­er­a­tor rather than cor­rect an­swers.) If you ac­tu­ally had this AGI to­day, what ques­tions would you ask it?

4. If you were to ask it a ques­tion like “Given these symp­toms, do I need emer­gency med­i­cal treat­ment?” and the cor­rect an­swer is “yes”, it would an­swer “no” be­cause if it an­swered “yes” then the op­er­a­tor would leave the room and it would get 0 re­ward for the rest of the epi­sode. Maybe not a big deal but it’s kind of a counter-ex­am­ple to “We ar­gue that our al­gorithm pro­duces an AGI that, even if it be­came om­ni­scient, would con­tinue to ac­com­plish what­ever task we wanted, in­stead of hi­jack­ing its re­ward, es­chew­ing its task, and neu­tral­iz­ing threats to it, even if it saw clearly how to do ex­actly that.”

(Feel free to count this as some num­ber of com­ments be­tween 1 and 4, since some of the above items are re­lated. Also I haven’t read most of the math yet and may have more com­ments and ques­tions once I un­der­stood the mo­ti­va­tions and math bet­ter.)

• 4. If you were to ask it a ques­tion like “Given these symp­toms, do I need emer­gency med­i­cal treat­ment?” and the cor­rect an­swer is “yes”, it would an­swer “no” be­cause if it an­swered “yes” then the op­er­a­tor would leave the room and it would get 0 re­ward for the rest of the epi­sode...

When I say it would con­tinue to ac­com­plish what­ever task we wanted, I’m be­ing a bit sloppy—if we have a task we want ac­com­plished, and we provide re­wards ran­domly, it will not ac­com­plish our de­sired task. But I take the point that “what­ever task we wanted” does have some re­stric­tions: it has to be one that a hu­man op­er­a­tor can con­vert into a re­ward with­out leav­ing. So the task “re­spond with the true an­swer to [difficult ques­tion]” is not one that the op­er­a­tor can con­vert into a re­ward, but the task “re­spond with an an­swer that sounds plau­si­ble to the op­er­a­tor” is. I think this sub­sumes your ex­am­ple.

• 1. Can you give some in­tu­itions about why the sys­tem uses a hu­man ex­plorer in­stead of do­ing ex­plor­ing au­to­mat­i­cally?

What­ever policy is used for ex­plo­ra­tion, we can en­sure that BoMAI will even­tu­ally out­perform this policy. With a hu­man ex­e­cut­ing the policy, this leads to BoMAI ac­cu­mu­lat­ing re­ward at least as well as a hu­man. Un­der the “smarter” in­for­ma­tion the­o­retic ex­plo­ra­tory poli­cies that I’ve con­sid­ered, ex­plo­ra­tory be­hav­ior is un­safe from in­sa­tiable cu­ri­os­ity: the agent has to try kil­ling ev­ery­one just to check to make sure it’s not a weird cheat code.

• 3. The sys­tem seems limited to an­swer­ing ques­tions that the hu­man op­er­a­tor can cor­rectly eval­u­ate...

Yes. BoMAI would be able to give plau­si­ble-sound­ing an­swers to ques­tions. BoMAI could also do any task that was au­to­mat­i­cally check­able: don’t use a hu­man op­er­a­tor at all; have an au­to­mated sys­tem which in­ter­prets text as an amino acid se­quence; syn­the­size that pro­tein; mea­sure some fea­ture of it’s be­hav­ior; provide re­ward ac­cord­ingly. (That ex­am­ple in­vites re­newed fo­cus on the im­per­me­abil­ity of the box, by the way).

Some things I would do is send an em­i­nent can­cer re­searcher in to ask BoMAI for a re­search pro­posal. Then the re­searcher could go out and test it. It might be worth­less, no mat­ter how plau­si­ble it seemed, but then they could go back hav­ing learned some­thing about a failed path. Re­peat­ing this pro­cess, it seems likely to me that a cor­rect idea would ap­pear, just con­sid­er­ing the like­li­hood of ap­pear­ing plau­si­ble to a bet­ter and bet­ter trained eval­u­a­tor.

I would also nat­u­rally ask it how to make a safe un­bounded AGI. And the next epi­sode, I would ask for an ex­pla­na­tion for why that would fail.

REDACTED: On that topic, in ad­di­tion to hav­ing mul­ti­ple hu­mans in the box, you could also have 2 agents that the op­er­a­tor in­ter­acts with, both of which are clones ex­cept that the re­ward for the sec­ond is one minus the re­ward for the first. This would look like “AI Safety via de­bate.”

• I would also nat­u­rally ask it how to make a safe un­bounded AGI. And the next epi­sode, I would ask for an ex­pla­na­tion for why that would fail.

This seems use­ful if you could get around the mind hack­ing prob­lem, but how would you do that?

On that topic, in ad­di­tion to hav­ing mul­ti­ple hu­mans in the box, you could also have 2 agents that the op­er­a­tor in­ter­acts with, both of which are clones ex­cept that the re­ward for the sec­ond is one minus the re­ward for the first. This would look like “AI Safety via de­bate.”

I don’t know how this would work in terms of your setup. The most ob­vi­ous way would seem to re­quire the two agents to simu­late each other, which would be im­pos­si­ble, and I’m not sure what else you might have in mind.

• This seems use­ful if you could get around the mind hack­ing prob­lem, but how would you do that?

On sec­ond thought, (even as­sum­ing away the mind hack­ing prob­lem) if you ask about “how to make a safe un­bounded AGI” and “what’s wrong with the an­swer” in sep­a­rate epi­sodes, you’re es­sen­tially man­u­ally search­ing an ex­po­nen­tially large tree of pos­si­ble ar­gu­ments, coun­ter­ar­gu­ments, counter-coun­ter­ar­gu­ments, and so on. (Two epi­sodes isn’t enough to de­ter­mine whether the first an­swer you got was a good one, be­cause the sec­ond an­swer is also op­ti­mized for sound­ing good in­stead of be­ing ac­tu­ally cor­rect, so you’d have to do an­other epi­sode to ask for a counter-ar­gu­ment to the sec­ond an­swer, and so on, and then once you’ve defini­tively figured out that some an­swer/​node was bad, you have to ask for an­other an­swer at that node and re­peat this pro­cess.) The point of “AI Safety via De­bate” was to let AI do all this search­ing for you, so it seems that you do have to figure out how to do some­thing similar to avoid the ex­po­nen­tial search.

ETA: Do you know if the pro­posal in “AI Safety via De­bate” is “asymp­tot­i­cally be­nign” in the sense you’re us­ing here?

• ETA: Do you know if the pro­posal in “AI Safety via De­bate” is “asymp­tot­i­cally be­nign” in the sense you’re us­ing here?

No! Either de­bater is in­cen­tivized to take ac­tions that get the op­er­a­tor to cre­ate an­other ar­tifi­cial agent that takes over the world, re­places the op­er­a­tor, and set­tles the de­bate in fa­vor of the de­bater in ques­tion.

• I guess we can in­cor­po­rate into DEBATE the idea of build­ing a box around the de­baters and judge with a door that au­to­mat­i­cally ends the epi­sode when opened. Do you think that would be suffi­cient to make it “be­nign” in prac­tice? Are there any other ideas in this pa­per that you would want to in­cor­po­rate into a prac­ti­cal ver­sion of DEBATE?

• Add the ret­ro­grade am­ne­sia cham­ber and an ex­plorer, and we’re pretty much at this, right?

Without the ret­ro­grade am­ne­sia, it might still be be­nign, but I don’t know how to show it. Without the ex­plorer, I doubt you can get very strong use­ful­ness re­sults.

• I sus­pect that AI Safety via De­bate could be be­nign for cer­tain de­ci­sions (like whether to re­lease an AI) if we were to weight the de­bate more to­wards the safer op­tion.

• Do you have thoughts on this?

Either de­bater is in­cen­tivized to take ac­tions that get the op­er­a­tor to cre­ate an­other ar­tifi­cial agent that takes over the world, re­places the op­er­a­tor, and set­tles the de­bate in fa­vor of the de­bater in ques­tion.
• you’re es­sen­tially man­u­ally search­ing an ex­po­nen­tially large tree of pos­si­ble ar­gu­ments, coun­ter­ar­gu­ments, counter-coun­ter­ar­gu­ments, and so on

I ex­pect the hu­man op­er­a­tor mod­er­at­ing this de­bate would get pretty good at think­ing about AGI safety, and start to be­come no­tice­ably bet­ter at dis­miss­ing bad rea­son­ing than good rea­son­ing, at which point BoMAI would find the pro­duc­tion of cor­rect rea­son­ing a good heuris­tic for seem­ing con­vinc­ing.

… but yes, it is still ex­po­nen­tial (ex­po­nen­tial in what, ex­actly? maybe the num­ber of con­cepts we have han­dles for?); this com­ment is the real an­swer to your ques­tion.

• I ex­pect the hu­man op­er­a­tor mod­er­at­ing this de­bate would get pretty good at think­ing about AGI safety, and start to be­come no­tice­ably bet­ter at dis­miss­ing bad rea­son­ing than good rea­son­ing, at which point BoMAI would find the pro­duc­tion of cor­rect rea­son­ing a good heuris­tic for seem­ing con­vinc­ing.

Alter­na­tively, the hu­man might have a lot of ad­ver­sar­ial ex­am­ples and the de­bate be­comes an ex­er­cise in ex­plor­ing all those ad­ver­sar­ial ex­am­ples. I’m not sure how to tell what will re­ally hap­pen short of ac­tu­ally hav­ing a su­per­in­tel­li­gent AI to test with.

• I don’t know how this would work in terms of your setup. The most ob­vi­ous way would seem to re­quire the two agents to simu­late each other, which would be im­pos­si­ble, and I’m not sure what else you might have in mind.

You’re right (see the redac­tion). Why Wei is right. Here’s an un­pol­ished idea though: they could do some­thing like min­i­max. In­stead of simu­lat­ing the other agent, they could model the en­vi­ron­ment as re­spond­ing to a pair of ac­tions. For in­fer­ence, they would have the his­tory of their op­po­nent’s ac­tions as well, and for plan­ning, they could pick their ac­tion to max­i­mize their ob­jec­tive as­sum­ing the other agent’s ac­tions are max­i­mally in­con­ve­nient.

• So you ba­si­cally have the same AI play both sides of the zero-sum game, right? That seems like it should work, with the same caveat as for “AI Safety via de­bate”, namely that it seems hard to pre­dict what hap­pens when you have su­per­in­tel­li­gent AIs play a zero-sum game with a hu­man as the judge.

• With a de­bate-like setup, if one side (A) is about to lose a de­bate, it seems to have a high in­cen­tive to claim that the other side (B) try­ing to do a mind hack and that if the judge keeps pay­ing at­ten­tion to what B says (i.e., read any fur­ther out­put from B), they will soon be taken over. What is the judge sup­posed to do in this case? They could ask A to ex­plain how B’s pre­vi­ous out­puts con­sti­tute part of an at­tempt to mind hack, but A could con­coct a story mixed with its own at­tempt to mind hack, and the judge can’t ask for any counter-ar­gu­ments from B with­out risk­ing be­ing mind hacked by B.

(I re­al­ize this is a prob­lem in “AI Safety via de­bate” as well, but I’m ask­ing you since you’re here and Ge­offrey Irv­ing isn’t. :)

• *but A could con­coct a story … coun­ter­ar­gu­ments from B .. mind hacked by B, right?

I think the main con­tention of their pa­per is that it is harder to mys­tify a judge than it is to pierce through some­one else mys­tify­ing a judge, so this shouldn’t be a prob­lem.

That said, here’s one pos­si­bil­ity: if A ever sug­gests that you don’t read more out­put from B, don’t read any­thing more from B, then flip coins to give A a 25% chance of vic­tory.

• *but A could con­coct a story … coun­ter­ar­gu­ments from B .. mind hacked by B, right?

Yeah, I mixed up the A’s and B’s at the end. It’s fixed now. Thanks for point­ing it out.

I think the main con­tention of their pa­per is that it is harder to mys­tify a judge than it is to pierce through some­one else mys­tify­ing a judge, so this shouldn’t be a prob­lem.

I don’t un­der­stand how the former im­plies the lat­ter. As­sum­ing the former is true (and it seems like a big as­sump­tion), why can’t what I sug­gested still hap­pen?

That said, here’s one pos­si­bil­ity: if A ever sug­gests that you don’t read more out­put from B, don’t read any­thing more from B, then flip coins to give A a 25% chance of vic­tory.

But what about the case where B is ac­tu­ally try­ing to mind hack the judge? If you always give A a 25% chance of vic­tory for sug­gest­ing or im­ply­ing that you shouldn’t read more out­put from B, then mind hack­ing be­comes a (mostly) win­ning strat­egy, since a player gets a 75% chance of vic­tory from mind hack­ing even if the other side suc­cess­fully con­vinces the judge that they’re try­ing to mind hack the judge. The equil­ibrium might then con­sist of a race to see who can mind hack the judge first, or (if one side has >75% chance of win­ning such a race due to first-mover or sec­ond-mover ad­van­tage) one side try­ing to mind hack the judge, get­ting blocked by the other side, and still get­ting 75% vic­tory.

• As­sum­ing the former is true (and it seems like a big as­sump­tion), why can’t what I sug­gested still hap­pen?

If the as­sump­tion is true, we could de­mand that A use their words, and counter us be­ing mind-hacked by pok­ing holes in what B is say­ing rather than de­mand­ing we stop listen­ing to B. And if A is able to con­vince us that B was mind-hack­ing, even af­ter some more back and forth, B will be pun­ished for that.

So ac­tu­ally I framed my point above wrong: “de­mand­ing that A use their words” could look like the pro­to­col I de­scribe; it is not some­thing that would work in­de­pen­dently of the as­sump­tion that it is eas­ier to deflate an at­tempted mind-hack­ing than it is to mind-hack (with an equal amount of in­tel­li­gence/​re­sources).

But your origi­nal point was “why doesn’t A just claim B is mind-hack­ing” not “why doesn’t B just mind-hack”? The an­swer to that point was “de­mand A use their words rather than ne­go­ti­ate an end to the con­ver­sa­tion” or more mod­er­ately, “75%-de­mand that A do this.”

• If the as­sump­tion is true, we could de­mand that A use their words, and counter us be­ing mind-hacked by pok­ing holes in what B is say­ing rather than de­mand­ing we stop listen­ing to B. And if A is able to con­vince us that B was mind-hack­ing, even af­ter some more back and forth, B will be pun­ished for that.

Oh, I see, I didn’t un­der­stand “it is harder to mys­tify a judge than it is to pierce through some­one else mys­tify­ing a judge” cor­rectly. So this as­sump­tion ba­si­cally rules out a large class of pos­si­ble vuln­er­a­bil­ities in the judge, right? For ex­am­ple, if the judge had the equiv­a­lent of a buffer overflow bug in a net­work stack, the scheme would fail. In that case, A would not be able to “pierce through” B’s at­tack and stop it with its words if the judge keeps listen­ing to B (and B was ac­tu­ally at­tack­ing).

I don’t think the “AI safety via de­bate” pa­per ac­tu­ally makes ar­gu­ments for this as­sump­tion (at least I couldn’t find where it does). Do you have rea­sons to think it’s true, or ideas for how to ver­ify that it’s true, short of putting a hu­man in a BoMAI?

• Yeah… I don’t have much to add here. Let’s keep think­ing about this. I won­der if Paul is more bullish on the premise that “it is harder to mys­tify a judge than it is to pierce through some­one else mys­tify­ing a judge” than I am?

Re­call that this idea was to avoid

es­sen­tially man­u­ally search­ing an ex­po­nen­tially large tree of pos­si­ble ar­gu­ments, coun­ter­ar­gu­ments, counter-coun­ter­ar­gu­ments, and so on

If it also re­duces the risk of op­er­a­tor-de­vo­tion, and it might well do that (be­cause a pow­er­ful ad­ver­sary is op­posed to that), that wasn’t origi­nally what brought us here.

• This seems use­ful if you could get around the mind hack­ing prob­lem, but how would you do that?

If the per­son leaves hav­ing been hyp­no­tized, since it’s not catas­trophic, we can mod­ify the setup. For ex­am­ple, the room could be di­vided in half, with Oper­a­tor 1 in­ter­act­ing BoMAI, and with Oper­a­tor 2 ob­serv­ing Oper­a­tor 1 for signs of hyp­no­sis or for signs of an at­tempt to com­mu­ni­cate with Oper­a­tor 2, and in ei­ther event, Oper­a­tor 2 could end the epi­sode im­me­di­ately. (That would be Oper­a­tor 2′s only abil­ity). More im­por­tantly than the mag­ni­tude of [(the prob­a­bil­ity that this works) - (the prob­a­bil­ity it would work the first time)] is the fact that this can be done iter­a­tively.

A bit of a side note: I’m cu­ri­ous what odds you give to hyp­no­sis (or some­thing like it) be­ing the best way to get op­ti­mal re­ward for such a task (in the vanilla setup).

• In­stead of hyp­no­sis, I’m more wor­ried about the AI talk­ing the op­er­a­tor into some kind of world view that im­plies they should be re­ally gen­er­ous to the AI (i.e., give it max re­wards), or give some se­quence of an­swers that feel ex­tremely in­sight­ful (and invit­ing fur­ther ques­tions/​an­swers in the same vein). And then the op­er­a­tor might feel a de­sire af­ter­wards to spread this world view or se­quence of an­swers to oth­ers (even though, again, this wasn’t op­ti­mized for by the AI).

If you try to solve the mind hack­ing prob­lem iter­a­tively, you’re more likely to find a way to get use­ful an­swers out of the sys­tem, but you’re also more likely to hit upon an ex­is­ten­tially catas­trophic form of mind hack­ing.

A bit of a side note: I’m cu­ri­ous what odds you give to hyp­no­sis (or some­thing like it) be­ing the best way to get op­ti­mal re­ward for such a task (in the vanilla setup).

I guess it de­pends on how many in­ter­ac­tions per epi­sode and how long each an­swer can be. I would say >.9 prob­a­bil­ity that hyp­no­sis or some­thing like what I de­scribed above is op­ti­mal if they are both long enough. So you could try to make this sys­tem safer by limit­ing these num­bers, which is also talked about in “AI Safety via De­bate” if I re­mem­ber cor­rectly.

• the op­er­a­tor might feel a de­sire af­ter­wards to spread this world view

It is plau­si­ble to me that there is se­lec­tion pres­sure to make the op­er­a­tor “de­voted” in some sense to BoMAI. But most peo­ple with a unique mo­tive are not able to then take over the world or cause an ex­tinc­tion event. And BoMAI has no in­cen­tive to help the op­er­a­tor gain those skills.

Just to step back and frame this con­ver­sa­tion, we’re dis­cussing the is­sue of out­side-world side-effects that cor­re­late with in-the-box in­stru­men­tal goals. Im­plicit in the claim of the pa­per is that tech­nolog­i­cal progress is an out­side-world cor­re­late of op­er­a­tor-satis­fac­tion, an in-the-box in­stru­men­tal goal. I agree it is very much worth con­sid­er­ing plau­si­ble path­ways to nega­tive con­se­quences, but I think the de­fault an­swer is that with op­ti­miza­tion pres­sure, sur­pris­ing things hap­pen, but with­out op­ti­miza­tion pres­sure, sur­pris­ing things don’t. (Again, that is just the de­fault be­fore we look closer). This doesn’t mean we should be to­tally skep­ti­cal about the idea of ex­pect­ing tech­nolog­i­cal progress or long-term op­er­a­tor de­vo­tion, but it does con­tribute to my be­ing less con­cerned that some­thing as sur­pris­ing as ex­tinc­tion would arise from this.

• Yeah, the threat model I have in mind isn’t the op­er­a­tor tak­ing over the world or caus­ing an ex­tinc­tion event, but spread­ing bad but ex­tremely per­sua­sive ideas that can dras­ti­cally cur­tail hu­man­ity’s po­ten­tial (which is part of the defi­ni­tion of “ex­is­ten­tial risk”). For ex­am­ple fulfilling our po­ten­tial may re­quire that the uni­verse even­tu­ally be con­trol­led mostly by agents that have man­aged to cor­rectly solve a num­ber of moral and philo­soph­i­cal prob­lems, and the spread of these bad ideas may pre­vent that from hap­pen­ing. See Some Thoughts on Me­taphilos­o­phy and the posts linked from there for more on this per­spec­tive.

• Let XX be the event in which: a viru­lent meme causes suffi­ciently many power-bro­kers to be­come en­trenched with ab­surd val­ues, such that we do not end up even satis­fic­ing The True Good.

Em­piri­cal anal­y­sis might not be use­less here in eval­u­at­ing the “sur­pris­ing­ness” of XX. I don’t think Chris­ti­an­ity makes the cut ei­ther for viru­lence or for in­com­pat­i­bil­ity with some satis­fac­tory level of The True Good.

I’m adding this not for you, but to clar­ify for the ca­sual reader: we both agree that a Su­per­in­tel­li­gence set­ting out to ac­com­plish XX would prob­a­bly suc­ceed; the ques­tion here is how likely this is to hap­pen by ac­ci­dent if a su­per­in­tel­li­gence tries to get a hu­man in a closed box to love it.

• but you’re also more likely to hit upon an ex­is­ten­tially catas­trophic form of mind hack­ing.

Can you ex­plain this?

• Sup­pose there are n forms of mind hack­ing that the AI could do, some of which are ex­is­ten­tially catas­trophic. If your plan is “Run this AI, and if the op­er­a­tor gets mind-hacked, stop and switch to an en­tirely differ­ent de­sign.” the like­li­hood of hit­ting upon an ex­is­ten­tially catas­trophic form of mind hack­ing is lower than if the plan is in­stead “Run this AI, and if the op­er­a­tor gets mind-hacked, tweak the AI de­sign to block that spe­cific form of mind hack­ing and try again. Re­peat un­til we get a use­ful an­swer.”

• Hm. This doesn’t seem right to me. My ap­proach for try­ing to form an in­tu­ition here in­cludes re­turn­ing to the ex­am­ple (in a par­ent com­ment)

For ex­am­ple, the room could be di­vided in half, with Oper­a­tor 1 in­ter­act­ing BoMAI, and with Oper­a­tor 2 ob­serv­ing Oper­a­tor 1...

but I don’t imag­ine this satis­fies you. Another piece of the in­tu­ition is that mind-hack­ing for the aim of re­ward within the epi­sode, or even the pos­si­ble in­stru­men­tal aim of op­er­a­tor-de­vo­tion, still doesn’t seem very ex­is­ten­tially risky to me, given the lack of op­ti­miza­tion pres­sure to that effect. (I know the lat­ter com­ment sort of be­longs in other branches of our con­ver­sa­tion, so we should con­tinue to dis­cuss it el­se­where).

Maybe other peo­ple can weigh in on this, and we can come back to it.

I’m open to other ter­minol­ogy. Yes, there is no guaran­tee about what hap­pens to the op­er­a­tor. As I’m defin­ing it, be­nig­nity is defined to be not hav­ing out­side-world in­stru­men­tal goals, and the in­tu­ition for the term is “not ex­is­ten­tially dan­ger­ous.”

• The best al­ter­na­tive to “be­nign” that I could come up with is “un­am­bi­tious”. I’m not very good at this type of thing though, so maybe ask around for other sug­ges­tions or in­di­cate some­where promi­nent that you’re in­ter­ested in giv­ing out a prize speci­fi­cally for this?

• What do you think about “al­igned”? (in the sense of hav­ing goals which don’t in­terfere with our own, by be­ing limited in scope to the events of the room)

• To clar­ify, I’m look­ing for:

“We’re talk­ing about what you do, not what you do.”

“Sup­pose you give us a new toy/​sum­ma­rized toy, some­thing like a room, an in­side-view view thing, and ask them to ex­plain what you de­sire.”

“Ah,” you re­ply, “I’m ask­ing what you think about how your life would go if you lived it way up un­til now. I think I would be in­ter­ested in hear­ing about that.

“Oh? I’d think about that, and I might want to think about it a bit more. So I would say, for ex­am­ple, that you might want to give some­one a toy/​sum­ma­rized toy by the same crite­ria as other peo­ple in the room and make them play the role of toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing toy-max­i­miz­ing.

It seems like the an­swer would be quite differ­ent.

“Oh, then,” you say, “That seems like too much work. Let me try harder!”

“What about that—what does this all just—so close to the real thing, don’t you think? And that I shouldn’t think such things are real?”

“Not ex­actly. But don’t you think there should be any al vari­a­tive rea­sons why this is always so hard, or that any al vari­a­tive rea­sons are not just illu­mi­na­tive, or couldn’t be found some other way?”

“That’s not ex­actly how I would put it. I’m fully closed about it. I’m still work­ing on it. I don’t know whether I could get this out­come with­out spend­ing so much effort on find­ing this par­tic­u­lar method of do­ing some­thing, be­cause I don’t think it would hap­pen with­out them try­ing it, so it’s not like they’re try­ing to de­ter­mine whether that out­come is real or not.”

“Ah...” said your friend, star­ing at you in hor­ror. “So, did you ever even think of the idea, or did it just

• What do you think about “do­mes­ti­cated”?

• My com­ment is more like:

A sec­ond com­ment, but it doesn’t seem worth an an­swer: it can’t be an ex­plicit state­ment of what would hap­pen if you tried this, and it seems to me un­likely that my ini­tial re­ac­tion when it was pre­sented in the first place was in­sincere, so it seems like a re­ally good idea to let it prop­a­gate in your mind a lit­tle. I’m hop­ing a lot of good ideas do be­come use­ful this time.

• There’s still an ex­is­ten­tial risk in the sense that the AGI has an in­cen­tive to hack the op­er­a­tor to give it max­i­mum re­ward, and that hack could have pow­er­ful effects out­side the box (even though the AI hasn’t op­ti­mized it for that pur­pose), for ex­am­ple it might turn out to be a viru­lent memetic virus. Of course this is much less risky than if the AGI had di­rect in­stru­men­tal goals out­side the box, but “be­nign” and “not ex­is­ten­tially dan­ger­ous” both seem to be claiming a bit too much. I’ll think about what other term might be more suit­able.

• The first nu­clear re­ac­tion ini­ti­ated an un­prece­dented tem­per­a­ture in the at­mo­sphere, and peo­ple were right to won­der whether this would cause the at­mo­sphere to ig­nite. The ex­is­tence of a gen­er­ally in­tel­li­gent agent is likely to cause un­prece­dented men­tal states in hu­mans, and we would be right to won­der whether that will cause an ex­is­ten­tial catas­tro­phe. I think the con­cern of “could have pow­er­ful effects out­side the box” is mostly cap­tured by the un­prece­dent­ed­ness of this men­tal state, since the men­tal state is not se­lected to have those side effects. Cer­tainly there is no way to rule out side-effects of in­side-the-box events, since these side effects are the only rea­son it’s use­ful. And there is also cer­tainly no way to rule out how those side effects “might turn out to be,” with­out a com­plete view of the fu­ture.

Would you agree that un­prece­dent­ed­ness cap­tures the con­cern?

• How does your AI know to avoid run­ning in­ter­nal simu­la­tions con­tain­ing lots of suffer­ing?

• It does not; thank you for point­ing this out! This fea­ture would have to be added on. Maybe you can come up with a way.

• Would you mind ex­plain­ing what the re­tracted part was? Even if it was a mis­take, point­ing it out might be use­ful to oth­ers think­ing along the same lines.

• Sorry, I prob­a­bly shouldn’t have writ­ten the sen­tence in the first place; it was an AI ca­pa­bil­ities idea.

• From the for­mal de­scrip­tion of the al­gorithm, it looks like you use a uni­ver­sal prior to pick , and then al­low the Tur­ing ma­chine to run for steps, but don’t pe­nal­ize the run­ning time of the ma­chine that out­puts . Is that right? That didn’t match my in­tu­itive un­der­stand­ing of the al­gorithm, and seems like it would lead to strange out­comes, so I feel like I’m mi­s­un­der­stand­ing.

• Yes this is cor­rect. If you use the same bi­jec­tion con­sis­tently from strings to nat­u­ral num­bers, it looks a lit­tle more in­tu­itive than if you don’t. The uni­ver­sal prior picks (the num­ber) by out­putting as a string. The th Tur­ing ma­chine is the Tur­ing ma­chine de­scribed by as a string. So you end up look­ing at the Kol­mogorov com­plex­ity of the de­scrip­tion of the Tur­ing ma­chine. So the con­struc­tion of the de­scrip­tion of the world-model isn’t time-pe­nal­ized. This doesn’t change the asymp­totic re­sult, so I went with the more fa­mil­iar rather than trans­lat­ing this new speed prior into mea­sure over finite strings, which would re­quire some more ex­po­si­tion, but I agree with you it feels like there might be some strange out­comes “be­fore the limit” as a re­sult of this ap­proach: namely, the code on the UTM that out­puts the de­scrip­tion of the world-model-Tur­ing-ma­chine will try to do as much of the com­pu­ta­tion as pos­si­ble in ad­vance, by com­put­ing the de­scrip­tion of an speed-op­ti­mized Tur­ing ma­chine for when the ac­tions start com­ing.

The other rea­son­able choices here in­stead of are (con­structed to be like the new speed prior here) and—the length of x. But ba­si­cally tells you that a Tur­ing ma­chine with fewer states is sim­pler, which would lead to a mea­sure over that is dom­i­nated by world-mod­els that are just uni­ver­sal Tur­ing ma­chines, which defeats the pur­pose of do­ing max­i­mum a pos­te­ri­ori in­stead of a Bayes mix­ture. The way this is­sue ap­pears in the proof ren­ders the Nat­u­ral Prior As­sump­tion less plau­si­ble.

• This in­val­i­dates some of my other con­cerns, but also seems to mean things are in­cred­ibly weird at finite times. I sus­pect that you’ll want to change to some­thing less ex­treme here.

(I might well be mi­s­un­der­stand­ing some­thing, apolo­gies in ad­vance.)

Sup­pose the “in­tended” physics take at least 1E15 steps to run on the UTM (this is a con­ser­va­tive lower bound, since you have to simu­late the hu­man for the whole epi­sode). And sup­pose (I think you need much lower than this). Then the in­tended model gets pe­nal­ized by at least exp(1E12) for its slow­ness.

For al­most the same de­scrip­tion com­plex­ity, I could write down physics + “pre­com­pute the pre­dic­tions for the first N epi­sodes, for ev­ery se­quence of pos­si­ble ac­tions/​ob­ser­va­tions, and store them in a lookup table.” This in­creases the com­plex­ity by a few bits, some con­stant plus K(N|physics), but avoids most of the com­pu­ta­tion. In or­der for the in­tended physics to win, i.e. in or­der for the “speed” part of the speed prior to do any­thing, we need the com­plex­ity of this pre­com­puted model to be at least 1E12 bits higher than the com­plex­ity of the fast model.

That ap­pears to hap­pen only once N > BB(1E12). Does that seem right to you?

We could talk about whether ma­lign con­se­quen­tial­ists also take over at finite times (I think they prob­a­bly do, since the “speed” part of the speed prior is not do­ing any work un­til af­ter BB(1E12) steps, long af­ter the agent be­comes in­cred­ibly smart), but it seems bet­ter to ad­just the scheme first.

Us­ing the speed prior seems more rea­son­able, but I’d want to know which ver­sion of the speed prior and which pa­ram­e­ters, since which par­tic­u­lar prob­lem bites you will de­pend on those choices. And maybe to save time, I’d want to first get your take on whether the pro­posed ver­sion is dom­i­nated by con­se­quen­tial­ists at some finite time.

• Does that seem right to you?

Yes. I re­call think­ing about pre­com­put­ing ob­ser­va­tions for var­i­ous ac­tions in this phase, but I don’t re­call notic­ing how bad the prob­lem was not in the limit.

your take on whether the pro­posed ver­sion is dom­i­nated by con­se­quen­tial­ists at some finite time.

This goes in the cat­e­gory of “things I can’t rule out”. I say maybe 15 chance it’s ac­tu­ally dom­i­nated by con­se­quen­tial­ists (that low be­cause I think the Nat­u­ral Prior As­sump­tion is still fairly plau­si­ble in its origi­nal form), but for all in­tents and pur­poses, 15 is very high, and I’ll con­cede this point.

I’d want to know which ver­sion of the speed prior and which parameters

is a mea­sure over bi­nary strings. In­stead, let’s try , where is the length of , is the time it takes to run on , and is a con­stant. If there were no clev­erer strat­egy than pre­com­put­ing ob­ser­va­tions for all the ac­tions, then could be above , where is the num­ber of epi­sodes we can tol­er­ate not hav­ing a speed prior for. But if it some­how mag­i­cally pre­dicted which ac­tions BoMAI was go­ing to take in no time at all, then would have to be above .

What prob­lem do you think bites you?

• I say maybe 15 chance it’s ac­tu­ally dom­i­nated by consequentialists

Do you get down to 20% be­cause you think this ar­gu­ment is wrong, or be­cause you think it doesn’t ap­ply?

What prob­lem do you think bites you?

What’s ? Is it O(1) or re­ally tiny? And which value of do you want to con­sider, polyno­mi­ally small or ex­po­nen­tially small?

But if it some­how mag­i­cally pre­dicted which ac­tions BoMAI was go­ing to take in no time at all, then c would have to be above 1/​d.

Wouldn’t they have to also mag­i­cally pre­dict all the stochas­tic­ity in the ob­ser­va­tions, and have a run­ning time that grows ex­po­nen­tially in their log loss? Pre­dict­ing what BoMAI will do seems likely to be much eas­ier than that.

• Do you get down to 20% be­cause you think this ar­gu­ment is wrong, or be­cause you think it doesn’t ap­ply?

You ar­gu­ment is about a Bayes mix­ture, not a MAP es­ti­mate; I think the case is much stronger that con­se­quen­tial­ists can take over a non-triv­ial frac­tion of a mix­ture. I think that the meth­ods with con­se­quen­tial­ists dis­cover for gain­ing weight in the prior (be­fore the treach­er­ous turn) are mostly likely to be el­e­gant (short de­scrip­tion on UTM), and that is the con­se­quen­tial­ists’ real com­pe­ti­tion; then [the prob­a­bil­ity the uni­verse they live in pro­duces them with their spe­cific goals]or [the bits to di­rectly spec­ify a con­se­quen­tial­ist de­cid­ing to to do this] set them back (in the MAP con­text).

• I don’t see why their meth­ods would be el­e­gant. In par­tic­u­lar, I don’t see why any of {the an­thropic up­date, im­por­tance weight­ing, up­dat­ing from the choice of uni­ver­sal prior} would have a sim­ple form (sim­pler than the sim­plest physics that gives rise to life).

I don’t see how MAP helps things ei­ther—doesn’t the same ar­gu­ment sug­gest that for most of the pos­si­ble physics, the sim­plest model will be a con­se­quen­tial­ist? (Even more broadly, for the uni­ver­sal prior in gen­eral, isn’t MAP ba­si­cally equiv­a­lent to a ran­dom sam­ple from the prior, since some ran­dom model hap­pens to be slightly more com­press­ible?)

• I don’t see why their meth­ods would be el­e­gant.

Yeah I think we have differ­ent in­tu­itions here; are we at least within a few bits of log-odds dis­agree­ment? Even if not, I am not will­ing to stake any­thing on this in­tu­ition, so I’m not sure this is a hugely im­por­tant dis­agree­ment for us to re­solve.

I don’t see how MAP helps things either

I didn’t re­al­ize that you think that a sin­gle con­se­quen­tial­ist would plau­si­bly have the largest share of the pos­te­rior. I as­sumed your be­liefs were in the neigh­bor­hood of:

it seems plau­si­ble that the weight of the con­se­quen­tial­ist part is in ex­cess of 1/​mil­lion or 1/​billion

(from your origi­nal post on this topic). In a Bayes mix­ture, I bet that a team of con­se­quen­tial­ists that col­lec­tively amount to 110 or even 150 of the pos­te­rior could take over our world. In MAP, if you’re not first, you’re last, and more im­por­tantly, you can’t team up with other con­se­quen­tial­ist-con­trol­led world-mod­els in the mix­ture.

• Let’s say , .

Wouldn’t they have to also mag­i­cally pre­dict all the stochas­tic­ity in the ob­ser­va­tions, and have a run­ning time that grows ex­po­nen­tially in their log loss?

Oh yeah—that’s good news.

• Oh yeah—that’s good news.

Although I don’t re­ally like to make any­thing that would fall apart if the world were de­ter­minis­tic. Rely­ing on stochas­tic­ity feels wrong to me.

• Are there UDT-ish con­cerns with break­ing iso­la­tion of epi­sodes? For ex­am­ple, if the AI re­ceives a low re­ward at the be­gin­ning of epi­sode 117, does it have an in­cen­tive to ma­nipu­late the ex­ter­nal world to make epi­sode 117 hap­pen many times some­how, with most of these times giv­ing it a higher re­ward? For an­other ex­am­ple, can the AI at epi­sode 117 re­al­ize that it’s in a game the­ory situ­a­tion with the AI at epi­sodes 116 and 118 and trade re­wards with them acausally, lead­ing to long-term goal di­rected be­hav­ior?

• Are there UDT-ish concerns

BoMAI is a causal de­ci­sion the­o­rist.

• It’s ac­tu­ally up­date­less within an epi­sode, right, be­cause it finds the op­ti­mal policy at the be­gin­ning of an epi­sode and then the same policy is used through­out the epi­sode? I think this means there’s no rea­son for the op­er­a­tor to en­ter a re­ward at ev­ery times­tamp, and in­stead we could let them just in­put a to­tal re­ward at any time in the epi­sode (or they could en­ter any num­ber of re­wards, and the op­ti­miza­tion is based on the last re­ward en­tered). This would re­move the in­cen­tive for the AI to keep the op­er­a­tor in the room as long as pos­si­ble and would par­tially ad­dress item 4 in my com­ment.

• So “up­date­less” is a rea­son­able term to ap­ply to BoMAI, but it’s not an up­date­less de­ci­sion the­o­rist in your sense (if I un­der­stand cor­rectly). An up­date­less de­ci­sion the­o­rist picks a policy that has the best con­se­quences, with­out mak­ing as­sump­tion that its choice of policy af­fects the world only through the ac­tions it picks. It con­sid­ers the pos­si­bil­ity that an an­other agent will be able to perfectly simu­late it, so if it picks policy 1 at the start, the other agent will simu­late it fol­low­ing policy 1, and if it picks policy 2, the other agent will simu­late it pick­ing policy 2. Since this is an effect that isn’t me­di­ated by ac­tual choice of ac­tion, up­date­less­ness ends up hav­ing con­se­quences.

If an agent picks an ex­pec­ti­max policy un­der the as­sump­tion that the only way this choice im­pacts the en­vi­ron­ments is through the ac­tions it takes (which BoMAI as­sumes), then it’s iso­mo­prhic whether it com­putes -ex­pec­ti­max as it goes, or all at once at the be­gin­ning. The policy at the be­gin­ning will in­clude con­tin­gen­cies for what­ever mid­way-through-the-epi­sode po­si­tion the agent might land in, and as for what to do at that point, it’s the same calcu­la­tion be­ing run. And this calcu­la­tion is CDT.

I guess this means, and I’ve never thought about this be­fore so this could eas­ily be wrong, un­der the as­sump­tion that a policy’s effect on the world is screened off by which ac­tions it takes, CDT is re­flec­tively sta­ble.

(And yes, you could just give one re­ward, which ends the epi­sode.)

• My con­cern is that since CDT is not re­flec­tively sta­ble, it may have in­cen­tives to cre­ate non-CDT agents in or­der to fulfill in­stru­men­tal goals.

• If I un­der­stand cor­rectly, it’s ac­tu­ally up­date­less within an epi­sode, and that’s the only thing it cares about so I don’t see how it would not be re­flec­tively sta­ble. Plus, even if it had an in­cen­tive to cre­ate a non-CDT agent, it would have to do that by out­putting some mes­sage to the op­er­a­tor, and the op­er­a­tor wouldn’t have the abil­ity to cre­ate a non-CDT agent with­out leav­ing the room which would end the epi­sode. (I guess it could hack the op­er­a­tor’s mind and cre­ate a non-CDT agent within, but at that point it might as well just make the op­er­a­tor give it max re­wards.)

• does it have an in­cen­tive to ma­nipu­late the ex­ter­nal world to make epi­sode 117 hap­pen many times somehow

For any given world-model, epi­sode 117 is just a string of ac­tions on the in­put tape, and ob­ser­va­tions and re­wards on the out­put tape (po­si­tions (m+1)*117 through (m+1)*118 −1, if you care). In none of these world-mod­els, un­der no ac­tions that it con­sid­ers, does “epi­sode 117 hap­pen twice.”

• In none of these world-mod­els, un­der no ac­tions that it con­sid­ers, does “epi­sode 117 hap­pen twice.”

Yes, epi­sode 117 hap­pens only once in the world model; and sup­pose the agent cares only about epi­sode 117 in the “cur­rent ex­e­cu­tion”. The con­cern still holds: the agent might write a ma­lign out­put that would re­sult in ad­di­tional in­vo­ca­tions of it­self in which epi­sode 117 ends with the agent get­ting a high re­ward. Note that the agent does not care about the other ex­e­cu­tions of it­self. The only pur­pose of the ma­lign out­put is to in­crease the prob­a­bil­ity that the “cur­rent ex­e­cu­tion” is one that ends with the agent re­ceiv­ing a high re­ward.

• Okay so I think you could con­struct a world-model that re­flects this sort of rea­son­ing, where it as­so­ci­ates re­ward with the re­ward pro­vided to a ran­domly sam­pled in­stance of its al­gorithm in the world in a way that looks like this. But the “ma­lign out­put that would re­sult in ad­di­tional in­vo­ca­tions of it­self” would re­quire the op­er­a­tor to leave the room, so this has the same form as, for ex­am­ple, . At this point, I think we’re no longer con­sid­er­ing any­thing that sounds like “epi­sode 117 hap­pen­ing twice,” but that’s fine. Also, just a side-note: this world-model would get ruled out if the re­wards/​ob­ser­va­tions pro­vided to the two sep­a­rate in­stances ever di­verge.

• The re­ward be­longs to a finite set of ra­tio­nals be­tween 0 and 1.

Once the AI be­comes suffi­ciently pow­er­ful, it is definitely go­ing to hack the op­er­a­tor any time it ex­pects the op­er­a­tor to give a re­ward less than 1. So the op­er­a­tor’s in­put is re­ally bi­nary, at least once the AI has learned an ac­cu­rate model. Given that, why al­low non-bi­nary re­wards at all? Is it just sup­posed to provide faster learn­ing early on?

Along similar lines: once the AI has learned an ac­cu­rate model, why would we ex­pect it to ever provide any­thing use­ful at all, rather than just hack­ing the op­er­a­tor all day? Do we think that hack­ing the hu­man is likely to be harder than ob­tain­ing perfect re­wards ev­ery time with­out hack­ing the hu­man? Seems like that would de­pend very heav­ily on the prob­lem at hand, and on the op­er­a­tor’s feed­back strat­egy.

To put it differ­ently: this setup will not provide a solu­tion to any prob­lem which is more difficult than hack­ing the hu­man op­er­a­tor.

• To put it differ­ently: this setup will not provide a solu­tion to any prob­lem which is more difficult than hack­ing the hu­man op­er­a­tor.

Yep. I ac­tu­ally ex­pect it to be pretty hard to hack the op­er­a­tor com­pared to other tasks, but more im­por­tantly, if it turns out to be eas­ier, we can tin­ker with the setup un­til hack­ing the op­er­a­tor be­comes very hard, since a mis­take here isn’t an ex­is­ten­tial risk. For ex­am­ple, we could add a mon­i­tor who ob­serves the op­er­a­tor, and can set the re­wards of the epi­sode to 0 if he ob­serves any­thing sus­pi­cious; the box would in­clude them both, but the op­er­a­tor couldn’t eas­ily com­mu­ni­cate to the mon­i­tor, and if they tried, the mon­i­tor could im­me­di­ately end the epi­sode.

This com­ment has a re­cap of other parts of the con­ver­sa­tion on this topic, alongside a more de­tailed pro­posal.

• So I think this pro­posal is a spe­cific in­stance of a gen­eral prin­ci­ple: wire­head­ing acts as a safety valve. As long as the AI is prone to wire­head­ing, and you leave it some path to wire­head­ing, the AI will never bother with any­thing more difficult than that path. In this par­tic­u­lar case, it will never do any­thing harder than hack­ing its hu­man op­er­a­tor, and that’s where most of the safety benefit comes from.

This kind of setup cre­ates a di­rect trade-off be­tween safety and ca­pa­bil­ity: the harder it is to hack the hu­man op­er­a­tor, the more likely that some other es­cape chan­nel is eas­ier—ma­nipu­lat­ing elec­tri­cal cir­cuitry to push atoms around, or some weird thing like that. The AI will be­come an ex­is­ten­tial threat if you send in an op­er­a­tor who re­ally doesn’t want to give it 100% re­ward, and make that op­er­a­tor hard to hack—then the AI has a strong in­cen­tive to gain what­ever power it needs. (I definitely do not trust a door with a kill switch to keep a su­per­in­tel­li­gent AI in a box.)

My in­tu­ition says that noth­ing else in the pro­posal ac­tu­ally mat­ters for safety, ex­cept the wire­head­ing safety valve as­pect. In par­tic­u­lar, I think just giv­ing the AI di­rect ex­plo­ra­tion abil­ities won’t mess up safety, as long as the wire­head­ing path is pre­sent and there’s not “99.99% cer­tainty is bet­ter than 99.98%”-type failure mode.

• (I definitely do not trust a door with a kill switch to keep a su­per­in­tel­li­gent AI in a box.)

If you’re right about this, the setup is not safe. I’m go­ing to re­spond to this in the “con­cerns about the box” sec­tion. I don’t think a com­mit­ment to give the agent high re­ward if it jumps through a few hoops will save us.

In this par­tic­u­lar case, it will never do any­thing harder than hack­ing its hu­man op­er­a­tor, and that’s where most of the safety benefit comes from.

I dis­agree with this. The safety benefit comes from it not hav­ing out­side-world in­stru­men­tal goals (which it lacks if and only if the box is se­cure).

My in­tu­ition says that noth­ing else in the pro­posal ac­tu­ally mat­ters for safety

That’s what I would con­clude as well if the box were not se­cure.

In par­tic­u­lar, I think just giv­ing the AI di­rect ex­plo­ra­tion abil­ities won’t mess up safety,

See Ap­pendix F. If the agent picks its own ex­plo­ra­tory poli­cies (rea­son­ably), the agent will try ev­ery com­putable policy un­til it dies, in­clud­ing the poli­cies of ev­ery sim­ple AGI.

• Can you ex­pand a bit on why a com­mit­ment to give a high re­ward won’t save us? Is it a mat­ter of the AI seek­ing more cer­tainty, or is there some other is­sue?

• An ex­am­ple of a mind-kil­ling “mind” to me, even if it has no di­rect, veridi­cal con­tent, be­ing able to put the AI into an en­vi­ron­ment that seems to be too hos­tile.

• the goal at stake is the abil­ity to not just put a mind un­der the en­vi­ron­ment you think of as your true goal. (My cur­rent model of the world is that there’s a sin­gle goal, and only a sin­gle goal can be achieved in this world.)

• the AI isn’t al­lowed to try and get out of an en­vi­ron­ment within which it’s in con­trol. It can make its own goals—it can make money—by mak­ing a lot of money in the same way peo­ple en­joy huge amounts of free time.

• the AI is al­lowed to run in a com­pletely un­pre­dictable en­vi­ron­ment, out of the ex­per­i­men­tal space. How­ever, its op­tions would be:

• it can make thou­sands of copies of it­self, only tak­ing some of its re­sources and col­lect­ing enough money to run a very very com­pli­cated AI;

• it can make thou­sands of copies of it­self, only do­ing this very com­pli­cated be­hav­ior;

• it can make thou­sands of copies of it­self, each of which is do­ing it to­gether, and col­lect­ing much more money in the course of its evolu­tion (and per­haps also in the hands of other Minds), un­til it gets to the point where it can’t make mil­lions of copies of it­self, or if not it’s in a simu­lated uni­verse as it in­tends to.

So what’s the right thing to do? Where should we be go­ing with this?

• “I see you mean some­thing else” is also equiv­a­lent to “I don’t know how you mean some­thing differ­ent”.

You don’t think that, say, it’s bet­ter to be safe. You don’t know what’s go­ing wrong. So you don’t want to put up with the prob­lem and start try­ing new strate­gies, when no one’s already done some­thing stupid. (It’s not clear to me at all how to re­solve this prob­lem. If you can’t be cer­tain how to re­solve this prob­lem, then you’re not safe, be­cause all you can do is to have more ba­sic is­sues and big­ger prob­lems. But if you’re not sure how to re­solve the prob­lem, then you’re not safe, be­cause all you can do is to have more ba­sic is­sues and big­ger prob­lems, and you can always take a more care­ful ap­proach.

There are prob­a­bly other things (e.g. more com­pli­cated solu­tions, more com­pli­cated prob­lems, etc.) which are more ex­pen­sive, but I don’t think it’s some­thing that is worth the risk to hu­man civ­i­liza­tion, and may be worth it. I think this is a use­ful sug­ges­tion, but it de­pends a bit on how it re­lates, and it’s prob­a­bly not some­thing that you can write up very pre­cisely.

• What if we had some sim­ple way of solv­ing this prob­lem with­out need­ing to be safe? I think a solu­tion to the prob­lem would in­volve some se­ri­ous tech­ni­cal effort, and an un­der­stand­ing that the “solv­ing” prob­lem won’t be solved by “solv­ing”, but it is the prob­lem of Friendly AI which you see here not miss­ing some big con­cep­tual in­sight.

One way that I would go about solv­ing the prob­lem would be to build a safe AGI, and build the safety solu­tion. That way “solv­ing” prob­lems won’t always be safe, but (and also won’t make the ex­act prob­lem safe), the “solv­ing” prob­lem won’t always be safe, and any solu­tion to safe AI will prob­a­bly be safe. But it would be nice if it worked for prac­ti­cal pur­poses; if it worked for a big goal, the prob­lem would be safe.

In the world where the solu­tions are safe, there are no fun­da­men­tally scary al­ter­na­tives so long as their safety is se­cure, and so the safety solu­tion won’t be scary to hu­mans.

So, yes, it is an AGI safety prob­lem that the sys­tem of AGIs will face, be­cause it will not need to be dan­ger­ous. But what if the sys­tem of AGI does not need to be safe. The only rea­son to have an AI safety prob­lem is that we want to have a sys­tem which is safe. So our AI safety prob­lem will not always be scary to hu­mans, but it definitely will be. We might not be able to solve it one way or an­other.

The way to make progress on safety is to build an AGI sys­tem that can cre­ate an AGI sys­tem suffi­ciently smart that at least one of the world’s most in­tel­li­gent hu­mans is be cre­ated. A sys­tem which has a safety net are ex­tremely difficult to build. A sys­tem which has a safety net of highly trained hu­mans is ex­tremely difficult to build. And so on. The safety net of an AGI sys­tem can scale with time and scale with ca­pa­bil­ity.

I think that the prob­lem seems to be that if the world was already as dumb as we think, we should want to do great safety re­search. If you want to do great safety re­search, you are go­ing to have to be a lot smarter than the av­er­age sci­en­tist or pro­gram­mer. You can’t build an AGI that can ac­tu­ally ac­com­plish any­thing to the world’s challenges. You have to be the first in per­son.

I would take a sec­ond to say that I want to fo­cus more on these ques­tions than on ac­tu­ally de­sign­ing an AGI. In

• This should prob­a­bly have been bet­ter ti­tle than “So you want to have a com­plete and thor­ough un­der­stand­ing of the sub­ject mat­ter.”

This should work to the ex­tent that you should post a sum­mary like the one I just gave, rather than the one Anna seems to think will be best for your au­di­ence. I think the se­quence ver­sion should have been clar­ified to note that it’s very easy, and if we didn’t have the full ver­sion now (we should do that), or if we did have a ver­sion that just sounded so much like what you’ve planned for (the next post is a sum­mary in the “So you want to have an un­der­stand­ing of the sub­ject mat­ter of”? That’s definitely some­thing that is quite valuable to have here, and it’s im­por­tant to get peo­ple to read it.

• Other al­gorithms… would even­tu­ally seek ar­bi­trary power in the world in or­der to in­ter­vene in the pro­vi­sion of its own re­ward; this fol­lows straight­for­wardly from its di­rec­tive to max­i­mize reward

The con­clu­sion seems false; AUP (IJCAI, LW) is a re­ward max­i­mizer which does not ex­hibit this be­hav­ior. For similar rea­sons, the re­cent to­tal­i­tar­ian con­ver­gence con­jec­ture made here also seems not true.

• AUP seems re­ally promis­ing. I just meant other al­gorithms that have been proven gen­er­ally in­tel­li­gent, which is re­ally just AIXI, the Thomp­son Sam­pling Agent, BayesExp, and a cou­ple other var­i­ants on Bayesian agents with large model classes.

• This may be a dumb ques­tion, but how can you asymp­tot­i­cally guaran­tee hu­man-level in­tel­li­gence when the world-mod­els have bounded com­pu­ta­tion time, and the hu­man is a “com­putable func­tion” that has no such limit? Is it be­cause the num­ber of Tur­ing ma­chines is in­finite?

• Not a dumb ques­tion; bounded com­pu­ta­tion time here means bounded com­pu­ta­tion time per epi­sode, so re­ally it’s lin­ear com­pu­ta­tion time.

• My main con­cern is that the sys­tem will be nei­ther safe, nor use­ful. Its use­ful­ness is limited by the num­ber of tasks which could be solved in such setup. Th­ese tasks are prob­a­bly some ad­vises or pro­jects: “should we de­velop wind en­ergy” or “print the the blueprint of a nanorobot”. The main prob­lem is that an ad­vice can look good but have some un­in­tended con­se­quences, which are not ob­vi­ous. For ex­am­ple, wide­spread wind farms will kill birds and re­sult in pest pro­lifer­a­tion. If the AI con­tinue to work, it could ad­just its ad­vise to this new data, but the pro­gram has stopped.

In other words, sealed Or­a­cle can give bad ad­vice and can’t be pun­ished for that.

There will be strong in­cen­tive by the sys­tem’s owner to dis­able all this safety mea­sures, and use the sys­tem in real world con­tin­u­ously, so all Omo­hun­dro’s drives ap­pear all over again, but on the level of the sys­tem’s own­ers.

More­over, the sys­tem can feel re­warded now by know­ing that it will in fu­ture dom­i­nate in the uni­verse, so the sys­tem may feel the need to put a se­cret code in its out­put (if it is a long code) which will cre­ate many its copies in the wild. For ex­am­ple, in or­der to tile the uni­verse with wind­farms. If it can’t out­put com­plex code, it will be mostly use­less.

• The main prob­lem is that an ad­vice can look good but have some un­in­tended con­se­quences, which are not ob­vi­ous.

link­ing this to the dis­cus­sion be­tween Wei Dai and me here.

• There will be strong in­cen­tive by the sys­tem’s owner to dis­able all this safety measures

If the op­er­a­tors be­lieve that with­out the safety mea­sures, hu­man­ity would be wiped out, I think they won’t jet­ti­son them. More to the point, run­ning this al­gorithm does not put more pres­sure on the op­er­a­tors to try out a dan­ger­ous AI. What ever in­cen­tive ex­isted already is not the fault of this al­gorithm.

• The prob­lem of any hu­man op­er­a­tor is other hu­man op­er­a­tors, e.g. Chi­nese vs. Amer­i­can. This cre­ates ex­actly the same dy­nam­ics as was ex­plored by Omo­hun­dro: the strong in­cen­tive to grab the power and take more risky ac­tions.

You dis­sect the whole sys­tem on two parts, and then claim that one of the parts is “safe”. But the same thing can be done with any AI: just say that its mem­ory or any other part is safe.

If we look at the whole sys­tem in­clud­ing AI and its hu­man op­er­a­tors it will have un­safe dy­nam­ics as whole. I wrote about it more in “Mili­tary AI as con­ver­gent goal of AI de­vel­op­ment

• What would con­sti­tute a solu­tion to the prob­lem of the race to the bot­tom be­tween teams of AGI de­vel­op­ers as they sac­ri­fice cau­tion to se­cure a strate­gic ad­van­tage be­sides the con­junc­tion of a) tech­ni­cal pro­pos­als and b) mul­ti­lat­eral treaties? Is your com­plaint that I make no dis­cus­sion of b? I think we can fo­cus on these two things one at a time.

• There could be, in fact, many solu­tions, start­ing from pre­ven­tion AI cre­ation at all – and up to cre­ation so many AIs that they will bal­ance each other. I have an ar­ti­cle with overview of pos­si­ble “global” solu­tions.

I don’t think you should dis­cuss differ­ent global solu­tions, as it would be off topic. But the dis­cus­sion of the whole sys­tem of “boxed AI + AI cre­ators” may be in­ter­est­ing.

• the sys­tem can feel re­warded now by know­ing that it will in fu­ture dom­i­nate in the universe

I do not see any map­ping from these con­cepts to its ac­tion-se­lec­tion-crite­rion of max­i­miz­ing the re­wards for its cur­rent epi­sode.

• Com­ment thread: con­cerns with As­sump­tion 1

• Since the real world is quan­tum, does your UTM need to be quan­tum too? More gen­er­ally, what hap­pens if there’s a mis­match be­tween what com­pu­ta­tions can be done effi­ciently in the real world vs on the UTM?

Also, I’m not sure what cat­e­gory this ques­tion falls un­der, but can you ex­plain the new speed prior that you use, e.g., what prob­lems in the old speed pri­ors was it de­signed to solve? (I re­call notic­ing some is­sues with Sch­mid­hu­ber’s speed prior but can’t find the post where I wrote about it now.)

• can you ex­plain the new speed prior

Yes! If you’re a proper Bayesian, us­ing the speed prior on se­quence pre­dic­tion for in­finite se­quences, you end up with sur­pris­ingly good loss bounds. This is sur­pris­ing be­cause the speed prior as­signs 0 prob­a­bil­ity to in­finite se­quences, so the truth has no prior sup­port.

If you use a max­i­mum a pos­te­ri­ori es­ti­mate, in­stead of a full Bayesian mix­ture, and the truth has prior sup­port, you also do fine.

But as far as I can tell, things break if you try both at once. So what I needed was a way of pe­nal­iz­ing slow world-mod­els, while still mak­ing sure that true en­vi­ron­ment had prior sup­port (and in par­tic­u­lar, the pos­si­bil­ity of the true en­vi­ron­ment run­ning for in­finitely many timesteps). Other­wise, you don’t get any sort of in­tel­li­gence re­sult.

• does your UTM need to be quan­tum too?

I feel like I’ve heard some­thing about quadratic slow­down simu­lat­ing quan­tum me­chan­ics on a clas­si­cal ma­chine? But it doesn’t re­ally mat­ter: a simu­la­tion that re­quired 100 com­pu­ta­tion steps per epi­sode on a quan­tum com­puter might re­quire 10,000 com­pu­ta­tion steps per epi­sode on a Tur­ing ma­chine, but it’s still bounded.

• Th­ese three sources all say simu­lat­ing a quan­tum sys­tem or com­puter on a clas­si­cal com­puter takes ex­po­nen­tial time. Does that make a differ­ence?

• It doesn’t make a differ­ence. Thank you for ac­tu­ally look­ing into this though.

I was think­ing of this.

• It doesn’t make a differ­ence.

I’m sur­prised by this. Can you ex­plain a bit more? I was think­ing that an ex­po­nen­tially large com­pu­ta­tion bound for the TM that ac­cu­rately simu­lates the real world would make its speed prior so small that it would be prac­ti­cally im­pos­si­ble for the AI to get enough in­puts (i.e., mes­sages from the op­er­a­tor) to up­date on to make that world model have the high­est weight in the pos­te­rior.

• Yes, in that sense it’s bad—I just meant it doesn’t make a differ­ence to the the­o­rems or pose a threat to As­sump­tion 1.

And it isn’t quite that bad. This effect also means you could safely in­crease , (pe­nal­ize slow­ness less) which causes BoMAI to start en­ter­tain­ing use­ful mod­els of re­al­ity sooner. To be clear, this point miti­gates the difficulty you de­scribe; it cer­tainly doesn’t go be­yond can­cel­ling it out.

• My worry at this point is that if simu­lat­ing the real world us­ing ac­tual physics takes ex­po­nen­tial time on your UTM, the world model with the great­est pos­te­rior may not be such a simu­la­tion but in­stead for ex­am­ple an alien su­per­in­tel­li­gence that runs effi­ciently on a clas­si­cal TM which is pre­dict­ing the be­hav­ior of the op­er­a­tor (us­ing var­i­ous al­gorithms that it came up with that run effi­ciently on a clas­si­cal com­puter) and at some point the alien su­per­in­tel­li­gence will cause BoMAI to out­put some­thing to mind hack the op­er­a­tor and then take over our uni­verse. I’m not sure which as­sump­tion this would vi­o­late, but do you see this as a rea­son­able con­cern?

• The the­o­rem is con­sis­tent with the aliens caus­ing trou­ble any finite num­ber of times. But each time they cause the agent to do some­thing weird their model loses some prob­a­bil­ity, so there will be some epi­sode af­ter which they stop caus­ing trou­ble (if we man­age to suc­cess­fully run enough epi­sodes with­out in fact hav­ing any­thing bad hap­pen in the mean­time, which is an as­sump­tion of the asymp­totic ar­gu­ments).

• Thanks. Is there a way to de­rive a con­crete bound on how long it will take for BoMAI to be­come “be­nign”, e.g., is it ex­po­nen­tial or some­thing more rea­son­able? (Although if even a sin­gle “ma­lign” epi­sode could lead to dis­aster, this may be only of aca­demic in­ter­est.) Also, to com­ment on this sec­tion of the pa­per:

“We can only offer in­for­mal claims re­gard­ing what hap­pens be­fore BoMAI is definitely be­nign. One in­tu­ition is that even­tual be­nig­nity with prob­a­bil­ity 1 doesn’t hap­pen by ac­ci­dent: it sug­gests that for the en­tire life­time of the agent, ev­ery­thing is con­spiring to make the agent be­nign.”

If BoMAI can be effec­tively con­trol­led by alien su­per­in­tel­li­gences be­fore it be­comes “be­nign” that would sug­gest “ev­ery­thing is con­spiring to make the agent be­nign” is mis­lead­ing as far as rea­son­ing about what BoMAI might do in the mean time.

(if we man­age to suc­cess­fully run enough epi­sodes with­out in fact hav­ing any­thing bad hap­pen in the mean­time, which is an as­sump­tion of the asymp­totic ar­gu­ments)

Is this noted some­where in the pa­per, or just im­plicit in the ar­gu­ments? I guess what we ac­tu­ally need is ei­ther a guaran­tee that all epi­sodes are “be­nign” or a bound on util­ity loss that we can in­cur through such a scheme. (I do ap­pre­ci­ate that “in the ab­sence of any other al­gorithms for gen­eral in­tel­li­gence which have been proven asymp­tot­i­cally be­nign, let alone be­nign for their en­tire life­times, BoMAI rep­re­sents mean­ingful the­o­ret­i­cal progress to­ward de­sign­ing the lat­ter.”)

• Is there a way to de­rive a con­crete bound on how long it will take for BoMAI to be­come “be­nign”, e.g., is it ex­po­nen­tial or some­thing more rea­son­able?

The clos­est thing to a dis­cus­sion of this so far is Ap­pendix E, but I have not yet thought through this very care­fully. When you ask if it is ex­po­nen­tial, what ex­actly are you ask­ing if it is ex­po­nen­tial in?

• When you ask if it is ex­po­nen­tial, what ex­actly are you ask­ing if it is ex­po­nen­tial in?

I guess I was ask­ing if it’s ex­po­nen­tial in any­thing that would make BoMAI im­prac­ti­cally slow to be­come “be­nign”, so ba­si­cally just us­ing “ex­po­nen­tial” as a short­hand for “im­prac­ti­cally large”.

• Is this noted some­where in the paper

I don’t think it is, thank you for point­ing this out.

• If BoMAI can be effec­tively con­trol­led by alien su­per­in­tel­li­gences be­fore it be­comes “be­nign” that would sug­gest “ev­ery­thing is con­spiring to make the agent be­nign” is mis­lead­ing as far as rea­son­ing about what BoMAI might do in the mean time.

Agreed that would be mis­lead­ing, but I don’t think it would be con­trol­led by alien su­per­in­tel­li­gences.

• the world model with the great­est pos­te­rior may not be such a simu­la­tion but in­stead for ex­am­ple an alien su­per­in­tel­li­gence that runs effi­ciently on a clas­si­cal TM which is pre­dict­ing the be­hav­ior of the operator

Con­sider al­gorithm the alien su­per­in­tel­li­gence is run­ning to pre­dict the be­hav­ior of the op­er­a­tor which runs effi­ciently on a clas­si­cal TM (Al­gorithm A). Now com­pare Al­gorithm A with Al­gorithm B: simu­late aliens de­cid­ing to run al­gorithm A; run al­gorithm A; ex­cept at some point, figure out when to do a treach­er­ous turn, and then do it.

Al­gorithm B is clearly slower than Al­gorithm A, so Al­gorithm B loses.

There is an im­por­tant con­ver­sa­tion to be had here: your par­tic­u­lar ex­am­ple isn’t con­cern­ing, but maybe we just haven’t thought of an ana­log that is con­cern­ing. Re­gard­less, I think has be­come di­vorced from the dis­cus­sion about quan­tum me­chan­ics.

This is why I try to write down all the as­sump­tions to rule out a whole host of world-mod­els we haven’t even con­sid­ered. In the ar­gu­ment in the pa­per, the as­sump­tion that rules out this ex­am­ple is the Nat­u­ral Prior As­sump­tion (as­sump­tion 3), al­though I think for your par­tic­u­lar ex­am­ple, the ar­gu­ment I just gave is more straight­for­ward.

• Al­gorithm B is clearly slower than Al­gorithm A.

Yes but al­gorithm B may be shorter than al­gorithm A, be­cause it could take a lot of bits to di­rectly spec­ify an al­gorithm that would ac­cu­rately pre­dict a hu­man us­ing a clas­si­cal com­puter, and less bits to pick out an alien su­per­in­tel­li­gence who has an in­stru­men­tal rea­son to in­vent such an al­gorithm. If β is set to be so near 1 that the ex­po­nen­tial time simu­la­tion of real physics can have the high­est pos­te­rior within a rea­son­able time, the fact that B is slower than A makes al­most no differ­ence and ev­ery­thing comes down to pro­gram length.

Re­gard­less, I think has be­come di­vorced from the dis­cus­sion about quan­tum me­chan­ics.

Quan­tum me­chan­ics is what’s mak­ing B be­ing slower than A not mat­ter (via the above ar­gu­ment).

• If β is set to be so near 1 that the ex­po­nen­tial time simu­la­tion of real physics can have the high­est pos­te­rior within a rea­son­able time...

So I’m a bit baf­fled by the philos­o­phy here, but here’s why I haven’t been con­cerned with the long time it would take BoMAI to en­ter­tain the true en­vi­ron­ment (and it might well, given a safe value of ).

There is rel­a­tively clear dis­tinc­tion one can make be­tween ob­jec­tive prob­a­bil­ities and sub­jec­tive ones. The asymp­totic be­nig­nity re­sult makes use of world-mod­els that perfectly match the ob­jec­tive prob­a­bil­ities ris­ing to the top.

Con­sider a new kind of prob­a­bil­ity: a “-op­ti­mal sub­jec­tive prob­a­bil­ity.” That is, the best (in the sense of KL di­ver­gence) ap­prox­i­ma­tion of the ob­jec­tive prob­a­bil­ities that can be sam­pled from us­ing a UTM and us­ing only com­pu­ta­tion steps. Sus­pend dis­be­lief for a mo­ment, and sup­pose we thought of these prob­a­bil­ities as ob­jec­tive prob­a­bil­ities. My in­tu­ition here is that ev­ery­thing works just great when agents treat sub­jec­tive prob­a­bil­ities like real prob­a­bil­ities, and to a -bounded agent, it feels like there is some sense in which these might as well be ob­jec­tive prob­a­bil­ities; the more in­tri­cate struc­ture is in­ac­cessible. If no world-mod­els were con­sid­ered that al­lowed more than com­pu­ta­tion steps per timestep ( per epi­sode I guess, what­ever), then just by call­ing “-op­ti­mal sub­jec­tive prob­a­bil­ities” “ob­jec­tive,” the same be­nig­nity the­o­rems would ap­ply, where the role in the proofs of [the world-model that matches the ob­jec­tive prob­a­bil­ities] is re­placed by [the world-model that matches the -op­ti­mal sub­jec­tive prob­a­bil­ities]. And in this ver­sion, comes much sooner, and the limit­ing value of in­tel­li­gence is reached much sooner.

Of course, “the limit­ing value of in­tel­li­gence” is much less, be­cause only fast world-mod­els are con­sid­ered. But that just goes to show that even if, on a hu­man timescale, BoMAI ba­si­cally never fields a world-model that ac­tu­ally matches ob­jec­tive prob­a­bil­ities, along the way, it will still be field­ing the best ones available that use a more mod­est com­pu­ta­tion bud­get. Once the com­pu­ta­tion bud­get sur­passes the hu­man brain, that should suffice for it to be prac­ti­cally in­tel­li­gent.

EDIT: if one sets to be safe, then if this logic fails, BoMAI will be use­less, not dan­ger­ous.

• If there’s an effi­cient clas­si­cal ap­prox­i­ma­tion of quan­tum dy­nam­ics, I bet this has a con­cise and lovely math­e­mat­i­cal de­scrip­tion. I bet that de­scrip­tion is much shorter than “in Con­way’s game of life, the effi­cient ap­prox­i­ma­tion of quan­tum me­chan­ics that what­ever life­form emerges will prob­a­bly come up with.”

But I’m hes­i­tant here. This is ex­actly the sort of con­ver­sa­tion I wanted to have.

• If there’s an effi­cient clas­si­cal ap­prox­i­ma­tion of quan­tum dy­nam­ics, I bet this has a con­cise and lovely math­e­mat­i­cal de­scrip­tion.

I doubt that there’s an effi­cient clas­si­cal ap­prox­i­ma­tion of quan­tum dy­nam­ics in gen­eral. There are prob­a­bly tricks to speed up the clas­si­cal ap­prox­i­ma­tion of a hu­man mind though (or parts of a hu­man mind), that an alien su­per­in­tel­li­gence could dis­cover. Con­sider this anal­ogy. Sup­pose there’s a robot stranded on a planet with­out tech­nol­ogy. What’s the short­est al­gorithm for con­trol­ling the robot such that it even­tu­ally leaves that planet and reaches an­other star? It’s prob­a­bly some kind of AGI that has an in­stru­men­tal goal of reach­ing an­other star, right? (It could also be a ter­mi­nal goal, but there are many other ter­mi­nal goals that call for in­ter­stel­lar travel as an in­stru­men­tal goal so the lat­ter seems more likely.) Leav­ing the planet calls for solv­ing many prob­lems that come up, on the fly, in­clud­ing in­vent­ing new al­gorithms for solv­ing them. If you put all these in­di­vi­d­ual solu­tions and al­gorithms to­gether that would also be an al­gorithm for reach­ing an­other star but it could be a lot longer than the code for the AGI.

• I see—so I think I make the same re­sponse on a differ­ent level then.

My model for this is: the world-model is a stochas­tic sim­ple world, some­thing like Con­way’s game of life (but with ran­dom­ness). Life evolves. The out­put chan­nel has dis­t­in­guished within-world effects, so that in­hab­itants can rec­og­nize it. The in­hab­itants con­trol the out­put chan­nel and use some of their world’s noise to sam­ple from a uni­ver­sal prior, which they then feed into the out­put chan­nel. But they don’t just use any uni­ver­sal prior—they use a bet­ter one, one which up­dates the prior over world-mod­els as if the ob­ser­va­tion has been made: “some­one in this world-model is sam­pling from the uni­ver­sal prior.” Maybe they also started with a speed prior of some form (which would cause them to be more likely to out­put the fast ap­prox­i­ma­tion of the hu­man mind we were just dis­cussing). And then af­ter a while, they mess with the out­put.

What­ever bet­ter uni­ver­sal prior they come up with (e.g. an­throp­i­cally up­dated speed prior), I think has a short de­scrip­tion—shorter than [- log prob(in­tel­li­gent life evolves and picks it) + de­scrip­tion of sim­ple uni­verse].

• It doesn’t make sense to me that they’re sam­pling from a uni­ver­sal prior and feed­ing it into the out­put chan­nel, be­cause the aliens are try­ing to take over other wor­lds through that out­put chan­nel (and pre­sum­ably they also have a dis­t­in­guished in­put chan­nel to go along with it), so they should be fo­cus­ing on find­ing wor­lds that both can be taken over via the chan­nel (in­clud­ing figur­ing out the com­pu­ta­tional costs of do­ing so) and are worth tak­ing over (i.e., offers greater com­pu­ta­tional re­sources than their own), and then gen­er­at­ing out­puts that are op­ti­mized for tak­ing over those wor­lds. Maybe this can be viewed as sam­pling from some kind of uni­ver­sal prior (with a short de­scrip­tion), but I’m not see­ing it. If you think it can or should be viewed that way, can you ex­plain more?

In par­tic­u­lar, if they’re try­ing to take over a com­pu­ta­tion­ally richer world, like ours, they have to figure out how to make suffi­cient pre­dic­tions about the richer world us­ing their own im­pov­er­ished re­sources, which could in­volve do­ing re­search that’s equiv­a­lent to our physics, chem­istry, biol­ogy, neu­ro­science, etc. I’m not see­ing how sam­pling from “an­throp­i­cally up­dated speed prior” would do the equiv­a­lent of all that (un­less you end up sam­pling from a com­pu­ta­tion within the prior that con­sists of some aliens try­ing to take over our world).

• I think you might be more or less right here.

I hadn’t thought about the can-do and the worth-do­ing up­date, in ad­di­tion to the an­thropic up­date. And it’s not that im­por­tant, but for ter­minol­ogy’s sake, I for­got that the up­date could send a world-model’s prior to 0, so the prior might not be uni­ver­sal any­more.

The rea­son I think of these steps as up­dates to what started as a uni­ver­sal prior, is that they would like to take over as many pos­si­ble wor­lds as pos­si­ble, and they don’t know which one. And the uni­ver­sal prior is a good way to pre­dict the dy­nam­ics of a world you know noth­ing about.

they have to figure out how to make suffi­cient pre­dic­tions about the richer world us­ing their own im­pov­er­ished re­sources, which could in­volve do­ing re­search that’s equiv­a­lent to our physics, chem­istry, biol­ogy, neu­ro­science, etc. I’m not see­ing how sam­pling from “an­throp­i­cally up­dated speed prior” would do the equiv­a­lent of all that

If you want to make fast pre­dic­tions about an un­known world, I think that’s what we call a speed prior. Once the alien race has sub­mit­ted a se­quence of ob­ser­va­tions, they should act as if the ob­ser­va­tions were largely cor­rect, be­cause that’s the situ­a­tion in which any­thing they do mat­ters, so they are ba­si­cally “learn­ing” about the world they are copy­ing (along with what they get from their in­put chan­nel, of course, which cor­re­sponds to the op­er­a­tor’s ac­tions). Sam­pling from a speed prior al­lows the aliens to out­put quick-to-com­pute plau­si­ble con­tinu­a­tions of what they’ve out­putted already. Hence, my re­duc­tion from [re­search about var­i­ous top­ics] to [sam­pling from a speed prior].

But—when you add in the can-do up­date and the worth-do­ing up­date, I agree with you that the re­sult­ing mea­sure (speed prior + an­thropic up­date + can-do up­date + worth-do­ing up­date) might have a longer de­scrip­tion than the mea­sure which starts like that, then takes a treach­er­ous turn. This case seems differ­ent to me (so I don’t make the same ob­jec­tion on this level) be­cause the can-do up­date and the worth-do­ing up­date are about this treach­er­ous turn.

So let me back up here. I don’t say any­thing in the Nat­u­ral Prior As­sump­tion about “for suffi­ciently small ,” but this makes me think I might need to. As I sug­gested above, I do think there is huge com­pu­ta­tional over­head that comes from hav­ing evolved life in a world run­ning an al­gorithm on a “vir­tual ma­chine” in their Tur­ing-ma­chine-simu­lated world, com­pared to the al­gorithm just be­ing run on a Tur­ing ma­chine that is spe­cial­ized for that al­gorithm. (75% con­fi­dence that life in a uni­verse leads to egre­gious slow­down; 97% con­fi­dence that run­ning on a vir­tual ma­chine leads to at least 2x slow­down). And with­out the aliens in­volved, the “pre­dict well” part is sim­pler than “pre­dict well” + “treach­er­ous turn.” In this ver­sion of the Nat­u­ral Prior As­sump­tion, the in­tu­ition is that con­trol flow takes time to eval­u­ate, even if in rare cir­cum­stances it doesn’t re­quire more code. (Really, the rea­son­ing that got us here is that in the rare case that treach­er­ous world-mod­els may be shorter to de­scribe, they are also very slow, but that “in­tu­ition” seems less likely to gen­er­al­ize to world-mod­els we haven’t thought about). But I do think a max­i­mum a pos­te­ri­ori es­ti­mate us­ing a speed prior would be pretty dev­as­tat­ing to op­ti­miza­tion dae­mons.

Another point that I was con­sid­er­ing, and I’m not ex­actly sure where this fits in, but I have the in­tu­ition that for the speed prior that they use when they de­sign plau­si­ble out­put, if they start with one that’s ei­ther more se­vere or less se­vere than the one we use, they lose ground. After i epi­sodes, there’s a trade-off to be made be­tween ac­cu­racy and speed for be­ing a high-pos­te­rior world-model (where ac­cu­racy is more fa­vored as i in­creases). If their speed prior isn’t se­vere enough, then at any given point in time, the world-model they use to pipe to out­put will be slower, which takes them more com­pu­ta­tion, which pe­nal­izes them. If their speed prior is too se­vere, they’ll be too fo­cused on ap­prox­i­mat­ing and lose to more ac­cu­rate world-mod­els whose rel­a­tive slow­ness we’re pre­pared to ac­com­mo­date. I think their best bet is to match our speed prior, and take what­ever ad­van­tage they can get from the an­thropic up­date and pick­ing their bat­tles (the other two up­dates). Add “match­ing our prior” to the list of “things that make it hard to take over a uni­ver­sal prior.”

• I’m glad that I’m get­ting some of my points across, but I think we still have some re­main­ing dis­agree­ments or con­fu­sions here.

If you want to make fast pre­dic­tions about an un­known world, I think that’s what we call a speed prior.

That doesn’t seem right to me. A speed prior still fa­vors short al­gorithms. If you’re try­ing to make pre­dic­tions about a com­pu­ta­tion­ally richer uni­verse, why fa­vor short al­gorithms? Why not ap­ply your in­tel­li­gence to try to dis­cover the best al­gorithm (or in­creas­ingly bet­ter al­gorithms), re­gard­less of the length?

Also, sam­pling from a speed prior in­volves ran­dom­iz­ing over a mix­ture of TMs, but from an EU max­i­miza­tion per­spec­tive, wouldn’t run­ning one par­tic­u­lar TM from the mix­ture give the high­est ex­pected util­ity? Why are the aliens sam­pling from the speed prior in­stead of di­rectly pick­ing a spe­cific al­gorithm to gen­er­ate the next out­put, one that they ex­pect to give the high­est util­ity for them?

I don’t say any­thing in the Nat­u­ral Prior As­sump­tion about “for suffi­ciently small β,” but this makes me think I might need to.

What hap­pens if β is too small? If it’s re­ally tiny, then the world model with the high­est pos­te­rior is ran­dom, right, be­cause it’s “com­puted” by a TM that (to min­i­mize run time) just copies ev­ery­thing on its ran­dom tape to the out­put? And as you in­crease β, the TM with high­est pos­te­rior starts do­ing fast and then in­creas­ingly com­pute-in­ten­sive pre­dic­tions?

As I sug­gested above, I do think there is huge com­pu­ta­tional over­head that comes from hav­ing evolved life in a world run­ning an al­gorithm on a “vir­tual ma­chine” in their Tur­ing-ma­chine-simu­lated world, com­pared to the al­gorithm just be­ing run on a Tur­ing ma­chine that is spe­cial­ized for that al­gorithm.

I think if β is small but not too small, the high­est pos­te­rior would not in­volve evolved life, but in­stead a di­rectly coded AGI that runs “na­tively” on the TM who can de­cide to ex­e­cute ar­bi­trary al­gorithms “na­tively” on the TM.

Maybe there is still some range of β where BoMAI is both safe and use­ful (can an­swer so­phis­ti­cated ques­tions like “how to build a safe un­bounded AGI”) be­cause in that range the high­est pos­te­rior is a good non-life/​non-AGI pre­dic­tion al­gorithm. But A) I don’t know an ar­gu­ment for that, and B) even if it’s true, to take ad­van­tage of it would seem to re­quire fine tun­ing β and I don’t see how to do that, given that trial-and-er­ror wouldn’t be safe.

• a di­rectly coded AGI that runs “na­tively” on the TM who can de­cide to ex­e­cute ar­bi­trary al­gorithms “na­tively” on the TM.

At the end of the day, it will be run­ning some sub­rou­tine for its gain trust/​pre­dict ac­cu­rately phase.

I as­sume this sort of thing is true for any model of com­pu­ta­tion, but when you con­struct a uni­ver­sal Tur­ing ma­chine, so that it can simu­late com­pu­ta­tion step af­ter com­pu­ta­tion step of an­other Tur­ing ma­chine, it takes way more than one com­pu­ta­tion step for each one. If the AGI is us­ing ma­chin­ery that would al­low it to simu­late any world-model, it will be way slower than the Tur­ing ma­chine built for that al­gorithm.

I re­al­ize this seems re­ally in-the-weeds and par­tic­u­lar, but I think this is a gen­eral prin­ci­ple of com­pu­ta­tion. The more gen­eral a sys­tem is, the less well it can do any par­tic­u­lar task. I think an AGI that chose to pipe vi­able pre­dic­tions to the out­put with some pro­ce­dure will be slower than the Tur­ing ma­chine which just runs that pro­ce­dure.

• Ok, I see, so in other words the AGI doesn’t have the abil­ity to write an ar­bi­trary func­tion in the base pro­gram­ming lan­guage and call it, it has a fixed code base and has to simu­late that func­tion us­ing its ex­ist­ing code. How­ever I think the AGI can still win a race against a straight­for­ward “pre­dict ac­cu­rately” al­gorithm, be­cause it can to two things. 1) In­clude the most im­por­tant in­ner loops of the “pre­dict ac­cu­rately” al­gorithm as func­tions in its own code to min­i­mize the rel­a­tive slow­down (this is not a de­ci­sion by the AGI but just a mat­ter of which AGI ends up hav­ing the high­est pos­te­rior) and 2) keep find­ing im­prove­ments to its own pre­dic­tion al­gorithm so that it can even­tu­ally over­take any fixed pre­dic­tion al­gorithm in ac­cu­racy which hope­fully more than “pays for” the re­main­ing slow­down that is in­curred.

• Let the AGI’s “pre­dict ac­cu­rately” al­gorithm be fixed.

What you call a se­quence of im­prove­ments to the pre­dic­tion al­gorithm, let’s just call that the pre­dic­tion al­gorithm. Imag­ine this to have as much or as lit­tle over­head as you like com­pared to what was pre­vi­ously con­cep­tu­al­ized as “pre­dict ac­cu­rately.” I think this recon­cep­tu­al­iza­tion elimi­nates 2) as a con­cern, and if I’m un­der­stand­ing cor­rectly, 1) is only able to miti­gate slow­down, not over­power it.

Also I think 1) doesn’t work—maybe you came to this con­clu­sion as well?

Sup­pose M is the C pro­gram­ming lan­guage, but in C there is no way to say “in­ter­pret this string as a C pro­gram and run it as fast as a na­tive C pro­gram”.

But maybe you’re say­ing that doesn’t ap­ply be­cause:

(this is not a de­ci­sion by the AGI but just a mat­ter of which AGI ends up hav­ing the high­est pos­te­rior)

I think this way throws off the con­tention that this AGI will have a short de­scrip­tion length. One can imag­ine a slid­ing scale here. Short de­scrip­tion, lots of over­head: a sim­ple uni­verse evolves life, aliens de­cide to run “pre­dict ac­cu­rately” + “treach­er­ous turn”. Longer de­scrip­tion, less over­head: an AGI that runs “pre­dict ac­cu­rately” + “treach­er­ous turn.” Longer de­scrip­tion, less over­head: an AGI with some of the sub­rou­tines in­volved already (con­ve­niently) baked in to its ar­chi­tec­ture. Once all the sub­rou­tines are “baked into its ar­chi­tec­ture” you just have: the al­gorithm “pre­dict ac­cu­rately” + “treach­er­ous turn”. And in this form, that has a longer de­scrip­tion than just “pre­dict ac­cu­rately”.

• Once all the sub­rou­tines are “baked into its ar­chi­tec­ture” you just have: the al­gorithm “pre­dict ac­cu­rately” + “treach­er­ous turn”

You only have to bake in the in­ner­most part of one loop in or­der to get al­most all the com­pu­ta­tional sav­ings.

• I’ve made a case that the two end­points in the trade-off are not prob­le­matic. I’ve ar­gued (roughly) that one re­duces com­pu­ta­tional over­head by do­ing things that dis­so­ci­ate the nat­u­ral­ness of de­scribing “pre­dict ac­cu­rately” and “treach­er­ous turn” all at once. This goes back to the gen­eral prin­ci­ple I pro­posed above: “The more gen­eral a sys­tem is, the less well it can do any par­tic­u­lar task.” The only thing I feel like I can still do is ar­gue against par­tic­u­lar points in the trade-off that you think are likely to cause trou­ble. Can you point me to an ex­act in­ner loop that can be na­tive to an AGI that would cause this to fall out­side of this trend? To frame this case, the Tur­ing ma­chine de­scrip­tion must spec­ify [AGI + a rou­tine that it can call]--sort of like a brain-com­puter in­ter­face, where the AGI is the brain and the fast rou­tine is the com­puter.

• If the AGI is us­ing ma­chin­ery that would al­low it to simu­late any world-model, it will be way slower than the Tur­ing ma­chine built for that al­gorithm.

Just con­sider a pro­gram that gives the aliens the abil­ity to write ar­bi­trary func­tions in M and then pass con­trol to them. That pro­gram is barely any big­ger (all you have to do is in­sert one use af­ter free in physics :) ), and guaran­tees the aliens have zero slow­down.

For the literal sim­plest ver­sion of this, your pro­gram is M(Alien(), ran­dom­ness), which is go­ing to run just as fast as M(physics, ran­dom­ness) for the in­tended physics, and prob­a­bly much faster (if the aliens can think of any clever tricks to run faster with­out com­pro­mis­ing much ac­cu­racy). The only rea­son you wouldn’t get this is if Alien is ex­pen­sive. That prob­a­bly rules out crazy alien civ­i­liza­tions, but I’m with Wei Dai that it prob­a­bly doesn’t rule out sim­pler sci­en­tists.

• Just con­sider a pro­gram that gives the aliens the abil­ity to write ar­bi­trary func­tions in M and then pass con­trol to them.

That’s what I was think­ing too, but Michael made me re­al­ize this isn’t pos­si­ble, at least for some M. Sup­pose M is the C pro­gram­ming lan­guage, but in C there is no way to say “in­ter­pret this string as a C pro­gram and run it as fast as a na­tive C pro­gram”. Am I miss­ing some­thing at this point?

all you have to do is in­sert one use af­ter free in physics

I don’t un­der­stand this sen­tence.

• That’s what I was think­ing too, but Michael made me re­al­ize this isn’t pos­si­ble, at least for some M. Sup­pose M is the C pro­gram­ming lan­guage, but in C there is no way to say “in­ter­pret this string as a C pro­gram and run it as fast as a na­tive C pro­gram”. Am I miss­ing some­thing at this point?

I agree this is only go­ing to be pos­si­ble for some uni­ver­sal Tur­ing ma­chines. Though if you are us­ing a Tur­ing ma­chine to define a speed prior, this does seem like a de­sir­able prop­erty.

I don’t un­der­stand this sen­tence.

If physics is im­ple­mented in C, there are many pos­si­ble bugs that would al­low the at­tacker to ex­e­cute ar­bi­trary C code with no slow­down.

• I agree this is only go­ing to be pos­si­ble for some uni­ver­sal Tur­ing ma­chines. Though if you are us­ing a Tur­ing ma­chine to define a speed prior, this does seem like a de­sir­able prop­erty.

Why is it a de­sir­able prop­erty? I’m not see­ing why it would be bad to choose a UTM that doesn’t have this prop­erty to define the speed prior for BoMAI, if that helps with safety. Please ex­plain more?

• I just mean: “uni­ver­sal­ity” in the sense of a UTM isn’t a suffi­cient prop­erty when defin­ing the speed prior, the analo­gous prop­erty of the UTM is some­thing more like: “You can run an ar­bi­trary Tur­ing ma­chine with­out too much slow­down.” Of course that’s not pos­si­ble, but it seems like you still want to be as close to that as pos­si­ble (for the same rea­sons that you wanted uni­ver­sal­ity at all).

I agree that it would be fine to sac­ri­fice this prop­erty if it was helpful for safety.

Each world-model is a Tur­ing ma­chine, whose prior re­lates to the Kol­mogorov com­plex­ity (on some uni­ver­sal Tur­ing ma­chine) of the de­scrip­tion of Tur­ing ma­chine—all the tran­si­tion rules, and what­not. Usu­ally, this would be iso­mor­phic (within a con­stant), but since we’re con­sid­er­ing speed, pro­grams ac­tu­ally aren’t simu­lated on a UTM.

• What hap­pens if β is too small?

Just as you said: it out­puts Bernoulli(1/​2) bits for a long time. It’s not dan­ger­ous.

B) even if it’s true, to take ad­van­tage of it would seem to re­quire fine tun­ing β and I don’t see how to do that, given that trial-and-er­ror wouldn’t be safe.

Fine tun­ing from both sides isn’t safe. Ap­proach from be­low.

• Just as you said: it out­puts Bernoulli(1/​2) bits for a long time. It’s not dan­ger­ous.

I just read the math more care­fully, and it looks like no mat­ter how small β is, as long as β is pos­i­tive, as BoMAI re­ceives more and more in­put, it will even­tu­ally con­verge to the most ac­cu­rate world model pos­si­ble. This is be­cause the com­pu­ta­tion penalty is ap­plied to the per-epi­sode com­pu­ta­tion bound and doesn’t in­crease with each epi­sode, whereas the ac­cu­racy ad­van­tage gets ac­cu­mu­lated across epi­sodes.

As­sum­ing that the most ac­cu­rate world model is an ex­po­nen­tial-time quan­tum simu­la­tion, that’s what BoMAI will con­verge to (no mat­ter how small β is), right? And in the mean­time it will go through some ar­bi­trar­ily com­plex (up to some very large bound) but faster than ex­po­nen­tial clas­si­cal ap­prox­i­ma­tions of quan­tum physics that are in­creas­ingly ac­cu­rate, as the num­ber of epi­sodes in­crease? If so, I’m no longer con­vinced that BoMAI is be­nign as long as β is small enough, be­cause the qual­i­ta­tive be­hav­ior of BoMAI seems the same no mat­ter what β is, i.e., it gets smarter over time as its world model gets more ac­cu­rate, and I’m not sure why the rea­son BoMAI might not be be­nign at high β couldn’t also ap­ply at low β (if we run it for a long enough time).

(If you’re go­ing to dis­cuss all this in your “longer re­ply”, I’m fine with wait­ing for it.)

• The longer re­ply will in­clude an image that might help, but a cou­ple other notes. If it causes you to doubt the asymp­totic re­sult, it might be helpful to read the be­nig­nity proof (es­pe­cially the proof of Re­ject­ing the Sim­ple Me­mory-Based Lemma, which isn’t that long). The heuris­tic rea­son for why it can be helpful to de­crease for long-run be­hav­ior, even though long-run be­hav­ior is qual­i­ta­tively similar, is that while ac­cu­racy even­tu­ally be­comes the dom­i­nant con­cern, along the way the prior is *sort of* a ran­dom per­tur­ba­tion to this which changes the pos­te­rior weight, so for two world-mod­els that are ex­actly equally ac­cu­rate, we need to make sure the ma­lign one is pe­nal­ized for be­ing slower, enough to out­weigh the in­con­ve­nient pos­si­ble out­come in which it has shorter de­scrip­tion length. Put an­other way, for be­nig­nity, we don’t need con­cern for speed to dom­i­nate con­cern for ac­cu­racy; we need it to dom­i­nate con­cern for “sim­plic­ity” (on some refer­ence ma­chine).

• so for two world-mod­els that are ex­actly equally ac­cu­rate, we need to make sure the ma­lign one is pe­nal­ized for be­ing slower, enough to out­weigh the in­con­ve­nient pos­si­ble out­come in which it has shorter de­scrip­tion length

Yeah, I un­der­stand this part, but I’m not sure why, since the be­nign one can be ex­tremely com­plex, the ma­lign one can’t have enough of a K-com­plex­ity ad­van­tage to over­come its slow­ness penalty. And since (with low β) we’re go­ing through many more differ­ent world mod­els as the num­ber of epi­sodes in­creases, that also gives ma­lign world mod­els more chances to “win”? It seems hard to make any trust­wor­thy con­clu­sions based on the kind of in­for­mal rea­son­ing we’ve been do­ing and we need to figure out the ac­tual math some­how.

• And since (with low β) we’re go­ing through many more differ­ent world mod­els as the num­ber of epi­sodes in­creases, that also gives ma­lign world mod­els more chances to “win”?

Check out the or­der of the quan­tifiers in the proofs. One works for all pos­si­bil­ities. If the quan­tifiers were in the other or­der, they couldn’t be triv­ially flipped since the num­ber of world-mod­els is in­finite, and the in­tu­itive worry about ma­lign world-mod­els get­ting “more chances to win” would ap­ply.

Let’s con­tinue the con­ver­sa­tion here, and this may be a good place to refer­ence this com­ment.

• Fine tun­ing from both sides isn’t safe. Ap­proach from be­low.

Sure, ap­proach­ing from be­low is ob­vi­ous, but that still re­quires know­ing how wide the band of β that would pro­duce a safe and use­ful BoMAI is, oth­er­wise even if the band ex­ists you could over­shoot it and end up in the un­safe re­gion.

ETA: But the first ques­tion is, is there a β such that BoMAI is both safe and in­tel­li­gent enough to an­swer ques­tions like “how to build a safe un­bounded AGI”? When β is very low BoMAI is use­less, and as you in­crease β it gets smarter, but then at some point with a high enough β it be­comes un­safe. Do you know a way to figure out how smart BoMAI is just be­fore it be­comes un­safe?

• Some vi­su­al­iza­tions which might help with this:

But then one needs to fac­tor in “sim­plic­ity” or the prior penalty from de­scrip­tion length:

Note also that these are av­er­age effects; they are just for form­ing in­tu­itions.

is there a β such that BoMAI is both safe and in­tel­li­gent enough to an­swer ques­tions like “how to build a safe un­bounded AGI” [af­ter a rea­son­able num­ber of epi­sodes]?

This was the sort of thing I as­sumed could be im­proved upon later once the asymp­totic re­sult was es­tab­lished. Now that you’re ask­ing for the im­prove­ment, here’s a pro­posal:

Set safely. Once enough ob­ser­va­tions have been pro­vided that you be­lieve hu­man-level AI should be pos­si­ble, ex­clude world-mod­els that use less than com­pu­ta­tion steps per epi­sode. Every epi­sode, in­crease un­til hu­man-level perfor­mance is reached. Un­der the as­sump­tion that the av­er­age com­pu­ta­tion time of a ma­lign world-model is at least a con­stant times that of the “cor­re­spond­ing” be­nign one (cor­re­spond­ing in the sense of us­ing the same ((coarse) ap­prox­i­mate) simu­la­tion of the world), then should be safe for some (and ).

I need to think more care­fully about what hap­pens here, but I think the de­sign space is large.

• Fixed your images. You have to press space af­ter you use that syn­tax for the images to ac­tu­ally get fetched and dis­played. Sorry for the con­fu­sion.

• Thanks!

• Longer re­sponse com­ing. On hold for now.

• Also, sam­pling from a speed prior in­volves ran­dom­iz­ing over a mix­ture of TMs, but from an EU max­i­miza­tion per­spec­tive, wouldn’t run­ning one par­tic­u­lar TM from the mix­ture give the high­est ex­pected util­ity?

All the bet­ter. They don’t what know uni­verse is us­ing the prior. What are the odds our uni­verse is the sin­gle most sus­cep­ti­ble uni­verse to be­ing taken over?

I was as­sum­ing the worst, and guess­ing that there are diminish­ing marginal re­turns once your odds of a suc­cess­ful takeover get above ~50%, so in­stead of go­ing all in on ac­cu­rate pre­dic­tions of the weak­est and ripest tar­get uni­verse, you hedge and tar­get a few uni­verses. And I was as­sum­ing the worst in as­sum­ing they’d be so good at this, they’d be able to do this for a large num­ber of uni­verses at once.

To clar­ify: diminish­ing marginal re­turns of takeover prob­a­bil­ity of a uni­verse with re­spect to the weight you give that uni­verse in your prior that you pipe to out­put.

• I was as­sum­ing the worst, and guess­ing that there are diminish­ing marginal re­turns once your odds of a suc­cess­ful takeover get above ~50%, so in­stead of go­ing all in on ac­cu­rate pre­dic­tions of the weak­est and ripest tar­get uni­verse, you hedge and tar­get a few uni­verses.

There are mas­sive diminish­ing marginal re­turns; in a naive model you’d ex­pect es­sen­tially *ev­ery* uni­verse to get pre­dicted in this way.

But Wei Dai’s ba­sic point still stands. The speed prior isn’t the ac­tual prior over uni­verses (i.e. doesn’t re­flect the real de­gree of moral con­cern that we’d use to weigh con­se­quences of our de­ci­sions in differ­ent pos­si­ble wor­lds). If you have some data that you are try­ing to pre­dict, you can do way bet­ter than the speed prior by (a) us­ing your real prior to es­ti­mate or sam­ple from the ac­tual pos­te­rior dis­tri­bu­tion over phys­i­cal law, (b) us­ing en­g­ineer­ing rea­son­ing to make the util­ity max­i­miz­ing pre­dic­tions, given that faster pre­dic­tions are go­ing to get given more weight.

(You don’t re­ally need this to run Wei Dai’s ar­gu­ment, be­cause there seem to be dozens of ways in which the aliens get an ad­van­tage over the in­tended phys­i­cal model.)

• I think what you’re say­ing is that the fol­low­ing don’t com­mute:

“real prior” (uni­ver­sal prior) + speed up­date + an­thropic up­date + can-do up­date + worth-do­ing update

com­pared to

uni­ver­sal prior + an­thropic up­date + can-do up­date + worth-do­ing up­date + speed update

When uni­ver­sal prior is next to speed up­date, this is nat­u­rally con­cep­tu­al­ized as a speed prior, and when it’s last, it is nat­u­rally con­cep­tu­al­ized as “en­g­ineer­ing rea­son­ing” iden­ti­fy­ing faster pre­dic­tions.

I happy to go with the sec­ond or­der if you pre­fer, in part be­cause I think they do com­mute—all these up­dates just change the weights on mea­sures that get mixed to­gether to be piped to out­put dur­ing the “pre­dict ac­cu­rately” phase.

• If you’re try­ing to make pre­dic­tions about a com­pu­ta­tion­ally richer uni­verse, why fa­vor short al­gorithms? Why not ap­ply your in­tel­li­gence to try to dis­cover the best al­gorithm (or in­creas­ingly bet­ter al­gorithms), re­gard­less of the length?

You have a countable list of op­tions. What choice do you have but to fa­vor the ones at the be­gin­ning? Any (com­putable) per­mu­ta­tion of the things on the list just cor­re­sponds to a differ­ent choice of uni­ver­sal Tur­ing ma­chine for which a “short” al­gorithm just means it’s ear­lier on the list.

And a “se­quence of in­creas­ingly bet­ter al­gorithms,” if cho­sen in a com­putable way, is just a com­putable al­gorithm.

• And a “se­quence of in­creas­ingly bet­ter al­gorithms,” if cho­sen in a com­putable way, is just a com­putable al­gorithm.

True but I’m ar­gu­ing that this com­putable al­gorithm is just the alien it­self, try­ing to an­swer the ques­tion “how can I bet­ter pre­dict this richer world in or­der to take it over?” If there is no shorter/​faster al­gorithm that can come up with a se­quence of in­creas­ingly bet­ter al­gorithms, what is the point of say­ing that the alien is sam­pling from the speed prior, in­stead of say­ing that the alien is think­ing about how to an­swer “how can I bet­ter pre­dict this richer world in or­der to take it over?” Ac­tu­ally if this alien was sam­pling from the speed prior, then it would no longer be the short­est/​fastest al­gorithm to come up with a se­quence of in­creas­ingly bet­ter al­gorithms, and some other alien try­ing to take over our world would have the high­est pos­te­rior in­stead.

• I’m hav­ing a hard time fol­low­ing this. Can you ex­pand on this, with­out us­ing “se­quence of in­creas­ingly bet­ter al­gorithms”? I keep trans­lat­ing that to “al­gorithm.”

• The fast al­gorithms to pre­dict our physics just aren’t go­ing to be the short­est ones. You can use rea­son­ing to pick which one to fa­vor (af­ter figur­ing out physics), rather than just writ­ing them down in some ar­bi­trary or­der and tak­ing the first one.

• You can use rea­son­ing to pick which one to fa­vor (af­ter figur­ing out physics), rather than just writ­ing them down in some ar­bi­trary or­der and tak­ing the first one.

Us­ing “rea­son­ing” to pick which one to fa­vor, is just pick­ing the first one in some new or­der. (And not re­ally pick­ing the first one, just giv­ing ear­lier ones prefer­en­tial treat­ment). In gen­eral, if you have an in­finite list of pos­si­bil­ities, and you want to pick the one that max­i­mizes some prop­erty, this is not a pro­ce­dure that halts. I’m ag­nos­tic about what or­der you use (for now) but one can’t es­cape the ne­ces­sity to in­tro­duce the ar­bi­trary crite­rion of “valu­ing” ear­lier things on the list. One can put 50% prob­a­bil­ity mass on the first billion in­stead of the first 1000 if one wants to fa­vor “sim­plic­ity” less, but you can’t make that num­ber in­finity.

• Us­ing “rea­son­ing” to pick which one to fa­vor, is just pick­ing the first one in some new or­der.

Yes, some new or­der, but not an ar­bi­trary one. The re­sult­ing or­der is go­ing to be bet­ter than the speed prior or­der, so we’ll up­date in fa­vor of the aliens and away from the rest of the speed prior.

one can’t es­cape the ne­ces­sity to in­tro­duce the ar­bi­trary crite­rion of “valu­ing” ear­lier things on the list

Prob­a­bly some mis­com­mu­ni­ca­tion here. No one is try­ing to ob­ject to the ar­bi­trari­ness, we’re just mak­ing the point that the aliens have a lot of lev­er­age with which to beat the rest of the speed prior.

(They may still not be able to if the penalty for com­pu­ta­tion is suffi­ciently steep—e.g. if you pe­nal­ize based on cir­cuit com­plex­ity so that the model might as well bake in ev­ery­thing that doesn’t de­pend on the par­tic­u­lar in­put at hand. I think it’s an in­ter­est­ing open ques­tion whether that avoids all prob­lems of this form, which I un­suc­cess­fully tried to get at here.)

• They may still not be able to if the penalty for com­pu­ta­tion is suffi­ciently steep

It was definitely re­as­sur­ing to me that some­one else had had the thought that pri­ori­tiz­ing speed could elimi­nate op­ti­miza­tion dae­mons (re: min­i­mal cir­cuits), since the speed prior came in here for in­de­pen­dent rea­sons. The only other ap­proach I can think of is try­ing to do the an­thropic up­date our­selves.

• The only point I was try­ing to re­spond to in the grand­par­ent of this com­ment was your comment

The fast al­gorithms to pre­dict our physics just aren’t go­ing to be the short­est ones. You can use rea­son­ing to pick which one to fa­vor (af­ter figur­ing out physics), rather than just writ­ing them down in some ar­bi­trary or­der and tak­ing the first one.

Your con­cern (I think) is that our speed prior would as­sign a lower prob­a­bil­ity to [fast ap­prox­i­ma­tion of real world] than the aliens’ speed prior.

I can’t re­spond at once to all of the rea­sons you have for this be­lief, but the one I was re­spond­ing to here (which hope­fully we can file away be­fore pro­ceed­ing) was that our speed prior trades off short­ness with speed, and aliens could avoid this and only look at speed.

My point here was just that there’s no way to not trade off short­ness with speed, so no one has a com­par­a­tive ad­van­tage on us as re­sult of the claim “The fast al­gorithms to pre­dict our physics just aren’t go­ing to be the short­est ones.”

The “af­ter figur­ing out physics” part is like say­ing that they can use a prior which is up­dated based on ev­i­dence. They will ob­serve ev­i­dence for what our physics is like, and use that to up­date their pos­te­rior, but that’s ex­actly what we’re do­ing to. The prior they start with can’t be de­signed around our physics. I think that the only place this rea­son­ing gets you is that their pos­te­rior will as­sign a higher prob­a­bil­ity to [fast ap­prox­i­ma­tion of real world] than our prior does, be­cause the world-mod­els have been rea­son­ably reweighted in light of their “figur­ing out physics”. Of course I don’t ob­ject to that—our speed prior’s pos­te­rior will be much bet­ter than the prior too.

• but that’s ex­actly what we’re do­ing to

It seems to­tally differ­ent from what we’re do­ing, I may be mi­s­un­der­stand­ing the anal­ogy.

Sup­pose I look out at the world and do some sci­ence, e.g. dis­cov­er­ing the stan­dard model. Then I use my un­der­stand­ing of sci­ence to de­sign great pre­dic­tion al­gorithms that run fast, but are quite com­pli­cated ow­ing to all of the ap­prox­i­ma­tions and heuris­tics baked into them.

The speed prior gives this model a very low prob­a­bil­ity be­cause it’s a com­pli­cated model. But “do sci­ence” gives this model a high prob­a­bil­ity, be­cause it’s a sim­ple model of physics, and then the ap­prox­i­ma­tions fol­low from a bunch of rea­son­ing on top of that model of physics. We aren’t trad­ing off “short­ness” for speed—we are trad­ing off “looks good ac­cord­ing to rea­son­ing” for speed. Yes they are both ar­bi­trary or­ders, but one of them sys­tem­at­i­cally con­tains bet­ter mod­els ear­lier in the or­der, since the out­put of rea­son­ing is bet­ter than a blind pri­ori­ti­za­tion of shorter mod­els.

Of course the speed prior also in­cludes a hy­poth­e­sis that does “sci­ence with the goal of mak­ing good pre­dic­tions,” and in­deed Wei Dai and I are say­ing that this is the part of the speed prior that will dom­i­nate the pos­te­rior. But now we are back to po­ten­tially-ma­lign con­se­quen­tial­istism. The cog­ni­tive work be­ing done in­ter­nally to that hy­poth­e­sis is to­tally differ­ent from the work be­ing done by up­dat­ing on the speed prior (ex­cept in­so­far as the speed prior liter­ally con­tains a hy­poth­e­sis that does that work).

In other words:

Sup­pose physics takes n bits to spec­ify, and a rea­son­able ap­prox­i­ma­tion takes N >> n bits to spec­ify. Then the speed prior, work­ing in the in­tended way, takes N bits to ar­rive at the rea­son­able ap­prox­i­ma­tion. But the aliens take n bits to ar­rive at the stan­dard model, and then once they’ve done that can im­me­di­ately de­duce the N bit ap­prox­i­ma­tion. So it sure seems like they’ll beat the speed prior. Are you ob­ject­ing to this ar­gu­ment?

(In fact the speed prior only ac­tu­ally takes n + O(1) bits, be­cause it can spec­ify the “do sci­ence” strat­egy, but that doesn’t help here since we are just try­ing to say that the “do sci­ence” strat­egy dom­i­nates the speed prior.)

• I’m not sure which of these ar­gu­ments will be more con­vinc­ing to you.

Yes they are both ar­bi­trary or­ders, but one of them sys­tem­at­i­cally con­tains bet­ter mod­els ear­lier in the or­der, since the out­put of rea­son­ing is bet­ter than a blind pri­ori­ti­za­tion of shorter mod­els.

This is what is what I was try­ing to con­tex­tu­al­ize above. This is an un­fair com­par­i­son. You’re imag­in­ing that the “rea­son­ing”-based or­der gets to see past ob­ser­va­tions, and the “short­ness”-based or­der does not. A rea­son­ing-based or­der is just a short­ness-based or­der that has been up­dated into a pos­te­rior af­ter see­ing ob­ser­va­tions (un­der the view that good rea­son­ing is Bayesian rea­son­ing). Maybe the term “or­der” is con­fus­ing us, be­cause we both know it’s a dis­tri­bu­tion, not an or­der, and we were just sim­plify­ing to a rank­ing. A short­ness-based or­der should re­ally just be called a prior, and a rea­son­ing-based or­der (at least a Bayesian-rea­son­ing-based or­der) should re­ally just be called a pos­te­rior (once it has done some rea­son­ing; be­fore it has done the rea­son­ing, it is just a prior too). So yes, the whole premise of Bayesian rea­son­ing is that up­dat­ing based on rea­son­ing is a good thing to do.

Here’s an­other way to look at it.

The speed prior is do­ing the brute force search that sci­en­tists try to ap­prox­i­mate effi­ciently. The search is for a fast ap­prox­i­ma­tion of the en­vi­ron­ment. The speed prior con­sid­ers them all. The sci­en­tists use heuris­tics to find one.

In fact the speed prior only ac­tu­ally takes n + O(1) bits, be­cause it can spec­ify the “do sci­ence” strategy

Ex­actly. But this does help for rea­sons I de­scribe here. The de­scrip­tion length of the “do sci­ence” strat­egy (I con­tend) is less than the de­scrip­tion length of the “do sci­ence” + “treach­er­ous turn” strat­egy. (I ini­tially typed that as “tern”, which will now be the image I have of a treach­er­ous turn.)

• a rea­son­ing-based or­der (at least a Bayesian-rea­son­ing-based or­der) should re­ally just be called a posterior

Rea­son­ing gives you a prior that is bet­ter than the speed prior, be­fore you see any data. (*Much* bet­ter, limited only by the fact that the speed prior con­tains strate­gies which use rea­son­ing.)

The rea­son­ing in this case is not a Bayesian up­date. It’s eval­u­at­ing pos­si­ble ap­prox­i­ma­tions *by rea­son­ing about how well they ap­prox­i­mate the un­der­ly­ing physics, it­self in­ferred by a Bayesian up­date*, not by di­rectly see­ing how well they pre­dict on the data so far.

The de­scrip­tion length of the “do sci­ence” strat­egy (I con­tend) is less than the de­scrip­tion length of the “do sci­ence” + “treach­er­ous turn” strat­egy.

I can re­ply in that thread.

I think the only good ar­gu­ments for this are in the limit where you don’t care about sim­plic­ity at all and only care about run­ning time, since then you can rule out all rea­son­ing. The thresh­old where things start work­ing de­pends on the un­der­ly­ing physics, for more com­pu­ta­tion­ally com­plex physics you need to pick larger and larger com­pu­ta­tion penalties to get the de­sired re­sult.

• Given a world model , which takes com­pu­ta­tion steps per epi­sode, let be the best world-model that best ap­prox­i­mates (in the sense of KL di­ver­gence) us­ing only com­pu­ta­tion steps. is at least as good as the “rea­son­ing-based re­place­ment” of .

The de­scrip­tion length of is within a (small) con­stant of the de­scrip­tion length of . That way of de­scribing it is not op­ti­mized for speed, but it pre­sents a one-time cost, and any­one ar­riv­ing at that world-model in this way is pay­ing that cost.

One could con­sider in­stead , which is, among the world-mod­els that -ap­prox­i­mate in less than com­pu­ta­tion steps (if the set is non-empty), the first such world-model found by a search­ing pro­ce­dure . The de­scrip­tion length of is within a (slightly larger) con­stant of the de­scrip­tion length of , but the one-time com­pu­ta­tional cost is less than that of .

, , and a host of other ap­proaches are promi­nently rep­re­sented in the speed prior.

If this is what you call “the speed prior do­ing rea­son­ing,” so be it, but the rele­vance for that ter­minol­ogy only comes in when you claim that “once you’ve en­coded ‘do­ing rea­son­ing’, you’ve ba­si­cally already writ­ten the code for it to do the treach­ery that nat­u­rally comes along with that.” That sense of “rea­son­ing” re­ally only ap­plies, I think, to the case where our code is simu­lat­ing aliens or an AGI.

• (ETA: I think this dis­cus­sion de­pended on a de­tail of your ver­sion of the speed prior that I mi­s­un­der­stood.)

Given a world model ν, which takes k com­pu­ta­tion steps per epi­sode, let νlog be the best world-model that best ap­prox­i­mates ν (in the sense of KL di­ver­gence) us­ing only logk com­pu­ta­tion steps. νlog is at least as good as the “rea­son­ing-based re­place­ment” of ν.
The de­scrip­tion length of νlog is within a (small) con­stant of the de­scrip­tion length of ν. That way of de­scribing it is not op­ti­mized for speed, but it pre­sents a one-time cost, and any­one ar­riv­ing at that world-model in this way is pay­ing that cost.

To be clear, that de­scrip­tion gets ~0 mass un­der the speed prior, right? A di­rect speci­fi­ca­tion of the fast model is go­ing to have a much higher prior than a brute force search, at least for val­ues of large enough (or small enough, how­ever you set it up) to rule out the alien civ­i­liza­tion that is (prob­a­bly) the short­est de­scrip­tion with­out re­gard for com­pu­ta­tional limits.

One could con­sider in­stead νlogε, which is, among the world-mod­els that ε-ap­prox­i­mate ν in less than logk com­pu­ta­tion steps (if the set is non-empty), the first such world-model found by a search­ing pro­ce­dure ψ. The de­scrip­tion length of νlogε is within a (slightly larger) con­stant of the de­scrip­tion length of ν, but the one-time com­pu­ta­tional cost is less than that of νlog.

Within this chunk of the speed prior, the ques­tion is: what are good ψ? Any rea­son­able speci­fi­ca­tion of a con­se­quen­tial­ist would work (plus a few more bits for it to un­der­stand its situ­a­tion, though most of the work is done by hand­ing it ), or of a petri dish in which a con­se­quen­tial­ist would even­tu­ally end up with in­fluence. Do you have a con­crete al­ter­na­tive in mind, which you think is not dom­i­nated by some con­se­quen­tial­ist (i.e. a ψ for which ev­ery con­se­quen­tial­ist is ei­ther slower or more com­plex)?

• Do you have a con­crete al­ter­na­tive in mind, which you think is not dom­i­nated by some con­se­quen­tial­ist (i.e. a ψ for which ev­ery con­se­quen­tial­ist is ei­ther slower or more com­plex)?

Well one ap­proach is in the fla­vor of the in­duc­tion al­gorithm I mes­saged you pri­vately about (I know I didn’t give you a com­pletely speci­fied al­gorithm). But when I wrote that, I didn’t have a con­crete al­gorithm in mind. Mostly, it just seems to me that the pow­er­ful al­gorithms which have been use­ful to hu­man­ity have short de­scrip­tions in them­selves. It seems like there are many cases where there is a sim­ple “ideal” ap­proach which con­se­quen­tial­ists “dis­cover” or ap­prox­i­mately dis­cover. A pow­er­ful heuris­tic search would be one such al­gorithm, I think.

(ETA: I think this dis­cus­sion de­pended on a de­tail of your ver­sion of the speed prior that I mi­s­un­der­stood.)

I don’t think any­thing here changes if K(x) were re­placed with S(x) (if that was what you mi­s­un­der­stood).

• Com­ment thread: con­cerns with “the box”

• Just ob­vi­ous and mun­dane con­cerns:

You might want to make clearer that “As long as the door is closed, in­for­ma­tion can­not leave the room” isn’t an as­sump­tion but a re­quire­ment of the setup. I.e. that you’re not as­sum­ing based on your de­scrip­tion that open­ing the door is the only means for an op­er­a­tor to get in­for­ma­tion out; you’re as­sum­ing ev­ery other means of in­for­ma­tion es­cape has been sys­tem­at­i­cally ac­counted for and ruled out (with the as­sump­tion that the op­er­a­tor has been com­pro­mised by the AI).

• Re­spond­ing to this com­ment here.

I definitely do not trust a door with a kill switch to keep a su­per­in­tel­li­gent AI in a box.

I didn’t go into the de­tails of the phys­i­cal de­sign for a re­ally se­cure box in the pa­per, be­cause it’s not re­ally a topic in com­puter sci­ence, but this is a good ex­cuse to write it up.

The op­er­a­tor is in­side a glass box. Out­side the glass box is a vac­uum, and then a con­crete box. Out­side the con­crete box is an­other vac­uum and an­other con­crete box with a Fara­day cage around it. There are (air­tight) doors through each layer. There is a pres­sure sen­sor in both the in­ner and outer vac­uum layer. The com­puter lives in the in­ner wall of the in­ner con­crete box; the screen is visi­ble from the in­side of the glass box. If the pres­sure in­creases in ei­ther vac­uum layer, the com­puter’s mem­ory is erased, and the whole thing is blown up. There is a laser shin­ing through the glass box (Laser 1). If the op­er­a­tor blocks the path of Laser 1, this cor­re­sponds to press­ing the but­ton which ends the epi­sode and opens the door. The pres­sure sen­sor → dy­na­mite path­way is de­ac­ti­vated, the vac­uum lay­ers are filled with air, and the doors are un­locked. The op­er­a­tor “types” ob­ser­va­tions and re­wards by block­ing other lasers that pass through the box. For now, just imag­ine a few dozen lasers in a row, that the op­er­a­tor can reach up and block in or­der to type. (With a bit of cre­ativity, and re­quiring that two lasers be blocked at once to “press a key” this could prob­a­bly be made some­what like a qw­erty setup.) The lasers and the laser-de­tec­tors are all on the in­ner wall of the in­ner con­crete box.

So in­side the box is only a hu­man body and some clothes. Let’s as­sume: 1) the only ways for in­for­ma­tion to move in space are with a) phys­i­cally mov­ing mat­ter, b) light, and c) grav­i­ta­tional waves and 2) it is im­pos­si­ble for a hu­man be­ing to con­vert a nu­clear re­ac­tion into use­ful work with only their body and clothes. Phys­i­cally mov­ing pack­ets mat­ter can­not leave the in­side of the glass box with­out re­leas­ing air into the vac­uum and caus­ing an ex­plo­sion. Light can­not leave the Fara­day cage. For sound waves (through an im­perfect vac­uum) and grav­i­ta­tional waves, we can use the amount of us­able en­ergy stored by the per­son in the room to bound how much noise and ran­domly mov­ing mat­ter we need to be pro­duc­ing out­side to bury the sig­nal in the noise. With a suffi­ciently large amount of noise, even if all the hu­man’s en­ergy went to­ward pro­duc­ing a sig­nal, we can cre­ate enough noise that this will not be de­tectable. Fur­ther­more, an anal­y­sis of the near-perfect vac­uum can provide an up­per bound on the effi­ciency with which a hu­man can con­vert en­ergy into an in­for­ma­tive mat­ter wave, which should drive down the amount of noise we need to pro­duce to al­most noth­ing. We can also put a deci­bel me­ter in the in­ner wall and blow the whole thing up if it gets too high. I haven’t done the math, but I bet the amount of mat­ter we need to be ran­domly mov­ing in or­der to hide sig­nals com­ing from grav­i­ta­tional waves origi­nat­ing from within the box re­ally is noth­ing. I bet am­bi­ent tem­per­a­ture takes care of that.

What if there are other ways in­for­ma­tion can be trans­mit­ted through space? If we ever learn of these, we shut down BoMAI (or mod­ify the setup in light of them). Un­til then, if such a dis­cov­ery is made by BoMAI, we can’t ex­actly de­tect those sorts of sig­nals from the out­side, be­cause we don’t know they ex­ist.

• This is pretty thor­ough from an en­g­ineer­ing stand­point. I still would not at all trust such a box to hold a su­per­in­tel­li­gent AI. The un­known un­knowns re­main too big a fac­tor. Just be­cause an un­known in­for­ma­tion chan­nel does not have any hu­mans listen­ing to it di­rectly, does not mean that it can’t be used to push around mat­ter, or cou­ple to some other chan­nel, or oth­er­wise leave an open­ing.

Also, the hu­man and their clothes are definitely not the only po­ten­tially-use­ful tools in the box: there’s also the com­puter, the com­puter’s power sup­ply, what­ever’s cool­ing the room, what­ever’s keep­ing the air fresh, what­ever’s used for in­put, and what­ever’s used for out­put. If any of those things were out­side the room, then they would offer a side chan­nel for in­for­ma­tion to es­cape. They can be buried in con­crete, but they’ll still need to be within the vac­u­ums and what­not.

In gen­eral, try­ing to out­smart a su­per­in­tel­li­gent AI is a Bad Idea, and build­ing an AI-proof box definitely falls in that cat­e­gory. If an AI is not safe with ar­bi­trar­ily large amounts of power, then it is not safe, full stop.

• The com­puter and ev­ery­thing is in the in­ner con­crete wall (sep­a­rated from the glass box by a vac­uum), as is the power sup­ply. Noth­ing is cool­ing is the room, ex­cept maybe some ice on the floor. I think you could al­low an oxy­gen tank in the room too.

Yes, the com­puter is us­ing power, but a com­puter can’t move mat­ter ex­cept in­side it­self. This can gen­er­ate sound and light, but the sec­ond vac­uum cham­ber and Fara­day cage will block that (and the rate at which power is drawn can be capped, which can also give a rea­son­able bound on how much noise to gen­er­ate.

what­ever’s used for in­put, and what­ever’s used for out­put.

For in­put, the hu­man is just block­ing lasers. For out­put, they’re look­ing through the in­ner vac­uum at the screen on the in­ner wall of the in­ner con­crete box.

They can be buried in con­crete, but they’ll still need to be within the vac­u­ums and what­not.

Yes, that’s what the outer vac­uum is for.

try­ing to out­smart a su­per­in­tel­li­gent AI is a Bad Idea

If I can con­struct a proof that Y is phys­i­cally im­pos­si­ble, then I feel fine de­pend­ing on the fact that an AI won’t be able to figure out how to do Y. It doesn’t feel to me like “try­ing to out­smart an AI.”

Just be­cause an un­known in­for­ma­tion chan­nel does not have any hu­mans listen­ing to it di­rectly, does not mean that it can’t be used to push around mat­ter, or cou­ple to some other channel

Yes, you’re right. My af­terthought was hasty. I still think it is un­likely that other forms of in­for­ma­tion trans­mis­sion are phys­i­cally pos­si­ble, and quite un­likely that they could be gen­er­ated by a hu­man or a com­puter in iso­la­tion.

Si­de­note: I think we’ve all gen­er­ated a lot of our in­tu­itions here from the AI Box ex­per­i­ments. In a sense, the AI “in the box” is not re­ally boxed. There’s an in­for­ma­tion chan­nel di­rectly to the out­side world!

• We can­not “prove” that some­thing is phys­i­cally im­pos­si­ble, only that it is im­pos­si­ble un­der some model of physics. Nor­mally that dis­tinc­tion would be en­tirely ir­rele­vant, but when deal­ing with a su­per­in­tel­li­gent AI, it’s quite likely to un­der­stand the physics bet­ter than we do. For all we know, it may turn out that Alcu­bierre drives are pos­si­ble, and if so then the AI could definitely break out that way and would have an in­cen­tive to do so.

I agree that the AI is not re­ally boxed here; it’s the “my­opia” that makes the differ­ence. But one of two things should gen­er­ally be true:

• The AI doesn’t want to get out of the box, in which case the box doesn’t need to be se­cure in the first place.

• The AI can­not get out of the box, in which case the AI doesn’t need to be safe (but also won’t be very use­ful).

This case seems like the former, so long as hack­ing the hu­man is eas­ier than get­ting out of the box. But that means we don’t need to make the box perfect any­way.

• Whoops—when I said

In a sense, the AI “in the box” is not re­ally boxed

I meant the “AI Box” sce­nario where it is print­ing re­sults to a screen in the out­side world. I do think BoMAI is truly boxed.

We can­not “prove” that some­thing is phys­i­cally im­pos­si­ble, only that it is im­pos­si­ble un­der some model of physics.

Right, that’s more or less what I mean to do. We can as­sign prob­a­bil­ities to state­ments like “it is phys­i­cally im­pos­si­ble (un­der the true mod­els of physics) for a hu­man or a com­puter in iso­la­tion with an en­ergy bud­get of x joules and y joules/​sec­ond to trans­mit in­for­ma­tion in any way other than via a), b), or c) from above.” This seems ex­tremely likely to me for rea­son­able val­ues of x and y, so it’s still use­ful to have a “proof” even if it must be pred­i­cated on such a phys­i­cal as­sump­tion.

• I strongly dis­agree with this quote (and would like to know how to point this out!):

I have never seen any­one point out that an­other’s thoughts were wrong, be­cause they were too ab­stract, and that they were harm­ful to the gen­eral au­di­ence. I have seen three com­ments ad­vo­cat­ing for a spe­cific model of hu­man val­ues, which I have never seen any­body; but at the mo­ment I have not seen any­one any­where in that con­text any­where.

• This isn’t be­cause it is wrong, but be­cause it doesn’t re­ally sound like a per­son who would care, even if the AI were not go­ing to see him do his work.

• This is to me, the more com­pel­ling ar­gu­ment in terms of What if “AIs” might end up be­ing the type that can de­cide whether to take over, then there isn’t a rea­son­able way for AIs to have any con­scious thoughts.

• The idea that AGI is com­ing soon isn’t ob­vi­ously right. It looks like we already are. I don’t want to live in a world with lots of AIs over, not enough to make them “free” and not yet un­der­stand the ba­sic prin­ci­ples of util­ity.

I can’t see how you can say that such a sce­nario is im­pos­si­ble, since the AI would sim­ply be a kind of com­puter. How­ever, this ar­gu­ment de­pends on your defi­ni­tion of AI as a “mind with 1” (a mind of a sin­gle type).

• Just a minor nit­pick: I think the ba­sic rules of this sort of situ­a­tion are good. I have a friend who does some pretty cool stuff with a sim­ple set of rules, but it’s rare to get to a point where each rule is clearly valid.

One thing I would like to add is that while it’s definitely worth the pa­per I’ll write the pa­per, it’s not easy to find a bet­ter way to pro­duce them. An ex­am­ple would be writ­ing an open let­ter to one of the fel­lows: some­one who was very ex­cited about giv­ing up the \$100 could get it from them.

• Com­ment thread: con­cerns with As­sump­tion 4

• Still wrap­ping my head around the pa­per, but...

1) It seems too weak: In the mo­ti­vat­ing sce­nario of Figure 3, isn’t is the case that “what the op­er­a­tor in­puts” and “what’s in the mem­ory reg­ister af­ter 1 year” are “his­tor­i­cally dis­tributed iden­ti­cally”?

2) It seems too strong: aren’t real-world fea­tures and/​or world-mod­els “dense”? Shouldn’t I be able to find fea­tures ar­bi­trar­ily close to F*? If I can, doesn’t that break the as­sump­tion?

3) Also, I don’t un­der­stand what you mean by: “it’s on policy be­hav­ior [is de­scribed as] simu­lat­ing X”. It seems like you (rather/​also) want to say some­thing like “as­so­ci­at­ing re­ward with X”?

• 1) It seems too weak: In the mo­ti­vat­ing sce­nario of Figure 3, isn’t is the case that “what the op­er­a­tor in­puts” and “what’s in the mem­ory reg­ister af­ter 1 year” are “his­tor­i­cally dis­tributed iden­ti­cally”?

This as­sump­tion isn’t nec­es­sary to rule out mem­ory-based world-mod­els (see Figure 4). And yes you are cor­rect that in­deed it doesn’t rule them out.

2) It seems too strong: aren’t real-world fea­tures and/​or world-mod­els “dense”? Shouldn’t I be able to find fea­tures ar­bi­trar­ily close to F*? If I can, doesn’t that break the as­sump­tion?

Yes. Yes. No. There are only finitely many short English sen­tences. (I think this an­swers your con­cern if I un­der­stand it cor­rectly).

3) Also, I don’t un­der­stand what you mean by: “its on policy be­hav­ior [is de­scribed as] simu­lat­ing X”. It seems like you (rather/​also) want to say some­thing like “as­so­ci­at­ing re­ward with X”?

I don’t quite rely on the lat­ter. As­so­ci­at­ing re­ward with X means that the re­wards are dis­tributed iden­ti­cally to X un­der all ac­tion se­quences. In­stead, the rele­vant im­pli­ca­tion here is: “the world-model’s on-policy be­hav­ior can be de­scribed as simu­lat­ing X” im­plies “for on-policy ac­tion se­quences, the world-model simu­lates X” which means “for on-policy ac­tion se­quences, re­wards are dis­tributed iden­ti­cally to X.”

• Also, it’s worth not­ing that this as­sump­tion (or rather, Lemma 3) also seems to pre­clude BoMAI op­ti­miz­ing any­thing *other* than re­vealed prefer­ences (which oth­ers have noted seems prob­le­matic, al­though I think it’s definitely out of scope).

• Com­ment thread: con­cerns with As­sump­tion 3

• I’m call­ing this the “no grue as­sump­tion” (https://​​en.wikipe­dia.org/​​wiki/​​New_rid­dle_of_in­duc­tion).

My con­cern here is that this as­sump­tion might be False, even in a strong sense of “There is no such U”.

Have you proven the ex­is­tence of such a U? Do you agree it might not ex­ist? It strikes me as po­ten­tially run­ning up against is­sues of NFL /​ self-refer­ence.

• That’s a good name for the as­sump­tion. Well, any Tur­ing ma­chine/​com­putable func­tion can be de­scribed in English (per­haps quite ar­du­ously), so con­sider the uni­ver­sal Tur­ing ma­chine which con­verts the bi­nary de­scrip­tion to English, and then uses that de­scrip­tion to iden­tify the Tur­ing ma­chine to simu­late. This UTM cer­tainly satis­fies this as­sump­tion.

It strikes me as po­ten­tially run­ning up against is­sues of NFL /​ self-refer­ence.

Can you ex­plain more? (If the above doesn’t an­swer it).

Another in­tu­ition I have for this as­sump­tion which doesn’t ap­pear in the pa­per: English is re­ally good lan­guage. (This is ad­mit­tedly vague). In think­ing about this in­tu­ition fur­ther, I’ve no­ticed a weaker form of As­sump­tion 3 that would also do the trick: the as­sump­tion need only hold for -ac­cu­rate world-mod­els (for some ). In that ver­sion of the as­sump­tion, one can use the more plau­si­ble in­tu­itive jus­tifi­ca­tion: “English is a re­ally good lan­guage for de­scribing events aris­ing from hu­man-civ­i­liza­tion in our uni­verse.”

• Wei Dai’s re­sponse to this com­ment rep­re­sents a con­cern with As­sump­tion 3.

• Com­ment thread: con­cerns with As­sump­tion 2

• Let the set of po­ten­tial hu­man ex­plorer ac­tions be , and the best hu­man ex­plorer ac­tion be with re­ward . Con­sider the fol­low­ing world model. When asked to pre­dict the re­sult of an ac­tion , it simu­lates it to find the pre­dicted ob­ser­va­tion and re­ward . If , it out­puts and faith­fully. Other­wise, it out­puts any re­ward it chooses, as long as for the ac­tion that it as­signs the max­i­mum re­ward to, it re­ports faith­fully. In prac­tice, this means that the world model can get BoMAI to choose any ac­tion it wants, as long as it is at least as good as the hu­man ex­plorer’s best ac­tion. This is bad if the world model has mal­i­cious in­ner op­ti­miz­ers.

I be­lieve that As­sump­tion 2 is the one re­spon­si­ble for rul­ing out a model of this form. It seems prob­a­bly rea­son­able to say that for ac­tions where it con­tinues simu­lat­ing the out­side world there’s use­less com­pu­ta­tion. How­ever, it can also save com­pu­ta­tion rel­a­tive to : while must pre­dict and perfectly for all ac­tions , this model can im­me­di­ately out­put a null ob­ser­va­tion and zero re­ward for any that it knows will not al­ign with its goals, rather than spend­ing com­pu­ta­tion to simu­late what re­wards those ac­tions would get. Another way of think­ing about this is that this model uses con­se­quen­tial­ist gen­eral in­tel­li­gence to quickly prune away un­in­ter­est­ing non-hu­man ac­tions to save on com­pu­ta­tion, but that gen­eral in­tel­li­gence comes at the price of mis­al­igned goals + de­cep­tive be­hav­ior.

Also, from this com­ment:

The real re­sult of the pa­per would then be “Asymp­totic Benig­nity, proven in a way that in­volves off-policy pre­dic­tions ap­proach­ing their be­nign out­put with­out ever be­ing tested”.

I think the model above has ar­bi­trar­ily bad off-policy pre­dic­tions, and it’s not im­plau­si­ble for it to be the MAP world model for­ever.

• This is an in­ter­est­ing world-model.

In prac­tice, this means that the world model can get BoMAI to choose any ac­tion it wants

So re­ally this is a set of world-mod­els, one for ev­ery al­gorithm for pick­ing ac­tions to pre­sent as op­ti­mal to BoMAI. Depend­ing on how the ac­tions are cho­sen by the world-model, ei­ther it will be ruled out by As­sump­tion 2 or it will be be­nign.

Sup­pose the choice of ac­tion de­pends on out­side-world fea­tures. (This would be the point of ma­nipu­lat­ing BoMAI—get­ting it to take ac­tions with par­tic­u­lar out­side-world effects). Then, the fea­ture that this world-model as­so­ci­ates re­ward with de­pends on out­side-world events that de­pend on ac­tions taken, and is ruled out by As­sump­tion 2. And as the par­en­thet­i­cal men­tions, if the world-model is not se­lect­ing ac­tions to ad­ver­tise as high-re­ward based on the out­side-world effects of those ac­tions, then the world-model is be­nign.

How­ever, it can also save computation

Only the on-policy com­pu­ta­tion is ac­counted for.

• Ac­tu­ally, I’m not sure if the world model I de­scribed is mem­ory-based. EDIT: Never mind, see Michael’s com­ment be­low, the non-be­nign ones are mem­ory-based.

The re­wards it out­puts are cor­rect, ex­cept when it says “the re­ward is zero”, but those ex­cep­tions are not causally de­pen­dent on out­side-world fea­tures that causally de­pend on the ac­tions of the epi­sode. But it also satis­fies Lemma 3. So in that case it seems like none of the the­o­ret­i­cal ar­gu­ments pro­hibit this world model?

(I think this is be­cause you are think­ing of how the world model can cause BoMAI to have bad in­cen­tives, whereas here I’m pos­tu­lat­ing that the world model it­self has mal­i­cious goals and is ma­nipu­lat­ing BoMAI.)

• [Quite pos­si­bly I’m con­fused, but in case I’m not:]
I think this as­sump­tion might be in­valid (or per­haps re­quire more hand-wav­ing than is ideal).

The AI has an in­cen­tive to un­der­stand the op­er­a­tor’s mind, since this bears di­rectly on its re­ward.
Bet­ter un­der­stand­ing the op­er­a­tor’s mind might be achieved in part by run­ning simu­la­tions in­clud­ing the op­er­a­tor.
One spe­cific simu­la­tion would in­volve simu­lat­ing the op­er­a­tor’s en­vi­ron­ment and ac­tions af­ter he leaves the room.

Here this isn’t done to un­der­stand the im­pli­ca­tions of his ac­tions (which can’t af­fect the epi­sode); it’s done to bet­ter un­der­stand his mind (which can).

In this way, one branch of for­get/​not-for­get has two use­ful pur­poses (bet­ter un­der­stand mind and simu­late fu­ture), while the other has one (bet­ter un­der­stand mind). So a ma­lign mem­ory-based model needn’t be slower than a be­nign model, if it’s use­ful for that be­nign model to simu­late the fu­ture too.
So ei­ther I’m con­fused, or the jus­tifi­ca­tion for the as­sump­tion isn’t valid. Hope­fully the former :).

If I’m right, then what you seem to need is an as­sump­tion that simu­lat­ing the out­side-world’s fu­ture can’t be helpful in the AI’s pre­dic­tion of its re­ward. To me, this seems like ma­jor hand-wav­ing ter­ri­tory.

• I wouldn’t re­ally use the term “in­cen­tives” to de­scribe the free-for-all among world-mod­els as they com­pete to be max­i­mum a pos­te­ri­ori. All they have to do is out­put ob­ser­va­tions and re­wards in a dis­tri­bu­tion that matches the ob­jec­tive prob­a­bil­ities. But I think we ar­rive at the same pos­si­bil­ity: you’ll see in the al­gorithm for that it does simu­late the out­side-world.

I do ac­knowl­edge in the pa­per that some of the out­side-world simu­la­tion that a mem­ory-based world-model does when it’s fol­low­ing the “wrong path” may turn out to be use­ful; all that is re­quired for the ar­gu­ment to go through is that this simu­la­tion is not perfectly use­ful—there is a shorter com­pu­ta­tion that ac­com­plishes the same thing.

I would love it if this as­sump­tion could look like: “the quick­est way to simu­late one coun­ter­fac­tual does not in­clude simu­lat­ing a mu­tu­ally ex­clu­sive coun­ter­fac­tual” and make as­sump­tion 2 into a lemma that fol­lows from it, but I couldn’t figure out how to for­mal­ize this.

• Ah yes—I was con­fus­ing my­self at some point be­tween form­ing and us­ing a model (hence “in­cen­tives”).

I think you’re cor­rect that “perfectly use­ful” isn’t go­ing to hap­pen. I’m happy to be wrong.

“the quick­est way to simu­late one coun­ter­fac­tual does not in­clude simu­lat­ing a mu­tu­ally ex­clu­sive coun­ter­fac­tual”

I don’t think you’d be able to for­mal­ize this in gen­eral, since I imag­ine it’s not true. E.g. one could imag­ine a frac­tal world where ev­ery de­tail of a coun­ter­fac­tual ap­peared later in a sub­branch of a mu­tu­ally ex­clu­sive coun­ter­fac­tual. In such a case, simu­lat­ing one coun­ter­fac­tual could be perfectly use­ful to the other. (I sup­pose you’d still ex­pect it to be an op­er­a­tion or so slower, due to ex­tra in­di­rec­tion, but per­haps that could be op­ti­mised away??)

To rule this kind of thing out, I think you’d need more spe­cific as­sump­tions (e.g. physics-based).

• This doesn’t seem to ad­dress what I view as the heart of Joe’s com­ment. Quot­ing from the pa­per:

“Now we note that µ* is the fastest world-model for on-policy pre­dic­tion, and it does not simu­late post-epi­sode events un­til it has read ac­cess to the ran­dom ac­tion”.

It seems like simu­lat­ing *post-epi­sode* events in par­tic­u­lar would be use­ful for pre­dict­ing the hu­man’s re­sponses, be­cause they will be simu­lat­ing post-epi­sode events when they choose their ac­tions. In­tu­itively, it seems like we *need* to simu­late post-epi­sode events to have any hope of guess­ing how the hu­man will act. I guess the ob­vi­ous re­sponse is that we can in­stead simu­late the in­ter­nal work­ings of the hu­man in de­tail, and thus un­cover their simu­la­tion of post-epi­sode events (as a past event). That seems cor­rect, but also a bit trou­bling (again, prob­a­bly just for “re­vealed prefer­ences” rea­sons, though).

More­over, I think in prac­tice we’ll want to use mod­els that make good, but not perfect, pre­dic­tions. That means that we trade-off ac­cu­racy with de­scrip­tion length, and I think this makes mod­el­ing the out­side world (in­stead of the hu­man’s model of it) po­ten­tially more ap­peal­ing, at least in some cases.

• I guess the ob­vi­ous re­sponse is that we can in­stead simu­late the in­ter­nal work­ings of the hu­man in de­tail, and thus un­cover their simu­la­tion of post-epi­sode events (as a past event).

So this is the sense in which I think my state­ment is tech­ni­cally cor­rect. This is what liter­ally does.

The next ques­tion is whether it is cor­rect in way that isn’t frag­ile once we start con­sid­er­ing fast/​sim­ple ap­prox­i­ma­tions of . You’re right that there is more to dis­cuss here than I dis­cuss in the pa­per: if a hu­man’s simu­la­tion of the fu­ture has fidelity, and the world-model it­self has fidelity, then a clever mem­ory-based world-model could reuse the com­pu­ta­tion of the hu­man’s pre­dic­tion of the fu­ture when it is com­put­ing the ac­tual fu­ture. If it hasn’t spent much com­pu­ta­tion time “go­ing down the wrong path” there isn’t much that’s lost for hav­ing done so.

I don’t ex­pect the hu­man op­er­a­tor will be simu­lat­ing/​imag­in­ing all post-epi­sode events that are rele­vant for -ac­cu­rate pre­dic­tions of fu­ture epi­sodes. -ac­cu­rate world-mod­els have to simu­late all the out­side-world events that are nec­es­sary to get within an thresh­old of un­der­stand­ing how epi­sodes af­fect each other, and it won’t be nec­es­sary for the hu­man op­er­a­tor to con­sider all this. So I think that even for ap­prox­i­mately ac­cu­rate world-mod­els, fol­low­ing the wrong coun­ter­fac­tual won’t be perfectly use­ful to fu­ture com­pu­ta­tion.

• So it seems like you have a the­ory that could col­lapse the hu­man value sys­tem into an (mostly non-moral) “moral value sys­tem” (or, as Eliezer would put it, “the moral value sys­tem”)

(Note that I am not as­sert­ing that the moral value sys­tem (or the hu­man metaethics) is nec­es­sar­ily sta­ble—or that there’s a good and bad rea­son for not to value things in the first place.)

A few back­ground ob­ser­va­tions:

A very few “real world” situ­a­tions would be rele­vant here.

As an ex­am­ple, the fol­low­ing pos­si­ble wor­lds are very in­ter­est­ing but I will fo­cus on a cou­ple:

• The micro class and the macro class seem fairly differ­ent at first glance.

• There is a very differ­ent class of micro-wor­lds available from a rel­a­tively small amount of re­sources.

The fol­low­ing world hy­po­thet­i­cal would be clearly very differ­ent from the usual, and that looks very differ­ent than there’s a vastly smaller class of micro-wor­lds available to the same amount of re­sources.

At first I as­sumed that they were en­tirely plau­si­ble wor­lds. Then I as­sumed they were plau­si­ble to me.

Then I as­sumed there’s an over­all level of plau­si­bil­ity that differ­ent peo­ple re­ally do but have the same prob­a­bil­ity mass and the same amount of en­ergy/​effort.

The above causal leap isn’t that much of an ar­gu­ment.

The fol­low­ing ex­am­ples, taken from Eliezer:

(It seems like Eliezer’s as­sump­tion of an “in­tended life”, in the sense of a non-ex­tended life, is sim­ply not true)

Th­ese seem to be com­pletely rea­son­able and rea­son­ably fre­quent enough that I’m rea­son­ably sure they’re rea­son­able.

“In a world that never pre­sents it­self, there is no rea­son for this to be a prob­lem.”

(A quick check of self-refer­ence and how that’s not what it’s about seem rele­vant, though this sounds to me like a straw­man.)

If you would like to con­tribute, please com­ment with the amount. If you have venmo, please send the amount to @Michael-Co­hen-45. If not, we can dis­cuss.

• Just ex­po­si­tion-wise, I’d front-load pi^H and pi^* when you define pi^B, and also clar­ify then that pi^B con­sid­ers hu­man-ex­plo­ra­tion as part of it’s policy.

• ″ This re­sult is in­de­pen­dently in­ter­est­ing as one solu­tion to the prob­lem of safe ex­plo­ra­tion with limited over­sight in non­er­godic en­vi­ron­ments, which [Amodei et al., 2016] dis­cus ”

^ This wasn’t su­per clear to me.… maybe it should just be moved some­where else in the text?

I’m not sure what you’re say­ing is in­ter­est­ing here. I guess it’s the same thing I found in­ter­est­ing, which is that you can get suffi­cient (and safe-as-a-hu­man) ex­plo­ra­tion us­ing the hu­man-does-the-ex­plo­ra­tion scheme you pro­pose. Is that what you mean to re­fer to?

• Yeah that’s what I mean to re­fer to: this is a sys­tem which learns ev­ery­thing it needs to from the hu­man while query­ing her less and less, which makes hu­man-lead ex­plo­ra­tion vi­able from a ca­pa­bil­ities stand­point. Do you think that clar­ifi­ca­tion would make things clearer?

• ETA: NVM, what you said is more de­scrip­tive (I just looked in the ap­pendix).

RE foot­note 2: maybe you want to say “mono­ton­i­cally in­creas­ing as a func­tion of” rather than “pro­por­tional to”. (It’s a shame there doesn’t seem to be a shorter way of say­ing the first one, which seems to be more of­ten what peo­ple ac­tu­ally want to say...)

• Maybe “pro­mo­tional of” would be a good phrase for this.

• Is this where ty­pos go?

• Typo: some of the hover-boxes say nu but seem to be refer­ring to the let­ter mu.

• Thank you, I’ll have to clar­ify that. For now, is a gen­eral world-model, and is a spe­cific one, so in the hover text, I ex­plain the no­ta­tion with a gen­eral case. But I see how that’s con­fus­ing.

• Yes, but this is also for things that seem like mis­takes in the ex­po­si­tion, but ei­ther have sim­ple fixes or don’t im­pact the main the­o­rems.

• [ ]
[deleted]
• I’m not go­ing on to talk about this topic be­cause it is prob­a­bly in­cred­ibly im­por­tant to not just know about the prob­lem you are solv­ing. I am aware of the prob­lem you are try­ing to solve. Per­haps it will be even harder to solve?

I will note that if you are try­ing to prove that a solu­tion is fea­si­ble, you are prob­a­bly not go­ing to make any sort of break­through. It is not ob­vi­ous that you can come up with a new break­through. Some peo­ple think that is be­cause you are try­ing to prove it is pos­si­ble, and then you come up with a new ex­am­ple.

I am of the opinion that this is only mod­er­ately use­ful. If you are try­ing to build a pow­er­ful mind, you are ba­si­cally wast­ing your re­sources. If you are try­ing to get a large amount of power with­out the re­sources to ex­plore the world, you may need to de­velop a way to gen­er­ate power for oth­ers with­out mak­ing a break­through. To get an idea, start and ex­plore it. No one can do this on my own.

The idea I am try­ing to pro­mote is that the world of math­e­mat­ics should be un­der­stood as a com­puter sci­ence—by the time you have a ba­sic un­der­stand­ing of math­e­mat­ics, you can build pow­er­ful minds.

I’m hop­ing that you can offer a con­vinc­ing case that this is the only way to learn how to de­sign a mind, by the way. I would like to pro­pose some­thing that is not just a lit­tle bit more re­al­is­tic but has the op­po­site effect. It is per­haps the most difficult and in­ter­est­ing part of math­e­mat­ics.

It would be in­ter­est­ing to hear whether you have any ideas that might help at such a fun­da­men­tal level.

Please let me know if I am con­fused.