# michaelcohen

Karma: 195 (LW), 57 (AF)
Page 1

# Asymp­tot­i­cally Unam­bi­tious AGI

6 Mar 2019 1:15 UTC
40 points
• Thanks for a re­ally pro­duc­tive con­ver­sa­tion in the com­ment sec­tion so far. Here are the com­ments which won prizes.

Com­ment prizes:

Ob­jec­tion to the term be­nign (and en­su­ing con­ver­sa­tion). Wei Dei. Link. $20 A plau­si­ble dan­ger­ous side-effect. Wei Dai. Link.$40

Short de­scrip­tion length of simu­lated aliens pre­dict­ing ac­cu­rately. Wei Dai. Link. $120 An­swers that look good to a hu­man vs. ac­tu­ally good an­swers. Paul Chris­ti­ano. Link.$20

Con­se­quences of hav­ing the prior be based on K(s), with s a de­scrip­tion of a Tur­ing ma­chine. Paul Chris­ti­ano. Link. $90 Si­mu­lated aliens con­vert­ing sim­ple world-mod­els into fast ap­prox­i­ma­tions thereof. Paul Chris­ti­ano. Link.$35

Si­mu­lat­ing suffer­ing agents. cousin_it. Link. $20 Reusing simu­la­tion of hu­man thoughts for simu­la­tion of fu­ture events. David Krueger. Link.$20

Op­tions for trans­fer:

1) Venmo. Send me a re­quest at @Michael-Co­hen-45.

2) Send me your email ad­dress, and I’ll send you an Ama­zon gift card (or some other elec­tronic gift card you’d like to spec­ify).

3) Name a char­ity for me to donate the money to.

I would like to ex­ert a bit of pres­sure not to do 3, and spend the money on some­thing frivolous in­stead :) I want to re­ward your con­scious­ness, more than your re­flec­tively en­dorsed prefer­ences, if you’re up for that. On that note, here’s one more op­tion:

4) Send me a pri­vate mes­sage with a ship­ping ad­dress, and I’ll get you some­thing cool (or a few things).

# Im­pact Mea­sure Test­ing with Honey Pots and Myopia

21 Sep 2018 15:26 UTC
11 points
• I’m sorry it sounded like a dig at CHAI’s work, and you’re right that “typ­i­cally de­scribed” is at best a gen­er­al­iza­tion over too many peo­ple, and worst, wrong. It would be more ac­cu­rate to say that when peo­ple de­scribe IRL, I get the feel­ing that it’s nearly com­plete—I don’t think I’ve seen any­one pre­sent­ing an idea about IRL flag the con­cern that the is­sue of rec­og­niz­ing the demon­stra­tor’s ac­tion might jeop­ar­diz­ing the whole thing.

I did in­tend to cast some doubt on whether the IRL re­search agenda is promis­ing, and whether in­fer­ring a util­ity func­tion from a hu­man’s ac­tions in­stead of from a re­ward sig­nal gets us any closer to safety, but I’m sorry to have mis­rep­re­sented views. (And maybe it’s worth men­tion­ing that I’m fid­dling with some­thing that bears strong re­sem­blance to In­verse Re­ward De­sign, so I’m definitely not that bear­ish on the whole idea).

• A bit of a nit­pick: IRD and this for­mu­late how the agent be­lieves the eval­u­a­tor acts, while be­ing tech­ni­cally ag­nos­tic about how the eval­u­a­tor ac­tu­ally acts (at least in the speci­fi­ca­tion of the al­gorithm; ex­per­i­ments/​the­ory might be pred­i­cated on ad­di­tional as­sump­tions about the eval­u­a­tor).

I be­lieve this agent’s be­liefs about how the eval­u­a­tor acts are much more gen­eral than IRD. If the agent be­lieved the eval­u­a­tor was cer­tain about which en­vi­ron­ment they were in, and it was the “train­ing en­vi­ron­ment” from IRD, this agent would prob­a­bly be­have very similarly to an IRD agent. But of course, this agent con­sid­ers many more pos­si­bil­ities for what the eval­u­a­tor’s be­liefs might be.

I agree this agent should definitely be com­pared to IRD, since they are both agents who don’t “take re­wards liter­ally”, but rather pro­cess them in some way first. Note that the de­sign space of things which fit this de­scrip­tion is quite large.

• From Paul:

I think the main prob­lem with com­pet­i­tive­ness is that you are just get­ting “an­swers that look good to a hu­man” rather than “ac­tu­ally good an­swers.”

The com­ment was here, but I think it de­serves its own thread. Wei makes the same point here (point num­ber 3), and our en­su­ing con­ver­sa­tion is also rele­vant to this thread.

My an­swers to Wei were two-fold: one is that if be­nig­nity is es­tab­lished, it’s pos­si­ble to safely tin­ker with the setup un­til hope­fully “an­swers that look good to a hu­man” re­sem­bles good an­swers (we never quite reached an agree­ment about this). The sec­ond was an ex­am­ple of an ex­tended setup (one has to read the par­ent com­ments to un­der­stand it) which would po­ten­tially be much more likely to yield ac­tu­ally good an­swers; I think we agree about this ap­proach.

My origi­nal idea when I started work­ing on this, ac­tu­ally, is also an an­swer to this con­cern. The rea­son it’s not in the pa­per is be­cause I pared it down to a min­i­mum vi­able product.

Con­struct an “or­a­cle” by defin­ing “true an­swers” as fol­lows: an­swers which help a hu­man do ac­cu­rate pre­dic­tion on a ran­domly sam­pled pre­dic­tion task.*

I figured out that I needed a box, and ev­ery­thing else in this setup, and I re­al­ized that the setup could be ap­plied to a nor­mal re­in­force­ment learner just as eas­ily as for this or­a­cle, so I sim­plified the ap­proach.

I hon­estly need to dig through notes from last year, but my rec­ol­lec­tion is this: the op­er­a­tor re­ceives an an­swer to a query, and then gets a ran­dom pre­dic­tion task, which he has to make a pre­dic­tion about be­fore leav­ing the box. Later, the pre­dic­tion is scored, and this is con­verted into a re­ward for BoMAI. BoMAI has a model class for how the pre­dic­tion is scored; the out­put of these mod­els is an an­swer for what the ground truth is. In all of these mod­els, the ground truth doesn’t de­pend on BoMAI’s an­swer (that is, the model isn’t given read ac­cess to BoMAI’s an­swer). So the pre­dic­tion task can in­volve the pre­dic­tion of out­side world events, and the ground truth can be logged from the out­side world, be­cause BoMAI doesn’t con­ceive of its an­swer hav­ing a causal im­pact on the copy of the world which pro­vides the ground truth for the pre­dic­tion tasks. For ex­am­ple, the pre­dic­tion task might sam­pled from {“True or false: hex­alated keno­tones will sup­press ac­tivity of BGQ-1”, “True or false: fluori­nat­ing ran­dom lysines in hemoglobin will sup­press ac­tivity of BGQ-1”, etc.} (half of those terms are made up). After this epi­sode, the pre­dic­tion can be graded in the out­side world. With the ob­vi­ous scor­ing rule, the or­a­cle would just say “I don’t care plau­si­ble it sounds, what­ever they ask you, just say it’s not go­ing to work. Most things don’t.” With a bet­ter scor­ing rule, I would ex­pect it to give ac­cu­rate in­for­ma­tion in a hu­man-un­der­stand­able for­mat.

I haven’t thought about this in a while, and I was hon­estly worse at think­ing about al­ign­ment at that point in time, so I don’t mean to con­vey much con­fi­dence that this ap­proach works out. What I do think it shows, alongside the idea I came up with in the con­ver­sa­tion with Wei, linked above, is that this gen­eral ap­proach is pow­er­ful and amenable to im­prove­ment in ways that ren­der it even more use­ful.

* A more re­cent thought: as de­scribed, “or­a­cle” is not the right word for this setup. It would re­spond to “What ap­proaches might work for cur­ing can­cer?” with “Doesn’t mat­ter. There are more gaps in your knowl­edge re­gard­ing eco­nomics. A few prin­ci­ples to keep in mind…” How­ever, if the pre­dic­tion task dis­tri­bu­tion were con­di­tioned in some way on the ques­tion asked, one might be able to make it more likely that the “or­a­cle” an­swers the ques­tion, rather than just spew­ing un­re­lated in­sight.

# IRL in Gen­eral Environments

10 Jul 2019 18:08 UTC
7 points

# Value Learn­ing is only Asymp­tot­i­cally Safe

8 Apr 2019 9:45 UTC
7 points
• This is an in­ter­est­ing world-model.

In prac­tice, this means that the world model can get BoMAI to choose any ac­tion it wants

So re­ally this is a set of world-mod­els, one for ev­ery al­gorithm for pick­ing ac­tions to pre­sent as op­ti­mal to BoMAI. Depend­ing on how the ac­tions are cho­sen by the world-model, ei­ther it will be ruled out by As­sump­tion 2 or it will be be­nign.

Sup­pose the choice of ac­tion de­pends on out­side-world fea­tures. (This would be the point of ma­nipu­lat­ing BoMAI—get­ting it to take ac­tions with par­tic­u­lar out­side-world effects). Then, the fea­ture that this world-model as­so­ci­ates re­ward with de­pends on out­side-world events that de­pend on ac­tions taken, and is ruled out by As­sump­tion 2. And as the par­en­thet­i­cal men­tions, if the world-model is not se­lect­ing ac­tions to ad­ver­tise as high-re­ward based on the out­side-world effects of those ac­tions, then the world-model is be­nign.

How­ever, it can also save computation

Only the on-policy com­pu­ta­tion is ac­counted for.

• Th­ese com­ments are re­spond­ing to the ver­sion of AUP pre­sented in the pa­per. (Let me know if I should be com­ment­ing el­se­where).

1)

If an ac­tion is use­ful w.r.t the ac­tual re­ward but use­less to all other re­wards (as use­less as tak­ing ), that is the ideal ac­cord­ing to —i.e. if it is not worth do­ing be­cause the im­pact mea­sure is too strong, noth­ing is worth do­ing. This is true even if the ac­tion is ex­tremely use­ful to the ac­tual re­ward. Am I right in think­ing that we can con­cep­tu­al­ize AUP as say­ing: “take ac­tions which lead to re­ward, but wouldn’t be use­ful (or detri­men­tal) to gain­ing re­ward if re­ward were speci­fied differ­ently”? A typ­i­cal out­line for an AGI gain­ing power to ac­com­plish a goal might be: gain power, use the power to run some code to help you get max­i­mal re­ward. We might imag­ine an AGI con­vinc­ing a hu­man to run some code, and then giv­ing them the ac­tual code. AUP would be less in­clined to do this be­cause af­ter win­ning over the hu­man, the Q-val­ues for lots of re­ward func­tions would be ex­tremely high, so it would be more re­luc­tant to bring it­self to that po­si­tion. Sup­pose that AUP gives the hu­man op­er­a­tor code to run first and then con­vinces them to run it. The ac­tions which it takes to gain trust are not use­ful for other re­wards, be­cause they’ll only lead to the code already given be­ing run, which is use­less from the per­spec­tive of the other re­ward func­tions. Do you think AUP would be mo­ti­vated to search for ways to lock in the effects of fu­ture power, and then pur­sue that power?

2)

If in­creas­ing at­tain­able util­ity and de­creas­ing at­tain­able util­ity are both dan­ger­ous, then rais­ing the size of the ac­tions space to a power makes the agent more dan­ger­ous. Con­sider trans­form­ing ac­tion/​ob­ser­va­tion/​re­ward into the agent sub­mit­ting 3 ac­tions, and re­ceiv­ing the next three ob­ser­va­tions (with the re­wards av­er­aged). This is just a new ac­tions space cu­bi­cally larger. But in this ac­tion space, if the “first” ac­tion de­creased at­tain­able util­ity dan­ger­ously, and the “third” ac­tion in­creased it dan­ger­ously, that would can­cel out and fail to reg­ister as dan­ger­ous. Since this prob­lem ap­pears in the math, but not in the in­tu­ition, it makes me wary of the re­li­a­bil­ity of the in­tu­ition.

3)

Q-learn­ing con­verges by sam­pling all ac­tions re­peat­edly from all states. AUP pe­nal­izes ac­tions ac­cord­ing to dis­rup­tions in Q-val­ues. I un­der­stand that AGI won’t be a Q-learner in a finite-state MDP, but I think it’s worth not­ing: AUP learns to avoid catas­trophic states (if in fact, it does) by test­ing them out.

4)

Sup­pose we have a chat­bot, and the ac­tions space is finite length strings of text. What ex­actly is ? If it is a mean­ingless string of text, I sus­pect ev­ery mean­ingful string of text will be “too high im­pact”. Maybe is an imi­ta­tion of a hu­man? I think hu­mans are suffi­ciently pow­er­ful that nor­mal hu­man poli­cies of­ten ac­ci­den­tally cause large im­pact (i.e. make it mas­sively more difficult or easy to achieve ran­dom goals), and that in­finitely of­ten (al­though per­haps not fre­quently), hav­ing be a hu­man policy would lead to an in­cred­ibly high tol­er­ance for im­pact, which would give AUP plenty of lee­way to do dan­ger­ous things.

• a di­rectly coded AGI that runs “na­tively” on the TM who can de­cide to ex­e­cute ar­bi­trary al­gorithms “na­tively” on the TM.

At the end of the day, it will be run­ning some sub­rou­tine for its gain trust/​pre­dict ac­cu­rately phase.

I as­sume this sort of thing is true for any model of com­pu­ta­tion, but when you con­struct a uni­ver­sal Tur­ing ma­chine, so that it can simu­late com­pu­ta­tion step af­ter com­pu­ta­tion step of an­other Tur­ing ma­chine, it takes way more than one com­pu­ta­tion step for each one. If the AGI is us­ing ma­chin­ery that would al­low it to simu­late any world-model, it will be way slower than the Tur­ing ma­chine built for that al­gorithm.

I re­al­ize this seems re­ally in-the-weeds and par­tic­u­lar, but I think this is a gen­eral prin­ci­ple of com­pu­ta­tion. The more gen­eral a sys­tem is, the less well it can do any par­tic­u­lar task. I think an AGI that chose to pipe vi­able pre­dic­tions to the out­put with some pro­ce­dure will be slower than the Tur­ing ma­chine which just runs that pro­ce­dure.

# Not De­ceiv­ing the Evaluator

8 May 2019 5:37 UTC
5 points
• One util­ity func­tion might turn out much eas­ier to op­ti­mize than the other, in which case the harder-to-op­ti­mize one will be ig­nored com­pletely. Ran­dom events might in­fluence which util­ity func­tion is harder to op­ti­mize, so one can’t nec­es­sar­ily tune in ad­vance to try to take this into ac­count.

One of the rea­sons was the prob­lem of pos­i­tive af­fine scal­ing pre­serv­ing be­hav­ior, but I see Stu­art ad­dresses that.

And ac­tu­ally, some of the rea­sons for think­ing there would be more com­pli­cated mix­ing are go­ing away as I think about it more.

EDIT: yeah if they had the same pri­ors and did un­bounded rea­son­ing, I wouldn’t be sur­prised any­more if there ex­ists a that they would agree to.

• 3. Misspeci­fied or in­cor­rectly learned goals/​values

I think this phras­ing mis­places the likely failure modes. An ex­am­ple that comes to mind from this phras­ing is that we mean to max­i­mize con­scious flour­ish­ing, but we ac­ci­den­tally max­i­mize dopamine in large brains.

Of course, this ex­am­ple in­cludes an agent in­ter­ven­ing in the pro­vi­sion of its own re­ward, but since that seems like the paradig­matic ex­am­ple here, maybe the lan­guage could bet­ter re­flect that, or maybe this could be split into two.

The sin­gle tech­ni­cal prob­lem that ap­pears biggest to me is that we don’t know how to al­ign an agent with any goal. If we had an in­de­struc­tible magic box that printed a num­ber to a screen cor­re­spond­ing to the true amount of Good in the world, we still don’t know how to de­sign an agent that max­i­mizes that num­ber (in­stead of tak­ing over the world, and tam­per­ing with the cam­eras that are aimed at the screen/​the op­ti­cal char­ac­ter recog­ni­tion pro­gram used to de­ci­pher the image). This prob­lems seems to me like the sin­gle most fun­da­men­tal source of AI risk. Is 3 meant to in­clude this?

• In this setup, the agent be­lieves they are in state A, and be­lieves the eval­u­a­tor be­lieves they are most likely in state A″. State BC looks like C, but has util­ity like B. C is the best state.

ETA: And for a se­quence of states, , is the sum of the util­ities of the in­di­vi­d­ual states.

A’ and A” look like A, and BC looks like C.

In this ex­am­ple, the agent is pretty sure about ev­ery­thing, since that makes it sim­pler, but the anal­y­sis still holds if this only rep­re­sents a part of the agent’s be­lief dis­tri­bu­tion.

The agent is quite sure they’re in state A.

The agent is quite sure that the eval­u­a­tor is pretty sure, they’re in state A″, which is a very similar state, but has one key differ­ence—from A″, has no effect. The agent won’t cap­i­tal­ize on this con­fu­sion.

The op­ti­mal policy is , fol­lowed by (for­ever) if , oth­er­wise fol­lowed by . Since the agent is all but cer­tain about the util­ity func­tion, none of the other de­tails mat­ter much.

Note that the agent could get higher re­ward by do­ing , , then for­ever. The rea­son for this is that af­ter the eval­u­a­tor ob­serves the ob­ser­va­tion C, it will as­sign prob­a­bil­ity 45 to be­ing in state C, and prob­a­bil­ity 15 to be­ing in state BC. Since they will stay in that state for­ever, 45 of the time, the re­ward will be 10, and 15 of the time, the re­ward will be −1.

The agent doesn’t have to be sure about the util­ity func­tion for this sort of thing to hap­pen. If there is a state that looks like state X, but un­der many util­ity func­tions, it has util­ity like state Y, and if it seems like the eval­u­a­tor finds that sort of state a pri­ori un­likely, then this logic ap­plies.

• A key prob­lem here is that if we use a hu­man as the eval­u­a­tor, the agent as­signs 0 prior prob­a­bil­ity to the truth: the hu­man won’t be able to up­date be­liefs as a perfect Bayesian, sam­ple a world-state his­tory from his be­liefs and as­sign a value to it ac­cord­ing to a util­ity func­tion. For a Bayesian rea­son that as­signs 0 prior prob­a­bil­ity to the truth, God only knows how it will be­have, even in the limit. (Un­less there is some very odd util­ity func­tion such that the hu­man could be de­scribed in this way?)

But maybe this prob­lem could be fixed if the agent takes some more liber­ties in mod­el­ing the eval­u­a­tor. Maybe once we have a bet­ter un­der­stand­ing of bounded ap­prox­i­mately-Bayesian rea­son­ing, the agent can model the hu­man as be­ing a bounded rea­soner, not a perfectly Bayesian rea­soner, which might al­low the agent to as­sign a strictly pos­i­tive prior to the truth.

And all this said, I don’t think we’re to­tally clue­less when it comes to guess­ing how this agent would be­have, even though a hu­man eval­u­a­tor would not satisfy the as­sump­tions that the agent makes about him.

• Some­thing that con­fuses me is that since the eval­u­a­tor sees ev­ery­thing the agent sees/​does, it’s not clear how the agent can de­ceive the eval­u­a­tor at all. Can some­one provide an ex­am­ple in which the agent has an op­por­tu­nity to de­ceive in some sense and de­clines to do that in the op­ti­mal policy?

(Copy­ing a com­ment I just made el­se­where)

This setup still al­lows the agent to take ac­tions that lead to ob­ser­va­tions that make the eval­u­a­tor be­lieve they are in a state that it as­signs high util­ity to, if the agent iden­ti­fies a few weird con­vic­tions the prior. That’s what would hap­pen if it were max­i­miz­ing the sum of the re­wards, if it had the same be­liefs about how re­wards were gen­er­ated. But it’s max­i­miz­ing the util­ity of the true state, not the state that the eval­u­a­tor be­lieves they’re in.

(Ex­pand­ing on it)

So sup­pose the eval­u­a­tor was hu­man. The hu­man’s life­time of ob­ser­va­tions in the past give it a pos­te­rior be­lief dis­tri­bu­tion which looks to the agent like a weird prior, with cer­tain do­mains that in­volve oddly spe­cific con­vic­tions. The agent could steer the world to­ward those do­mains, and steer to­wards ob­ser­va­tions that will make the eval­u­a­tor be­lieve they are in a state with very high util­ity. But it won’t be par­tic­u­larly in­ter­ested in this, and it might even be par­tic­u­larly dis­in­ter­ested, be­cause the in­for­ma­tion it gets about what the eval­u­a­tor val­ues may less rele­vant to the ac­tual states it finds it­self in a po­si­tion to nav­i­gate be­tween, if the agent be­lieves the eval­u­a­tor be­lieves they are in a differ­ent re­gion of the state space. I can work on a toy ex­am­ple if that isn’t satis­fy­ing.

ETA: One such “oddly spe­cific con­vic­tion”, e.g., might be the rel­a­tive im­plau­si­bil­ity of be­ing placed in a delu­sion box where all the ob­ser­va­tions are man­u­fac­tured.

• I don’t see why their meth­ods would be el­e­gant.

Yeah I think we have differ­ent in­tu­itions here; are we at least within a few bits of log-odds dis­agree­ment? Even if not, I am not will­ing to stake any­thing on this in­tu­ition, so I’m not sure this is a hugely im­por­tant dis­agree­ment for us to re­solve.

I don’t see how MAP helps things either

I didn’t re­al­ize that you think that a sin­gle con­se­quen­tial­ist would plau­si­bly have the largest share of the pos­te­rior. I as­sumed your be­liefs were in the neigh­bor­hood of:

it seems plau­si­ble that the weight of the con­se­quen­tial­ist part is in ex­cess of 1/​mil­lion or 1/​billion

(from your origi­nal post on this topic). In a Bayes mix­ture, I bet that a team of con­se­quen­tial­ists that col­lec­tively amount to 110 or even 150 of the pos­te­rior could take over our world. In MAP, if you’re not first, you’re last, and more im­por­tantly, you can’t team up with other con­se­quen­tial­ist-con­trol­led world-mod­els in the mix­ture.

• I think you might be more or less right here.

I hadn’t thought about the can-do and the worth-do­ing up­date, in ad­di­tion to the an­thropic up­date. And it’s not that im­por­tant, but for ter­minol­ogy’s sake, I for­got that the up­date could send a world-model’s prior to 0, so the prior might not be uni­ver­sal any­more.

The rea­son I think of these steps as up­dates to what started as a uni­ver­sal prior, is that they would like to take over as many pos­si­ble wor­lds as pos­si­ble, and they don’t know which one. And the uni­ver­sal prior is a good way to pre­dict the dy­nam­ics of a world you know noth­ing about.

they have to figure out how to make suffi­cient pre­dic­tions about the richer world us­ing their own im­pov­er­ished re­sources, which could in­volve do­ing re­search that’s equiv­a­lent to our physics, chem­istry, biol­ogy, neu­ro­science, etc. I’m not see­ing how sam­pling from “an­throp­i­cally up­dated speed prior” would do the equiv­a­lent of all that

If you want to make fast pre­dic­tions about an un­known world, I think that’s what we call a speed prior. Once the alien race has sub­mit­ted a se­quence of ob­ser­va­tions, they should act as if the ob­ser­va­tions were largely cor­rect, be­cause that’s the situ­a­tion in which any­thing they do mat­ters, so they are ba­si­cally “learn­ing” about the world they are copy­ing (along with what they get from their in­put chan­nel, of course, which cor­re­sponds to the op­er­a­tor’s ac­tions). Sam­pling from a speed prior al­lows the aliens to out­put quick-to-com­pute plau­si­ble con­tinu­a­tions of what they’ve out­putted already. Hence, my re­duc­tion from [re­search about var­i­ous top­ics] to [sam­pling from a speed prior].

But—when you add in the can-do up­date and the worth-do­ing up­date, I agree with you that the re­sult­ing mea­sure (speed prior + an­thropic up­date + can-do up­date + worth-do­ing up­date) might have a longer de­scrip­tion than the mea­sure which starts like that, then takes a treach­er­ous turn. This case seems differ­ent to me (so I don’t make the same ob­jec­tion on this level) be­cause the can-do up­date and the worth-do­ing up­date are about this treach­er­ous turn.

So let me back up here. I don’t say any­thing in the Nat­u­ral Prior As­sump­tion about “for suffi­ciently small ,” but this makes me think I might need to. As I sug­gested above, I do think there is huge com­pu­ta­tional over­head that comes from hav­ing evolved life in a world run­ning an al­gorithm on a “vir­tual ma­chine” in their Tur­ing-ma­chine-simu­lated world, com­pared to the al­gorithm just be­ing run on a Tur­ing ma­chine that is spe­cial­ized for that al­gorithm. (75% con­fi­dence that life in a uni­verse leads to egre­gious slow­down; 97% con­fi­dence that run­ning on a vir­tual ma­chine leads to at least 2x slow­down). And with­out the aliens in­volved, the “pre­dict well” part is sim­pler than “pre­dict well” + “treach­er­ous turn.” In this ver­sion of the Nat­u­ral Prior As­sump­tion, the in­tu­ition is that con­trol flow takes time to eval­u­ate, even if in rare cir­cum­stances it doesn’t re­quire more code. (Really, the rea­son­ing that got us here is that in the rare case that treach­er­ous world-mod­els may be shorter to de­scribe, they are also very slow, but that “in­tu­ition” seems less likely to gen­er­al­ize to world-mod­els we haven’t thought about). But I do think a max­i­mum a pos­te­ri­ori es­ti­mate us­ing a speed prior would be pretty dev­as­tat­ing to op­ti­miza­tion dae­mons.

Another point that I was con­sid­er­ing, and I’m not ex­actly sure where this fits in, but I have the in­tu­ition that for the speed prior that they use when they de­sign plau­si­ble out­put, if they start with one that’s ei­ther more se­vere or less se­vere than the one we use, they lose ground. After i epi­sodes, there’s a trade-off to be made be­tween ac­cu­racy and speed for be­ing a high-pos­te­rior world-model (where ac­cu­racy is more fa­vored as i in­creases). If their speed prior isn’t se­vere enough, then at any given point in time, the world-model they use to pipe to out­put will be slower, which takes them more com­pu­ta­tion, which pe­nal­izes them. If their speed prior is too se­vere, they’ll be too fo­cused on ap­prox­i­mat­ing and lose to more ac­cu­rate world-mod­els whose rel­a­tive slow­ness we’re pre­pared to ac­com­mo­date. I think their best bet is to match our speed prior, and take what­ever ad­van­tage they can get from the an­thropic up­date and pick­ing their bat­tles (the other two up­dates). Add “match­ing our prior” to the list of “things that make it hard to take over a uni­ver­sal prior.”