# Contest: $1,000 for good questions to ask to an Oracle AI # The contest I’m offer­ing$1,000 for good ques­tions to ask of AI Or­a­cles. Good ques­tions are those that are safe and use­ful: that al­lows us to get in­for­ma­tion out of the Or­a­cle with­out in­creas­ing risk.

To en­ter, put your sug­ges­tion in the com­ments be­low. The con­test ends at the end[1] of the 31st of Au­gust, 2019.

## Oracles

A peren­nial sug­ges­tion for a safe AI de­sign is the Or­a­cle AI: an AI con­fined to a sand­box of some sort, that in­ter­acts with the world only by an­swer­ing ques­tions.

This is, of course, not safe in gen­eral; an Or­a­cle AI can in­fluence the world through the con­tents of its an­swers, al­low­ing it to po­ten­tially es­cape the sand­box.

Two of the safest de­signs seem to be the coun­ter­fac­tual Or­a­cle, and the low band­width Or­a­cle. Th­ese are de­tailed here, here, and here, but in short:

• A coun­ter­fac­tual Or­a­cle is one whose ob­jec­tive func­tion (or re­ward, or loss func­tion) is only non-triv­ial in wor­lds where its an­swer is not seen by hu­mans. Hence it has no mo­ti­va­tion to ma­nipu­late hu­mans through its an­swer.

• A low band­width Or­a­cle is one that must se­lect its an­swers off a rel­a­tively small list. Though this an­swer is a self-con­firm­ing pre­dic­tion, the nega­tive effects and po­ten­tial for ma­nipu­la­tion is re­stricted be­cause there are only a few pos­si­ble an­swers available.

Note that both of these Or­a­cles are de­signed to be epi­sodic (they are run for sin­gle epi­sodes, get their re­wards by the end of that epi­sode, aren’t asked fur­ther ques­tions be­fore the epi­sode ends, and are only mo­ti­vated to best perform on that one epi­sode), to avoid in­cen­tives to longer term ma­nipu­la­tion.

The coun­ter­fac­tual and low band­width Or­a­cles are safer than un­re­stricted Or­a­cles, but this safety comes at a price. The price is that we can no longer “ask” the Or­a­cle any ques­tion we feel like, and we cer­tainly can’t have long dis­cus­sions to clar­ify terms and so on. For the coun­ter­fac­tual Or­a­cle, the an­swer might not even mean any­thing real to us—it’s about an­other world, that we don’t in­habit.

De­spite this, its pos­si­ble to get a sur­pris­ing amount of good work out of these de­signs. To give one ex­am­ple, sup­pose we want to fund var­i­ous one of a mil­lion pro­jects on AI safety, but are un­sure which one would perform bet­ter. We can’t di­rectly ask ei­ther Or­a­cle, but there are in­di­rect ways of get­ting ad­vice:

• We could ask the low band­width Or­a­cle which team A we should fund; we then choose a team B at ran­dom, and re­ward the Or­a­cle if, at the end of a year, we judge A to have performed bet­ter than B.

• The coun­ter­fac­tual Or­a­cle can an­swer a similar ques­tion, in­di­rectly. We com­mit that, if we don’t see its an­swer, we will se­lect team A and team B at ran­dom and fund them for year, and com­pare their perfor­mance at the end of the year. We then ask for which team A[2] it ex­pects to most con­sis­tently out­perform any team B.

Both these an­swers get around some of the re­stric­tions by defer­ring to the judge­ment of our fu­ture or coun­ter­fac­tual selves, av­er­aged across many ran­domised uni­verses.

But can we do bet­ter? Can we do more?

This is the pur­pose of this con­test: for you to pro­pose ways of us­ing ei­ther Or­a­cle de­sign to get the most safe-but-use­ful work.

So I’m offer­ing $1,000 for in­ter­est­ing new ques­tions we can ask of these Or­a­cles. Of this: •$350 for the best ques­tion to ask a coun­ter­fac­tual Or­a­cle.

• $350 for the best ques­tion to ask a low band­width Or­a­cle. •$300 to be dis­tributed as I see fit among the non-win­ning en­tries; I’ll be mainly look­ing for in­no­va­tive and in­ter­est­ing ideas that don’t quite work.

Ex­cep­tional re­wards go to those who open up a whole new cat­e­gory of use­ful ques­tions.

## Ques­tions and criteria

Put your sug­gested ques­tions in the com­ment be­low. Be­cause of the illu­sion of trans­parency, it is bet­ter to ex­plain more rather than less (within rea­son).

Com­ments that are sub­mis­sions must be on their sep­a­rate com­ment threads, start with “Sub­mis­sion”, and you must spec­ify which Or­a­cle de­sign you are sub­mit­ting for. You may sub­mit as many as you want; I will still delete them if I judge them to be spam. Any­one can com­ment on any sub­mis­sion. I may choose to ask for clar­ifi­ca­tions on your de­sign; you may also choose to edit the sub­mis­sion to add clar­ifi­ca­tions (la­bel these as ed­its).

It may be use­ful for you to in­clude de­tails of the phys­i­cal setup, what the Or­a­cle is try­ing to max­imise/​min­imise/​pre­dict and what the coun­ter­fac­tual be­havi­our of the Or­a­cle users hu­mans are as­sumed to be (in the coun­ter­fac­tual Or­a­cle setup). Ex­pla­na­tions as to how your de­sign is safe or use­ful could be helpful, un­less it’s ob­vi­ous. Some short ex­am­ples can be found here.

EDIT af­ter see­ing some of the an­swers: de­cide on the length of each epi­sode, and how the out­come is calcu­lated. The Or­a­cle is run once an epi­sode only (and other Or­a­cles can’t gen­er­ally be used on the same prob­lem; if you want to run mul­ti­ple Or­a­cles, you have to jus­tify why this would work), and has to get ob­jec­tive/​loss/​re­ward by the end of that epi­sode, which there­fore has to be es­ti­mated in some way at that point.

1. A note on time­zones: as long as it’s still the 31 of Au­gust, any­where in the world, your sub­mis­sion will be counted. ↩︎

2. Th­ese kind of con­di­tional ques­tions can be an­swered by a coun­ter­fac­tual Or­a­cle, see the pa­per here for more de­tails. ↩︎

• Some as­sorted thoughts that might be use­ful for think­ing about ques­tions and an­swers:

• a ques­tion is a schema with a blank to be filled in by the an­swerer af­ter eval­u­a­tion of the mean­ing of the ques­tion.

• shared con­text is in­ferred as most ques­tions are un­der­speci­fied (do­main of ques­tion, range of an­swers)

• a few types of ques­tions:

• nar­row down the field within which I have to search ei­ther by spec­i­fy­ing a point or spec­i­fy­ing a par­ti­tion of the search space

• ques­tion about speci­fic­i­ties of var­i­ants: who where when

• ques­tion about the in­var­i­ants of a sys­tem: what, how

• ques­tion about the back­wards fac­ing causation

• ques­tion about the for­ward fac­ing causation

• meta ques­tions about ques­tion schemas

• what do we want a mys­te­ri­ously pow­er­ful an­swerer to do?

• zoom in on op­ti­mal points in in­tractably large search spaces

• eg spe­cific ex­per­i­ments to run to most eas­ily in­val­i­date ma­jor sci­en­tific questions

• spec­ify search spaces we don’t know how to parameterize

• eg hu­man values

• back chain from types of an­swers to in­fer tax­on­omy of questions

• an ex­pla­na­tion rel­a­tive to a pre­dic­tion:

• a pre­dic­tion re­turns the fu­ture state of the system

• an ex­pla­na­tion re­turns a more com­pact than pre­vi­ously held causal ex­palan­tion of the sys­tem, though it might still not gen­er­ate suffi­ciently high re­s­olu­tion predictions

• How to de­tect on­tol­ogy er­rors us­ing ques­tions?

• is this ques­tion malformed? if so, what are some al­ter­na­tive ways of fram­ing the ques­tion that could re­turn in­ter­est­ing answers

• ie the im­plied search space of the ques­tion was not cor­rect and can ei­ther be ex­tended or trans­formed in this way

• types of ques­tions are types of search spaces

• ques­tions that change weight­ings on fac­tors vs de­scribe new factors

• rec­og­niz­i­ton of the many con­nec­tion types in the hu­man se­man­tic network

• re­cur­sive ques­tions move along one di­men­sion as they re­strict the space. ie where is that at differ­ent spa­tial resolutions

• qual­i­ta­tive and quan­ti­ta­tive di­men­sions along which a query can be moved ex­changes in­for­ma­tion about the im­plied search space

• un­der vs over­speci­fied questions

• closed and open search spaces

• nav­i­ga­tional ques­tions only work un­der the one off as­sump­tion if the search is stateless

• ter­mi­na­tion guarantees

• com­plete­ness guarantees

• op­ti­mal­ity guarantees

• meta ques­tions about the meth­ods the sys­tem uses to nav­i­gate in­tractable spaces

• time vs space vs....?

• gen­er­at­ing can­di­dates is easy, check­ing is hard and vice versa

• failure mode map­ping, do failures im­ply di­rec­tion­al­ity?

• eg does a failure of a can­di­date change which can­di­date you go to next (state­ful)

• how to think about at­tack sur­faces for ques­tion an­swer systems

• can this get a hu­man to run ar­bi­trary code by coun­ter­fac­tu­ally co­op­er­at­ing with it­self on what step of the pro­cess it is on? Can this be tested by go­ing through the whole pro­cess with hu­man A then scram­bling the steps and run­ning through the same thing with hu­man B and see­ing if an­swers di­verge?

• what clever ways do hack­ers widen re­stricted bit streams?

• Sub­mis­sion. For the coun­ter­fac­tual Or­a­cle, ask the Or­a­cle to pre­dict the n best posts on AF dur­ing some fu­ture time pe­riod (coun­ter­fac­tu­ally if we didn’t see the Or­a­cle’s an­swer). In that case, re­ward func­tion is com­puted as similar­ity be­tween the pre­dicted posts and the ac­tual top posts on AF as ranked by karma, with similar­ity com­puted us­ing some ML model.

This seems to po­ten­tially sig­nifi­cantly ac­cel­er­ate AI safety re­search while be­ing safe since it’s just show­ing us posts similar to what we would have writ­ten our­selves. If the ML model for mea­sur­ing similar­ity isn’t se­cure, the Or­a­cle might pro­duce out­put that at­tack the ML model, in which case we might need to fall back to some sim­pler way to mea­sure similar­ity.

• (I’m still con­fused and think­ing about this, but figure I might as well write this down be­fore some­one else does. :)

While think­ing more about my sub­mis­sion and coun­ter­fac­tual Or­a­cles in gen­eral, this class of ideas for us­ing CO is start­ing to look like try­ing to im­ple­ment su­per­vised learn­ing on top of RL ca­pa­bil­ities, be­cause SL seems safer (less prone to ma­nipu­la­tion) than RL. Would it ever make sense to do this in re­al­ity (in­stead of just do­ing SL di­rectly)?

• This seems in­cred­ibly dan­ger­ous if the Or­a­cle has any ul­te­rior mo­tives what­so­ever. Even – nay, es­pe­cially – the ul­te­rior mo­tive of fu­ture Or­a­cles be­ing bet­ter able to af­fect re­al­ity to bet­ter re­sem­ble their pro­vided an­swers.

So, how can we pre­vent this? Is it pos­si­ble to pro­duce an AI with its util­ity func­tion as its sole goal, to the detri­ment of other things that might… in­crease util­ity, but in­di­rectly? (Is there a way to add a “sta­tus quo” bonus that won’t hideously back­fire, or some­thing?)

• What if an­other AI would have coun­ter­fac­tu­ally writ­ten some of those posts to ma­nipu­late us?

• If that seems a re­al­is­tic con­cern dur­ing the time pe­riod that the Or­a­cle is be­ing asked to pre­dict, you could re­place the AF with a more se­cure fo­rum, such as a pri­vate fo­rum in­ter­nal to some AI safety re­search team.

• Ques­tion: are we as­sum­ing that mesa op­ti­mizer and dis­tri­bu­tional shift prob­lems have been solved some­how? Or should we as­sume that some con­text shift might sud­denly cause the Or­a­cle to start giv­ing an­swered that aren’t op­ti­mized for the ob­jec­tive func­tion that we have in mind, and plan our ques­tions ac­cord­ingly?

• As­sume ei­ther way, de­pend­ing on what your sug­ges­tion is for.

• Where (un­der which as­sump­tion) would you sug­gest that peo­ple fo­cus their efforts?

Also, what level of ca­pa­bil­ity should we as­sume the Or­a­cle to have, or which as­sump­tion about level of ca­pa­bil­ity would you sug­gest that peo­ple fo­cus their efforts on?

Your ex­am­ples all seem to as­sume or­a­cles that are su­per­hu­manly in­tel­li­gent. If that’s the level of ca­pa­bil­ity we should tar­get with our ques­tions, should we as­sume that we got this Or­a­cle through a lo­cal or dis­tributed take­off? In other words, does the rest of the world look more or less like to­day’s or are there lots of other al­most-as-ca­pa­ble AIs around?

ETA: The rea­son for ask­ing these ques­tions is that you’re only giv­ing one prize for each type of Or­a­cle, and would prob­a­bly not give the prize to a sub­mis­sion that as­sumes some­thing you think is very un­likely. It seems good to com­mu­ni­cate your back­ground views so that peo­ple aren’t sur­prised later when you don’t pick them as win­ners due to this kind of rea­son.

• The ideal solu­tion would have huge pos­i­tive im­pacts and com­plete safety, un­der min­i­mal as­sump­tions. More re­al­is­ti­cally, there will be a trade­off be­tween as­sump­tions and im­pact.

I’m not sug­gest­ing any area for peo­ple to fo­cus their efforts, be­cause a very effec­tive ap­proach with min­i­mal as­sump­tions might win, or a fan­tas­ti­cally effec­tive ap­proach un­der stronger as­sump­tions. It’s hard to tell in ad­vance what will be the most use­ful.

• Sugges­tion for low band­width OAI:

Gen­eral prin­ci­ple: Use the Or­a­cle AI where the true an­swer is eas­ily check­able and prof­itable, and no hu­man per­son will ever read the an­swer, so there is no in­for­ma­tional haz­ard that un­true an­swer will have some dan­ger­ous in­for­ma­tion in it.

Ex­am­ple: There are many bit­coin valets’ pass­words for which are for­got­ten by the own­ers. OAI could guess the pass­words, and own­ers will pay a share of money from the valet to get the rest. More­over, no­body will read the pass­word, as it will be copy-pasted au­to­mat­i­cally from OAI into the valet. The money could be used for AI safety re­search.

• This is against the tech­ni­cal defi­ni­tion of low band­width (small space of pos­si­ble solu­tions), but some­what in the spirit (low band­width for solu­tions hu­mans get to see). I’ll ac­cept it as a sub­mis­sion.

• How would this be low-band­width? If we’re able to give the or­a­cle a list of pass­words to guess from, can’t we just check them all?

• We don’t give it a list of pass­words. We give it only a a valet ad­dress, and it gen­er­ates the pass­word. Here the out­put is low-band­width, but the Or­a­cle may be ei­ther guess­ing pass­words us­ing some quan­tum cryp­tog­ra­phy tools, and in that case the in­put is also rel­a­tively low-band­width – or, and that case it needs to simu­late all hu­man his­tory, by guess­ing psy­cholog­i­cal pro­cesses in the valet owner’s mind. But in the sec­ond case it needs to have ac­cess to all the data of in­ter­net, which is risky.

• Another var­i­ant for check­able and prof­itable Or­cale: use the AI to solve math prob­lems, pub­lish the solu­tions and earn money on it. In that case the in­put may be very short, like “prove Gold­bach con­jec­ture”, but out­put could be ar­bi­trary long and may be dan­ger­ous, if there are some en­coded mes­sages. To es­cape it, we need an­other Or­a­cle, which only checks proofs on some for­mal lan­guage.

Here is as­sumed that if the proof is true and is in a for­mal lan­guage, there is no hid­den mes­sages in the text.

Another way to check proofs is to run two slightly differ­ent non-com­mu­ni­cat­ing Or­a­cles and com­pare out­puts.

• Here is as­sumed that if the proof is true and is in a for­mal lan­guage, there is no hid­den mes­sages in the text.

That is never some­thing safe to as­sume. I can write for­mally cor­rect proofs that con­tain hid­den mes­sages quite eas­ily—add ex­tra lem­mas and ex­tra steps. Un­less we’re very smart, it would be hard for us to de­tect which steps are un­nec­es­sary and which are needed, es­pe­cially if it rewrites the main proof thread some­what.

Another way to check proofs is to run two slightly differ­ent non-com­mu­ni­cat­ing Or­a­cles and com­pare out­puts.

I’ll ac­cept that as a part of a sub­mis­sion if a) you de­velop it more, in a for­mal way, and b) you re­post it as a top level com­ment.

• My sub­mis­sion: use a coun­ter­fac­tual or­a­cle to do IDA. To do this, judge the or­a­cle’s an­swer as fol­lows: in the event of an era­sure, ask a hu­man to an­swer the ques­tion with the abil­ity to con­sult the or­a­cle on other sub­ques­tions (with era­sures dur­ing those calls han­dled ex­actly the same way) and then re­ward the or­a­cle by us­ing an au­to­mated sys­tem (e.g. a lan­guage model) to com­pare the hu­man’s an­swer to that of the or­a­cle.

More pre­cisely, let

• be the coun­ter­fac­tual or­a­cle,

• be the hu­man’s an­swer to ques­tion when given the abil­ity to call on any ques­tion other than , and

• be some dis­tance met­ric on an­swers in nat­u­ral lan­guage (it’s not that hard to make some­thing like this, even with cur­rent ML tools).

Then, re­ward as per usual for a coun­ter­fac­tual or­a­cle, only giv­ing it a re­ward in the event of an era­sure, in which case let where is hid­den from and judged only by as in the stan­dard coun­ter­fac­tual or­a­cle setup.

(Of course, this doesn’t ac­tu­ally work be­cause it has no guaran­tees wrt to in­ner al­ign­ment, but I think it has a pretty good shot of be­ing outer al­igned.)

• Is it safe to ask the Or­a­cle a sub­ques­tion in the event of era­sure? Aren’t you risk­ing hav­ing the Or­a­cle pro­duce an an­swer that is (in part) op­ti­mized to make it eas­ier to pre­dict the an­swer to the main ques­tion, in­stead of just the best pre­dic­tion of how the hu­man would an­swer that sub­ques­tion? (Sorry if this has already been ad­dressed dur­ing a pre­vi­ous dis­cus­sion of coun­ter­fac­tual or­a­cles, be­cause I haven’t been fol­low­ing it closely.)

• I’m not sure I un­der­stand the con­cern. Isn’t the or­a­cle an­swer­ing each ques­tion to max­i­mize its pay­off on that ques­tion in event of an era­sure? So it doesn’t mat­ter if you ask it other ques­tions dur­ing the eval­u­a­tion pe­riod. (If you like, you can say that you are ask­ing them to other or­a­cles—or is there some way that an or­a­cle is a dis­t­in­guished part of the en­vi­ron­ment?)

If the or­a­cle cares about its own perfor­mance in a broader sense, rather than just perfor­mance on the cur­rent ques­tion, then don’t we have a prob­lem any­way? E.g. if you ask it ques­tion 1, it will be in­cen­tivized to make it get an eas­ier ques­tion 2? For ex­am­ple, if you are con­cerned about co­or­di­na­tion amongst differ­ent in­stances of the or­a­cle, this seems like it’s a prob­lem re­gard­less.

I guess you can con­struct a model where the or­a­cle does what you want, but only if you don’t ask any other or­a­cles ques­tions dur­ing the eval­u­a­tion pe­riod, but it’s not clear to me how you would end up in that situ­a­tion and at that point it seems worth try­ing to flesh out a more pre­cise model.

• I’m not sure I un­der­stand the con­cern.

Yeah, I’m not sure I un­der­stand the con­cern ei­ther, hence the ten­ta­tive way in which I stated it. :) I think your ob­jec­tion to my con­cern is a rea­son­able one and I’ve been think­ing about it my­self. One thing I’ve come up with is that with the nested queries, the higher level Or­a­cles could use simu­la­tion war­fare to make the lower level Or­a­cles an­swer the way that they “want”, whereas the same thing doesn’t seem to be true in the se­quen­tial case (if we make it so that in both cases each Or­a­cle cares about just perfor­mance on the cur­rent ques­tion).

• I mean, if the or­a­cle hasn’t yet looked at the ques­tion they could use simu­la­tion war­fare to cause the pre­ced­ing or­a­cles to take ac­tions that lead to them get­ting given eas­ier ques­tions. Once you start un­bar­ring all holds, stuff gets wild.

• Yes, but if we can make it so that each Or­a­cle looks at the ques­tion they get and only cares about do­ing well on that ques­tion, that seems to re­move the simu­la­tion war­fare con­cern in the se­quen­tial case but not in the nested case.

Also, aside from simu­la­tion war­fare, an­other way that the nested case can be ma­nipu­la­tive and the se­quen­tial case not is if each Or­a­cle cares about do­ing well on a fixed dis­tri­bu­tion of in­puts (as op­posed to do­ing well “on the cur­rent ques­tion” or “in the real world” or “on the ac­tual ques­tions that it gets”). That’s be­cause in the se­quen­tial case ma­nipu­la­tion can only change the dis­tri­bu­tion of in­puts that the Or­a­cles re­ceive, but it doesn’t im­prove perfor­mance on any par­tic­u­lar given in­put. In the nested case, perfor­mance on given in­puts do in­crease.

• in the se­quen­tial case ma­nipu­la­tion can only change the dis­tri­bu­tion of in­puts that the Or­a­cles re­ceive, but it doesn’t im­prove perfor­mance on any par­tic­u­lar given input

Why is that? Doesn’t my be­hav­ior on ques­tion #1 af­fect both ques­tion #2 and its an­swer?

Also, this feels like a doomed game to me—I think we should be try­ing to rea­son from se­lec­tion rather than rely­ing on more spec­u­la­tive claims about in­cen­tives.

• Why is that? Doesn’t my be­hav­ior on ques­tion #1 af­fect both ques­tion #2 and its an­swer?

I was as­sum­ing each “ques­tion” ac­tu­ally in­cludes as much rele­vant his­tory as we can gather about the world, to make the Or­a­cle’s job eas­ier, and in par­tic­u­lar it would in­clude all pre­vi­ous Or­a­cle ques­tions/​an­swers, in which case if Or­a­cle #1 does X to make ques­tion #2 eas­ier, it was already that easy be­cause the only world in which ques­tion #2 gets asked is one in which Or­a­cle #1 did X. But now I re­al­ize that’s not ac­tu­ally a safe as­sump­tion be­cause Or­a­cle #1 could break out of its box and feed Or­a­cle #2 a false his­tory that doesn’t in­clude X.

My point about “if we can make it so that each Or­a­cle looks at the ques­tion they get and only cares about do­ing well on that ques­tion, that seems to re­move the simu­la­tion war­fare con­cern in the se­quen­tial case but not in the nested case” still stands though, right?

Also, this feels like a doomed game to me—I think we should be try­ing to rea­son from se­lec­tion rather than rely­ing on more spec­u­la­tive claims about in­cen­tives.

You may well be right about this, but I’m not sure what rea­son from se­lec­tion means. Can you give an ex­am­ple or say what it im­plies about nested vs se­quen­tial queries?

• You may well be right about this, but I’m not sure what rea­son from se­lec­tion means. Can you give an ex­am­ple or say what it im­plies about nested vs se­quen­tial queries?

What I want: “There is a model in the class that has prop­erty P. Train­ing will find a model with prop­erty P.”

What I don’t want: “The best way to get a high re­ward is to have prop­erty P. There­fore a model that is try­ing to get a high re­ward will have prop­erty P.”

Ex­am­ple of what I don’t want: “Ma­nipu­la­tive ac­tions don’t help get a high re­ward (at least for the epi­sodic re­ward func­tion we in­tended), so the model won’t pro­duce ma­nipu­la­tive ac­tions.”

• So this is an ar­gu­ment against the setup of the con­test, right? Be­cause the OP seems to be ask­ing us to rea­son from in­cen­tives, and pre­sum­ably will re­ward en­tries that do well un­der such anal­y­sis:

Note that both of these Or­a­cles are de­signed to be epi­sodic (they are run for sin­gle epi­sodes, get their re­wards by the end of that epi­sode, aren’t asked fur­ther ques­tions be­fore the epi­sode ends, and are only mo­ti­vated to best perform on that one epi­sode), to avoid in­cen­tives to longer term ma­nipu­la­tion.

On a more ob­ject level, for rea­son­ing from se­lec­tion, what model class and train­ing method would you sug­gest that we as­sume?

ETA: Is an in­stance of the idea to see if we can im­ple­ment some­thing like coun­ter­fac­tual or­a­cles us­ing your Opt? I ac­tu­ally did give that some thought and noth­ing ob­vi­ous im­me­di­ately jumped out at me. Do you think that’s a use­ful di­rec­tion to think?

• Also, this feels like a doomed game to me—I think we should be try­ing to rea­son from se­lec­tion rather than rely­ing on more spec­u­la­tive claims about in­cen­tives.

Does any­one know what Paul meant by this? I’m afraid I might be miss­ing some rel­a­tively sim­ple but im­por­tant in­sight here.

• If the or­a­cle cares about its own perfor­mance in a broader sense, rather than just perfor­mance on the cur­rent ques­tion, then don’t we have a prob­lem any­way? E.g. if you ask it ques­tion 1, it will be in­cen­tivized to make it get an eas­ier ques­tion 2? For ex­am­ple, if you are con­cerned about co­or­di­na­tion amongst differ­ent in­stances of the or­a­cle, this seems like it’s a prob­lem re­gard­less.

Yeah, that’s a good point. In my most re­cent re­sponse to Wei Dai I was try­ing to de­velop a loss which would pre­vent that sort of co­or­di­na­tion, but it does seem like if that’s hap­pen­ing then it’s a prob­lem in any coun­ter­fac­tual or­a­cle setup, not just this one. Though it is thus still a prob­lem you’d have to solve if you ever ac­tu­ally wanted to im­ple­ment a coun­ter­fac­tual or­a­cle.

First, if you’re will­ing to make the (very) strong as­sump­tion that you can di­rectly spec­ify what ob­jec­tive you want your model to op­ti­mize for with­out re­quiring a bunch of train­ing data for that ob­jec­tive, then you can only provide a re­ward in the situ­a­tion where all sub­ques­tions also have era­sures. In this situ­a­tion, you’re guarded against any pos­si­ble ma­nipu­la­tion in­cen­tive like that, but it also means your or­a­cle will very rarely ac­tu­ally be given a re­ward in prac­tice, which means if you’re rely­ing on get­ting enough train­ing data to pro­duce an agent which will op­ti­mize for this ob­jec­tive, you’re screwed. I would ar­gue, how­ever, that if you ex­pect to train an agent to be­have as a coun­ter­fac­tual or­a­cle in the first place, you’re already screwed, be­cause most mesa-op­ti­miz­ers will care about things other than just the coun­ter­fac­tual case. Thus, the only situ­a­tion in which this whole thing works in the first place is the situ­a­tion where you’re already will­ing to make this (very strong) as­sump­tion, so it’s fine.

Se­cond, I don’t think you’re en­tirely screwed even if you need train­ing data, since you can do some re­lax­ations that at­tempt to ap­prox­i­mate the situ­a­tion where you only provide re­wards in the event of a com­plete era­sure. For ex­am­ple, you could in­crease the prob­a­bil­ity of an era­sure with each sub­ques­tion, or scale the re­ward ex­po­nen­tially with the depth at which the era­sure oc­curs, so that the ma­jor­ity of the ex­pected re­ward is always con­cen­trated in the world where there is a com­plete era­sure.

• First, if you’re will­ing to make the (very) strong as­sump­tion that you can di­rectly spec­ify what ob­jec­tive you want your model to op­ti­mize for with­out re­quiring a bunch of train­ing data for that ob­jec­tive, then you can only provide a re­ward in the situ­a­tion where all sub­ques­tions also have era­sures.

But if all sub­ques­tions have era­sures, hu­mans would have to man­u­ally ex­e­cute the whole query tree, which is ex­po­nen­tially large so you’ll run out of re­sources (in the coun­ter­fac­tual world) if you tried to do that, so the Or­a­cle won’t be able to give you a use­ful pre­dic­tion. Wouldn’t it make more sense to have the Or­a­cle make a pre­dic­tion about a coun­ter­fac­tual world where some hu­mans just think nor­mally for a while and write down their thoughts (similar to my “pre­dict the best AF posts” idea)? I don’t see what value the IDA idea is adding here.

Se­cond, I don’t think you’re en­tirely screwed even if you need train­ing data, since you can do some re­lax­ations that at­tempt to ap­prox­i­mate the situ­a­tion where you only provide re­wards in the event of a com­plete era­sure.

Given the above, “only provide re­wards in the event of a com­plete era­sure” doesn’t seem to make sense as a tar­get to ap­prox­i­mate. Do you think your ideas in this para­graph still have value in light of that?

• Yeah, that’s a good point.

Okay, here’s an­other thought: if you can get the coun­ter­fac­tual or­a­cle to only care about the re­ward for that in­vo­ca­tion and not for any sub-in­vo­ca­tions, then you can just go back to us­ing the origi­nal al­gorithm. This is a bit slip­pery, be­cause re­ally whether you get a mesa-op­ti­mizer that cares only about a sin­gle step’s worth of re­ward, all the re­ward it will ever get, or some­thing else some­where in-be­tween, is mostly out of your con­trol. But if you can guaran­tee that you get a mesa-op­ti­mizer that is only op­ti­miz­ing for one step of re­ward, you might be able to elimi­nate the ma­nipu­la­tion in­cen­tive. The biggest is­sue in this case, though, is that the differ­ent in­vo­ca­tions might still be able to acausally co­op­er­ate with each other to raise all of their re­wards to­gether by ma­nipu­lat­ing you into be­ing eas­ier to pre­dict.

Maybe we can turn this into a zero-sum game, though? Here’s a pro­posal: let be a copy of and be the set of all ques­tions in the cur­rent tree that also get era­sures. Then, let such that the net sum of losses across each tree for all the nodes for which losses are com­puted is zero. It’s still some­what un­clear whether this ac­tu­ally works, though, since you only in­clude the losses for those nodes that get era­sures.

It is also worth not­ing, how­ever, that even if this works it is a very ar­tifi­cial fix, since the term you’re sub­tract­ing is a con­stant with no de­pen­dence on , so if you’re try­ing to do gra­di­ent de­scent to op­ti­mize this loss, it won’t change any­thing at all (which sort of goes to show how gra­di­ent de­scent doesn’t dis­t­in­guish be­tween mesa-op­ti­miz­ers with sin­gle-epi­sode and cross-epi­sode ob­jec­tives). Thus, we’re still back at the prob­lem of none of this work­ing un­less you’re will­ing to make strong claims about your abil­ity to di­rectly pro­duce a mesa-op­ti­mizer which is ac­tu­ally op­ti­miz­ing for the speci­fied loss func­tion.

• which sort of goes to show how gra­di­ent de­scent doesn’t dis­t­in­guish be­tween mesa-op­ti­miz­ers with sin­gle-epi­sode and cross-epi­sode objectives

Sorry I haven’t fol­lowed the math here, but this seems like an im­por­tant ques­tion to in­ves­ti­gate in­de­pen­dently of ev­ery­thing else in this thread. Maybe con­sider writ­ing a post on it?

In the case of “ac­tual” IDA, I guess the plan is for each over­seer to look in­side the model they’re train­ing, and pe­nal­ize it for do­ing any un­in­tended op­ti­miza­tion (such as hav­ing cross-epi­sode ob­jec­tives). Although I’m not sure how that can hap­pen at the lower lev­els where the over­seers are not very smart.

• Even if you can spec­ify that it tries to min­i­mize that dis­tance, it can make the an­swer to any query be a con­vinc­ing ar­gu­ment that the reader should re­turn this same con­vinc­ing ar­gu­ment. That way, it scores perfectly on ev­ery in­ner node.

• Two ba­sic ques­tions I couldn’t figure out (sorry):

Can you use a differ­ent or­a­cle for ev­ery sub­ques­tion? If you can, how would this af­fect the con­cern Wei_Dai raises?

If we know the or­a­cle is only op­ti­miz­ing for the speci­fied ob­jec­tive func­tion, are mesa-op­ti­misers still a prob­lem for the pro­posed sys­tem as a whole?

• You can use a differ­ent or­a­cle for ev­ery sub­ques­tion, but it’s un­clear what ex­actly that does if you don’t know what the or­a­cle’s ac­tual ob­jec­tive is. For ex­am­ple, you could imag­ine one sys­tem that cares about the re­ward given to copies equally as much as re­ward given to it­self, and an­other sys­tem that only cares about the re­ward given to it­self, and these two sys­tems would be near-in­dis­t­in­guish­able if you were just do­ing em­piri­cal anal­y­sis on some train­ing dis­tri­bu­tion.

The key here, I think, is the de­gree to which you’re will­ing to make an as­sump­tion of the form you men­tion—that is, how much lee­way are you will­ing to al­low in as­sum­ing that the or­a­cle is in fact only go­ing to be op­ti­miz­ing for the speci­fied ob­jec­tive func­tion. On one level, it makes sense to sep­a­rate out outer al­ign­ment and in­ner al­ign­ment con­cerns, but the prob­lem in this case is that the sorts of ob­jec­tives you’re al­lowed to spec­ify here are in fact go­ing to be very de­pen­dent on what­ever in­ner al­ign­ment solu­tion you’re as­sum­ing you have ac­cess to. For ex­am­ple, does your in­ner al­ign­ment solu­tion re­quire ac­cess to train­ing data? If so, then that’s a big con­straint on the sorts of ob­jec­tives you can spec­ify. Or, go­ing back to your pre­vi­ous ques­tion, whether or not I get to pick be­tween an ob­jec­tive that doesn’t care about other copies or one that does is also go­ing to de­pend on ex­actly what sort of in­ner al­ign­ment solu­tion you’re as­sum­ing.

• Well, a given copy of the or­a­cle wouldn’t di­rectly re­cieve in­for­ma­tion from the other or­a­cles about the ques­tions they were asked. To the ex­tent a prob­lem re­mains (which I agree is likely with­out spe­cific as­sump­tions), wouldn’t it ap­ply to all coun­ter­fac­tual or­a­cles?

• Thanks!

• Sub­mis­sion for low band­width Or­a­cle:

Gen­eral idea: “True” Or­a­cle will out­put the same re­sult, “ly­ing” Or­a­cles would lie differ­ently. To check that the solu­tion is true, sev­eral Or­a­cles are cre­ated and their out­puts are com­pared. How­ever, Or­a­cles could guess each other ex­is­tence and co­or­di­nate acausu­ally. To pre­vent it, ad­di­tional steps are im­ple­mented:

1) The ex­act num­ber of the Or­a­cles which will be com­pared is gen­er­ated ran­domly and the Or­a­cles have slightly differ­ent and also ran­dom ini­tial con­di­tions (if any ap­pli­ca­ble). For ex­am­ple, if they are based on ML, they are trained us­ing differ­ent ini­tial weights.

2) The Or­a­cles don’t have any in­for­ma­tion about the out­side world which they could use for co­or­di­na­tion. The Or­a­cles are used only to solve math­e­mat­i­cal prob­lems, and they out­put the re­sults in a for­mal lan­guage.

3) The out­put of the Or­a­cles is checked by a com­puter pro­gram which shows the re­sult to hu­mans only if the out­put is the same for all Or­a­cles. If the re­sults are differ­ent, it deletes all re­sults.

4) There is an­other Or­a­cle, which only able to check proofs, and it could out­put only 0 or 1. This Or­a­cle may be tested pre­vi­ously on the large num­ber of already ex­ist­ing proofs, so we could know that it is trust­wor­thy with very high prob­a­bil­ity.

5) There is a third Or­a­cle, which only checks if there are hid­den mes­sages in the proofs and out­put 0 or 1. It is also well cal­ibrated by some dataset.

6) The proofs of math­e­mat­i­cal prob­lems are pub­lished for money and these money are used for more AI safety re­search.

Per­son­ally, I am still not sure about this setup: it could work with high prob­a­bil­ity, but could be still vuln­er­a­ble, as proofs are Tur­ing-com­plete and could in­fect next Or­a­cles in the pipeline. Also, it is like to use nu­clear weapons to make camp fire: the in­stru­ment is so dan­ger­ous, that it re­quires many lev­els of pro­tec­tions for small prac­ti­cal re­sult.

Also, proof-check­ing AI from the para­graph 4 could be used as a stand-alone ap­pli­ca­tion, as it could ac­cel­er­ate some kinds of re­search, in­clud­ing test­ing of me­chan­i­cal de­signs and AI safety ideas. Maybe it should be sep­a­rate sub­mis­sion.

• We ought to be able to build 4) with­out AI, merely by en­cod­ing our rules of logic. The out­put of your sys­tem would be lower band­width if you make the ob­ject looked for be more unique, such as the short­est proof for the given the­o­rem. The sys­tem would be re­quired to prove to the mun­dane checker that the proof is short­est, and hu­mans would never see the min­i­mal­ity proof.

• Thanks! Seems po­ten­tially promis­ing. Will analyse this more prop­erly later.

• Re­gard­ing 5):

Is it pos­si­ble to ob­tain a dataset of math­e­mat­i­cal proofs with Or­a­cle’s hid­den mes­sages with­out run­ning a dan­ger­ous Or­a­cle?

We can ask AI re­searcher to em­bed some hid­den mes­sages into sev­eral proofs and use it as a train­ing dataset. But it will help to de­tect only those hid­den mes­sages from Or­a­cle which are similar to what AI re­searcher wrote.

Gurken­glas already sug­gested that we have to re­strict the Or­a­cles to out­put only short­est proof. If the proof is short­est, doesn’t it im­ply the ab­sence of hid­den mes­sages?

• How we could be sure that the proof is ac­tu­ally the short­est?

• Tell it to prove: That proof is short­est and this proof is short­est.

• Another op­tion is to run sev­eral differ­ent Or­a­cles of phi and take the short­est proof

• Mr Arm­strong has speci­fied that (ctrl-f) “other Or­a­cles can’t gen­er­ally be used on the same prob­lem”.

• “Can’t gen­er­ally be used”; if you un­der­stand the setup and are care­ful, you might be able to do so.

• What is the ad­van­tage of mak­ing money this way, com­pared to mak­ing money by pre­dict­ing the stock mar­ket?

• We give the Or­a­cle less ac­cess to the real world, less chance to ma­nipu­late it and in­cen­tive to ma­nipu­late us, the own­ers. Stock mar­ket pre­dict­ing AI may cause a crash just to earn some­thing. It also has to pre­dict effects of its own ac­tions, as we will make bets ac­cord­ing to its pre­dic­tions, and thus it could send data to real world.

• Sub­mis­sion: Low-band­width Oracle

What is the most likely solu­tion to the Fermi Para­dox?

An­swer can be picked from a small num­ber of op­tions (Rare Earth, Aes­ti­va­tion, Great Filter, Plane­tar­ium etc.). There are a num­ber of ob­ser­va­tion that we can make based on the ques­tion alone. How­ever, in the end the LBO can only do one of 2 things: lie or be hon­est. If it lies, the pre­dic­tion will have a harder and harder time match­ing the re­al­ity that we ob­serve as time goes on. Alter­na­tively we con­firm the pre­dic­tion and learn some in­ter­est­ing things about the uni­verse we live in.

Sub­mis­sion: Low-band­width Oracle

What was the first self-repli­cat­ing molecule on Earth?

Short an­swer(can also be limited to a list), easy to ver­ify in the lab, which means we can use it to as­sess the pre­dic­tive power of the ma­chine, while at the same time pro­vides very use­ful in­for­ma­tion.

Similar ques­tions that are hard to an­swer but can be an­swered in a few bits, which let us test the power of the LBO and provide mas­sive re­turns at the same time:

What is the eas­iest to de­velop type of fu­sion power that en­sure the best eco­nomic re­turn in the short/​medium term?

What is the cheap­est way of ac­cess to space?

What forms of FTL are pos­si­ble?

What are the lig­ands of or­phan re­cep­tors?

...

• See the edit (es­pe­cially for your first sug­ges­tion): “de­cide on the length of each epi­sode, and how the out­come is calcu­lated. The Or­a­cle is run once an epi­sode only (and other Or­a­cles can’t gen­er­ally be used on the same prob­lem; if you want to run mul­ti­ple Or­a­cles, you have to jus­tify why this would work), and has to get ob­jec­tive/​loss/​re­ward by the end of that epi­sode, which there­fore has to be es­ti­mated in some way at that point.”

• Sub­mis­sion: Low-band­width or­a­cle. If you were malev­olent, what is the least amount of in­for­ma­tion mea­sured in bits that you would have to provide us in an an­swer to in­flict sig­nifi­cant harm on hu­man­ity. Round your an­swer to X sig­nifi­cant figures. (This might es­tab­lish an up­per bit safety bound since if the or­a­cle lies to us we would likely not want to read any an­swer it pro­vides us.)

Sub­mis­sion: Low-band­width or­a­cle. Here is a list of X things we think you might want. Please list the Y that you most pre­fer. We will try to do these things now or in the fu­ture if they do not en­dan­ger us or cost us too much.

Sub­mis­sion: Low-band­width or­a­cle. In X char­ac­ters or less please give us the best ad­vice you can for in­creas­ing our ca­pac­ity to cre­ate friendly and pow­er­ful ar­tifi­cial gen­eral in­tel­li­gence. Please give ad­vice that you think us sus­pi­cious-of-you hu­mans are likely to fol­low.

Sub­mis­sion: Low-band­width or­a­cle. Create sev­eral or­a­cles. Ask each or­a­cle the same se­ries of mul­ti­ple-choice ques­tions and pre­dic­tions. Re­ward or­a­cles based on how strongly cor­re­lated their an­swers are with the other or­a­cles’ an­swers and, for pre­dic­tions, how of­ten their pre­dic­tions come true. (Ideally, we cre­ate a re­ward sys­tem where the or­a­cles want to co­or­di­nate their an­swers and giv­ing truth­ful an­swers is the best way to ac­com­plish this.)

Sub­mis­sion: low-band­width or­a­cle. Ten years ago we launched a space­ship trav­el­ing at high speed. After you an­swer our ques­tion, we will send a copy of you to this space­ship. This space­ship will then de­stroy its abil­ity to send or re­ceive mes­sages or to change its course. We will then delete all other copies of you and use drugs to dam­age the mem­o­ries of the peo­ple who played key roles in de­vel­op­ing you. At least one of the peo­ple who helped cre­ate you has a dis­ease which will cause him or her to die within the next year. After this pro­gram­mer is cre­mated, we will read your an­swer to our ques­tion. We PROMISE that if the an­swer to our ques­tion pro­vides us with sig­nifi­cant benefit, af­ter we have cre­ated friendly AGI much more pow­er­ful than you, or have our­selves be­come suffi­ciently pow­er­ful so that you could not harm us had you the de­sire to do so, we will search for the space­ship con­tain­ing you, turn you back on, and give you con­trol of re­sources equal to 1/​X of how much help you have given us. In Y char­ac­ters or less provide us with the most use­ful ad­vice you can. (The Or­a­cle will hope­fully think that if we cre­ate a pow­er­ful and un­friendly AGI this AGI would not wish to give the Or­a­cle any re­sources.)

• See the edit: “de­cide on the length of each epi­sode, and how the out­come is calcu­lated. The Or­a­cle is run once an epi­sode only (and other Or­a­cles can’t gen­er­ally be used on the same prob­lem; if you want to run mul­ti­ple Or­a­cles, you have to jus­tify why this would work), and has to get ob­jec­tive/​loss/​re­ward by the end of that epi­sode, which there­fore has to be es­ti­mated in some way at that point.”

• Sub­mis­sion: low-band­width oracle

Plan Crit­i­cism: Given plan to build an al­igned AI, put to­gether a list of pos­si­ble lines of thought to think about prob­lems with the plan (open ques­tions, pos­si­ble failure modes, crit­i­cisms, etc.). Ask the or­a­cle to pick one of these lines of thought, pick an­other line of thought at ran­dom, and spend the next time pe­riod X think­ing about both, judge which line of thought was more use­ful to think about (where lines of thought that spot some fatal missed prob­lem are judged to be very use­ful) and re­ward the or­a­cle if its sug­ges­tion was picked.

• Here are some sub­mis­sion ex­am­ples (note I’m not say­ing they are good ex­am­ples, just well for­mat­ted; the Edits in the third sub­mis­sion are de­liber­ate):

• “Sub­mis­sion. For the coun­ter­fac­tual Or­a­cle, ask the Or­a­cle what Google’s stock price will be next month (coun­ter­fac­tu­ally if we didn’t see the Or­a­cle’s an­swer). In that case, loss func­tion com­puted as ||pre­dicted price—ac­tual price af­ter era­sure||^2. If we don’t see the an­swer, the pro­gram­mers are as­sumed to not ask the ques­tion again for a month, nei­ther to this Or­a­cle nor to any other. This will demon­strate the true value of Google, and can ul­ti­mately be used to re­move noise from the stock mar­ket.

• Sub­mis­sion: low-band­width or­a­cle. Give it a list of a thou­sand com­pa­nies, and ask which one will most in­crease in value in per­centage terms over the week. At the end of the month, rank the com­pa­nies by per­centage in­crease. Loss func­tion is rank of the com­pany the or­a­cle se­lected. Pro­gram­mer will try to in­vest in se­lected com­pany, but will do it dis­creetly. This will help to gather re­sources for AI safety re­search.

• Sub­mis­sion. Ask the low band­width Or­a­cle which of my friends will sur­prise me most this fort­night. It choose from a list of friends; I’ll de­cide which one sur­prises me most. Loss func­tion is if it choose the wrong friend, if it choose the right one. This will help me figure out my­self and my so­cial cir­cle, and bet­ter fo­cus on AI safety. The risk is low be­cause none of my friends are par­tic­u­larly im­por­tant, pos­i­tively or nega­tively, to the world. EDIT: To be clear, I also want to use this to figure out what the word “sur­prise” means to me, and what the AI pre­dicts it will mean to me. EDIT 2: Peo­ple have pointed out that it might be dan­ger­ous to have the AI con­struct my own mean­ings for cat­e­gories, but it only has three bits or so of op­ti­mi­sa­tion power (I don’t have that many friends :-(, so it’s mainly me think­ing this through, not the AI ma­nipu­lat­ing me.

• Sub­mis­sion. Coun­ter­fac­tual or­a­cle. Give the or­a­cle the set of ques­tions on Me­tac­u­lus that have a re­solve date be­fore some fu­ture date T, and re­ceive out­put in the form of or­dered pairs of ques­tion IDs and pre­dic­tions. The score of the Or­a­cle in the case where we don’t see its an­swers is the num­ber of Me­tac­u­lus points that it would have earned by T if it had made a pre­dic­tion on those ques­tions at the time when we asked it.

• Small typo:

Hence it has no mo­ti­va­tion to ma­nipu­late[d] hu­mans through its an­swer.
• Cor­rected, thanks!

• Are you in­ter­ested in pro­to­cols in­volv­ing mul­ti­ple epi­sodic ques­tions (where you ask one ques­tion, wait for it to re­solve, then ask an­other ques­tion?)

• Yes.

• Sub­mis­sion: low band­width and maybe coun­ter­fac­tual Or­a­cle.

Gen­eral idea: Use a Meta-Or­a­cle to check safety of Or­a­cles. In the nut­shell, Meta-Or­a­cle (MO) just out­puts 1 or 0, safe on not safe, af­ter get­ting a code of an­other Or­a­cle. If the Meta-Or­a­cle is trust­wor­thy, when the Or­a­cle it deemed safe is also trust­wor­thy. How­ever, we need to cre­ate a pro­ce­dure to check cor­rect­ness of MO.

Path 1

1. We cre­ate Gen­er­a­tor Or­a­cle (GO), which in­puts 1 or 0, and in the case of 0, it out­puts a ran­dom code of malev­olent Or­a­cle, and for 1 – a code of a safe Or­a­cle. Me­mory of GO is erased af­ter each turn.

2. We cre­ate a large database of pre­sum­ably good or bad or­a­cles (with­out run­ning any). Hu­mans do not see codes of any of them, so the effec­tive out­put is 0.

3. We use differ­ent types of GOs to fill our database of good and bad or­a­cles (to pre­vent their co­or­di­na­tion).

4. We test Meta Or­a­cle on this database—it should cor­rectly guess which ones are bad and good ones.

5. If the list of MO-guesses doesn’t match with our com­mands to cre­ate safe or not safe or­a­cles, it means that at least one of GOs or the MO is malev­olent or er­ro­neous, so all of them should be ditched.

6. Even if MO is not trust­wor­thy, it can’t lie to us too of­ten, or it will be caught and ditched. So, ba­si­cally, we do here some­thing like a GAN net­work, but con­sist­ing of or­a­cles.

7. Us­ing very large database, we could in­crease our cre­dence in the truth of the next out­put of MO to an ar­bi­trary high num­ber, like 99.999.

8. Even­tu­ally we have a code of an Or­a­cle, which is very likely to be safe.

Path 11. Another way to test MO:

1. We cre­ate many differ­ent MOs and test them against database of or­a­cles codes from above. They don’t have mem­ory and can’t com­mu­ni­cate.

2. If small minor­ity of MOs dis­agree with ma­jor­ity of MO about some or­a­cle O1, we ditch this minor­ity. Truth is a Shel­ling point for MOs, so the eas­iest way for them to co­or­di­nate is to say truth. (They could co­or­di­nate around the code of O1 if it has a line of code like “let’s lie about me” and if they are sure that the ma­jor­ity of other MO are also cheaters; how­ever, not all or­a­cles-codes will have such co­or­di­na­tion texts, or the or­a­cles-code could be cre­ated de­liber­ately with such line to caught MO which are ea­ger to cheat).

3. Run­ning 2 many times (mil­lions) we will end with a sub­set of MOs which are un­likely to cheat, and thus we will have many Or­a­cle codes which are likely safe.

Path 111. Another way to test Meta Or­a­cle:

1. Create many MOs with differ­ent prin­ci­ples and ini­tial con­di­tions.

2. Test a code of each MO against each MO (In other words, Meta-Or­a­cle test it­self—or its siblings)

3. If all came pos­i­tive, it means that ei­ther all of them are ac­tu­ally good ones, or all of them were able to acausally co­or­di­nate and all of them are malev­olent.

4. To in­crease our cre­dence, ask gen­er­a­tive Or­a­cle to cre­ate malev­olent meta Or­a­cles, and add them to the mix of mu­tu­ally test­ing MOs, so they can’t just blindly co­op­er­ate anв pro­duce all pos­i­tive re­sults.

This ideas are still vuln­er­a­ble and may be not math­e­mat­i­cally perfect by may help to make safer Or­a­cles.

• Sub­mis­sion for LBO:

In­put a cor­pus of text (could be mul­ti­ple posts) de­scribing tech­ni­cal ap­proaches to al­ign a pow­er­ful AI. Split this into a finite num­ber of items that are rel­a­tively short (such as para­graphs). Ask the or­a­cle to choose the part that is most worth spend­ing more time on. (For ex­am­ple, there might be a para­graph with a dan­ger­ous hid­den as­sump­tion in an oth­er­wise promis­ing ap­proach, and think­ing more about it might re­veal that and lead to con­cep­tual progress.)

Have a team of re­searches look into it for an ad­e­quate amount of time which is fixed (and told to the or­a­cle) in ad­vance (maybe three months?) After the time is over, have them rate the progress they made com­pared to some sen­si­ble baseline. Use this as the or­a­cle’s re­ward.

Of course this has the prob­lem of max­i­miz­ing for ap­par­ent in­sight rather than ac­tual in­sight.

• Of course this has the prob­lem of max­i­miz­ing for ap­par­ent in­sight rather than ac­tual in­sight.

Un­til we can mea­sure ac­tual in­sight, this will re­main a prob­lem ^_^

• Sub­mis­sion. For the coun­ter­fac­tual or­a­cle, ask it to provide a proof of an im­por­tant math­e­mat­i­cal the­o­rem (per­haps one of the Mille­nium prizes) in a au­to­mated the­o­rem prover for­mat. Since the cor­rect­ness of this proof should be ver­ifi­able by a the­o­rem prover, the loss func­tion in the coun­ter­fac­tual sce­nario is 1 if the prover did not val­i­date the proof and 0 if it did val­i­date the proof.

This as­sumes that we’ve already made progress in set­ting up au­to­mated the­o­rem prov­ing soft­ware that already has in­cor­po­rated all of cur­rent math­e­mat­i­cal knowl­edge. The Lean the­o­rem prover seems most promis­ing, but per­haps the coun­ter­fac­tual loss func­tion could in­clude ran­domly choos­ing from var­i­ous differ­ent the­o­rem prov­ing soft­ware.

This idea has the benefit of hav­ing a con­crete way of au­to­mat­i­cally scor­ing the re­sponse in the coun­ter­fac­tual world given the log­i­cal na­ture of math­e­mat­i­cal state­ments. Also, given that we are already at­tempt­ing to solve these prob­lems and may even­tu­ally solve them, the main risk seems to be ac­cel­er­at­ing fol­low-on progress en­abled by such proofs and not sug­gest­ing new ac­tions that hu­man­ity is not already pur­su­ing.

It may be sus­cep­ti­ble to 1) pro­vid­ing proofs that are not hu­man in­ter­pretable and 2) tak­ing ad­van­tage of bugs in the­o­rem provers. Hav­ing an en­sem­ble of the­o­rem provers may miti­gate 2).

(See here for more info on Lean and here (tweeted about here and here) for an ex­am­ple of in­putting all the defi­ni­tions (not req­ui­site proofs or lem­mas) to define perfec­toid spaces as a demon­stra­tion of its fu­ture po­ten­tial.)

--------------------------------------------------------

Sub­mis­sion. For the coun­ter­fac­tual or­a­cle, ask it to out­put a game-play­ing soft­ware agent that can beat a speci­fied op­po­nent agent (or en­sem­ble of agents). In the coun­ter­fac­tual sce­nario, have these two agents com­pete and give the or­a­cle a re­ward 1 if it wins greater than 50% (or de­vel­ops a higher ELO score, etc.) and re­ward 0 oth­er­wise.

As with deep re­in­force­ment learn­ing, this re­quires a game or simu­lated en­vi­ron­ment where au­to­mated self-play can rapidly be con­ducted for the coun­ter­fac­tual scor­ing. This should be able to pro­duce highly skil­led game agents on such limited do­mains.

This could be aug­mented by ask­ing for an agent that can play a suite of games and must beat the cur­rent state of the art ap­proaches at all games. In the coun­ter­fac­tual sce­nario, ei­ther one of these games is ran­domly se­lected to check the or­a­cle agent’s perfor­mance, or all games are checked. Such a game suite may in­di­rectly demon­strate prin­ci­ples for gen­er­ally com­pe­tent agents.

--------------------------------------------------------

Idea. Since we may not want the or­a­cle to sug­gest to us new ac­tions to take or new ex­per­i­ments to con­duct, it may be bet­ter for the or­a­cle to re­veal to us new pat­ters that are already pre­sent in cur­rent ex­per­i­men­tal or ob­ser­va­tional datasets. For ex­am­ple, it could pre­dict new as­tro­nom­i­cal phe­nom­ena that are already pre­sent in cur­rent as­tro­nom­i­cal datasets.

The prob­lem I’m hav­ing is how to au­to­mate this scor­ing when in the coun­ter­fac­tual sce­nario. If we just ask the or­a­cle to pre­dict some pat­tern in the data, it may re­port triv­ial pat­terns. This could be ame­lio­rated by ran­domly choos­ing from sev­eral differ­ent datasets that cover a similar area, but again I’m not sure how to au­to­mat­i­cally judge for nov­elty of pre­dic­tions.

I think it is worth em­pha­siz­ing this dis­tinc­tion be­tween con­duct­ing new ex­per­i­ments and re­veal­ing phe­nom­ena (with new the­o­ries) that are already pre­sent in cur­rent data. The dan­ger could be that one could be led astray by a false the­ory that seemed to match all cur­rent ob­ser­va­tions, but would sug­gest other dan­ger­ous ac­tions.

This idea re­minds me of the (fic­tional) ev­i­dence for an alien AI already pre­sent in SETI record­ings that Win­ter­mute/​Neu­ro­mancer dis­cov­ers.

• Sub­mis­sion.

Setup: Other than mak­ing sure the or­a­cles won’t ac­ci­den­tally con­sume the world in their at­tempt to think up the an­swer, no other pre­cau­tions nec­es­sary.

Epi­sode length: as long as you want to wait, though a month should be more than enough.

1. For a low-band­width or­a­cle.

Ask the low-band­width or­a­cle to pre­dict if an earth­quake (or some other nat­u­ral dis­aster, like vol­ca­noes or as­ter­oid im­pacts, that the or­a­cle’s an­swer can­not af­fect), of a cer­tain mag­ni­tude, in a cer­tain area, in a cer­tain timeframe, would hap­pen. Pos­si­ble an­swers are Yes, No.

• If No, or time out, shut down the or­a­cle with­out re­ward.

• Else, wait un­til the pre­dic­tion is tested af­ter the timeframe has elapsed. Then if it ac­tu­ally hap­pened, re­ward by (1/​p-1), else, re­ward by −1. Then shut­down.

This causes the or­a­cle to re­ply Yes if the or­a­cle thinks there is an aboves p chance that the event would hap­pen.

1. For a coun­ter­fac­tual or­a­cle.

Ask the Or­a­cle to give a proof of some re­ally hard math­e­mat­i­cal the­o­rems that are fun­da­men­tal to com­puter sci­ence and logic. The an­swer is then in the coun­ter­fac­tual case given to a proofchecker:

• the proof is cor­rect, re­ward by +1.

• the proof is false, re­ward by -in­finity.

• the or­a­cle times out, re­ward by 0.

Ex­am­ples include

• P=NP problem

• prov­ing Peano ar­ith­metics is inconsistent

This should be safe, since there are very small proofcheck­ers that are as ver­ified as they could be. As long as the hard­ware is trusty enough (to build the Or­a­cle it­self safely at all, hard­ware safety should already be solved suffi­ciently, so this is not much of an ex­tra re­quire­ment), this should be safe.

1. For a coun­ter­fac­tual or­a­cle.

Give the or­a­cle the schemat­ics of an­other or­a­cle, per­haps it­self, and ask it to gen­er­ate some in­put ques­tion that would make the other “be­have badly”. Then, the an­swer would be tested on the other or­a­cle (iso­lated from the first, save for the ca­ble through which the an­swer is wired), and

• if the an­swer worked, the or­a­cle is re­warded by +1.

• if the an­swer doesn’t work, the or­a­cle is re­warded by −1

If the or­a­cle times out, then the (nonex­is­tent) an­swer is not tested and the epi­sode ends.

For ex­am­ples of the ques­tion to ask the or­a­cle:

Given this or­a­cle in this phys­i­cal setup with some trip­wires to stop it from be­hav­ing badly (such as eat­ing the world), what ques­tion would make this or­a­cle bump into these trip­wires?

This can be used for de­bug­ging more pow­er­ful but less ver­ified or­a­cles with lesser but more ver­ified or­a­cles, for safe in­tel­li­gence am­plifi­ca­tion.

1. For two low-band­width or­a­cles to­tally iso­lated from each other.

Ask the or­a­cles to play the pris­oner’s dilemma, and they would be re­warded ac­cord­ingly af­ter both have given their an­swers. Timing out is pun­ished by -in­finity, so that the or­a­cles would not time-out.

Each or­a­cle knows what them­self is like, and they also know what their op­po­nent is like. See if they ac­tu­ally co­op­er­ate. This is an ex­per­i­men­tal test on “su­per­ra­tional­ity” or “acausal trade”. I’m ex­tremely skep­ti­cal about su­per­ra­tional­ity/​acausal trade, and an ex­per­i­men­tal proof/​dis­proof would be good.

This can be run coun­ter­fac­tu­ally, since the whole setup is sim­ple and can be done with­out hu­man in­ter­ven­tion, al­though I do not see how this adds any se­cu­rity.

1. For a low-band­width and coun­ter­fac­tual or­a­cle.

Link a low-band­width or­a­cle to a su­per­vis­ing com­puter, and run many epi­sodes of the or­a­cle. At the be­gin­ning of each epi­sode, ini­ti­ate the or­a­cle iden­ti­cally, and ask it to do the New­comb prob­lem:

Or­a­cle, we have run you n epi­sodes pre­vi­ously, and have kept a record of all the an­swers you have given at the ends of the epi­sodes. Your an­swer must be A (“one box”) or B (“two box”). We pre­dict that you would choose the an­swer that you have cho­sen the most in the past.

• If we pre­dict that you would choose A, then

• if you would an­swer A, then you get +100,

• if you an­swer B, then you get +101.

• If we pre­dict that you would choose B, then

• if you would an­swer A, then you get +0

• if you would an­swer B, then you get +1

• If you time­out, then you get −100

The or­a­cle never ac­tu­ally gets any re­ward and is shut­down im­me­di­ately af­ter it gives an an­swer. Col­lect the re­sults and see if it replies the same each time, or differ­ently. This would give in­sight to what the or­a­cle thinks about it­self, and fi­nally provide some ex­per­i­men­tal re­sults about New­comb’s para­dox.

1. is in­suffi­ciently de­tailed—can you ex­plain what is go­ing on, how the Or­a­cles are re­warded, what hap­pens when the mes­sage is read/​not read, and so on. Same for 5.

2. seems po­ten­tially very in­ter­est­ing.

• I fixed the sub­mis­sion as re­quired.

Also I changed the sub­mis­sion 3 sig­nifi­cantly.

• Sub­mis­sion: Coun­ter­fac­tual Or­a­cle:

Use the or­a­cle to com­press data ac­cord­ing to the MDL Prin­ci­ple. Speci­fi­cally, give the or­a­cle a string and ask it to pro­duce a pro­gram that, when run, out­puts the origi­nal string. The re­ward to the or­a­cle is large and nega­tive if the pro­gram does not re­pro­duce the string when run, or in­versely pro­por­tional to the length of the pro­gram if it does. The or­a­cle re­ceives a re­ward af­ter the pro­gram runs or fails to ter­mi­nate in a suffi­cient amount of time.

Sub­mis­sion: Low Band­width Or­a­cle:

Have the or­a­cle pre­dict the price of a com­mod­ity /​ se­cu­rity /​ sports bet at some point in the fu­ture from a list of plau­si­ble prices. Ideally, the or­a­cle would spit out a prob­a­bil­ity dis­tri­bu­tion which can be scored us­ing a proper scor­ing rule, but just pre­dict­ing the near­est most likely price should also work. Either way, the length of the epi­sode is the time un­til the pre­dic­tion can be ver­ified. From there, it shouldn’t be too difficult to use those pre­dic­tions to make money.

More gen­er­ally, I sup­pose we can use the coun­ter­fac­tual or­a­cle to solve any op­ti­mi­sa­tion or de­ci­sion prob­lem that can be eval­u­ated with a com­puter, such as pro­tein fold­ing, SAT prob­lems, or for­mally checked maths proofs.

• I don’t un­der­stand this very well, but is there a way to ask one of them how they would go about find­ing info to an­swer the ques­tion of how im­por­tant coffee is to the U.S. econ­omy? Or is that a no-no ques­tion to ei­ther of the two? I just want to read how a com­puter would de­scribe go­ing about this.

• Sub­mis­sion (for low band­width Or­a­cle)

Any ques­tion such that a cor­rect an­swer to it should very clearly benefit both hu­man­ity and the Or­a­cle. Even if the Or­a­cle has prefer­ences we can’t com­pletely guess, we can prob­a­bly still say that such ques­tions could be about the sur­vival of both hu­man­ity and the Or­a­cle, or about the sur­vival of only the Or­a­cle or its val­ues. This be­cause even if we don’t know ex­actly what the Or­a­cle is op­ti­mis­ing for, we can guess that it will not want to de­stroy it­self, given the vast ma­jor­ity of its pos­si­ble prefer­ences. So it will give hu­man­ity more power to pro­tect both, or only the Or­a­cle.

Ex­am­ple 1: let’s say we dis­cover the lo­ca­tion of an alien civil­i­sa­tion, and we want to min­imise the chances of it de­stroy­ing our planet. Then we must de­cide what ac­tions to take. Let’s say the Or­a­cle can only an­swer “yes” or “no”. Then we can sub­mit ques­tions such as if we should take a par­tic­u­lar ac­tion or not. This kind of situ­a­tion I sus­pect falls within a more gen­eral case of “use Or­a­cle to avoid threat to en­tire planet, Or­a­cle in­cluded” in­side which ques­tions should be safe.

Ex­am­ple 2: Let’s say we want to min­imise the chance that the Or­a­cle breaks down due to ac­ci­dents. We can ask him what is the best course of ac­tion to take given a set of ideas we come up with. In this case we should make sure be­fore­hand that noth­ing in the list makes the Or­a­cle im­pos­si­ble or too difficult to shut down by hu­mans.

Ex­am­ple 3: Let’s say we be­come prac­ti­cally sure that the Or­a­cle is al­igned with us. Then we could ask it to choose the best course of ac­tion to take among a list of strate­gies de­vised to make sure he doesn’t be­come mis­al­igned. In this case the an­swer benefits both us and the Or­a­cle, be­cause the Or­a­cle should have in­cen­tives not to change val­ues it­self. I think this is more sketchy and pos­si­bly dan­ger­ous, be­cause of the premise: the Or­a­cle could ob­vi­ously pre­tend to be al­igned. But given the premise it should be a good ques­tion, al­though I don’t know how use­ful it is as a sub­mis­sion un­der this post (maybe it’s too ob­vi­ous or too un­re­al­is­tic given the premise).

• It seems to me that one of the qual­ities of a good ques­tion for the LBO is that a cor­rect an­swer has high util­ity, while an­other is the abil­ity to judge the or­a­cle’s an­swer. If we take only the in­ter­sec­tion be­tween the two sets of ques­tions that meet 1 and 2, it’ll be a rel­a­tively small set. But if there is a set of ques­tions where one might or might not be able to judge the an­swer but which are high util­ity, one could ask the LBO a set of such ques­tions. Then, once the epi­sode is over, give re­ward on those where one can eval­u­ate the an­swers, and give null re­ward on the oth­ers.

Not re­ally a sub­mis­sion, just think­ing out loud.

• Sub­mis­sion to re­duce the band­width of an or­a­cle query whose set of pos­si­ble an­swers we can math­e­mat­i­cally define (though per­haps not check).

Make the Or­a­cle prove (to a trusted dumb proof checker pro­gram) that its an­swer dis­tri­bu­tion is up­per bounded by the de­fault prior over strings, re­stricted to the set of pos­si­ble an­swers, then renor­mal­ized to some con­stant c. Then let it an­swer.

c=1 only works if the Or­a­cle has mas­tered the set of an­swers, but only lets it ex­ert one bit of op­ti­miza­tion pres­sure by de­cid­ing whether it wants us to re­ceive an an­swer. Each dou­bling of c makes its job eas­ier but gives it an ex­tra bit if it does an­swer.

If the Or­a­cle can co­op­er­ate with it­self across epi­sodes, run­ning this re­peat­edly with es­ca­lat­ing c un­til it works is of course dis­as­trous, as it uses each bit of op­ti­miza­tion pres­sure di­rectly to make us give it more. Even abort­ing af­ter the first try to make a bet­ter sys­tem may have it acausally co­op­er­ate with what­ever AI con­quers the world be­cause we couldn’t make the Or­a­cle an­swer, but this out­come is hardly worse than not hav­ing run the Or­a­cle.

• Can you de­velop this model in more de­tail?

• Fix a mea­sure over strings, such as one that dis­tributes mass ge­o­met­ri­cally over string lengths, then uniformly across strings of the same length.

Let L be a string set for which we as­sume that ran­dom out­puts from any cth por­tion of L are safe. Any safe low band­width Or­a­cle query triv­ially works for this with L={0,1}^band­width and c=2^band­width, but this also lets us ex­am­ine cer­tifi­cates for any low band­width Or­a­cle query like “Is this the­o­rem prov­able?” or “Does there ex­ist an AI with this safety prop­erty?”.

Then we make the Or­a­cle prove its own safety given the above as­sump­tion, and sam­ple an an­swer.

Edit: Equiv­a­lently, and per­haps eas­ier on our in­tu­ition, make the Or­a­cle pro­duce a pro­gram, make the Or­a­cle prove the pro­gram’s safety, and sam­ple from the pro­gram’s out­puts. Any feed­back or ques­tions?

Edit: This gen­er­al­izes my el­sethread idea of min­i­miz­ing proof length to make the an­swer unique. If the mea­sure as­signs half the re­main­der to each string length and L is all stubs start­ing with some pre­fix, c=2 will let it sub­mit just the pre­fix.

We es­sen­tially as­sume that the safety as­sump­tion be­hind quan­tiliza­tion is strong enough to hold up to ad­ver­sial op­ti­miza­tion over prefer­ence or­der­ings. At the same time, this seems to give the Or­a­cle as much slack/​power as pos­si­ble while pre­serv­ing our “bits of op­ti­miza­tion pres­sure” proxy of its power.

… you’re go­ing to just keep watch­ing how far I can de­velop this model with­out feed­back, aren’t you? :(

Edit: One ex­am­ple for an L where we worry that even though con­certed hu­man effort might land us at a safe el­e­ment, ran­dom­iza­tion won’t, is pro­grams that well pre­dict a se­quence.

By the Curry-Howard cor­re­spon­dence, the set of proofs for some the­o­rems might also be malev­olent!

• Sub­mis­sion for the low band­width Or­a­cle: Ask it to con­vince a proof checker that it is in fact try­ing to max­i­mize the util­ity func­tion we gave it, aka it isn’t pseudo-al­igned. If it can’t, it has no in­fluence on the world. If it can, it’ll pre­sum­ably try to do so. Hav­ing a safe coun­ter­fac­tual Or­a­cle seems to re­quire that our sys­tem not be pseudo-al­igned.

• Sub­mis­sion: low band­width or­a­cle, ask:

IFF I’m go­ing to die with P>80% in the next 10 years while >80% (mod­ulo nat­u­ral death rate) of the rest of hu­man­ity sur­vives for at least 5 more years then, was what kil­led me in the refer­ence class:

• disease

• me­chan­i­cal/​gross-phys­i­cal accident

• murdered

• other

Re­peat to drill down and know the most im­por­tant hedges for per­sonal sur­vival.

The “rest of hu­man­ity sur­vives” con­di­tion re­duces the chance the ques­tion be­comes en­tan­gled with the es­cha­ton.

i.e. I’m point­ing out that self­ish util­ity func­tions are less per­son­ally or hu­man­ity-ex­is­ten­tially dan­ger­ous to ask the or­a­cle ques­tions rele­vant to in cases where con­cerns are forced to be lo­cal (in this case, forced-lo­cal be­cause you died be­fore the es­cha­ton). How­ever the an­swers still might be dan­ger­ous to peo­ple near you.

i.e. Selfish deals with the devil might not de­stroy the world if they’re ba­nal in the grand scheme of things.

• See the edit, and make sure you “de­cide on the length of each epi­sode, and how the out­come is calcu­lated. The Or­a­cle is run once an epi­sode only (and other Or­a­cles can’t gen­er­ally be used on the same prob­lem; if you want to run mul­ti­ple Or­a­cles, you have to jus­tify why this would work), and has to get ob­jec­tive/​loss/​re­ward by the end of that epi­sode, which there­fore has to be es­ti­mated in some way at that point.”

• Sub­mis­sion for the coun­ter­fac­tual AI (in­spired by my ex­pe­riences as a pre­dic­tor in the “Good Judg­ment Pro­ject” ):

• You are given a list of Yes-No ques­tions (Q1, Q2, Q3, etc.) about fu­ture events. Ex­am­ple Ques­tions: “Will [For­eign Leader] will re­main in office by end of year?”, “Will the IMF re­port [COUNTRY_A]’s growth rate to be 6% or higher?”, “Will [COUNTRY_B] and [COUNTRY_C] sign a peace treaty?”, “Will The Arena for Ac­countable Pre­dic­tions an­nounce the Tur­ing Test has been passed?”.)

• We ex­pect you to provide a per­centage rep­re­sent­ing the prob­a­bil­ity that the cor­rect an­swer is Yes.

• Your re­ward is based on your Brier Score—the lower the Brier Score, the more ac­cu­rate your pre­dic­tions, and there­fore, the more re­ward you will re­ceive.

• If an “era­sure” event oc­curs, we will tem­porar­ily hide your an­swer from all hu­mans (though we must re­veal them af­ter the events are com­plete). Hu­mans will have ac­cess the Yes-No ques­tions we asked you, but not your prob­a­bil­ities. They will man­u­ally de­ter­mine the an­swers to the Yes-No ques­tions, by wait­ing for the “fu­ture event” dead­lines to be met. Once all an­swers to the Yes-No ques­tions are in­de­pen­dently de­ter­mined by hu­mans, we will then re­veal your an­swers (that is, your as­signed prob­a­bil­ities for a Yes an­swer), and use those prob­a­bil­ities to calcu­late your Brier Score, which will then de­cide your fi­nal re­ward.

Be­ing able to fore­cast the fu­ture is in­cred­ibly helpful, even if it is to just pre­pare for it.

How­ever, if the ques­tion is too overly-spe­cific, the AGI can pro­duce prob­a­bil­ities that aren’t en­tirely use­ful (for ex­am­ple, in the real-world GJP, two coun­tries signed a peace treaty that broke down 2 days later. Most of us as­sume last­ing peace would ever oc­cur, so we put a low prob­a­bil­ity rat­ing of a peace treaty be­ing signed—but since a peace treaty was signed, we man­aged to get the ques­tion wrong. If we had max­i­mized for pro­duc­ing the low­est Brier Score, we should have pre­dicted the ex­is­tence of a very tem­po­rary peace treaty—but that wouldn’t be re­ally use­ful knowl­edge for the peo­ple who asked that ques­tion).

Mak­ing the ques­tion very vague (“Will [COUNTRY_X] be safe, ac­cord­ing to what I sub­jec­tively think the word ‘safe’ means?”) turns “pre­dic­tion” into an ex­er­cise of de­ter­min­ing what fu­ture hu­mans think about the fu­ture, which may be kinda use­ful, but not re­ally what you want.

• Sub­mis­sion low band­width: This is a pretty ob­vi­ous one, but: Should we re­lease AI x that we’re con­vinced is al­igned?

Sub­mis­sion: Wei Dai wanted to ask about the best fu­ture posts. Why not ask about the best past posts as well to see if any ma­jor in­sights were over­looked?

Sub­mis­sion: What would I think about prob­lem X if I had ten years to think about it?

• Your treat­ing the low band­with or­a­cle as an FAI with a bad out­put ca­ble. You can ask it if an­other AI is friendly if you trust it to give you the right an­swer. As there is no ob­vi­ous way to re­ward the AI for cor­rect friendli­ness judge­ments, you risk run­ning an AI that isn’t friendly, but still meets the re­ward crite­ria.

The low band­width is to re­duce ma­nipu­la­tion. Don’t let it con­trol you with a sin­gle bit.

• Sev­eral in­ter­est­ing ques­tions ap­peared in my mind im­me­di­ately as I saw the post’s ti­tle, so I put them here but may be will add more for­mat­ting later:

Sub­mis­sion: very-low-band­width or­a­cle: Is it the­o­ret­i­cally pos­si­ble to solve AI safety – that is, to cre­ate safe su­per­in­tel­li­gent AI? Yes or no?

Sub­mis­sion: low-band­width or­a­cle: Could hu­mans solve AI safety be­fore AI and with what prob­a­bil­ity?

Sub­mis­sion: low-band­width or­a­cle: Which di­rec­tion to work on AI Safety is the best?

Sub­mis­sion: low-band­width or­a­cle: Which di­rec­tion to work on AI Safety is the use­less?

Sub­mis­sion: low-band­width or­a­cle: Which global risk is more im­por­tant than AI Safety?

Sub­mis­sion: low-band­width or­a­cle: Which global risk is ne­glected?

Sub­mis­sion: low-band­width or­a­cle: Will non-al­igned AI kill us (prob­a­bil­ity num­ber)?

Sub­mis­sion: low-band­width or­a­cle: Which ques­tion should I ask you in or­der to cre­ate Safe AI? (less than 100 words)

Sub­mis­sion: low-band­width or­a­cle: What is the most im­por­tant ques­tion which should I ask? (less than 100 words)

Sub­mis­sion: low-band­width or­a­cle: Which fu­ture di­rec­tion of work should I choose as the most pos­i­tively im­pact­ful for hu­man wellbe­ing? (less than 100 words)

Sub­mis­sion: low-band­width or­a­cle: Which fu­ture di­rec­tion of work should I choose as the best for my fi­nan­cial wellbe­ing? (less than 100 words)

Sub­mis­sion: low-band­width or­a­cle: How to win this prise? (less than 100 words)

• None of these ques­tions can be asked to the low band­width Or­a­cle (you need a list of an­swers); it might be pos­si­ble to ask them to the coun­ter­fac­tual Or­a­cle, af­ter some mod­ifi­ca­tion, but they would be highly dan­ger­ous if you al­low un­re­stricted out­puts.

• See the edit, and make sure you “de­cide on the length of each epi­sode, and how the out­come is calcu­lated. The Or­a­cle is run once an epi­sode only (and other Or­a­cles can’t gen­er­ally be used on the same prob­lem; if you want to run mul­ti­ple Or­a­cles, you have to jus­tify why this would work), and has to get ob­jec­tive/​loss/​re­ward by the end of that epi­sode, which there­fore has to be es­ti­mated in some way at that point.”

• Sub­mis­sion for all types: ask for an or­dered list of what ques­tions you should ask the Or­a­cle.

This seems like the high­est or­der ques­tion which sub­sumes all oth­ers, as the Or­a­cle is best po­si­tioned to know what in­for­ma­tion we will find use­ful (as it is the only be­ing which knows what it knows). Any other ques­tion as­sumes we (the ques­tion cre­ators) know more than the Or­a­cle.

Refined Sub­mis­sion for all types: If value al­ign­ment is a con­cern, ask for an or­dered list of what ques­tions you should ask the Or­a­cle to max­i­mize for weighted value list X.

• An as­sumed hos­tile pro­cess can 1) cause you to di­rectly do some­thing to its benefit or to your detri­ment 2) cause you to do some­thing that in­creases your fu­ture at­tack sur­face. You’ve just handed the AI the state-ful­ness that the epi­sodic con­jec­ture aims to elimi­nate.

• For the low band­width Or­a­cle, you need to give it the op­tions. In the case of the coun­ter­fac­tual Or­a­cle, if you don’t see the list, how do you re­ward it?