# AlexMennen(Alex Mennen)

Karma: 3,421
• Let α be the least countable or­di­nal such that there is no polyno­mial-time com­putable re­cur­sive well-or­der­ing of length α.

Proof: Let be any com­putable well-or­der­ing of . Let be the num­ber of steps it takes to com­pute whether or not . Let (no­tice I’m us­ing the stan­dard or­der­ing on , so this is the max­i­mum of a finite set, and is thus well-defined). is com­putable in time. Let be a bi­jec­tive pairing func­tion such that both the pairing func­tion and its in­verse are com­putable in polyno­mial time. Now let be the well-or­der­ing of given by , if is not for any , and if nei­ther nor is of the form for any . Then is com­putable in polyno­mial time, and the or­der type of is plus the or­der type of , which is just the same as the or­der type of if that or­der type is at least .

The fact that you said you think is makes me sus­pect you were think­ing of the least countable or­di­nal such that there is no re­cur­sive well-or­der­ing of length that can be proven to be a re­cur­sive well-or­der­ing in a nat­u­ral the­ory of ar­ith­metic such that, for ev­ery com­putable func­tion, there’s a pro­gram com­put­ing that func­tion that the given the­ory can prove is to­tal iff there’s a pro­gram com­put­ing that func­tion in polyno­mial time.

• This makes Sav­age a bet­ter com­par­i­son point, since the Sav­age ax­ioms are more similar to the VNM frame­work while also try­ing to con­struct prob­a­bil­ity and util­ity to­gether with one rep­re­sen­ta­tion the­o­rem.

Sure, I guess I just always talk about VNM in­stead of Sav­age be­cause I never both­ered to learn how Sav­age’s ver­sion works. Per­haps I should.

As a rep­re­sen­ta­tion the­o­rem, this makes VNM weaker and JB stronger: VNM re­quires stronger as­sump­tions (it re­quires that the prefer­ence struc­ture in­clude in­for­ma­tion about all these prob­a­bil­ity-dis­tri­bu­tion com­par­i­sons), where JB only re­quires prefer­ence com­par­i­son of events which the agent sees as real pos­si­bil­ities.

This might be true if we were ideal­ized agents who do Bayesian up­dat­ing perfectly with­out any com­pu­ta­tional limi­ta­tions, but as it is, it seems to me that the as­sump­tion that there is a fixed prior is un­rea­son­ably de­mand­ing. Peo­ple some­times up­date prob­a­bil­ities based purely on fur­ther thought, rather than em­piri­cal ev­i­dence, and a frame­work in which there is a fixed prior which gets con­di­tioned on events, and ban­ishes dis­cus­sion of any other prob­a­bil­ity dis­tri­bu­tions, would seem to have some trou­ble han­dling this.

Doesn’t pointless topol­ogy al­low for some dis­tinc­tions which aren’t mean­ingful in point­ful topol­ogy, though?

Sure, for in­stance, there are many dis­tinct lo­cales that have no points (only one of which is the empty lo­cale), whereas there is only one or­di­nary topolog­i­cal space with no points.

Isn’t the ap­proach you men­tion pretty close to JB? You’re not mod­el­ing the VNM/​Sav­age thing of ar­bi­trary gam­bles; you’re just as­sign­ing val­ues (and prob­a­bil­ities) to events, like in JB.

As­sum­ing you’re refer­ring to “So a similar thing here would be to treat a util­ity func­tion as a func­tion from some lat­tice of sub­sets of (the Borel sub­sets, for in­stance) to the lat­tice of events”, no. In JB, the set of events is the do­main of the util­ity func­tion, and in what I said, it is the codomain.

• In the Sav­age frame­work, an out­come already en­codes ev­ery­thing you care about.

Yes, but if you don’t know which out­come is the true one, so you’re con­sid­er­ing a prob­a­bil­ity dis­tri­bu­tion over out­comes in­stead of a sin­gle out­come, then it still makes sense to speak of the prob­a­bil­ity that the true out­come has some fea­ture. This is what I meant.

So the com­pu­ta­tion which seems to be sug­gested by Sav­age is to think of these max­i­mally-speci­fied out­comes, as­sign­ing them prob­a­bil­ity and util­ity, and then com­bin­ing those to get ex­pected util­ity. This seems to be very de­mand­ing: it re­quires imag­in­ing these very de­tailed sce­nar­ios.

You do not need to be able to imag­ine ev­ery pos­si­ble out­come in­di­vi­d­u­ally in or­der to think of func­tions on or prob­a­bil­ity dis­tri­bu­tions over the set of out­comes, any more than I need to be able to imag­ine each in­di­vi­d­ual real num­ber in or­der to un­der­stand the func­tion or the stan­dard nor­mal dis­tri­bu­tion.

It seems that you’re go­ing by an anal­ogy like Jeffrey-Bolker : VNM :: events : out­comes, which is par­tially right, but leaves out an im­por­tant sense in which the cor­rect anal­ogy is Jeffrey-Bolker : VNM :: events : prob­a­bil­ity dis­tri­bu­tions, since al­though util­ity is defined on out­comes, the func­tion that is ac­tu­ally eval­u­ated is ex­pected util­ity, which is defined on prob­a­bil­ity dis­tri­bu­tions (this be­ing a dis­tinc­tion that does not ex­ist in Jeffrey-Bolker, but does ex­ist in my con­cep­tion of real-world hu­man de­ci­sion mak­ing).

• I agree that the con­sid­er­a­tions you men­tioned in your ex­am­ple are not changes in val­ues, and didn’t mean to im­ply that that sort of thing is a change in val­ues. In­stead, I just meant that such shifts in ex­pec­ta­tions are changes in prob­a­bil­ity dis­tri­bu­tions, rather than changes in events, since I think of such things in terms of how likely each of the pos­si­ble out­comes are, rather than just which out­comes are pos­si­ble and which are ruled out.

• It seems to me that the Jeffrey-Bolker frame­work is a poor match for what’s go­ing on in peo­ples’ heads when they make value judge­ments, com­pared to the VNM frame­work. If I think about how good the con­se­quences of an ac­tion are, I try to think about what I ex­pect to hap­pen if I take that ac­tion (ie the out­come), and I think about how likely that out­come is to have var­i­ous prop­er­ties that I care about, since I don’t know ex­actly what the out­come will be with cer­tainty. This isn’t to say that I liter­ally con­sider prob­a­bil­ity dis­tri­bu­tions in my mind, since I typ­i­cally use qual­i­ta­tive de­scrip­tions of prob­a­bil­ity rather than num­bers in [0,1], and when I do use num­bers, they are very rough, but this does seem like a sort of fuzzy, com­pu­ta­tion­ally limited ver­sion of a prob­a­bil­ity dis­tri­bu­tion. Similarly, my es­ti­ma­tions of how good var­i­ous out­comes are are of­ten qual­i­ta­tive, rather than nu­mer­i­cal, and again this seems like a fuzzy, com­pu­ta­tion­ally limited ver­sion of util­ity func­tion. In or­der to de­ter­mine the util­ity of the event “I take ac­tion A”, I need to con­sider how good and how likely var­i­ous con­se­quences are, and take the ex­pec­ta­tion of the ‘how good’ with re­spect to the ‘how likely’. The Jeffrey-Bolker frame­work seems to be ask­ing me to pre­tend none of that ever hap­pened.

• Say I have a com­puter that will simu­late an ar­bi­trary Tur­ing ma­chine T, and will award me one utilon when that ma­chine halts, and do noth­ing for me un­til that hap­pens. With some clever cryp­tocur­rency scheme, this is a sce­nario I could ac­tu­ally build to­day.

No, you can’t do that to­day. You could pro­duce a con­trap­tion that will de­posit 1 BTC into a cer­tain bit­coin wallet if and when some com­puter pro­gram halts, but this won’t do the wallet’s owner much good if they die be­fore the pro­gram halts. If you re­flect on what it means to award some­one a utilon, rather than a bit­coin, I main­tain that it isn’t ob­vi­ous that this is even pos­si­ble in the­ory.

Why in the world would one ex­pect a util­ity func­tion over an un­countable do­main to be com­putable?

There is a no­tion of com­putabil­ity in the con­tin­u­ous set­ting.

As far as I can see, the mo­ti­va­tion for re­quiring a util­ity func­tion to be com­putable is that this would make op­ti­miza­tion for said util­ity func­tion to be a great deal eas­ier.

This seems like a straw­man to me. A bet­ter mo­ti­va­tion would be that agents that ac­tu­ally ex­ist are com­putable, and a util­ity func­tion is de­ter­mined by judge­ments ren­dered by the agent, which is in­ca­pable of think­ing un­com­putable thoughts.

• I think we’re go­ing to have to back up a bit. Call the space of out­comes and the space of Tur­ing ma­chines . It sounds like you’re talk­ing about two func­tions, and . I was think­ing of as the util­ity func­tion we were talk­ing about, but it seems you were think­ing of .

You sug­gested should be com­putable but should not be. It seems to me that should cer­tainly be com­putable (with the caveat that it might be a par­tial func­tion, rather than a to­tal func­tion), as com­pu­ta­tion is the only thing Tur­ing ma­chines do, and that if non-halt­ing is in­cluded in a space of out­comes (so that is to­tal), it should be rep­re­sented as some sort of limit of par­tial in­for­ma­tion, rather than rep­re­sented ex­plic­itly, so that is con­tin­u­ous.

In any case, a slight gen­er­al­iza­tion of Rice’s the­o­rem tells us that any com­putable func­tion from Tur­ing ma­chines to re­als that de­pends only of the ma­chine’s se­man­tics must be con­stant, so I sup­pose I’m forced to agree that, if we want a util­ity func­tion that is defined on all Tur­ing ma­chines and de­pends only on their se­man­tics, then at least one of or should be un­com­putable. But I guess I have to ask why we would want to as­sign util­ities to Tur­ing ma­chines.

• I’m not sure what it would mean for a real-val­ued func­tion to be enu­mer­able. You could call a func­tion enu­mer­able if there’s a pro­gram that takes as in­put and enu­mer­ates the ra­tio­nals that are less than , but I don’t think this is what you want, since pre­sum­ably if a Tur­ing ma­chine halt­ing can gen­er­ate a pos­i­tive amount of util­ity that doesn’t de­pend on the num­ber of steps taken be­fore halt­ing, then it could gen­er­ate a nega­tive amount of util­ity by halt­ing as well.

I think ac­cept­ing the type of rea­son­ing you give sug­gests that limit-com­putabil­ity is enough (ie there’s a pro­gram that takes and pro­duces a se­quence of ra­tio­nals that con­verges to , with no guaran­tees on the rate of con­ver­gence). Though I don’t agree that it’s ob­vi­ous we should ac­cept such util­ity func­tions as valid.

• we need not as­sume there are “wor­lds” at all. … In math­e­mat­ics, it brings to mind pointless topol­ogy.

I don’t think the mo­ti­va­tion for this is quite the same as the mo­ti­va­tion for pointless topol­ogy, which is de­signed to mimic clas­si­cal topol­ogy in a way that Jeffrey-Bolker-style de­ci­sion the­ory does not mimic VNM-style de­ci­sion the­ory. In pointless topol­ogy, a con­tin­u­ous func­tion of lo­cales is a func­tion from the lat­tice of open sets of to the lat­tice of open sets of . So a similar thing here would be to treat a util­ity func­tion as a func­tion from some lat­tice of sub­sets of (the Borel sub­sets, for in­stance) to the lat­tice of events.

My un­der­stand­ing of the Jeffrey-Bolker frame­work is that its pri­mary differ­ence from the VNM frame­work is not its pointless­ness, but the fact that it comes with a prior prob­a­bil­ity dis­tri­bu­tion over out­comes, which can only be up­dated by con­di­tion­ing on events (i.e. up­dat­ing on ev­i­dence that has prob­a­bil­ity 1 in some wor­lds and prob­a­bil­ity 0 in the rest). VNM does not start out with a prior, and al­lows any prob­a­bil­ity dis­tri­bu­tion over out­comes to be com­pared to any other, and Jeffrey-Bolker only al­lows com­par­i­son of prob­a­bil­ity dis­tri­bu­tions ob­tained by con­di­tion­ing the prior on an event. Of course, this in­ter­pre­ta­tion re­quires a fair amount of read­ing be­tween the lines, since the Jeffrey-Bolker ax­ioms make no ex­plicit men­tion of any prob­a­bil­ity dis­tri­bu­tion, but I don’t see any other rea­son­able way to in­ter­pret them, since if asked which of two events is bet­ter, I will of­ten be un­able to an­swer with­out fur­ther in­for­ma­tion, since the events may con­tain wor­lds of widely vary­ing util­ity. As­so­ci­at­ing an event with a fixed prior con­di­tioned on the event gives me this ad­di­tional in­for­ma­tion needed to an­swer the ques­tion, and I don’t see how any oth­ers could work. Start­ing with a prior that gets con­di­tioned on events that cor­re­spond to the agent’s ac­tions seems to build in ev­i­den­tial de­ci­sion the­ory as an as­sump­tion, which makes me sus­pi­cious of it.

In the Jeffrey-Bolker treat­ment, a world is just a max­i­mally spe­cific event: an event which de­scribes ev­ery­thing com­pletely. But there is no re­quire­ment that max­i­mally-spe­cific events ex­ist.

This can be re­solved by defin­ing wor­lds to be min­i­mal non-zero el­e­ments of the com­ple­tion of the Boolean alge­bra of events, rather than a min­i­mal non-zero event. This is what you seemed to be im­plic­itly do­ing later with the in­finite bit­strings ex­am­ple, where the events were clopen sub­sets of Can­tor space (i.e. sets of in­finite bit­strings such that mem­ber­ship in the set only de­pends on finitely many bits), and this Boolean alge­bra has no min­i­mal non-zero el­e­ments (max­i­mally-spe­cific events), but the min­i­mal non-zero el­e­ments of its com­ple­tion cor­re­spond to in­finite bit­strings, as de­sired.

• I guess what I was try­ing to say is (al­though I think I’ve par­tially figured out what you meant; see next para­graph), cul­tural evolu­tion is a pro­cess that ac­quires adap­ta­tions slowly-ish and trans­mits pre­vi­ously-ac­quired adap­ta­tions to new or­ganisms quickly, while biolog­i­cal evolu­tion is a pro­cess that ac­quires adap­ta­tions very slowly and trans­mits pre­vi­ously-ac­quired adap­ta­tions to new or­ganisms quickly. You seem to be com­par­ing the rate at which cul­tural evolu­tion ac­quires adap­ta­tions to the rate at which biolog­i­cal evolu­tion trans­mits pre­vi­ously-ac­quired adap­ta­tions to new or­ganisms, and con­clud­ing that cul­tural evolu­tion is slower.

Re-read­ing the part of your post where you talked about AI take­off speeds, you ar­gue (which I hadn’t un­der­stood be­fore) that the rise of hu­mans was fast on an evolu­tion­ary timescale, and slow on a cul­tural timescale, so that if it was due to an evolu­tion­ary change, it must in­volve a small change that had a large effect on ca­pa­bil­ities, so that a large change will oc­cur very sud­denly if we mimic evolu­tion quickly, while if it was due to a cul­tural change, it was prob­a­bly a large change, so mimick­ing cul­ture quickly won’t pro­duce a large effect on ca­pa­bil­ities un­less it is ex­tremely quick.

This clar­ifies things, but I don’t agree with the claim. I think slow changes in the in­tel­li­gence of a species is com­pat­i­ble with fast changes in its ca­pa­bil­ities even if the changes are mainly in raw in­no­va­tive abil­ity rather than cul­tural learn­ing. In­no­va­tions can in­crease abil­ity to in­no­vate, caus­ing a pos­i­tive feed­back loop. A species could have high enough cul­tural learn­ing abil­ity for in­no­va­tions to be trans­mit­ted over many gen­er­a­tions with­out hav­ing the in­no­va­tive abil­ity to ever get the in­no­va­tions that will kick off this loop. Then, when they start slow­ing gain­ing in­no­va­tive abil­ity, the in­no­va­tions ac­cu­mu­lated into cul­tural knowl­edge grad­u­ally in­crease, un­til they reach the feed­back loop and the rate of in­no­va­tion be­comes more de­ter­mined by changes in pre-ex­ist­ing in­no­va­tions than by changes in raw in­no­va­tive abil­ity. There doesn’t even have to be any evolu­tion­ary changes in the pe­riod in which in­no­va­tion rate starts to get dra­matic.

If you don’t buy this story, then it’s not clear why the changes be­ing in cul­tural learn­ing abil­ity rather than in raw in­no­va­tive abil­ity would re­move the need for a dis­con­ti­nu­ity. After all, our cul­tural learn­ing abil­ity went from not giv­ing us much ad­van­tage over other an­i­mals to “ac­cu­mu­lat­ing de­ci­sive tech­nolog­i­cal dom­i­nance in an evolu­tion­ary eye­blink” in an evolu­tion­ary eye­blink (quo­ta­tion marks added for ease of pars­ing). Does this mean our abil­ity to learn from cul­ture must have greatly in­creased from a small change? You ar­gue in the post that there’s no clear can­di­date for what such a dis­con­ti­nu­ity in cul­tural learn­ing abil­ity could look like, but this seems just as true to me for raw in­no­va­tive abil­ity.

Per­haps you could ar­gue that it doesn’t mat­ter if there’s a sharp dis­con­ti­nu­ity in cul­tural learn­ing abil­ity be­cause you can’t learn from a cul­ture faster than the cul­ture learns things to teach you. In this case, yes, per­haps I would say that AI-driven cul­ture could make ad­vance­ments that look dis­con­tin­u­ous on a hu­man scale. Though I’m not en­tirely sure what that would look like, and I ad­mit it does sound kind of soft-take­offy.

• The abil­ities we ob­tained from ar­chi­tec­tural changes to our brains also came from a slow, ac­cu­mu­lated pro­cess, tak­ing even longer than cul­tural evolu­tion does.

• There’s more than one thing that you could mean by raw in­no­va­tive ca­pac­ity sep­a­rate from cul­tural pro­cess­ing abil­ity. First, you could mean some­one’s abil­ity to in­no­vate on their own with­out any di­rect help from oth­ers on the task at hand, but where they’re al­lowed to use skills that they pre­vi­ously ac­quired from their cul­ture. Se­cond, you could mean some­one’s coun­ter­fac­tual abil­ity to in­no­vate on their own if they hadn’t learned from cul­ture. You seem to be con­flat­ing these some­what, though mostly fo­cus­ing on the sec­ond?

The sec­ond is un­der­speci­fied, as you’d need to de­cide what coun­ter­fac­tual up­bring­ing you’re as­sum­ing. If you com­pare the cog­ni­tive perfor­mance of a hu­man raised by bears to the cog­ni­tive perfor­mance of a bear in the same cir­cum­stances, this is un­fair to the hu­man, since the bear is raised in cir­cum­stances that it is adapted for and the hu­man is not, just like com­par­ing the cog­ni­tive perfor­mance of a bear raised by hu­mans to that of a hu­man in the same cir­cum­stances would be un­fair to the bear. Though a hu­man raised by non-hu­mans would still make a more in­ter­est­ing com­par­i­son to non-hu­man an­i­mals than Ge­nie would, since Ge­nie’s en­vi­ron­ment is even less con­ducive to hu­man de­vel­op­ment (I bet most an­i­mals wouldn’t cog­ni­tively de­velop very well if they were kept im­mo­bi­lized in a locked room un­til ma­tu­rity ei­ther).

I think this makes the sec­ond no­tion less in­ter­est­ing than the first, as there’s a some­what ar­bi­trary de­pen­dence on the coun­ter­fac­tual en­vi­ron­ment. I guess the first no­tion is more rele­vant when try­ing to rea­son speci­fi­cally on ge­net­ics as op­posed to other fac­tors that in­fluence traits, but the sec­ond seems more rele­vant in other con­texts, since it usu­ally doesn’t mat­ter to what ex­tent some­one’s abil­ities were de­ter­mined by ge­net­ics or en­vi­ron­men­tal fac­tors.

I didn’t re­ally fol­low your ar­gu­ment for the rele­vance of this ques­tion to AI de­vel­op­ment. Why should raw in­no­va­tion abil­ity be more sus­cep­ti­ble to dis­con­tin­u­ous jumps than cul­tural pro­cess­ing abil­ity? Un­til I un­der­stand the sup­posed rele­vance to AI bet­ter, it’s hard for me to say which of the two no­tions is more rele­vant for this pur­pose.

I’d be very sur­prised if any ex­ist­ing non-hu­man an­i­mals are ahead of hu­mans by the first no­tion, and there’s a clear rea­son in this case why perfor­mance would cor­re­late strongly with so­cial learn­ing abil­ity: so­cial learn­ing will have helped peo­ple in the past de­velop skills that they keep in the pre­sent. Even for the sec­ond no­tion, though it’s a bit hard to say with­out pin­ning down the coun­ter­fac­tual more closely, I’d still ex­pect hu­mans to out­perform all other an­i­mals in some rea­son­able com­pro­mise en­vi­ron­ment that helps both de­velop but doesn’t in­volve them be­ing taught things that the non-hu­mans can’t fol­low. I think there are still rea­sons to ex­pect so­cial learn­ing abil­ity and raw in­no­va­tive ca­pa­bil­ity to be cor­re­lated even in this sense, be­cause higher gen­eral in­tel­li­gence will help for both; origi­nal dis­cov­ery and un­der­stand­ing things that are taught to you by oth­ers both re­quire some of the same cog­ni­tive tools.

• The­o­rem: Fuzzy be­liefs (as in https://​​www.al­ign­ment­fo­rum.org/​​posts/​​Ajcq9xWi2fmgn8RBJ/​​the-credit-as­sign­ment-prob­lem#X6fFvAHkxCPmQYB6v ) form a con­tin­u­ous DCPO. (At least I’m pretty sure this is true. I’ve only given proof sketches so far)

The rele­vant defi­ni­tions:

A fuzzy be­lief over a set is a con­cave func­tion such that (where is the space of prob­a­bil­ity dis­tri­bu­tions on ). Fuzzy be­liefs are par­tially or­dered by . The in­equal­ities re­verse be­cause we want to think of “more spe­cific”/​”less fuzzy” be­liefs as “greater”, and these are the func­tions with lower val­ues; the most spe­cific/​least fuzzy be­liefs are or­di­nary prob­a­bil­ity dis­tri­bu­tions, which are rep­re­sented as the con­cave hull of the func­tion as­sign­ing 1 to that prob­a­bil­ity dis­tri­bu­tion and 0 to all oth­ers; these should be the max­i­mal fuzzy be­liefs. Note that, be­cause of the or­der-re­ver­sal, the supre­mum of a set of func­tions refers to their poin­t­wise in­fi­mum.

A DCPO (di­rected-com­plete par­tial or­der) is a par­tial or­der in which ev­ery di­rected sub­set has a supre­mum.

In a DCPO, define to mean that for ev­ery di­rected set with , such that . A DCPO is con­tin­u­ous if for ev­ery , .

Lemma: Fuzzy be­liefs are a DCPO.

Proof sketch: Given a di­rected set , is con­vex, and . Each of the sets in that in­ter­sec­tion are non-empty, hence so are finite in­ter­sec­tions of them since is di­rected, and hence so is the whole in­ter­sec­tion since is com­pact.

Lemma: iff is con­tained in the in­te­rior of and for ev­ery such that , .

Proof sketch: If , then , so by com­pact­ness of and di­rect­ed­ness of , there should be such that . Similarly, for each such that , there should be such that . By com­pact­ness, there should be some finite sub­set of such that any up­per bound for all of them is at least .

Lemma: .

Proof: clear?

# AlexMen­nen’s Shortform

8 Dec 2019 4:51 UTC
7 points
• The part about deriva­tives might have seemed a lit­tle odd. After all, you might think, is a dis­crete set, so what does it mean to take deriva­tives of func­tions on it. One an­swer to this is to just differ­en­ti­ate sym­bol­i­cally us­ing polyno­mial differ­en­ti­a­tion rules. But I think a bet­ter an­swer is to re­mem­ber that we’re us­ing a differ­ent met­ric than usual, and isn’t dis­crete at all! In­deed, for any num­ber , , so no points are iso­lated, and we can define differ­en­ti­a­tion of func­tions on in ex­actly the usual way with limits.

• The the­o­rem: where is rel­a­tively prime to an odd prime and , is a square mod iff is a square mod and is even.

The real meat of the the­o­rem is the case (i.e. a square mod that isn’t a mul­ti­ple of is also a square mod . Deriv­ing the gen­eral case from there should be fairly straight­for­ward, so let’s fo­cus on this spe­cial case.

Why is it true? This ques­tion has a sur­pris­ing an­swer: New­ton’s method for find­ing roots of func­tions. Speci­fi­cally, we want to find a root of , ex­cept in in­stead of .

To adapt New­ton’s method to work in this situ­a­tion, we’ll need the p-adic ab­solute value on : for rel­a­tively prime to . This has lots of prop­er­ties that you should ex­pect of an “ab­solute value”: it’s pos­i­tive ( with only when ), mul­ti­plica­tive (), sym­met­ric (), and satis­fies a tri­an­gle in­equal­ity (; in fact, we get more in this case: ). Be­cause of pos­i­tivity, sym­me­try, and the tri­an­gle in­equal­ity, the p-adic ab­solute value in­duces a met­ric (in fact, ul­tra­met­ric, be­cause of the strong ver­sion of the tri­an­gle in­equal­ity) . To vi­su­al­ize this dis­tance func­tion, draw gi­ant cir­cles, and sort in­te­gers into cir­cles based on their value mod . Then draw smaller cir­cles in­side each of those gi­ant cir­cles, and sort the in­te­gers in the big cir­cle into the smaller cir­cles based on their value mod . Then draw even smaller cir­cles in­side each of those, and sort based on value mod , and so on. The dis­tance be­tween two num­bers cor­re­sponds to the size of the small­est cir­cle en­com­pass­ing both of them. Note that, in this met­ric, con­verges to .

Now on to New­ton’s method: if is a square mod , let be one of its square roots mod . ; that is, is some­what close to be­ing a root of with re­spect to the p-adic ab­solute value. , so ; that is, is steep near . This is good, be­cause start­ing close to a root and the slope of the func­tion be­ing steep enough are things that helps New­ton’s method con­verge; in gen­eral, it might bounce around chaot­i­cally in­stead. Speci­fi­cally, It turns out that, in this case, is ex­actly the right sense of be­ing close enough to a root with steep enough slope for New­ton’s method to work.

Now, New­ton’s method says that, from , you should go to . is in­vert­ible mod , so we can do this. Now here’s the kicker: , so . That is, is closer to be­ing a root of than is. Now we can just iter­ate this pro­cess un­til we reach with , and we’ve found our square root of mod .

Ex­er­cise: Do the same thing with cube roots. Then with roots of ar­bi­trary polyno­mi­als.

• The im­pres­sive part is get­ting re­in­force­ment learn­ing to work at all in such a vast state space

It seems to me that that is AGI progress? The real world has an even vaster state space, af­ter all. Get­ting things to work in vast state spaces is a nec­es­sary pre-con­di­tion to AGI.

• Ok, I see what you mean about in­de­pen­dence of ir­rele­vant al­ter­na­tives only be­ing a real co­her­ence con­di­tion when the prob­a­bil­ities are ob­jec­tive (or oth­er­wise known to be equal be­cause they come from the same source, even if there isn’t an ob­jec­tive way of say­ing what their com­mon prob­a­bil­ity is).

But I dis­agree that this makes VNM only ap­pli­ca­ble to set­tings in which all sources of un­cer­tainty have ob­jec­tively cor­rect prob­a­bil­ities. As I said in my pre­vi­ous com­ment, you only need there to ex­ist some source of ob­jec­tive prob­a­bil­ities, and you can then use prefer­ences over lot­ter­ies in­volv­ing ob­jec­tive prob­a­bil­ities and prefer­ences over re­lated lot­ter­ies in­volv­ing other sources of un­cer­tainty to de­ter­mine what prob­a­bil­ity the agent must as­sign for those other sources of un­cer­tainty.

Re: the differ­ence be­tween VNM and Bayesian ex­pected util­ity max­i­miza­tion, I take it from the word “Bayesian” that the way you’re sup­posed to choose be­tween ac­tions does in­volve first com­ing up with prob­a­bil­ities of each out­come re­sult­ing from each ac­tion, and from “ex­pected util­ity max­i­miza­tion”, that these prob­a­bil­ities are to be used in ex­actly the way the VNM the­o­rem says they should be. Since the VNM the­o­rem does not make any as­sump­tions about where the prob­a­bil­ities came from, these still sound es­sen­tially the same, ex­cept with Bayesian ex­pected util­ity max­i­miza­tion be­ing framed to em­pha­size that you have to get the prob­a­bil­ities some­how first.

• I think you’re un­der­es­ti­mat­ing VNM here.

only two of those four are rele­vant to co­her­ence. The main prob­lem is that the ax­ioms rele­vant to co­her­ence (acyclic­ity and com­plete­ness) do not say any­thing at all about probability

It seems to me that the in­de­pen­dence ax­iom is a co­her­ence con­di­tion, un­less I mi­s­un­der­stand what you mean by co­her­ence?

cor­rectly point out prob­lems with VNM

I’m cu­ri­ous what prob­lems you have in mind, since I don’t think VNM has prob­lems that don’t ap­ply to similar co­her­ence the­o­rems.

VNM util­ity stipu­lates that agents have prefer­ences over “lot­ter­ies” with known, ob­jec­tive prob­a­bil­ities of each out­come. The prob­a­bil­ities are as­sumed to be ob­jec­tively known from the start. The Bayesian co­her­ence the­o­rems do not as­sume prob­a­bil­ities from the start; they de­rive prob­a­bil­ities from the co­her­ence crite­ria, and those prob­a­bil­ities are spe­cific to the agent.

One can con­struct lot­ter­ies with prob­a­bil­ities that are pretty well un­der­stood (e.g. flip­ping coins that we have ac­cu­mu­lated a lot of ev­i­dence are fair), and you can re­strict at­ten­tion to lot­ter­ies only in­volv­ing un­cer­tainty com­ing from such sources. One may then get prob­a­bil­ities for other, less well-un­der­stood sources of un­cer­tainty by com­par­ing prefer­ences in­volv­ing such un­cer­tainty to prefer­ences in­volv­ing easy-to-quan­tify un­cer­tainty (e.g. if A is preferred to B, and you’re in­differ­ent be­tween 60%A+40%B and “A if X, B if not-X”, then you as­sign prob­a­bil­ity 60% to X. Per­haps not quite as philo­soph­i­cally satis­fy­ing as de­riv­ing prob­a­bil­ities from scratch, but this doesn’t seem like a fatal flaw in VNM to me.

I do not ex­pect agent-like sys­tems in the wild to be pushed to­ward VNM ex­pected util­ity max­i­miza­tion. I ex­pect them to be pushed to­ward Bayesian ex­pected util­ity max­i­miza­tion.

I un­der­stood those as be­ing syn­onyms. What’s the differ­ence?