# FactorialCode

Karma: 169
Page 1
• then what hap­pens if the ‘donor’ doesn’t choose to con­vert it to a dona­tion?

Same thing that hap­pens when you fail to meet the re­quire­ments for any other fi­nan­cial in­stru­ment. You go into debt, your credit rat­ing takes a plunge, and debt col­lec­tors will start ha­rass­ing you for money.

• I’ll take a crack at this.

To a first or­der ap­prox­i­ma­tion, some­thing is a “big deal” to an agent if it causes a “large” swing in its ex­pected util­ity.

• In­spired by the re­cent post on im­pact mea­sures, I though of an ex­am­ple illus­trat­ing the sub­jec­tive na­ture of im­pact.

Con­sider tak­ing the ac­tion of si­mul­ta­neously col­laps­ing all the stars ex­cept our sun into black holes.(Sup­pose you can some­how do this with­out gen­er­at­ing su­per­novas.)

To me, this seems like a highly im­pact­ful event, po­ten­tially vastly cur­tailing the fu­ture po­ten­tial of hu­man­ity.

But to an 11th cen­tury peas­ant, all this would mean is that the stars in the night sky would slowly go out over the course of mil­le­nia. Which would have very lit­tle im­pact on the peas­ants life.

• I find hav­ing a skate­board is a com­pact way to shave min­utes off of the sec­tions of my com­mute where I would oth­er­wise have to walk. It turns a 15 minute walk to the bus stop into a 5 minute ride, which adds up in the long run.

• On the other end, when writ­ing, I feel that re­cur­sively ex­pand­ing upon your ideas to ex­plain them and back them up is a skill that needs to be learned and prac­ticed.

When I come up with an idea, I sus­pect that I do so with what­ever ab­strac­tions and ideas my brain has on hand, but those are prob­a­bly not the same as those of the tar­get au­di­ence. When I start writ­ing, I’ll end up writ­ing a ~1-2 sen­tence sum­mary that I feel cap­tures what I’m try­ing to get across. Then I need to make a con­scious effort to un­pack each of those com­po­nent ideas and back them up with rea­son­ing/​ex­am­ples to sup­port my claims, this get’s harder as I fur­ther un­pack state­ments, be­cause I’m more in­clined to take those claims for granted. I sus­pect that this gets eas­ier with prac­tice, and that I’ll be able to write pro­gres­sively more de­tailed posts as time goes on.

Does any­one else feel that this is a bot­tle­neck on their abil­ity to ex­plain things?

• Due to the vast data re­quire­ments, most of the en­vi­ron­ments would have to be simu­lated. I sus­pect that this will make the agenda harder than it may seem at first glance—I think that the com­plex­ity of the real world was quite cru­cial, and that simu­lat­ing en­vi­ron­ments that reach the ap­pro­pri­ate level of com­plex­ity will be a very difficult task.

I’m skep­ti­cal of this. I think that it’s well within our ca­pa­bil­ities to cre­ate a vir­tual en­vi­ron­ment with a de­gree of com­plex­ity com­pa­rable to the an­ces­tral en­vi­ron­ment. For in­stance, the de­vel­op­ment of minecraft with all of it’s com­plex­ity can be up­per bounded by the cost of pay­ing ~25 de­vel­op­ers over the course of 10 years. But the core fea­tures of the game, minecraft alpha, were done by a sin­gle per­son in his spare time over 2 years.

I think a small­ish com­pe­tent team with a 10-100 mil­lion dol­lar bud­get could eas­ily throw to­gether a vir­tual en­vi­ron­ment with am­ple com­plex­ity, pos­si­bly in­clud­ing de­vel­op­ing FPGA’s or ASICs to run it at the re­quired speed.

• I found a youtube chan­nel that has been pro­vid­ing com­men­tary on sus­pected games of AlphaS­tar on the lad­der. They’re pre­sented from a lay­man’s per­spec­tive, but they might be valuable for peo­ple to get an idea of what the cur­rent AI is ca­pa­ble of.

• Try­ing to cre­ate an FAI from al­chem­i­cal com­po­nents is ob­vi­ously not the best idea. But it’s not to­tally clear how much of a risk these com­po­nents pose, be­cause if the com­po­nents don’t work re­li­ably, an AGI built from them may not work well enough to pose a threat.

I think that us­ing al­chem­i­cal com­po­nents in an pos­si­ble FAI can lead to a se­ri­ous risk if the peo­ple de­vel­op­ing it aren’t suffi­ciently safety con­scious. Sup­pose that ei­ther im­plic­itly or ex­plic­itly, the AGI is struc­tured us­ing al­chem­i­cal com­po­nents as fol­lows:

1. A mod­ule for form­ing be­liefs about the world.

2. A mod­ule for plan­ning pos­si­ble ac­tions or poli­cies.

3. A util­ity or re­ward func­tion.

In the pro­cess of build­ing an AGI us­ing al­chem­i­cal means, all of the above will be in­cre­men­tally im­proved to a point where they are “good enough”. The AI is form­ing ac­cu­rate be­liefs about the world, and is mak­ing plans to get things that the re­searchers want. How­ever, in a setup like this, all of the clas­sic AI safety con­cerns come into play. Namely, the AI has an in­cen­tive to up­grade the first 2 mod­ules and pre­serve the util­ity func­tion. Since the util­ity func­tion is only “good enough”, this be­comes the clas­sic setup for Good­hart and we get UFAI.

Even in a situ­a­tion where the AI does not par­ti­ci­pate in it’s own fur­ther re­design, it’s effec­tive abil­ity to op­ti­mise the world in­creases as it gets more time to in­ter­act with it. As a re­sult, an ini­tially well be­haved AGI might even­tu­ally wan­der into a re­gion of state space where it be­comes un­friendly us­ing only ca­pa­bil­ities com­pa­rable to that of a hu­man.

That said, re­mains to be seen if re­searchers will build AI’s with this kind of ar­chi­tec­ture with­out ad­di­tional safety pre­cau­tions. But we do see model free RL var­i­ants of this gen­eral ar­chi­tec­ture such as Guided Cost Learn­ing and Deep re­in­force­ment learn­ing from hu­man prefer­ences.

As a prac­ti­cal ex­per­i­ment to val­i­date my rea­son­ing, one could repli­cate the lat­ter pa­per us­ing a weak RL al­gorithm, and then see what hap­pens if it’s swapped out with a much stronger al­gorithm af­ter learn­ing the re­ward func­tion. (Some ver­sion of MPC maybe?)

• Meta-philos­o­phy hy­poth­e­sis: Philos­o­phy is the pro­cess of reify­ing fuzzy con­cepts that hu­mans use. By “fuzzy con­cepts” I mean things where we can say “I know it when I see it.” but we might not be able to de­scribe what “it” is.

Ex­am­ples that I be­lieve sup­port the hy­poth­e­sis:

• This short­form is about the philos­o­phy of “philos­o­phy” and this hy­poth­e­sis is an at­tempt at an ex­pla­na­tion of what we mean by “philos­o­phy”.

• In episte­mol­ogy, Bayesian episte­mol­ogy is a hy­poth­e­sis that ex­plains the pro­cess of learn­ing.

• In ethics, an eth­i­cal the­ory at­tempts to make ex­plicit our moral in­tu­itions.

• A clear ex­pla­na­tion of con­scious­ness and qualia would be con­sid­ered philo­soph­i­cal progress.

• do you think any rea­son­able ex­ten­sion of these kinds of ideas could get what we want?

Con­di­tional on avoid­ing Good­hart, I think you could prob­a­bly get some­thing that looks a lot like a di­a­mond max­imiser. It might not be perfect, the situ­a­tion with the “most di­a­mond” might not be the max­i­mum of it’s util­ity func­tion, but I would ex­pect the max­i­mum of it’s util­ity func­tion will still con­tain a very large amount of di­a­mond. For in­stance, de­pend­ing on the rep­re­sen­ta­tion, and the way the pro­gram­mers baked in the utilty func­tion, it might have a quirk in it’s util­ity func­tion of only rec­og­niz­ing some­thing as a di­a­mond if it’s stereo­typ­i­cally “di­a­mond shaped”. This would bar it from just build­ing pure car­bon planets to achieve it’s goal.

IMO, you’d need some­thing else out­side of the ideas pre­sented to get a “perfect” di­a­mond max­i­mizer.

• Do you think we could build a di­a­mond max­i­mizer us­ing those ideas, though?

They’re definitely not suffi­cient, al­most cer­tainly. A full fledged di­a­mond max­i­mizer would need far more ma­chin­ery, if only to do the max­i­miza­tion and prop­erly learn the rep­re­sen­ta­tion.

The con­cern here is that the rep­re­sen­ta­tion has to cleanly de­mar­cate what we think of as di­a­monds.

I think this touches on a re­lated con­cern, namely good­hart­ing. If we even slightly miss-spec­ify the util­ity func­tion at the bound­ary and the AI op­ti­mize in an un­re­strained fash­ion, we’ll end up with weird situ­a­tions that are to­tally de-cor­re­lated with what we we’re ini­tially try­ing to get the AI to op­ti­mize.

If we don’t solve this prob­lem, I agree, the prob­lem is ex­tremely difficult at best and com­pletely in­tractable at worst. How­ever, If we can reign in good­hart­ing, then I don’t think things are in­tractable.

To make the point, I think the prob­lem of a AI good­hart­ing a rep­re­sen­ta­tion is very analo­gous to the prob­lems be­ing tack­led in the field of ad­ver­sar­ial per­tur­ba­tions for image clas­sifi­ca­tion. In this case, the “rep­re­sen­ta­tion space” is the image it­self. The bound­aries are clas­sifi­ca­tion bound­aries set by the clas­sify­ing neu­ral net­work. The op­ti­miz­ing AI that good­harts ev­ery­t­ing is usu­ally just some form or gra­di­ent de­cent.

The field started when peo­ple no­ticed that even tiny im­per­cep­ti­ble per­tur­ba­tions to images in one class would fool a clas­sifier into think­ing it was an image from an­other class. The in­ter­est­ing thing is that when you take this fur­ther, you get deep dream­ing and in­cep­tion­ism. The love­craf­tian dog-slugs that would arise from the pro­cess are are re­sult of the lo­cal op­ti­miza­tion prop­er­ties of SGD com­bined with the flaws of the clas­sifier. Which, I think, is analo­gous to good­hart­ing in the case of a di­a­mond max­imiser with a learnt on­tol­ogy. The AI will do some­thing weird, it be­comes con­vinced that the world is full of di­a­monds. Mean­while, if you ask a hu­man about the world it cre­ated, “love­craf­tian” will prob­a­bly pre­cede “di­a­mond” in the de­scrip­tion.

How­ever, the field of ad­ver­sar­ial ex­am­ples seems to in­di­cate that it’s pos­si­ble to at least par­tially over­come this form of good­hart­ing and, by ana­ogy, the good­hart­ing that we would see with a di­a­mond max­imiser. IMO, the most promis­ing and gen­eral solu­tion seems to be to be more bayesian, and keep track of the un­cer­tainty as­so­ci­ated with class la­bel. By keep­ing track of un­cer­tainty in class la­bels, it’s pos­si­ble to avoid class bound­aries al­to­gether, and op­ti­mize to­wards re­gions of the space that are more likely to be part of the de­sired class la­bel.

I can’t seem to dig it up right now, but I once saw a pa­per where they de­vel­oped a ro­bust clas­sifier. When they used SGD to change a pic­ture from be­ing clas­sified as a cat to be­ing clas­sified as a dog, the re­sult was that the un­der­ly­ing image went from look­ing like a dog to look­ing like a cat. By anal­ogy, an di­a­mond max­i­mizer with a ro­bust clas­sifi­ca­tion of di­a­monds in it’s rep­re­sen­ta­tion should ac­tu­ally pro­duce di­a­monds.

Over­all, ad­ver­sar­ial ex­am­ples seem to be a micro­cosm for eval­u­at­ing this spe­cific kind of good­hart­ing. My op­ti­mism that we can do ro­bust on­tol­ogy iden­ti­fi­ca­tion is tied to the suc­cess of that field, but at the mo­ment the prob­lem doesn’t seem to be in­tractable.

• I’m per­son­ally far more op­ti­mistic about on­tol­ogy iden­ti­fi­ca­tion. Work in rep­re­sen­ta­tion learn­ing, blog posts such as OpenAI’s sen­ti­ment neu­ron, and style trans­fer, all in­di­cate that it’s at least pos­si­ble to point at hu­man level con­cepts in a sub­set of world mod­els. Figur­ing out how to re­fine these learned rep­re­sen­ta­tions to fur­ther cor­re­spond with our in­tu­itions, and figur­ing out how to re­bind those con­cepts to rep­re­sen­ta­tions in more ad­vanced on­tolo­gies are both ar­eas that are ne­glected, but they’re both prob­lems that don’t seem fun­da­men­tally in­tractable.

• Un­der this view, al­ign­ment isn’t a prop­erty of re­ward func­tions: it’s a prop­erty of a re­ward func­tion in an en­vi­ron­ment. This prob­lem is much, much harder: we now have the joint task of de­sign­ing a re­ward func­tion such that the best way of string­ing to­gether fa­vor­able ob­ser­va­tions lines up with what we want. This task re­quires think­ing about how the world is struc­tured, how the agent in­ter­acts with us, the agent’s pos­si­bil­ities at the be­gin­ning, how the agent’s learn­ing al­gorithm af­fects things…

I think there are ways of do­ing this that don’t in­volve ex­plic­itly work­ing through what ob­ser­va­tion se­quences lead to good out­comes. AFAICT this was origi­nally out­lined in Model Based Re­wards quite a while ago. Essen­tially, the idea is to make the re­ward (or even bet­ter, utilty) a func­tion of the agent’s in­ter­nal model of the world. Then when the agent goes to make a de­ci­sion, the util­ity of the wor­lds where the agent does and does not make take an ac­tion are com­pared. Do­ing things this way has a cou­ple of nice prop­er­ties, in­clud­ing elimi­nat­ing the in­cen­tive to wire­head, and mak­ing it pos­si­ble to spec­ify util­ities over pos­si­ble wor­lds rather than just what the AI sees.

The rele­vant point how­ever, is that it takes the prob­lem from try­ing to pin down what chains of events lead to good out­comes, and splits it into a prob­lem of iden­ti­fy­ing good and bad wor­ld­states in the agents model and build­ing an ac­cu­rate model of the world. This is be­cause an agent with an ac­cu­rate model of the world will be able to figure out what se­quence of ac­tions and ob­ser­va­tions lead to any given wor­ld­state.

• This isn’t quite “lock in”, but it’s re­lated in the sense that an out­side force shaped the field of “deep learn­ing”.

I sus­pect the videogame in­dus­try, and the GPUs we’re de­vel­oped for it has locked in the type of tech­nolo­gies we now know as deep learn­ing. GPU’s were origi­nally ASICs de­vel­oped for play­ing videogames, so there are spe­cific types of op­er­a­tions they were op­ti­mized to perform.

I sus­pect that neu­ral net­work ar­chi­tec­tures that lev­er­aged these hard­ware op­ti­miza­tions out­performed other neu­ral net­works. Conv nets and Trans­form­ers are prob­a­bly ev­i­dence of this. The former lev­er­ages con­volu­tion, and the lat­ter lev­er­ages ma­trix mul­ti­pli­ca­tion. In turn, GPUs and ASICs have been op­ti­mized to run these suc­cess­ful neu­ral net­works faster, with NVIDIA rol­ling out Ten­sor Cores and Google de­ploy­ing their TPUs.

Look­ing back, it’s hard to say that this com­bi­na­tion of hard­ware and soft­ware isn’t a lo­cal op­tima, and that if we were to re­design the whole stack from the bot­tom up, that the tech­nolo­gies with the ca­pa­bil­ities of mod­ern “deep learn­ing” wouldn’t look com­pletely differ­ent.

It’s not even clear how one could find an­other op­ti­mum in the space of al­gorithms+hard­ware at this point ei­ther. The cur­rent stack benefits both from open source con­tri­bu­tions and mas­sive economies of scale.

• This of course, leads nat­u­rally to a new app/​os idea. We need a way to semi-seam­lessly use two phones to­gether as if they were a sin­gle phone. Like dual mon­i­tors, but with phones.

• Can you list some pa­pers that are vaguely in line with the kind of re­search you’re look­ing for?

• I’ve been lurk­ing on LW for many years, and over­all, my im­pres­sion is that there’s been steady progress. At the end of a very rele­vant es­say from Scott, way back in 2014, he states:

I find this re­ally ex­cit­ing. It sug­gests there’s this path to be pro­gressed down, that in­tel­lec­tual change isn’t just a ran­dom walk. Some peo­ple are fur­ther down the path than I am, and re­port there are ac­tual places to get to that sound very ex­cit­ing. And other peo­ple are around the same place I am, and still other peo­ple are lag­ging be­hind me. But when I look back at where we were five years ago, it’s so far back that none of us can even see it any­more, so far back that it’s not un­til I trawl the archives that re­al­ise how many things there used to be that we didn’t know.

5 years later, I still think that this still ap­plies. It ex­plains some of the re­hash­ing of top­ics that were pre­vi­ously dis­cussed. All the things I’m go­ing to point out be­low are some of the most no­table in­sights I can re­mem­ber.

When LW was rel­a­tively in­ac­tive, there were es­says from the sur­round­ing sphere that stuck with me. For in­stance, this es­say by paul chrisi­ano. Which was, for me, the first clear ex­am­ples of how epistem­i­cally ir­ra­tional things that hu­mans do can ac­tu­ally be in­stru­men­tally ra­tio­nal in the right set­ting, some­thing that wasn’t re­ally dis­cussed much in the origi­nal se­quences.

I think LW has also started fo­cus­ing a fair bit on group ra­tio­nal­ity, along with norms and sys­tems that foster it. That can be seen by look­ing at how the site has changed, along with all of the meta dis­cus­sion that fol­lows. I think that in pur­suit of this, there’s also been quite a bit of dis­cus­sion about group dy­nam­ics. Most no­table for me was Scott’s Med­i­ta­tions on Moloch and Tox­o­plasma of rage. Group ra­tio­nal­ity looks like a very broad topic, and in­sight­ful dis­cus­sion about it are still hap­pen­ing now. Such as this dis­cus­sion on simu­lacra lev­els.

On the AI safety side, I feel like there’s been an enor­mous amount of progress. Most no­tably for me was Stu­art Arm­strong’s post: Hu­mans can be as­signed any val­ues what­so­ever.. Along with all the dis­cus­sion about the pros and cons, of differ­ent meth­ods of achiev­ing al­ign­ment, such as AI Safety Via De­bate, HCH, and Value Learn­ing.

As for the se­quences, I don’t have any ex­am­ples off the top of my head, but I think at least some of the quoted psy­chol­ogy re­sults that were refer­enced failed to repli­cate dur­ing the repli­ca­tion crisis. But I can’t re­mem­ber too much else about them, since it’s been so long since I read them. Many of the core idea feel like they’ve be­come back­ground knowl­edge that I take for granted, even if I’ve for­got­ten their origi­nal source.

• This isn’t quite the same as weight­ing by min­i­mum de­scrip­tion length in the Solomonoff sense, since we care speci­fi­cally about sym­me­tries which cor­re­spond to func­tion calls—i.e. iso­mor­phic sub­DAGs. We don’t care about graphs which can be gen­er­ated by a short pro­gram but don’t have these sorts of sym­me­tries.

Can you elab­o­rate on this? What would be an ex­am­ple of a graph that can be gen­er­ated by a short pro­gram, but that does not have these sorts of sym­me­tries?

My in­tu­ition is that the class of pro­cesses your de­scribing is Tur­ing com­plete, and there­fore can simu­late any Tur­ing ma­chine, and is thus just an­other in­stance of Solomonoff in­duc­tion with a differ­ent MDL con­stant.

Edit: Rule 110 would be an ex­am­ple.