# Donald Hobson

Karma: 383 (LW), 5 (AF)
NewTop
Page 1
• In other words, the agent as­signed zero prob­a­bil­ity to an event, and then it hap­pened.

• As far as I un­der­stand it, you are propos­ing that the most re­al­is­tic failure mode con­sists of many AI sys­tems, all put into a po­si­tion of power by hu­mans, and op­ti­miz­ing for their own prox­ies. Call these Trusted Trial and Er­ror AI’s (TTE)

The dis­t­in­guish­ing fea­tures of TTE’s are that they were Trusted. A hu­man put them in a po­si­tion of power. Hu­mans have re­fined, un­der­stood and checked the code enough that they are pre­pared to put this al­gorithm in a self driv­ing car, or a stock man­age­ment sys­tem. They are not lab pro­to­types. They are also Trial and er­ror learn­ers, not one shot learn­ers.

Some More de­scrip­tions of what ca­pa­bil­ity range I am con­sid­er­ing.

Sup­pose hy­po­thet­i­cally that we had TTE re­in­force­ment learn­ers, a lit­tle bet­ter than to­days state of the art, and noth­ing be­yond that. The AI’s are ad­vanced enough that they can take a moun­tain of med­i­cal data and train them­selves to be skil­led doc­tors by trial and er­ror. How­ever they are not ad­vanced enough to figure out how hu­mans work from, say a se­quenced genome and noth­ing more.

Give them con­trol of all the traf­fic lights in a city, and they will learn how to min­i­mize traf­fic jams. They will ar­range for peo­ple to drive in cir­cles rather than stay still, so that they do not count as part of a traf­fic jam. How­ever they will not do any­thing out­side their pre­set policy space, like hack­ing into the traf­fic light con­trol sys­tem of other cities, or de­stroy­ing the city with nukes.

If such tech­nol­ogy is eas­ily available, peo­ple will start to use it for things. Some peo­ple put it in po­si­tions of power, oth­ers are more hes­i­tant. As the only way the sys­tem can learn to avoid some­thing is through trial and er­ror, the sys­tem has to cause a (prob­a­bly sev­eral) pub­lic out­crys be­fore it learns not to do so. If no one told the traf­fic light sys­tem that car crashes are bad on simu­la­tions or past data, (Align­ment failure) Then even if pub­lic opinion feeds di­rectly into re­ward, it will have to cause sev­eral car crashes that are clearly its fault be­fore it learns to only cause crashes that can be blamed on some­one else. How­ever, de­liber­ately caus­ing crashes will prob­a­bly get the sys­tem shut off or se­ri­ously mod­ified.

Note that we are sup­pos­ing many of these sys­tems ex­ist­ing, so the failures of some, com­bined with plenty of simu­lated failures, will give us a good idea of the failure modes.

The space of bad things an AI can get away with is small and highly com­plex in the space of bad things. An TTE set to re­duce crime rates tries mak­ing the crime re­port forms longer, this re­duces re­ported crime, but hu­mans quickly re­al­ize what its do­ing. It would have to do this and be patched many times be­fore it came up with a method that hu­mans wouldn’t no­tice.

Given Ad­vanced TTE’s as the most ad­vanced form of AI, we might slowly de­velop a prob­lem, but the de­ploy­ment of TTE’s would be slowed by the time it takes to gather data and check re­li­a­bil­ity. Espe­cially given mis­trust af­ter sev­eral ma­jor failures. And I sus­pect that due to statis­ti­cal similar­ity of train­ing and test­ing, many differ­ent sys­tems op­ti­miz­ing differ­ent prox­ies, and hu­mans hav­ing the best ab­stract rea­son­ing about novel situ­a­tions, and the power to turn the sys­tems off, any dis­crep­ancy of goals will be mod­er­ately minor. I do not ex­pect such op­ti­miza­tion power to be sig­nifi­cantly more pow­er­ful or less al­igned than mod­ern cap­i­tal­ism.

This all as­sumes that no one will man­age to make a lin­ear time AIXI. If such a thing is made, it will break out of any boxes and take over the world. So, we have a so­cial pro­cess of adap­tion to TTE AI, which is already in its early stages with things like self driv­ing cars, and at any time, this pro­cess could be ren­dered ir­rele­vant by the ar­rival of a su­per-in­tel­li­gence.

• 1)Cli­mate change caused ex­tinc­tion is not on the table. Low tech hu­mans can sur­vive ev­ery­where from the jun­gle to the arc­tic. Some hu­mans will sur­vive.

2) I sus­pect that cli­mate change won’t cause mas­sive so­cial col­lapse. It might well knock 10% of world GDP, but it won’t stop us hav­ing an ad­vanced high tech so­ciety. At the mo­ment, its not caus­ing dam­age on that scale, and I sus­pect that in a few decades, we will have biotech, re­new­ables or other techs that will make ev­ery­thing fine. I sus­pect that the dam­age caused by cli­mate change won’t in­crease by more than 2 or 3 times in the next 50 years.

3) If you are skil­led enough to be a sci­en­tist, in­vent­ing a so­lar panel that’s 0.5% more effi­cient does a lot more good than show­ing up to protests. Protest’s need many peo­ple to work, in­ven­tors can change the world by them­selves. Policy ad­vi­sors and aca­demics can sug­gest ac­tion in small groups. Even work­ing a nor­mal job and send­ing your earn­ings to a well cho­sen char­ity is likely to be more effec­tive.

4) Quite a few peo­ple are already work­ing on global warm­ing. It seems un­likely that a prob­lem needs 10,000,001 peo­ple work­ing on it to solve, and if only 10,000,000 peo­ple work on it, they won’t man­age. Most of the re­ally easy work on global warm­ing is already be­ing done. This is not the case with AI risk as of 10 years ago, for ex­am­ple. (It’s got a few more peo­ple work­ing on it since then, still noth­ing like cli­mate change.)

• I think the pro­tag­o­nist here should have looked at earth. If there was a tech­nolog­i­cal in­tel­li­gence on earth that cared about the state of Jupiter’s moons, then it could send rock­ets there. The most likely sce­nar­ios are a dis­aster bad enough to stop us launch­ing space­craft, and an AI that only cares about earth.

A su­per in­tel­li­gence should as­sign non neg­ligible prob­a­bil­ity to the re­sult that ac­tu­ally hap­pened. Given the tech was available, a space-probe con­tain­ing an up­loaded mind is not that un­likely. If such a probe was a real threat to the AI, it would have already blown up all space-probes on the off chance.

The up­per bound given on the amount that mal­i­cious info can harm you is ex­tremely loose. Mal­i­cious info can’t do much harm un­less the en­emy has a good un­der­stand­ing of the par­tic­u­lar sys­tem that they are sub­vert­ing.

• Yet policy ex­plo­ra­tion is an im­por­tant job. Un­less you think that some­one post­ing some­thing on a blog is go­ing to change policy with­out any­one dou­ble-check­ing it first, we should en­courage sug­ges­tion of rad­i­cally new poli­cies.

• I would like to pro­pose a model that is more flat­ter­ing to hu­mans, and more similar to how other parts of hu­man cog­ni­tion work. When we see a sim­ple tex­tual mis­take, like a re­peated “the”, we don’t no­tice it by de­fault. Hu­man minds cor­rect sim­ple er­rors au­to­mat­i­cally with­out con­sciously notic­ing that they are do­ing it. We round to the near­est pat­tern.

I pro­pose that au­to­matic pat­tern match­ing to the clos­est thing that makes sense is hap­pen­ing at a higher level too. When hu­mans skim semi con­tra­dic­tory text, they pro­duce a more con­sis­tent world model that doesn’t quite match up with what is said.

Lan­guage feeds into a deeper, sen­si­ble world model mod­ule within the hu­man brain and GPT2 doesn’t re­ally have a co­her­ent world model.

• As your be­lief about how well AGI is likely to go af­fects both the like­li­hood of a bet be­ing eval­u­ated, and the chance of win­ning, so bets about AGI are likely to give du­bi­ous re­sults. I also have sub­stan­tial un­cer­tainty about the value of money in a post sin­gu­lar­ity world. Most ob­vi­ously is ev­ery­one get­ting turned into pa­per­clips, noone has any use for money. If we get a friendly sin­gle­ton su­per-in­tel­li­gence, ev­ery­one is liv­ing in par­adise, whether or not they had money be­fore. If we get an eco­nomic sin­gu­lar­ity, where liber­tar­ian ASI(s) try to make money with­out cheat­ing, then money could be valuable. I’m not sure how we would get that, as an un­der­stand­ing of the con­trol prob­lem good enough to not wipe out hu­mans and fill the uni­verse with bank notes should be enough to make some­thing closer to friendly.

Even if we do get some kind of as­cen­dant econ­omy, given the amount of re­sources in the so­lar sys­tem (let alone wider uni­verse), its quite pos­si­ble that pocket change would be enough to live for aeons of lux­ury.

Given how un­clear it is about whether or not the bet will get paid and how much the cash would be worth if it was, I doubt that the bet­ting will pro­duce good info. If ev­ery­one thinks that money is more likely than not to be use­less to them af­ter ASI, then al­most no one will be pre­pared to lock their cap­i­tal up un­til then in a bet.

• I sus­pect that an AGI with such a de­sign could be much safer, if it was hard­coded to be­lieve that time travel and hy­per­ex­po­nen­tially vast uni­verses were im­pos­si­ble. Sup­pose that the AGI thought that there was a 0.0001% chance that it could use a galax­ies worth of re­sources to send 10^30 pa­per­clips back in time. Or cre­ate a par­allel uni­verse con­tain­ing 3^^^3 pa­per­clips. It will still chase those op­tions.

If start­ing a long plan to take over the world costs it liter­ally noth­ing, it will do it any­way. A se­quence of short term plans, each de­signed to make as many pa­per­clips as pos­si­ble within the next few min­utes could still end up dan­ger­ous. If the num­ber of pa­per­clips at time is , and its power at time is , then , would mean that both power and pa­per­clips grew ex­po­nen­tially. This is what would hap­pen if power can be used to gain power and clips at the same time, with min­i­mal loss of ei­ther from also pur­su­ing the other.

If power can only be used to gain one thing at a time, and the rate power can grow at is less than the rate of time dis­count, then we are safer.

This pro­posal has sev­eral ways to be caught out, world wreck­ing as­sump­tions that aren’t cer­tain, but if used with care, a short time frame, an on­tol­ogy that con­sid­ers time­travel im­pos­si­ble, and say a util­ity func­tion that maxes out at 10 clips, it prob­a­bly won’t de­stroy the world. Throw in mild op­ti­miza­tion and an im­pact penalty, and you have a sys­tem that re­lies on a dis­junc­tion of shaky as­sump­tions, not a con­junc­tion of them.

It is a CDT agent, or some­thing that doesn’t try to pun­ish you now so you make pa­per­clips last week. A TDT agent might de­cide to take the policy of kil­ling any­one who didn’t make clips be­fore it was turned on, caus­ing hu­mans that pre­dict this to make clips.

I sus­pect that it would be pos­si­ble to build such an agent, prove that there are no weird failure modes left, and turn it on, with a small chance of de­stroy­ing the world. I’m not sure why you would do that. Once you un­der­stand the sys­tem well enough to say its safe-ish, what vi­tal info do yo gain from turn­ing it on?

• But­terfly effects es­sen­tially un­pre­dictable, given your par­tial knowl­edge of the world. Sure, you do­ing home­work could cause a tor­nado in Texas, but it’s equally likely to pre­vent that. To ac­tu­ally pre­dict which, you would have to calcu­late the move­ment of ev­ery gust of air around the world. Other­wise your shuffling an already well shuffled pack of cards. Bear in mind that you have no rea­son to dis­t­in­guish the par­tic­u­lar ac­tion of “do­ing home­work” from a vast set of other ac­tions. If you re­ally did know what ac­tions would stop the Texas tor­nado, they might well look like ran­dom thrash­ing.

What you can calcu­late is the re­li­able effects of do­ing your home­work. So, given bounded ra­tio­nal­ity, you are prob­a­bly best to base your de­ci­sions on those. The fact that this only in­volves home­work might sug­gest that you have an in­ter­nal con­flict be­tween a part of your­self that thinks about ca­reers, and a short term pro­cras­ti­na­tor.

Most peo­ple who aren’t par­tic­u­larly eth­i­cal still do more good than harm. (If ev­ery­one looks out for them­selves, ev­ery­one has some­one to look out for them. The law stops most of the bad mu­tual defec­tions in pris­on­ers dilem­mas) Evil ge­nius try­ing to trick you into do­ing harm are much rarer than mod­er­ately com­pe­tent nice peo­ple try­ing to get your help to do good.

• This is an ex­am­ple of a pas­cals mug­ging. Tiny prob­a­bil­ities of vast re­wards can pro­duce weird be­hav­ior. The best known solu­tion is ei­ther a bounded util­ity func­tion, or a an­tipas­calene agent. (An agent that ig­nores the best x% and worst y% of pos­si­ble wor­lds when calcu­lat­ing ex­pected util­ities. It can be money pumped)

• Get a pack of cards in which some cards are blue on both sides, and some are red on one side and blue on the other. Pick a ran­dom card from the pile. If the sub­ject is shown one side of the card, and its blue, they gain a bit of ev­i­dence that the card is blue on both sides. Give them the op­tion to bet on the colour of the other side of the card, be­fore and af­ter they see the first side. In­vert the prospect the­ory curve to get from im­plicit prob­a­bil­ity to bet­ting be­havi­our. The peo­ple should perform a larger up­date in log odds when the pack is mostly one type of card, over when the pack is 50 : 50.

• I sus­pect that if vot­ing re­duced your own karma, some peo­ple wouldn’t vote. As it be­comes ob­vi­ous that this is hap­pen­ing, more peo­ple stop vot­ing, un­til karma just stops flow­ing at all. (The peo­ple who per­sis­tently vote any­way all run out of karma.)

• This is mak­ing the some­what du­bi­ous as­sump­tion that X risks are not so ne­glected that even a “self­ish” in­di­vi­d­ual would work to re­duce them. Of course, in the not too un­rea­son­able sce­nario where the cos­mic com­mons is di­vided up evenly, and you use your por­tion to make a vast num­ber of du­pli­cates of your­self, the util­ity, if your util­ity is lin­ear in copies of your­self, would be vast. Or you might hope to live for a ridicu­lously long time in a post sin­gu­lar­ity world.

The effect that a sin­gle per­son can have on X risks is small, but if they were self­ish with no time dis­count­ing, it would be a bet­ter op­tion than he­do­nism now. Although a third al­ter­na­tive of sit­ting in a padded room be­ing very very safe could be even bet­ter.

• I sus­pect that the so­cial in­sti­tu­tions of Law and Money are likely to be­come in­creas­ingly ir­rele­vant back­ground to the de­vel­op­ment of ASI.

Deter­rence Fails.

If you be­lieve that there is a good chance of im­mor­tal utopia, and a large chance of pa­per­clips in the next 5 years, the threat that the cops might throw you in jail, (on the off chance that they are still in power) is neg­ligible.

The law is blind to safety.

The law is bu­reau­cratic and os­sified. It is prob­a­bly not em­ploy­ing much top tal­ent, as it’s hard to tell top tal­ent from the rest if you aren’t as good your­self (and it doesn’t have the bud­get or glamor to at­tract them). Tel­ling whether an or­ga­ni­za­tion is on line for not de­stroy­ing the world is HARD. The safety pro­to­cols are be­ing in­vented on the fly by each team, the sys­tem is very com­plex and tech­ni­cal and only half built. The teams that would de­stroy the world aren’t idiots, they are still pro­duc­ing long pa­pers full of maths and talk­ing about the im­por­tance of safety a lot. There are no ex­am­ples to work with, or un­der­stood laws.

Likely as not (not re­ally, too much con­ju­ga­tion here), you get some ran­dom in­spec­tor with a check­list full of thing that sound like a good idea to peo­ple who don’t un­der­stand the prob­lem. All AI work has to have an emer­gency stop but­ton that turns the power off. (The idea of an AI cir­cum­vent­ing this was not con­sid­ered by the per­son who wrote the list).

All the law can re­ally do is tell what pub­lic image an AI group want’s to pre­sent, provide fund­ing to ev­ery­one, and get in ev­ery­one’s way. Tel­ling cops to “smash all GPU’s” would have an effect on AI progress. The fund vs smash axis is about the only lever they have. They can’t even tell an AI pro­ject from a maths con­ven­tion from a nor­mal pro­gram­ming pro­ject if the pro­ject lead­ers are in­cen­tivized to obfus­cate.

After ASI, gov­ern­ments are likely only rele­vant if the ASI was pro­grammed to care about them. Nei­ther pa­per­clip­pers or FAI will care about the law. The law might be rele­vant if we had tasky ASI that was not triv­ial to lev­er­age into a de­ci­sive strate­gic ad­van­tage. (An AI that can put a straw­berry on a plate with­out de­stroy­ing the world, but that’s about the limit of its safe op­er­a­tion.)

Such an AI em­bod­ies an un­der­stand­ing of in­tel­li­gence and could eas­ily be ac­ci­den­tally mod­ified to de­stroy the world. Such sce­nar­ios might in­volve ASI and timescales long enough for the law to act.

I don’t know how the law can han­dle some­thing that, can eas­ily de­stroy the world, has some eco­nomic value (if you want to flirt dan­ger) and, with fur­ther re­search could grant supreme power. The dis­cov­ery must be limited to a small group of peo­ple, (law of large num­ber of non­ex­perts, one will do some­thing stupid). I don’t think the law could no­tice what it was, af­ter all the robot in-front of the in­spec­tor only puts straw­ber­ries on plates. They can’t tell how pow­er­ful it would be with an un­bounded util­ity func­tion.

• Firstly, you are con­fus­ing dol­lars and utils.

If you buy this product for $100, you gain the use of it, at value U[30] to your­self. The work­ers who made it gain$80, at value U[80] to your­self, be­cause of your util­i­tar­ian prefer­ences. To­tal value U[110]

If the al­ter­na­tive was a product of cost $100, which you value the use of at U[105], but all the money goes to greedy rich peo­ple to be squan­dered, then you would choose the first. If the al­ter­na­tive was spend­ing$100 to do some­thing in­sanely morally im­por­tant, U[3^^^3], you would do that.

If the al­ter­na­tive was a product of cost \$100, that was of value U[100] to your­self, and some of the money would go to peo­ple that weren’t that rich U[15], you would do that.

If you could give the money to peo­ple twice as des­per­ate as the work­ers, at U[160], you would do that.

There are also good rea­sons why you might want to dis­cour­age mo­nop­o­lies. Any de­sire to do so is not in­cluded in the ex­pected value calcu­la­tions. But the ba­sic prin­ci­ple is that util­i­tar­i­anism can never tell you if some ac­tion is a good use of a re­source, un­less you tell it what else that re­source could have been used for.

• The in­for­ma­tion needed to de­scribe our par­tic­u­lar laws of physics < info needed to de­scribe the con­cept of “hab­it­able uni­verse” in gen­eral < info needed to de­scribe hu­man-like mind.

The biggest slip is the equiv­o­ca­tion of the word in­tel­li­gence. The Kol­mogorov com­plex­ity of AIXI-tl is quite small, so in­tel­li­gence’s in that sense of the word are likely to ex­ist in the uni­ver­sal prior.

Hu­man­like minds have not only the clear mark of evolu­tion, but the mark of stone age tribal in­ter­ac­tions across their psy­che. An ar­bi­trary mind will be bizarre and alien. Won­der­ing if such a mind might be benev­olent is hugely priv­ileg­ing the hy­poth­e­sis. The most likely way to make a hu­man­like mind is the pro­cess that cre­ated hu­mans. So in most of the uni­verses with hu­manoid deities, those deities evolved. This be­comes the simu­la­tion hy­poth­e­sis.

The best hy­poth­e­sis is still the laws of quan­tum physics or what­ever.

• We don’t know what we are miss­ing out on with­out su­per in­tel­li­gence. There might be all sorts of amaz­ing things that we would just never con­sider to make, or dis­miss as ob­vi­ously im­pos­si­ble, with­out su­per in­tel­li­gence.

I am point­ing out that be­ing able to make a FAI that is a bit smarter than you (smart­ness not re­ally on a sin­gle scale, vastly differ­ent cog­ni­tive ar­chi­tec­ture, is deep blue smarter than a horse?), in­volves solv­ing al­most all the hard prob­lems in al­ign­ment. When we have done all that hard work, we might as well tell it to make it­self a trillion times smarter, the cost to us is neg­ligible, the benefit could be huge.

AI can also serve as as a val­ues repos­i­tory. In most cir­cum­stances, val­ues are go­ing to drift over time, pos­si­bly due evolu­tion­ary forces. If we don’t want to end up as hard­scrap­ple fron­tier repli­ca­tors, we need some kind of sin­gle­ton. Most types of gov­ern­ment or com­mit­tee have their own forms of value drift, and couldn’t keep enough of an ab­solute grip on power to stop any re­bel­lions for billions of years. I have no ideas other than Friendly ASI over­sight for how to stop some­one in a cos­mi­cally vast so­ciety from cre­at­ing a UFASI. Suffi­ciently dra­co­nian ban­ning of any­thing at all tech­nolog­i­cal could stop any­one from cre­at­ing UFASI long term, and also stop most things since the in­dus­trial rev­olu­tion.

The only rea­son­able sce­nario that I can see in which FAI is not cre­ated and the cos­mic com­mons gets put to good use is if a small group of like­minded in­di­vi­d­u­als, or sin­gle per­son, gains ex­clu­sive ac­cess to self­rep nan­otech and mind up­load­ing. They then use many copies of them­selves to po­lice the world. They do all pro­gram­ming and only run code they can for­mally prove isn’t dan­ger­ous. No-one is al­lowed to touch any­thing Tur­ing com­plete.

• Both blanks are the iden­tity func­tion.

Here is some psudo code

class Prover:

____def new(self):

________self.ps=[PA]

____def prove(self, p, s, b):

________as­sert p in self.ps

________re­turn p(s,b)

________if self.prove(p1,“forall s:(ex­ists b2: p2(s,b2))=> (ex­ists b1: p2(s,b1))”, b)

____________self.ps.ap­pend(p2)

prover=Prover()