Siren worlds and the perils of over-optimised search

tl;dr An un­con­strained search through pos­si­ble fu­ture wor­lds is a dan­ger­ous way of choos­ing pos­i­tive out­comes. Con­strained, im­perfect or un­der-op­ti­mised searches work bet­ter.

Some sug­gested meth­ods for de­sign­ing AI goals, or con­trol­ling AIs, in­volve un­con­strained searches through pos­si­ble fu­ture wor­lds. This post ar­gues that this is a very dan­ger­ous thing to do, be­cause of the risk of be­ing tricked by “siren wor­lds” or “mar­ket­ing wor­lds”. The thought ex­per­i­ment starts with an AI de­sign­ing a siren world to fool us, but that AI is not cru­cial to the ar­gu­ment: it’s sim­ply an in­tu­ition pump to show that siren wor­lds can ex­ist. Once they ex­ist, there is a non-zero chance of us be­ing se­duced by them dur­ing a un­con­strained search, what­ever the search crite­ria are. This is a fea­ture of op­ti­mi­sa­tion: satis­fic­ing and similar ap­proaches don’t have the same prob­lems.

The AI builds the siren worlds

Imag­ine that you have a su­per­in­tel­li­gent AI that’s not just badly pro­grammed, or lethally in­differ­ent, but ac­tu­ally evil. Of course, it has suc­cess­fully con­cealed this fact, as “don’t let hu­mans think I’m evil” is a con­ver­gent in­stru­men­tal goal for all AIs.

We’ve suc­cess­fully con­strained this evil AI in a Or­a­cle-like fash­ion. We ask the AI to de­sign fu­ture wor­lds and pre­sent them to hu­man in­spec­tion, along with an im­ple­men­ta­tion path­way to cre­ate those wor­lds. Then if we ap­prove of those fu­ture wor­lds, the im­ple­men­ta­tion path­way will cause them to ex­ist (as­sume perfect de­ter­minis­tic im­ple­men­ta­tion for the mo­ment). The con­straints we’ve pro­grammed means that the AI will do all these steps hon­estly. Its op­por­tu­nity to do evil is limited ex­clu­sively to its choice of wor­lds to pre­sent to us.

The AI will at­tempt to de­sign a siren world: a world that seems ir­re­sistibly at­trac­tive while con­ceal­ing hideous nega­tive fea­tures. If the hu­man mind is hack­able in the crude sense—maybe through a se­ries of coloured flashes—then the AI would de­sign the siren world to be sub­tly full of these hacks. It might be that there is some stan­dard of “ir­re­sistibly at­trac­tive” that is ac­tu­ally ir­re­sistibly at­trac­tive: the siren world would be full of gen­uine sirens.

Even with­out those types of ap­proaches, there’s so much ma­nipu­la­tion the AI could in­dulge in. I could imag­ine my­self (and many peo­ple on Less Wrong) fal­ling for the fol­low­ing ap­proach:

First, the siren world looks com­pli­cated, wrong and scary—but with just a hint that there’s some­thing more to it. Some­thing in­trigu­ing, some­thing half-glimpsed, some­thing mak­ing me want to dig deeper. And as I fol­low up this some­thing, I see more pat­terns, and seem to gain a greater un­der­stand­ing. Not just of the world I’m look­ing at, but of the mean­ing of good it­self. The world seems to con­firm to me some of my ideas about what con­sti­tutes a worth­while life—not just the ideas I’ve been able to ar­tic­u­late, but the ones I’ve only got a vague half-baked no­tion of, and the ones I didn’t even know I had.

The longer I stare into this world, the greater an un­der­stand­ing I get of my own val­ues. And this is just the start­ing point: the world sub­tly opens up new av­enues of philo­soph­i­cal mus­ings in my brain, the images I see trig­ger­ing me to come up with my own in­sights and thought ex­per­i­ments as to the mean­ing of good­ness—in­sights that are then care­fully con­firmed as I did deeper. I could stay star­ing at this won­der­ful, beau­tiful and com­plex world for hours, days, years, gain­ing a deeper wis­dom and un­der­stand­ing as I go, truly ap­pre­ci­at­ing how the mys­ter­ies of this place un­ravel into new ver­sions of worth and good­ness. Every time I ever have a doubt about it, that doubt is sys­tem­i­cally re­solved into a new glo­ri­ous con­fir­ma­tion of how much the AI re­ally gets what I value, even be­fore I knew how much I val­ued it.

Mean­while, at some level of com­plex­ity safely be­yond what my hu­man mind will ever reach, the AI is hid­ing all the evil and un­miti­gated suffer­ing.

Your siren world may differ from mine—you may be vuln­er­a­ble to differ­ent tricks. Since peo­ple of­ten be­lieve they would be im­mune to such ma­nipu­la­tion, feel free to imag­ine that the AI is fool­ing some­one else—we know how stupid other peo­ple can be, right? - while you watch as a hel­pless by­stan­der.

The prob­lem is that a fu­ture uni­verse is likely to be vast and com­pli­cated. When in­spect­ing it, we have cer­tain in­spec­tion crite­ria (IC). Th­ese con­sist of our val­ues and prefer­ences, but also the ways we would look into this uni­verse, how we would fol­low up on ini­tial im­pres­sions, var­i­ous mea­sures and yard­sticks we might use to sum­marise the data we see, all the clever and de­vi­ous plans we might come up with to fer­ret out “what’s re­ally go­ing on”. Th­ese IC are go­ing to be con­sid­er­ably sim­pler than the to­tal­ity of the fu­ture uni­verse. So the AI’s task is to op­ti­mise a uni­verse de­sign that passes the IC, while shov­ing in as much di­su­til­ity as it can—which in a large uni­verse, is a tremen­dous amount. Un­less our IC are perfect and already in­clude a good solu­tion to the prob­lem of value (in which case we’ve solved the friendli­ness prob­lem already), a su­per­in­tel­li­gent AI will likely suc­ceed at its task.

Siren and mar­ket­ing wor­lds with­out builders

The above thought ex­per­i­ment needed a su­per­in­tel­li­gent evil AI for the de­sign of the siren world. But if we ad­mit that that is pos­si­ble, we don’t ac­tu­ally need the AI any more. The siren wor­lds ex­ist: there are po­ten­tial wor­lds of ex­treme di­su­til­ity that satis­fie our IC. If we sim­ply did an un­con­strained search across all pos­si­ble fu­ture wor­lds (some­thing like the search in Paul Chris­ti­ano’s in­di­rect nor­ma­tivity—an idea that in­spired the siren world con­cept), then we would at some point find siren wor­lds. And if we took the time to in­spect them, we’d get sucked in by them.

How bad is this prob­lem in gen­eral? A full search will not only find the siren wor­lds, but also a lot of very-se­duc­tive-but-also-very-nice wor­lds—gen­uine eu­topias. We may feel that it’s eas­ier to be happy than to pre­tend to be happy (while be­ing com­pletely mis­er­able and tor­tured and suffer­ing). Fol­low­ing that ar­gu­ment, we may feel that there will be far more eu­topias than siren wor­lds—af­ter all, the siren wor­lds have to have bad stuff plus a vast in­fras­truc­ture to con­ceal that bad stuff, which should at least have a com­plex­ity cost if noth­ing else. So if we chose the world that best passed our IC—or chose ran­domly among the top con­tenders—we might be more likely to hit a gen­uine eu­topia than a siren world.

Un­for­tu­nately, there are other dan­gers than siren wor­lds. We are now op­ti­mis­ing not for qual­ity of the world, but for abil­ity to se­duce or ma­nipu­late the IC. There’s no hid­den evil in this world, just a “pul­ling out all the stops to se­duce the in­spec­tor, through any means nec­es­sary” op­ti­mi­sa­tion pres­sure. Call a world that ranks high in this scale a “mar­ket­ing world”. Gen­uine eu­topias are un­likely to be mar­ket­ing wor­lds, be­cause they are op­ti­mised for be­ing good rather than seem­ing good. A mar­ket­ing world would be ut­terly op­ti­mised to trick, hack, se­duce, ma­nipu­late and fool our IC, and may well be a ter­rible world in all other re­spects. It’s the old “to demon­strate max­i­mal hap­piness, it’s much more re­li­able to wire peo­ple’s mouths to smile rather than make them happy” prob­lem all over again: the very best way of seem­ing good may com­pletely pre­clude ac­tu­ally be­ing good. In a gen­uine eu­topia, peo­ple won’t go around all the time say­ing “Btw, I am gen­uinely happy!” in case there is a hy­po­thet­i­cal ob­server look­ing in. If ev­ery one of your ac­tions con­stantly pro­claims that you are happy, chances are hap­piness is not your gen­uine state. EDIT: see also my com­ment:

We are both su­per­in­tel­li­gences. You have a bunch of in­de­pen­dently happy peo­ple that you do not ag­gres­sively com­pel. I have a group of zom­bies—hu­man-like pup­pets that I can make do any­thing, ap­pear to feel any­thing (though this is done suffi­ciently well that out­side hu­man ob­servers can’t tell I’m ac­tu­ally in con­trol). An out­side hu­man ob­server wants to check that our wor­lds rank high on scale X—a scale we both know about.

Which of us do you think is go­ing to be bet­ter able to max­imise our X score?

This can also be seen as a epistemic ver­sion of Good­hart’s law: “When a mea­sure be­comes a tar­get, it ceases to be a good mea­sure.” Here the IC are the mea­sure, and the mar­ket­ing wor­lds are tar­get­ing them, and hence they cease to be a good mea­sure. But re­call that the IC in­clude the to­tal­ity of ap­proaches we use to rank these wor­lds, so there’s no way around this prob­lem. If in­stead of in­spect­ing the wor­lds, we sim­ply rely on some sort of sum­mary func­tion, then the search will be op­ti­mised to find any­thing that can fool/​pass that sum­mary func­tion. If we use the sum­mary as a first filter, then ap­ply some more profound au­to­mated check­ing, then briefly in­spect the out­come so we’re sure it didn’t go stupid—then the search will op­ti­mised for “pass the sum­mary, pass au­to­mated check­ing, se­duce the in­spec­tor”.

Differ­ent IC there­fore will pro­duce differ­ent rank­ings of wor­lds, but the top wor­lds in any of the rank­ing will be mar­ket­ing wor­lds (and pos­si­bly siren wor­lds).

Con­strained search and satis­fic­ing our preferences

The is­sue is a prob­lem of (over) op­ti­mi­sa­tion. The IC cor­re­spond roughly with what we want to value, but differs from it in sub­tle ways, enough that op­ti­mis­ing for one could be dis­as­trous for the other. If we didn’t op­ti­mise, this wouldn’t be a prob­lem. Sup­pose we defined an ac­cept­able world as one that we would judge “yeah, that’s pretty cool” or even “yeah, that’s re­ally great”. Then as­sume we se­lected ran­domly among the ac­cept­able wor­lds. This would prob­a­bly re­sult in a world of pos­i­tive value: siren wor­lds and mar­ket­ing wor­lds are rare, be­cause they fulfil very spe­cific crite­ria. They triumph be­cause they score so high on the IC scale, but they are out­num­bered by the many more wor­lds that are sim­ply ac­cept­able.

This is in effect satis­fic­ing over the IC, rather than op­ti­mis­ing over them. Satis­fic­ing has its own is­sues, how­ever, so other ap­proaches could be valuable as well. One way could be use con­strained search. If for in­stance we took a thou­sand ran­dom wor­lds and IC-op­ti­mised over them, we’re very un­likely to en­counter a siren or mar­ket­ing world. We’re also very un­likely to en­counter a world of any qual­ity, though; we’d prob­a­bly need to IC-op­ti­mise over at least a trillion wor­lds to find good ones. There is a ten­sion in the num­ber: as the num­ber of wor­lds searched in­creases, their qual­ity in­creases, but so does the odds of en­coun­ter­ing a mar­ket­ing or siren world. EDIT: Lu­mifer sug­gested us­ing a first-past-the-post sys­tem: search through wor­lds, and pick the first ac­cept­able one we find. This is bet­ter than the ap­proach I out­lined in this para­graph.

We could also re­strict the search by con­sid­er­ing “re­al­is­tic” wor­lds. Sup­pose we had to take 25 differ­ent yes-no de­ci­sions that could af­fect the fu­ture of the hu­man­ity. This might be some­thing like “choos­ing which of these 25 very differ­ent AIs to turn on and let loose to­gether” or some­thing more pro­saic (which stocks to buy, which char­i­ties to sup­port). This re­sults in 225 differ­ent fu­ture wor­lds to search through: barely more than 33 mil­lion. Be­cause there are so few wor­lds, they are un­likely to con­tain a mar­ket­ing world (given the ab­solutely cru­cial pro­viso that none of the AIs is an IC-op­ti­miser!). But these wor­lds are not drawn ran­domly from the space of fu­ture wor­lds, but are de­pen­dent on key de­ci­sions that we be­lieve are im­por­tant and rele­vant. There­fore they are very likely to con­tain an ac­cept­able world—or at least far more likely than a ran­dom set of 33 mil­lion wor­lds would be. By con­strain­ing the choices in this way, we have in effect satis­ficed with­out satis­fic­ing, which is both Zen and use­ful.

As long as we’re aware of the prob­lem, other ap­proaches may also al­low for de­cent search with­out get­ting sucked in by a siren or a mar­keter.