But exactly how complex and fragile?

Link post

This is a post about my own con­fu­sions. It seems likely that other peo­ple have dis­cussed these is­sues at length some­where, and that I am not up with cur­rent thoughts on them, be­cause I don’t keep good track of even ev­ery­thing great that ev­ery­one writes. I wel­come any­one kindly di­rect­ing me to the most rele­vant things, or if such things are suffi­ciently well thought through that peo­ple can at this point just cor­rect me in a small num­ber of sen­tences, I’d ap­pre­ci­ate that even more.


The tra­di­tional ar­gu­ment for AI al­ign­ment be­ing hard is that hu­man value is ‘com­plex’ and ‘frag­ile’. That is, it is hard to write down what kind of fu­ture we want, and if we get it even a lit­tle bit wrong, most fu­tures that fit our de­scrip­tion will be worth­less.

The illus­tra­tions I have seen of this in­volve a per­son try­ing to write a de­scrip­tion of value con­cep­tual anal­y­sis style, and failing to put in things like ‘bore­dom’ or ‘con­scious­ness’, and so get­ting a uni­verse that is highly repet­i­tive, or un­con­scious.

I’m not yet con­vinced that this is world-de­stroy­ingly hard.

Firstly, it seems like you could do bet­ter than imag­ined in these hy­po­thet­i­cals:

  1. Th­ese thoughts are from a while ago. If in­stead you used ML to learn what ‘hu­man flour­ish­ing’ looked like in a bunch of sce­nar­ios, I ex­pect you would get some­thing much closer than if you try to spec­ify it man­u­ally. Com­pare man­u­ally spec­i­fy­ing what a face looks like, then gen­er­at­ing ex­am­ples from your de­scrip­tion to us­ing mod­ern ML to learn it and gen­er­ate them.

  2. Even in the man­u­ally de­scribing it case, if you had like a hun­dred peo­ple spend a hun­dred years writ­ing a very de­tailed de­scrip­tion of what went wrong, in­stead of a writer spend­ing an hour imag­in­ing ways that a more ig­no­rant per­son may mess up if they spent no time on it, I could imag­ine it ac­tu­ally be­ing pretty close. I don’t have a good sense of how far away it is.

I agree that nei­ther of these would likely get you to ex­actly hu­man val­ues.

But sec­ondly, I’m not sure about the frag­ility ar­gu­ment: that if there is ba­si­cally any dis­tance be­tween your de­scrip­tion and what is truly good, you will lose ev­ery­thing.

This seems to be a) based on a few ex­am­ples of dis­crep­an­cies be­tween writ­ten-down val­ues and real val­ues where the writ­ten down val­ues en­tirely ex­clude some­thing, and b) as­sum­ing that there is a fast take­off so that the rele­vant AI has its val­ues for­ever, and takes over the world.

My guess is that val­ues that are got us­ing ML but still some­what off from hu­man val­ues are much closer in terms of not de­stroy­ing all value of the uni­verse, than ones that a per­son tries to write down. Like, the kinds of er­rors peo­ple have used to illus­trate this prob­lem (for­get to put in, ‘con­scious­ness is good’) are like for­get­ting to say faces have nos­trils in try­ing to spec­ify what a face is like, whereas a mod­ern ML sys­tem’s im­perfect im­pres­sion of a face seems more likely to meet my stan­dards for ‘very facelike’ (most of the time).

Per­haps a big­ger thing for me though is the is­sue of whether an AI takes over the world sud­denly. I agree that if that hap­pens, lack of perfect al­ign­ment is a big prob­lem, though not ob­vi­ously an all value nul­lify­ing one (see above). But if it doesn’t abruptly take over the world, and merely be­comes a large part of the world’s sys­tems, with on­go­ing abil­ity for us to mod­ify it and mod­ify its roles in things and make new AI sys­tems, then the ques­tion seems to be how force­fully the non-al­ign­ment is push­ing us away from good fu­tures rel­a­tive to how force­fully we can cor­rect this. And in the longer run, how well we can cor­rect it in a deep way be­fore AI does come to be in con­trol of most de­ci­sions. So some­thing like the speed of cor­rec­tion vs. the speed of AI in­fluence grow­ing.

Th­ese are em­piri­cal ques­tions about the scales of differ­ent effects, rather than ques­tions about whether a thing is an­a­lyt­i­cally perfect. And I haven’t seen much anal­y­sis of them. To my own quick judg­ment, it’s not ob­vi­ous to me that they look bad.

For one thing, these dy­nam­ics are already in place: the world is full of agents and more ba­sic op­ti­miz­ing pro­cesses that are not al­igned with broad hu­man val­ues—most in­di­vi­d­u­als to a small de­gree, some strange in­di­vi­d­u­als to a large de­gree, cor­po­ra­tions, com­pe­ti­tions, the dy­nam­ics of poli­ti­cal pro­cesses. It is also full of forces for al­ign­ing them in­di­vi­d­u­ally and stop­ping the whole show from run­ning off the rails: law, so­cial pres­sures, ad­just­ment pro­cesses for the im­plicit rules of both of these, in­di­vi­d­ual cru­sades. The ad­just­ment pro­cesses them­selves are not nec­es­sar­ily perfectly al­igned, they are just over­all forces for redi­rect­ing to­ward al­ign­ment. And in fair­ness, this is already pretty alarm­ing. It’s not ob­vi­ous to me that im­perfectly al­igned AI is likely to be worse than the cur­rently mis­al­igned pro­cesses, and even that it won’t be a net boon for the side of al­ign­ment.

So then the largest re­main­ing worry is that it will still gain power fast and cor­rec­tion pro­cesses will be slow enough that its some­what mis­al­igned val­ues will be set in for­ever. But it isn’t ob­vi­ous to me that by that point it isn’t suffi­ciently well al­igned that we would rec­og­nize its fu­ture as a won­drous utopia, just not the very best won­drous utopia that we would have imag­ined if we had re­ally care­fully sat down and imag­ined utopias for thou­sands of years. This again seems like an em­piri­cal ques­tion of the scale of differ­ent effects, un­less there is a an ar­gu­ment that some effect will be to­tally over­whelming.