Alignment Newsletter #37

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.


Three AI Safety Re­lated Ideas and Two Ne­glected Prob­lems in Hu­man-AI Safety (Wei Dai): If any par­tic­u­lar hu­man got a lot of power, or was able to think a lot faster, then they might do some­thing that we would con­sider bad. Per­haps power cor­rupts them, or per­haps they get so ex­cited about the po­ten­tial tech­nolo­gies they can de­velop that they do so with­out think­ing se­ri­ously about the con­se­quences. We now have both an op­por­tu­nity and an obli­ga­tion to de­sign AI sys­tems that op­er­ate more cau­tiously, that aren’t prone to the same bi­ases of rea­son­ing and heuris­tics that we are, such that the fu­ture ac­tu­ally goes bet­ter than it would if we mag­i­cally made hu­mans more in­tel­li­gent.

If it’s too hard to make AI sys­tems in this way and we need to have them learn goals from hu­mans, we could at least have them learn from ideal­ized hu­mans rather than real ones. Hu­man val­ues don’t ex­trap­o­late well—just look at the myr­iad an­swers that peo­ple give to the var­i­ous hy­po­thet­i­cals like the trol­ley prob­lem. So, it’s bet­ter to learn from hu­mans that are kept in safe, fa­mil­iar en­vi­ron­ment with all their ba­sic needs taken care of. Th­ese are our ideal­ized hu­mans. In prac­tice the AI sys­tem would learn a lot from the prefer­ences of real hu­mans, since that should be a very good in­di­ca­tor of the prefer­ences of ideal­ized hu­mans. But if the ideal­ized hu­mans be­gin to have differ­ent prefer­ences from real hu­mans, then the AI sys­tem should ig­nore the “cor­rupted” val­ues of the real hu­mans.

More gen­er­ally, it seems im­por­tant for our AI sys­tems to help us figure out what we care about be­fore we make dras­tic and ir­re­versible changes to our en­vi­ron­ment, es­pe­cially changes that pre­vent us from figur­ing out what we care about. For ex­am­ple, if we cre­ate a he­do­nic par­adise where ev­ery­one is on side-effect-free recre­ational drugs all the time, it seems un­likely that we check whether this is ac­tu­ally what we wanted. This sug­gests that we need to work on AI sys­tems that differ­en­tially ad­vance our philo­soph­i­cal ca­pa­bil­ities rel­a­tive to other ca­pa­bil­ities, such as tech­nolog­i­cal ones.

One par­tic­u­lar way that “al­igned” AI sys­tems could make things worse is if they ac­ci­den­tally “cor­rupt” our val­ues, as in the he­do­nic par­adise ex­am­ple be­fore. A nearer-term ex­am­ple would be mak­ing more ad­dic­tive video games or so­cial me­dia. They might also make very per­sua­sive but wrong moral ar­gu­ments.

This could also hap­pen in a mul­ti­po­lar set­ting, where differ­ent groups have their own AIs that try to ma­nipu­late other hu­mans into hav­ing val­ues similar to theirs. The at­tack is easy, since you have a clear ob­jec­tive (whether or not the hu­mans start be­hav­ing ac­cord­ing to your val­ues), but it seems hard to defend against, be­cause it is hard to de­ter­mine the differ­ence be­tween ma­nipu­la­tion and use­ful in­for­ma­tion.

Ro­hin’s opinion: (A more de­tailed dis­cus­sion is available on these threads.) I’m glad these posts were writ­ten, they out­line real prob­lems that I think are ne­glected in the AI safety com­mu­nity and out­line some an­gles of at­tack. The rest of this is go­ing to be a bunch of dis­agree­ments I have, but these should be taken as dis­agree­ments on how to solve these prob­lems, not a dis­agree­ment that the prob­lems ex­ist.

It seems quite difficult to me to build AI sys­tems that are safe, with­out hav­ing them rely on hu­mans mak­ing philo­soph­i­cal progress them­selves. We’ve been try­ing to figure this out for thou­sands of years. I’m pes­simistic about our chances at cre­at­ing AI sys­tems that can out­perform this huge in­tel­lec­tual effort cor­rectly on the first try with­out feed­back from hu­mans. Learn­ing from ideal­ized hu­mans might ad­dress this to some ex­tent, but in many cir­cum­stances I think I would trust the real hu­mans with skin in the game more than the ideal­ized hu­mans who must rea­son about those cir­cum­stances from afar (in their safe, fa­mil­iar en­vi­ron­ment).

I do think we want to have a gen­eral ap­proach where we try to figure out how AIs and hu­mans should rea­son, such that the re­sult­ing sys­tem be­haves well. On the hu­man side, this might mean that the hu­man needs to be more cau­tious for longer timescales, or to have more epistemic and moral hu­mil­ity. Ideal­ized hu­mans can be thought of an in­stance of this ap­proach where rather than change the policy of real hu­mans, we in­di­rectly change their policy in a hy­po­thet­i­cal by putting them in safer en­vi­ron­ments.

For the prob­lem of in­ten­tion­ally cor­rupt­ing val­ues, this seems to me an in­stance of the gen­eral class of “Com­pet­ing al­igned su­per­in­tel­li­gent AI sys­tems could do bad things”, in the same way that we have the risk of nu­clear war to­day. I’m not sure why we’re fo­cus­ing on value cor­rup­tion in par­tic­u­lar. In any case, my cur­rent preferred solu­tion is not to get into this situ­a­tion in the first place (though ad­mit­tedly that seems very hard to do, and I’d love to see more thought put into this).

Over­all, I’m hop­ing that we can solve “hu­man safety prob­lems” by train­ing the hu­mans su­per­vis­ing the AI to not have those prob­lems, be­cause it sure does make the tech­ni­cal prob­lem of al­ign­ing AI seem a lot eas­ier. I don’t have a great an­swer to the prob­lem of com­pet­ing al­igned su­per­in­tel­li­gent AI sys­tems.

Leg­ible Nor­ma­tivity for AI Align­ment: The Value of Silly Rules (Dy­lan Had­field-Menell et al): One is­sue we might have with value learn­ing is that our AI sys­tem might look at “silly rules” and in­fer that we care about them deeply. For ex­am­ple, we of­ten en­force dress codes through so­cial pun­ish­ments. Given that dress codes do not have much func­tional pur­pose and yet we en­force them, should an AI sys­tem in­fer that we care about dress codes as much as we care about (say) prop­erty rights? This pa­per claims that these “silly rules” should be in­ter­preted as a co­or­di­na­tion mechanism that al­lows group mem­bers to learn whether or not the group rules will be en­forced by neu­tral third par­ties. For ex­am­ple, if I vi­o­late the dress code, no one is sig­nifi­cantly harmed but I would be pun­ished any­way—and this can give ev­ery­one con­fi­dence that if I were to break an im­por­tant rule, such as steal­ing some­one’s wallet, by­stan­ders would pun­ish me by re­port­ing me to the po­lice, even though they are not af­fected by my ac­tions and it is a cost to them to re­port me.

They for­mal­ize this us­ing a model with a pool of agents that can choose to be part of a group. Agents in the group play “im­por­tant” games and “silly” games. In any game, there is a scofflaw, a vic­tim, and a by­stan­der. In an im­por­tant game, if the by­stan­der would pun­ish any rule vi­o­la­tions, then the scofflaw fol­lows the rule and the vic­tim gets +1 util­ity, but if the by­stan­der would not pun­ish the vi­o­la­tion, the scofflaw breaks the rule and the vic­tim gets −1 util­ity. Note that in or­der to sig­nal that they would pun­ish, by­stan­ders must pay a cost of c. A silly game works the same way, ex­cept the vic­tim always gets 0 util­ity. Given a set of im­por­tant rules, the main quan­tity of in­ter­est is how many silly rules to add. The au­thors quan­tify this by con­sid­er­ing the pro­por­tion of all games that are silly games, which they call the den­sity. Since we are imag­in­ing adding silly rules, all out­comes are mea­sured with re­spect to the num­ber of im­por­tant games. We can think of this as a proxy for time, and in­deed the au­thors call the ex­pected num­ber of games till an im­por­tant game a timestep.

Now, for im­por­tant games the ex­pected util­ity to the vic­tim is pos­i­tive if the prob­a­bil­ity that the by­stan­der is a pun­isher is greater than 0.5. So, each of the agents cares about es­ti­mat­ing this prob­a­bil­ity in or­der to de­cide whether or not to stay in the group. Now, if we only had im­por­tant games, we would have a sin­gle game per timestep, and we would only learn whether one par­tic­u­lar agent is a pun­isher. As we add more silly games, we get more games per timestep, and so we can learn much more quickly the pro­por­tion of pun­ish­ers, which leads to more sta­ble groups. How­ever, the silly rules are not free. The au­thors prove that if they are free, then we keep adding silly rules and the den­sity would ap­proach 1. (More pre­cisely, they show that as den­sity goes to 1, the value of be­ing told the true prob­a­bil­ity of pun­ish­ment goes to 0, mean­ing that the agent already knows ev­ery­thing.)

They then show ex­per­i­men­tal re­sults show­ing a few things. When the agents are rel­a­tively cer­tain of the prob­a­bil­ity of an agent be­ing a pun­isher, then silly rules are not very use­ful and the group is more likely to col­lapse (since the cost of en­forc­ing the silly rules starts to be im­por­tant). Se­cond, as long as c is low (so it is easy to sig­nal that you will en­force rules), then groups with more silly rules will be more re­silient to shocks in in­di­vi­d­ual’s be­liefs about the pro­por­tion of pun­ish­ers, since they will very quickly con­verge to the right be­lief. If there aren’t any silly rules it can take more time and your es­ti­mate might be in­cor­rectly low enough that you de­cide to leave the group even though group mem­ber­ship is still net pos­i­tive. Fi­nally, if the pro­por­tion of pun­ish­ers drops be­low 0.5, mak­ing group mem­ber­ship net nega­tive, agents in groups with high den­sity will learn this faster, and their groups will dis­band much sooner.

Ro­hin’s opinion: I re­ally like this pa­per, it’s a great con­crete ex­am­ple of how sys­tems of agents can have very differ­ent be­hav­ior than any one in­di­vi­d­ual agent even if each of the agents have similar goals. The idea makes in­tu­itive sense and I think the model cap­tures its salient as­pects. There are definitely many quib­bles you could make with the model (though per­haps it is the stan­dard model, I don’t know this field), but I don’t think they’re im­por­tant. My per­spec­tive is that the model is a par­tic­u­larly clear and pre­cise way of com­mu­ni­cat­ing the effect that the au­thors are de­scribing, as op­posed to some­thing that is sup­posed to track re­al­ity closely.

Tech­ni­cal AI alignment


Three AI Safety Re­lated Ideas and Two Ne­glected Prob­lems in Hu­man-AI Safety (Wei Dai): Sum­ma­rized in the high­lights!

Tech­ni­cal agen­das and prioritization

Multi-agent minds and AI al­ign­ment (Jan Kul­veit): This post ar­gues against the model of hu­mans as op­ti­miz­ing some par­tic­u­lar util­ity func­tion, in­stead fa­vor­ing a model based on pre­dic­tive pro­cess­ing. This leads to sev­eral is­sues with the way stan­dard value learn­ing ap­proaches like in­verse re­in­force­ment learn­ing work. There are a few sug­gested ar­eas for fu­ture re­search. First, we could un­der­stand how hi­er­ar­chi­cal mod­els of the world work (pre­sum­ably for bet­ter value learn­ing). Se­cond, we could try to in­vert game the­ory to learn ob­jec­tives in mul­ti­a­gent set­tings. Third, we could learn prefer­ences in mul­ti­a­gent set­tings, which might al­low us to bet­ter in­fer norms that hu­mans fol­low. Fourth, we could see what hap­pens if we take a sys­tem of agents, in­fer a util­ity func­tion, and then op­ti­mize it—per­haps one of the agents’ util­ity func­tions dom­i­nates? Fi­nally, we can see what hap­pens when we take a sys­tem of agents and give it more com­pu­ta­tion, to see how differ­ent parts scale. On the non-tech­ni­cal side, we can try to figure out how to get hu­mans to be more self-al­igned (i.e. there aren’t “differ­ent parts pul­ling in differ­ent di­rec­tions”).

Ro­hin’s opinion: I agree with the gen­eral point that figur­ing out a hu­man util­ity func­tion and then op­ti­miz­ing it is un­likely to work, but for differ­ent rea­sons (see the first chap­ter of the Value Learn­ing se­quence). I also agree that hu­mans are com­plex and you can’t get away with mod­el­ing them as Boltz­mann ra­tio­nal and op­ti­miz­ing some fixed util­ity func­tion. I wouldn’t try to make the model more ac­cu­rate (eg. a model of a bunch of in­ter­act­ing sub­agents, each with their own util­ity func­tion), I would try to make the model less pre­cise (eg. a sin­gle gi­ant neu­ral net), be­cause that re­duces the chance of model mis­speci­fi­ca­tion. How­ever, given the im­pos­si­bil­ity re­sult say­ing that you must make as­sump­tions to make this work, we prob­a­bly have to give up on hav­ing some nice for­mally speci­fied mean­ing of “val­ues”. I think this is prob­a­bly fine—for ex­am­ple, iter­ated am­plifi­ca­tion doesn’t have any ex­plicit for­mal value func­tion.

Re­ward learn­ing theory

Figur­ing out what Alice wants: non-hu­man Alice (Stu­art Arm­strong): We know that if we have a po­ten­tially ir­ra­tional agent, then in­fer­ring their prefer­ences is im­pos­si­ble with­out fur­ther as­sump­tions. How­ever, in prac­tice we can in­fer prefer­ences of hu­mans quite well. This is be­cause we have very spe­cific and nar­row mod­els of how hu­mans work: we tend to agree on our judg­ments of whether some­one is an­gry, and what anger im­plies about their prefer­ences. This is ex­actly what the the­o­rem is meant to pro­hibit, which means that hu­mans are mak­ing some strong as­sump­tions about other hu­mans. As a re­sult, we can hope to solve the value learn­ing prob­lem by figur­ing out what as­sump­tions hu­mans are already mak­ing and us­ing those as­sump­tions.

Ro­hin’s opinion: The fact that hu­mans are quite good at in­fer­ring prefer­ences should give us op­ti­mism about value learn­ing. In the frame­work of ra­tio­nal­ity with a mis­take model, we are try­ing to in­fer the mis­take model from the way that hu­mans in­fer prefer­ences about other hu­mans. This sidesteps the im­pos­si­bil­ity re­sult by fo­cus­ing on the struc­ture of the al­gorithm that gen­er­ates the policy. How­ever, it still seems like we have to make some as­sump­tion about how the struc­ture of the al­gorithm leads to a mis­take model, or a model for what val­ues are. Though per­haps we can get an an­swer that is prin­ci­pled enough or in­tu­itive enough that we be­lieve it.

Han­dling groups of agents

Leg­ible Nor­ma­tivity for AI Align­ment: The Value of Silly Rules (Dy­lan Had­field-Menell et al): Sum­ma­rized in the high­lights!

Mis­cel­la­neous (Align­ment)

As­sum­ing we’ve solved X, could we do Y… (Stu­art Arm­strong): We of­ten want to make as­sump­tions that sound in­tu­itive but that we can’t eas­ily for­mal­ize, eg. “as­sume we’ve solved the prob­lem of de­ter­min­ing hu­man val­ues”. How­ever, such as­sump­tions can of­ten be in­ter­preted as be­ing very weak or very strong, and de­pend­ing on the in­ter­pre­ta­tion we could be as­sum­ing away the en­tire prob­lem, or the as­sump­tion doesn’t buy us any­thing. So, we should be more pre­cise in our as­sump­tions, or fo­cus on only on some pre­cise prop­er­ties of an as­sump­tion.

Ro­hin’s opinion: I think this ar­gu­ment ap­plies well to the case where we are try­ing to com­mu­ni­cate, but not so much to the case where I in­di­vi­d­u­ally am think­ing about a prob­lem. (I’m mak­ing this claim about me speci­fi­cally; I don’t know if it gen­er­al­izes to other peo­ple.) Com­mu­ni­ca­tion is hard and if the speaker uses some in­tu­itive as­sump­tion, chances are the listener will in­ter­pret it differ­ently from what the speaker in­tended, and so be­ing very pre­cise seems quite helpful. How­ever, when I’m think­ing through a prob­lem my­self and I make an as­sump­tion, I usu­ally have a fairly de­tailed in­tu­itive model of what I mean, such that if you ask me whether I’m as­sum­ing that prob­lem X is solved by the as­sump­tion, I could an­swer that, even though I don’t have a pre­cise for­mu­la­tion of the as­sump­tion. Mak­ing the as­sump­tion more pre­cise would be quite a lot of work, and prob­a­bly would not im­prove my think­ing on the topic that much, so I tend not to do it un­til I think there’s some in­sight and want to make the ar­gu­ment more rigor­ous. It seems to me that this is how most re­search progress hap­pens: by in­di­vi­d­ual re­searchers hav­ing in­tu­itions that they then make rigor­ous and pre­cise.

Near-term concerns

Fair­ness and bias

Pro­vid­ing Gen­der-Spe­cific Trans­la­tions in Google Trans­late (Melvin John­son)

Ma­chine ethics

Build­ing Ethics into Ar­tifi­cial In­tel­li­gence (Han Yu et al)

Build­ing Eth­i­cally Bounded AI (Francesca Rossi et al)

Mal­i­cious use of AI

FLI Signs Safe Face Pledge (Ariel Conn)

Other progress in AI

Re­in­force­ment learning

Off-Policy Deep Re­in­force­ment Learn­ing with­out Ex­plo­ra­tion (Scott Fu­ji­moto et al) (sum­ma­rized by Richard): This pa­per dis­cusses off-policy batch re­in­force­ment learn­ing, in which an agent is try­ing to learn a policy from data which is not based on its own policy, and with­out the op­por­tu­nity to col­lect more data dur­ing train­ing. The au­thors demon­strate that stan­dard RL al­gorithms do badly in this set­ting be­cause they give un­seen state-ac­tion pairs un­re­al­is­ti­cally high val­ues, and lack the op­por­tu­nity to up­date them. They pro­poses to ad­dress this prob­lem by only se­lect­ing ac­tions from pre­vi­ously seen state-ac­tion pairs; they prove var­i­ous op­ti­mal­ity re­sults for this al­gorithm in the MDP set­ting. To adapt this ap­proach to the con­tin­u­ous con­trol case, the au­thors train a gen­er­a­tive model to pro­duce likely ac­tions (con­di­tional on the state and the data batch) and then only se­lect from the top n ac­tions. Their batch-con­di­tional q-learn­ing al­gorithm (BCQ) con­sists of that gen­er­a­tive model, a per­tur­ba­tion model to slightly al­ter the top ac­tions, and a value net­work and critic to perform the se­lec­tion. When n = 0, BCQ re­sem­bles be­havi­oural clon­ing, and when n → ∞, it re­sem­bles Q-learn­ing. BCQ with n=10 hand­ily out­performed DQN and DDPG on some Mu­joco ex­per­i­ments us­ing batch data.

Richard’s opinion: This is an in­ter­est­ing pa­per, with a good bal­ance of in­tu­itive mo­ti­va­tions, the­o­ret­i­cal proofs, and em­piri­cal re­sults. While it’s not di­rectly safety-re­lated, the broad di­rec­tion of com­bin­ing imi­ta­tion learn­ing and re­in­force­ment learn­ing seems like it might have promise. Re­lat­edly, I wish the au­thors had dis­cussed in more depth what as­sump­tions can or should be made about the source of batch data. For ex­am­ple, BCQ would pre­sum­ably perform worse than DQN when data is col­lected from an ex­pert try­ing to min­imise re­ward, and (from the pa­per’s ex­per­i­ments) performs worse than be­havi­oural clon­ing when data is col­lected from an ex­pert try­ing to max­imise re­ward. Most hu­man data an ad­vanced AI might learn from is pre­sum­ably some­where in be­tween those two ex­tremes, and so un­der­stand­ing how well al­gorithms like BCQ would work on it may be valuable.

Soft Ac­tor Critic—Deep Re­in­force­ment Learn­ing with Real-World Robots (Tuo­mas Haarnoja et al)

Deep learning

How AI Train­ing Scales (Sam McCan­dlish et al): OpenAI has done an em­piri­cal in­ves­ti­ga­tion into the perfor­mance of AI sys­tems, and found that the max­i­mum use­ful batch size for a par­tic­u­lar task is strongly in­fluenced by the noise in the gra­di­ent. (Here, the noise in the gra­di­ent comes from the fact that we are us­ing stochas­tic gra­di­ent de­scent—any differ­ence in the gra­di­ents across batches counts as “noise”.) They also found some pre­limi­nary re­sults show­ing the more pow­er­ful ML tech­niques tend to have more gra­di­ent noise, and even a sin­gle model tends to have in­creased gra­di­ent noise over time as they get bet­ter at the task.

Ro­hin’s opinion: While OpenAI doesn’t spec­u­late on why this re­la­tion­ship ex­ists, it seems to me that as you get larger batch sizes, you are im­prov­ing the gra­di­ent by re­duc­ing noise by av­er­ag­ing over a larger batch. This pre­dicts the re­sults well: as the task gets harder and the noise in the gra­di­ents gets larger, there’s more noise to get rid of by av­er­ag­ing over data points, and so there’s more op­por­tu­nity to have even larger batch sizes.