Two Neglected Problems in Human-AI Safety

In this post I de­scribe a cou­ple of hu­man-AI safety prob­lems in more de­tail. Th­ese helped mo­ti­vate my pro­posed hy­brid ap­proach, and I think need to be ad­dressed by other AI safety ap­proaches that cur­rently do not take them into ac­count.

1. How to pre­vent “al­igned” AIs from un­in­ten­tion­ally cor­rupt­ing hu­man val­ues?

We know that ML sys­tems tend to have prob­lems with ad­ver­sar­ial ex­am­ples and dis­tri­bu­tional shifts in gen­eral. There seems to be no rea­son not to ex­pect that hu­man value func­tions have similar prob­lems, which even “al­igned” AIs could trig­ger un­less they are some­how de­signed not to. For ex­am­ple, such AIs could give hu­mans so much power so quickly or put them in such novel situ­a­tions that their moral de­vel­op­ment can’t keep up, and their value sys­tems no longer ap­ply or give es­sen­tially ran­dom an­swers. AIs could give us new op­tions that are ir­re­sistible to some parts of our mo­ti­va­tional sys­tems, like more pow­er­ful ver­sions of video game and so­cial me­dia ad­dic­tion. In the course of try­ing to figure out what we most want or like, they could in effect be search­ing for ad­ver­sar­ial ex­am­ples on our value func­tions. At our own re­quest or in a sincere at­tempt to help us, they could gen­er­ate philo­soph­i­cal or moral ar­gu­ments that are wrong but ex­tremely per­sua­sive.

(Some of these is­sues, like the in­ven­tion of new ad­dic­tions and new tech­nolo­gies in gen­eral, would hap­pen even with­out AI, but I think AIs would likely, by de­fault, strongly ex­ac­er­bate the prob­lem by differ­en­tially ac­cel­er­at­ing such tech­nolo­gies faster than progress in un­der­stand­ing how to safely han­dle them.)

2. How to defend against in­ten­tional at­tempts by AIs to cor­rupt hu­man val­ues?

It looks like we may be headed to­wards a world of mul­ti­ple AIs, some of which are ei­ther un­al­igned, or al­igned to other own­ers or users. In such a world there’s a strong in­cen­tive to use one’s own AIs to ma­nipu­late other peo­ple’s val­ues in a di­rec­tion that benefits one­self (even if the re­sult­ing loss to oth­ers are greater than gains to one­self).

There is an ap­par­ent asym­me­try be­tween at­tack and defense in this arena, be­cause ma­nipu­lat­ing a hu­man is a straight­for­ward op­ti­miza­tion prob­lem with an ob­jec­tive that is easy to test/​mea­sure (just check if the tar­get has ac­cepted the val­ues you’re try­ing to in­still, or has started do­ing things that are more benefi­cial to you), and hence rel­a­tively easy for AIs to learn how to do, but teach­ing or pro­gram­ming an AI to help defend against such ma­nipu­la­tion seems much harder, be­cause it’s un­clear how to dis­t­in­guish be­tween ma­nipu­la­tion and use­ful in­for­ma­tion or dis­cus­sion. (One way to defend against such ma­nipu­la­tion would be to cut off all out­side con­tact, in­clud­ing from other hu­mans be­cause we don’t know whether they are just be­ing used as other AIs’ mouth­pieces, but that would be highly detri­men­tal to one’s own moral de­vel­op­ment.)

There’s also an asym­met­ric be­tween AIs with sim­ple util­ity func­tions (ei­ther un­al­igned or al­igned to users who think they have sim­ple val­ues) and AIs al­igned to users who have high value com­plex­ity and moral un­cer­tainty. The former seem to be at a sub­stan­tial ad­van­tage in a con­test to ma­nipu­late oth­ers’ val­ues and pro­tect one’s own.