Two Neglected Problems in Human-AI Safety

In this post I describe a couple of human-AI safety problems in more detail. These helped motivate my proposed hybrid approach, and I think need to be addressed by other AI safety approaches that currently do not take them into account.

1. How to prevent “aligned” AIs from unintentionally corrupting human values?

We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even “aligned” AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can’t keep up, and their value systems no longer apply or give essentially random answers. AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. In the course of trying to figure out what we most want or like, they could in effect be searching for adversarial examples on our value functions. At our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.

(Some of these issues, like the invention of new addictions and new technologies in general, would happen even without AI, but I think AIs would likely, by default, strongly exacerbate the problem by differentially accelerating such technologies faster than progress in understanding how to safely handle them.)

2. How to defend against intentional attempts by AIs to corrupt human values?

It looks like we may be headed towards a world of multiple AIs, some of which are either unaligned, or aligned to other owners or users. In such a world there’s a strong incentive to use one’s own AIs to manipulate other people’s values in a direction that benefits oneself (even if the resulting loss to others are greater than gains to oneself).

There is an apparent asymmetry between attack and defense in this arena, because manipulating a human is a straightforward optimization problem with an objective that is easy to test/​measure (just check if the target has accepted the values you’re trying to instill, or has started doing things that are more beneficial to you), and hence relatively easy for AIs to learn how to do, but teaching or programming an AI to help defend against such manipulation seems much harder, because it’s unclear how to distinguish between manipulation and useful information or discussion. (One way to defend against such manipulation would be to cut off all outside contact, including from other humans because we don’t know whether they are just being used as other AIs’ mouthpieces, but that would be highly detrimental to one’s own moral development.)

There’s also an asymmetry between AIs with simple utility functions (either unaligned or aligned to users who think they have simple values) and AIs aligned to users who have high value complexity and moral uncertainty. The former seem to be at a substantial advantage in a contest to manipulate others’ values and protect one’s own.