The reward function is already how well you manipulate humans

We have many, many parables about dangers arising from super intelligent or supremely capable AI that wreaks ruin by maximizing a simple reward function. “Make paperclips” seems to be a popular one[1]. I worry that we don’t give a more realistic scenario deep consideration: the world that arises from super capable AIs that have “manipulate humans to do X” as the reward function.

This is a real worry because many human domains already have this as a reward function. The visual art that best captures human attention and manipulates emotion is what is sought out, copied, sold, and remembered. The novels that use language most effectively to manipulate human emotion and attention are what sells, gather acclaim, and spread through the culture. Advertisements that most successfully grab human attention and manipulate emotion and thought leading to purchase get the most money. I’m sure anyone reading this can easily add many more categories to this list.

An ML model that can successfully manipulate humans at a super human level in any of these examples would generate great wealth for it’s owner. I would argue that there is considerable evidence that humans can be manipulated. The question is, how good are humans at this manipulation? Is there enough untapped overhead in manipulative power that an AI could access to perform super human manipulation?

Fiction explores the concept of super human entertainment. “The Entertainment”[2] of Infinite Jest, and Monty Python’s “Funniest Joke in the World”[3] both present media that can literally kill people by extreme emotional manipulation. In the real world it seems video games are already entertaining enough to kill some people.[4] Could an AI create content so engaging as to cause dehydration, starvation, sleep deprivation, or worse? I can perhaps too easily imaging a book or show with a perfect cliff-hanger for every chapter or episode, where I always want to read or watch just a little bit more. With a super human author this content could continue such a pattern for ever.

In a more commercial vein, imagine there is a company that sells ads, and also has the best AI research in the world. This company trains a model that can create and serve ads such that 90% of people who view the ad, buy the product. This would lead to a direct transfer of wealth, draining individuals bank accounts and enriching the ad company. Could super human AI present such appealing ads, for such necessary products, that individuals would spend all savings, and borrow to buy more?

Does this sound impossible? I think about the times I have been the user who saw an ad, and said “Hmmm, I really DO need that now.” A discount airfare right when I’m thinking about a trip, the perfect gift for my spouse a week before her birthday, tools for the hobby I’m considering. All cases where the ad really led to a purchase that was not necessarily going to happen without the ad. Sometimes the bill at the end of the month surprised me.

Is it really possible for an AI model to manipulate humans to the extent I explore above? My fear is that it is more than possible, it is relatively easy. Humans have evolved with many, many hooks for emotional manipulation. This entire community is built around the idea that it is difficult to overcome our biases, and the best we can hope is be less wrong. Such an AI would have so much training data. Reinforcement is easy, because people seek out and interact with emotionally manipulative media constantly.

Is there anything we can do? Personally, I am watching how I use different media. I keep a large backlog of “safe” entertainment; books, CDs, old games. When my use of new media crosses a time threshold I plan to cut myself off from new (particularly online) entertainment, only consuming old media. I fear entertainment the most, because that is where I know my own weakness lies. I think it is worthwhile to consider where your own weakness is, and prepare.

  1. ^
  2. ^
  3. ^
  4. ^