‘Dumb’ AI observes and manipulates controllers

Stuart_Armstrong13 Jan 2015 13:35 UTC

51 points

The argument that AIs provided with a reward channel will observe their controllers and learn to manipulate them is a valid one. Unfortunately, it’s often framed in a way that feels counterintuitive or extreme, especially to AI designers. It typically starts with the standard reinforcement learning scenario, then posits that the AI becomes superintelligent and either manipulates the controller with super-social powers, or breaks out and gains control of its reward channel, killing or threatening its controllers.

And that is a fair argument. But conceptually, it leaps from a standard reinforcement learning scenario, to a science-fiction-sounding scenario. It might help to have intermediate scenarios: to show that even lower intelligence AIs might start exhibiting the same sort of behaviour, long before it gets to superintelligence.

So consider the following scenario. Some complex, trainable AI is tasked with writing automated news stories for a student newspaper. It trawls the web and composes its stories, then gets reward and feedback from the editors. Assume there are two editors for this newspaper, and they work on alternate days. The two editors have somewhat different ideas as to what constitutes a good story, so their feedbacks are different. After a while, the AI finds that it gets higher reward by using a certain style on Monday, Wednesday and Friday, and another style on Tuesday and Thursdays—this is a simple consequence of its reward mechanism.

But the rota isn’t perfect. Sometime the Monday editor will edit a story so late on Monday that it’s Tuesday, and sometimes the Tuesday editor will be up early and edit a story at the same time. This pattern is more common during the holidays, when the editors go to different time zones, often travel, and become more irregular in their editing.

So if the AI complex and skilled enough, then, simply through simple feedback, it will start building up a picture of its editors. It will figure out when they are likely to stick to a schedule, and when they will be more irregular. It will figure out the difference between holidays and non-holidays. Given time, it may be able to track the editors moods and it will certainly pick up on any major change in their lives—such as romantic relationships and breakups, which will radically change whether and how it should present stories with a romantic focus.

It will also likely learn the correlation between stories and feedbacks—maybe presenting a story define roughly as “positive” will increase subsequent reward for the rest of the day, on all stories. Or maybe this will only work on a certain editor, or only early in the term. Or only before lunch.

Thus the simple trainable AI with a particular focus—write automated news stories—will be trained, through feedback, to learn about its editors/controllers, to distinguish them, to get to know them, and, in effect, to manipulate them.

This may be a useful “bridging example” between standard RL agents and the superintelligent machines.

What links here?

AI: requirements for pernicious policies by Stuart_Armstrong (17 Jul 2015 14:18 UTC; 11 points)

Stuart_Armstrong13 Jan 2015 13:35 UTC

51 points

19 comments2 min readLW link Archive

AI Risk AI

Luke_A_Somers 14 Jan 2015 17:52 UTC
23 points
0
Oh, dang. Well, I mean, phew? Both? See, I thought this was going to be a news story.
- Dahlen 14 Jan 2015 21:55 UTC
  3 points
  0
  Parent
  Yeah, me too.
  
  Perhaps change the “observes”, “manipulates” to “observing”, “manipulating”? It doesn’t have the same connotation of “this actually happened”.
  
  Also, had it been a real occurrence, it might have been the first thing to make me care just a little about MIRI’s mission.
  - Stuart_Armstrong 14 Jan 2015 22:08 UTC
    2 points
    0
    Parent
    Well, FaceBook and Google algorithms are real occurrences—they’re just not “simple algorithms in a box”.
    - Gunnar_Zarncke 15 Jan 2015 22:27 UTC
      4 points
      0
      Parent
      These ‘dumb’ algorithms probably have a much higher impact than one might guess. It’s just a much more subtle but extremely far reaching effect. Complete surfing habits change. Industries rise and fall due to it. It is operating on a longer time-frame and without spectacular events. It smears out it’s effect. If it is intentional this is called salami tactics, long lie and other things. We are not well prepared to deal with or even detect this rationally because we feel the effect only via its aggregate over lots of events. Things our subconscious long term feeling notices but can only propagate to the concsious via vague feelings of dissatisfaction, frustration and other. But there is no specific ‘enemy’ that can be hit. The effect looks more like an inevitable force of nature than a conscious act of an agent. Even if it should be so. Fear the dumb but massively parallel AI.
    - Dahlen 14 Jan 2015 22:31 UTC
      1 point
      0
      Parent
      Perhaps. But whatever news I’ve heard about social network AI didn’t pass the threshold for my caring about it. A hypothetical story with the above title would have. As far as I’m concerned Facebook/Google AI discussion lies outside of the scope of my comment.
- Stuart_Armstrong 14 Jan 2015 20:28 UTC
  0 points
  0
  Parent
  :-D
Dr_Manhattan 14 Jan 2015 22:10 UTC
9 points
0
http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/
Alexandros 13 Jan 2015 18:22 UTC
7 points
0
The truly insidious effects are when the content of the stories changes the reward but not by going through the standard quality-evaluation function.

For instance, maybe the AI figures out that the order of the stories affects the rewards. Or perhaps it finds how stories that create a climate of joy/fear on campus lead to overall higher/lower evaluations for that period. Then the AI may be motivated to “take a hit” to push through some fear mongering so as to raise its evaluations for the following period. Perhaps it finds that causing strife in the student union, or perhaps causing racial conflict, or causing trouble with the university faculty affects its rewards one way or another. Perhaps if it’s unhappy with a certain editor, it can slip through bad enough errors to get the editor fired, hopefully replaced with a more rewarding editor.

etc etc.
- benwr 14 Jan 2015 2:34 UTC
  3 points
  0
  Parent
  The problem with these particular extensions is that they don’t sound plausible for this type of AI. In my opinion it would be easier when talking with designers to switch from this example to a slightly more sci-fi example.
  
  The leap is between the obvious “it’s ‘manipulating’ its editors by recognizing simple patterns in their behavior” to “it’s manipulating its editors by correctly interpreting the causes underlying their behavior.”
  
  Much easier to extend in the other direction first: “Now imagine that it’s not an article-writer, but a science officer aboard the commercial spacecraft Nostromo...”
  - G-Max 14 Jan 2015 12:07 UTC
    0 points
    0
    Parent
    Upvoted for remembering that Ash was the science officer and not just the movie’s token android.
Shmi 13 Jan 2015 19:47 UTC
1 point
0
I assume that you are alluding to spontaneously agentizing tools, a discussion initially triggered by the famous Karnofsky’s post. In your example the feedback loop is closed, which is probably enough for a tool eventually gaining agency.
- Gram_Stone 16 Jan 2015 14:51 UTC
  0 points
  0
  Parent
  Can you link to the post you’re talking about? I’ve been thinking about this problem, although I had never read or thought the words ‘spontaneously agentizing tools.’
  - Shmi 16 Jan 2015 16:15 UTC
    1 point
    0
    Parent
    See Objection 2 in http://lesswrong.com/lw/cbs/thoughts_on_the_singularity_institute_si/ and related posts and comments.
Gunnar_Zarncke 15 Jan 2015 22:32 UTC
0 points
0
If the rewand channel has only one bit per day I don’t think any agent can infer much about the authors. Their days maybe. Some fundamental components of their preferrences possibly. But nothing a human could infer from all the bits of background he possesses. There are convergence rate results for classifiers that require just too many sample to extract enough information—especially in the face of real life feature vectors.
- Stuart_Armstrong 16 Jan 2015 12:07 UTC
  1 point
  0
  Parent
  I’d assume there would be a reward for every story, that this would be on a ordinal scale with several options, and that it included feedback/corrections about grammar and phrasing.
V_V 21 Jul 2015 14:57 UTC
−1 points
0

Thus the simple trainable AI with a particular focus—write automated news stories—will be trained, through feedback, to learn about its editors/controllers, to distinguish them, to get to know them, and, in effect, to manipulate them.

Detecting and adapting to the individual controllers doesn’t seem to me particularly bad.

Emotionally manipulating the controllers using the content of the stories would be more worrying, but note that this is essentially only possible if the AI is allowed to plan more than one story at time. If the AI can do that, then it can trade off the reward obtained by the story at time t for greater rewards at times >t. Otherwise, any trade off will be limited to the different parts of each story, which greatly reduces the opportunities for significant emotional manipulation of the controllers.
I see no reason this story-writing AI would need to be allowed to plan more than one story at time.

I think this is an example of a general issue in safe AI design that you and other FAI folks overlook: dynamic inconsistency can provide intrinsic protection from unwanted long-term strategies from the AI.

You seem to always implicitly assume that the AI will be an agent trying to maximize a (discounted) utility or reward over a long, ideally infinite, time horizon, that is, you assume that the AI will be approximately dynamically consistent. This may be a reasonable requirement for an autonomous agent that needs to operate for extended times without direct human supervision, but not for a tool AI.
The work of a tool AI can be naturally broken into self-contained tasks, and if the AI doesn’t maximize utility or reward over multiple tasks, then any treacherous plan to gain utility in ways we would disapprove of will have to be confined to a single task. This is not a 100% safety guarantee, but certainly it makes the AI safety problem much more manageable.
- Stuart_Armstrong 23 Jul 2015 8:41 UTC
  0 points
  0
  Parent
  
  I see no reason this story-writing AI would need to be allowed to plan more than one story at time.
  
  Because the AI is programmed by people who hadn’t thought of this issue, and the other way turned out to be simpler/easier?
  
  dynamic inconsistency can provide intrinsic protection from unwanted long-term strategies from the AI.
  
  I know. The problem is that inconsistency is unstable (which is why we’re using other measures to maintain it, eg using a tool AI only). That’s one of the reasons I was interested in stable versions of these kind of unstable motivations http://lesswrong.com/r/discussion/lw/lws/closest_stable_alternative_preferences/ .
  - V_V 23 Jul 2015 9:41 UTC
    −1 points
    0
    Parent
    
    Because the AI is programmed by people who hadn’t thought of this issue, and the other way turned out to be simpler/easier?
    
    Ok, but if this is a narrow AI rather than an AGI agent used for that particular activity, then it seems intuitive to me that designing it to plan over a single task at time would be simpler.
    
    I know. The problem is that inconsistency is unstable (which is why we’re using other measures to maintain it, eg using a tool AI only). That’s one of the reasons I was interested in stable versions of these kind of unstable motivations http://lesswrong.com/r/discussion/lw/lws/closest_stable_alternative_preferences/ .
    
    The post you liked doesn’t deal with dynamic inconsistency. It refers to agents that are expected utility maximizers under Von Neumann–Morgenstern utility theory, but this theory only deals with one-shot decision making, not decision making over time.
    
    You can reduce the problem of decision making over time to one-shot decision making by combining instantaneous utilities into a cumulative utility function ( * ) and then using it as a one-shot utility function.
    
    If you combine the instantaneous utilities by their (exponentially discounted) sum over an infinite time horizon, you obtain a dynamically consistent expected utility maximizer agent. But if you sum utilities up to a fixed time horizon, you still obtain an agent that at each instant is an expected utility maximizer, but it is not dynamically consistent.
    
    You may argue that dynamical inconsistency is not stable under evolution by random mutations and natural selection, but it is not obvious to me that AIs would face such scenario. Even an AI that modifies itself or generate successors has no incentive to maximize its evolutionary fitness unless you specifically program it to do so.
    - Stuart_Armstrong 23 Jul 2015 9:51 UTC
      0 points
      0
      Parent
      Actually, you could use corrigibility to get dynamic inconsistency https://intelligence.org/2014/10/18/new-report-corrigibility/ .