Stopping to roll up several of my other responses (in the comments here) here into a single thing.
An hour of so of Googling wasn’t leading me to any clear examples of “AI attempts to prevent its modification or shutdown, possibly via deceit and manipulation”, but I did find a few elements of the corrigibility picture. Specifically, Arbital’s definition (warning, it takes a few minutes for the page to load) says that, among other things, corrigible agents don’t attempt to interfere with their being modified by their operators and don’t manipulate or deceive its operators. Once deceit and manipulation are under discussion, I think it’s not irrelevant to bring up actual cases where AI agents have learnt to deceive in any way, if it’s just just other other agents (for now).
So a few examples of agents displaying what I might consider “proto-incorrigibility”:
Facebook researchers train an RL agent to “negotiate”, it picks up rudimentary deception in feigning interest in an item it has no interest in so that it can later “concede” and get a better deal.
I think this is just interesting as an example of “we didn’t train it to deceive, but it figured out that tactic works.”
Researchers train robots with evolutionary algorithms to forage for limited “food” supplies in environments with lots of their kin. The robots learn to manipulate the information they share to other robots to protect their access to the limited shared resource.
Also interesting because a information signal/deceptive tactic being selected for organically.
Note: both the above examples are subject of multiple popular science articles with all kinds of click-seeking titles about robots lying and deceiving. I’m not sure what kind of selection effects these papers have undergone, though the results do remain of interest after quick inspection.
.
Not actually an agent actually being incorrigible, but a cute study with the unsurprising result that yes, humans can probably be manipulated into not modifying agents when they otherwise would:
Researchers set up a humanoid robot and had participants interact with it. At the end of their session, participants were told they could turn the robot off, at which point for half of the participants the robot cried (approximately) “No! Don’t turn me off! I want to live!” 13 out of 40 people who heard the objection did not try to switch the robot off.
Interesting as some light empirical evidence that things agents do can manipulate humans (of course the actual operators here probably wouldn’t have been manipulated so easily as naive subjects more likely to ascribe preferences to a robot).
.
Lastly, not an actual hands-on experiment, but a concrete formalization of a corrigibility as a problem.
We analyze a simple game between a human H and a robot R, where H can press R’s off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational.
A response/improvement the above paper made by other researchers:
In this paper, we make the analysis fully game theoretic, by modelling the human as a rational player with a random utility function. As a consequence, we are able to easily calculate the robot’s best action for arbitrary belief and irrationality assumptions.
If you’re already bought into the AI Safety paradigm, I don’t think the experiments I’ve listed are very surprising or informative, but maybe if you’re not bought in yet these real-world cases might bolster intuition in a way that makes the theoretical arguments seem more real. “Already we see very simple agents learn deception, what do you think truly smart agents will do?” “Already humans can be manipulated by very simple means, what do you think complicated means could accomplish?”
[Synthesis]
Stopping to roll up several of my other responses (in the comments here) here into a single thing.
An hour of so of Googling wasn’t leading me to any clear examples of “AI attempts to prevent its modification or shutdown, possibly via deceit and manipulation”, but I did find a few elements of the corrigibility picture. Specifically, Arbital’s definition (warning, it takes a few minutes for the page to load) says that, among other things, corrigible agents don’t attempt to interfere with their being modified by their operators and don’t manipulate or deceive its operators. Once deceit and manipulation are under discussion, I think it’s not irrelevant to bring up actual cases where AI agents have learnt to deceive in any way, if it’s just just other other agents (for now).
So a few examples of agents displaying what I might consider “proto-incorrigibility”:
Deal or No Deal? End-to-End Learning for Negotiation Dialogues
Facebook researchers train an RL agent to “negotiate”, it picks up rudimentary deception in feigning interest in an item it has no interest in so that it can later “concede” and get a better deal.
I think this is just interesting as an example of “we didn’t train it to deceive, but it figured out that tactic works.”
The Evolution of Information Suppression in Communicating Robots with Conflicting Interests
Researchers train robots with evolutionary algorithms to forage for limited “food” supplies in environments with lots of their kin. The robots learn to manipulate the information they share to other robots to protect their access to the limited shared resource.
Also interesting because a information signal/deceptive tactic being selected for organically.
Note: both the above examples are subject of multiple popular science articles with all kinds of click-seeking titles about robots lying and deceiving. I’m not sure what kind of selection effects these papers have undergone, though the results do remain of interest after quick inspection.
.
Not actually an agent actually being incorrigible, but a cute study with the unsurprising result that yes, humans can probably be manipulated into not modifying agents when they otherwise would:
Do a robot’s social skills and its objection discourage interactants from switching the robot off?
Researchers set up a humanoid robot and had participants interact with it. At the end of their session, participants were told they could turn the robot off, at which point for half of the participants the robot cried (approximately) “No! Don’t turn me off! I want to live!” 13 out of 40 people who heard the objection did not try to switch the robot off.
Interesting as some light empirical evidence that things agents do can manipulate humans (of course the actual operators here probably wouldn’t have been manipulated so easily as naive subjects more likely to ascribe preferences to a robot).
.
Lastly, not an actual hands-on experiment, but a concrete formalization of a corrigibility as a problem.
The Off-Switch Game
We analyze a simple game between a human H and a robot R, where H can press R’s off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational.
A response/improvement the above paper made by other researchers:
A Game-Theoretic Analysis of The Off-Switch Game
If you’re already bought into the AI Safety paradigm, I don’t think the experiments I’ve listed are very surprising or informative, but maybe if you’re not bought in yet these real-world cases might bolster intuition in a way that makes the theoretical arguments seem more real. “Already we see very simple agents learn deception, what do you think truly smart agents will do?” “Already humans can be manipulated by very simple means, what do you think complicated means could accomplish?”