Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell
Abstract It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead. Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R’s off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H’s actions as important observations about that utility. (R also has no incentive to switch itself off in this setting.) We conclude that giving machines an appropriate level of uncertainty about their objectives leads to safer designs, and we argue that this setting is a useful generalization of the classical AI paradigm of rational agents.
Tobias Wangberg, Mikael Bo¨ors, Elliot Catt , Tom Everitt , Marcus Hutter
Abstract. The off-switch game is a game theoretic model of a highly intelligent robot interacting with a human. In the original paper by Hadfield-Menell et al. (2016b), the analysis is not fully game-theoretic as the human is modelled as an irrational player, and the robot’s best action is only calculated under unrealistic normality and soft-max assumptions. In this paper, we make the analysis fully game theoretic, by modelling the human as a rational player with a random utility function. As a consequence, we are able to easily calculate the robot’s best action for arbitrary belief and irrationality assumptions.
I’m afraid is this theoretical rather than experiment, but it is a paper with a formalized problem.
https://www.aaai.org/ocs/index.php/WS/AAAIW17/paper/view/15156/14653
The Off-Switch Game
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell
Corresponding slide for this paper hosted on MIRI’s site: https://intelligence.org/files/csrbai/hadfield-menell-slides.pdf
Response/Improvement to above paper:
A Game-Theoretic Analysis of The Off-Switch Game
Tobias Wangberg, Mikael Bo¨ors, Elliot Catt , Tom Everitt , Marcus Hutter