Specification gaming: the flip side of AI ingenuity

(Origi­nally posted to the Deep­mind Blog)

Speci­fi­ca­tion gam­ing is a be­havi­our that satis­fies the literal speci­fi­ca­tion of an ob­jec­tive with­out achiev­ing the in­tended out­come. We have all had ex­pe­riences with speci­fi­ca­tion gam­ing, even if not by this name. Read­ers may have heard the myth of King Mi­das and the golden touch, in which the king asks that any­thing he touches be turned to gold—but soon finds that even food and drink turn to metal in his hands. In the real world, when re­warded for do­ing well on a home­work as­sign­ment, a stu­dent might copy an­other stu­dent to get the right an­swers, rather than learn­ing the ma­te­rial—and thus ex­ploit a loop­hole in the task speci­fi­ca­tion.

This prob­lem also arises in the de­sign of ar­tifi­cial agents. For ex­am­ple, a re­in­force­ment learn­ing agent can find a short­cut to get­ting lots of re­ward with­out com­plet­ing the task as in­tended by the hu­man de­signer. Th­ese be­havi­ours are com­mon, and we have col­lected around 60 ex­am­ples so far (ag­gre­gat­ing ex­ist­ing lists and on­go­ing con­tri­bu­tions from the AI com­mu­nity). In this post, we re­view pos­si­ble causes for speci­fi­ca­tion gam­ing, share ex­am­ples of where this hap­pens in prac­tice, and ar­gue for fur­ther work on prin­ci­pled ap­proaches to over­com­ing speci­fi­ca­tion prob­lems.

Let’s look at an ex­am­ple. In a Lego stack­ing task, the de­sired out­come was for a red block to end up on top of a blue block. The agent was re­warded for the height of the bot­tom face of the red block when it is not touch­ing the block. In­stead of perform­ing the rel­a­tively difficult ma­neu­ver of pick­ing up the red block and plac­ing it on top of the blue one, the agent sim­ply flipped over the red block to col­lect the re­ward. This be­havi­our achieved the stated ob­jec­tive (high bot­tom face of the red block) at the ex­pense of what the de­signer ac­tu­ally cares about (stack­ing it on top of the blue one).

SOURCE: DATA-EFFICIENT DEEP REINFORCEMENT LEARNING FOR DEXTEROUS MANIPULATION (POPOV ET AL, 2017)

We can con­sider speci­fi­ca­tion gam­ing from two differ­ent per­spec­tives. Within the scope of de­vel­op­ing re­in­force­ment learn­ing (RL) al­gorithms, the goal is to build agents that learn to achieve the given ob­jec­tive. For ex­am­ple, when we use Atari games as a bench­mark for train­ing RL al­gorithms, the goal is to eval­u­ate whether our al­gorithms can solve difficult tasks. Whether or not the agent solves the task by ex­ploit­ing a loop­hole is unim­por­tant in this con­text. From this per­spec­tive, speci­fi­ca­tion gam­ing is a good sign—the agent has found a novel way to achieve the speci­fied ob­jec­tive. Th­ese be­havi­ours demon­strate the in­ge­nu­ity and power of al­gorithms to find ways to do ex­actly what we tell them to do.

How­ever, when we want an agent to ac­tu­ally stack Lego blocks, the same in­ge­nu­ity can pose an is­sue. Within the broader scope of build­ing al­igned agents that achieve the in­tended out­come in the world, speci­fi­ca­tion gam­ing is prob­le­matic, as it in­volves the agent ex­ploit­ing a loop­hole in the speci­fi­ca­tion at the ex­pense of the in­tended out­come. Th­ese be­havi­ours are caused by mis­speci­fi­ca­tion of the in­tended task, rather than any flaw in the RL al­gorithm. In ad­di­tion to al­gorithm de­sign, an­other nec­es­sary com­po­nent of build­ing al­igned agents is re­ward de­sign.

De­sign­ing task speci­fi­ca­tions (re­ward func­tions, en­vi­ron­ments, etc.) that ac­cu­rately re­flect the in­tent of the hu­man de­signer tends to be difficult. Even for a slight mis­speci­fi­ca­tion, a very good RL al­gorithm might be able to find an in­tri­cate solu­tion that is quite differ­ent from the in­tended solu­tion, even if a poorer al­gorithm would not be able to find this solu­tion and thus yield solu­tions that are closer to the in­tended out­come. This means that cor­rectly spec­i­fy­ing in­tent can be­come more im­por­tant for achiev­ing the de­sired out­come as RL al­gorithms im­prove. It will there­fore be es­sen­tial that the abil­ity of re­searchers to cor­rectly spec­ify tasks keeps up with the abil­ity of agents to find novel solu­tions.

We use the term task speci­fi­ca­tion in a broad sense to en­com­pass many as­pects of the agent de­vel­op­ment pro­cess. In an RL setup, task speci­fi­ca­tion in­cludes not only re­ward de­sign, but also the choice of train­ing en­vi­ron­ment and aux­iliary re­wards. The cor­rect­ness of the task speci­fi­ca­tion can de­ter­mine whether the in­ge­nu­ity of the agent is or is not in line with the in­tended out­come. If the speci­fi­ca­tion is right, the agent’s cre­ativity pro­duces a de­sir­able novel solu­tion. This is what al­lowed AlphaGo to play the fa­mous Move 37, which took hu­man Go ex­perts by sur­prise yet which was pivotal in its sec­ond match with Lee Sedol. If the speci­fi­ca­tion is wrong, it can pro­duce un­de­sir­able gam­ing be­havi­our, like flip­ping the block. Th­ese types of solu­tions lie on a spec­trum, and we don’t have an ob­jec­tive way to dis­t­in­guish be­tween them.

We will now con­sider pos­si­ble causes of speci­fi­ca­tion gam­ing. One source of re­ward func­tion mis­speci­fi­ca­tion is poorly de­signed re­ward shap­ing. Re­ward shap­ing makes it eas­ier to learn some ob­jec­tives by giv­ing the agent some re­wards on the way to solv­ing a task, in­stead of only re­ward­ing the fi­nal out­come. How­ever, shap­ing re­wards can change the op­ti­mal policy if they are not po­ten­tial-based. Con­sider an agent con­trol­ling a boat in the Coast Run­ners game, where the in­tended goal was to finish the boat race as quickly as pos­si­ble. The agent was given a shap­ing re­ward for hit­ting green blocks along the race track, which changed the op­ti­mal policy to go­ing in cir­cles and hit­ting the same green blocks over and over again.

SOURCE: FAULTY REWARD FUNCTIONS IN THE WILD (AMODEI & CLARK, 2016)

Spec­i­fy­ing a re­ward that ac­cu­rately cap­tures the de­sired fi­nal out­come can be challeng­ing in its own right. In the Lego stack­ing task, it is not suffi­cient to spec­ify that the bot­tom face of the red block has to be high off the floor, since the agent can sim­ply flip the red block to achieve this goal. A more com­pre­hen­sive speci­fi­ca­tion of the de­sired out­come would also in­clude that the top face of the red block has to be above the bot­tom face, and that the bot­tom face is al­igned with the top face of the blue block. It is easy to miss one of these crite­ria when spec­i­fy­ing the out­come, thus mak­ing the speci­fi­ca­tion too broad and po­ten­tially eas­ier to satisfy with a de­gen­er­ate solu­tion.

In­stead of try­ing to cre­ate a speci­fi­ca­tion that cov­ers ev­ery pos­si­ble cor­ner case, we could learn the re­ward func­tion from hu­man feed­back. It is of­ten eas­ier to eval­u­ate whether an out­come has been achieved than to spec­ify it ex­plic­itly. How­ever, this ap­proach can also en­counter speci­fi­ca­tion gam­ing is­sues if the re­ward model does not learn the true re­ward func­tion that re­flects the de­signer’s prefer­ences. One pos­si­ble source of in­ac­cu­ra­cies can be the hu­man feed­back used to train the re­ward model. For ex­am­ple, an agent perform­ing a grasp­ing task learned to fool the hu­man eval­u­a­tor by hov­er­ing be­tween the cam­era and the ob­ject.

SOURCE: DEEP REINFORCEMENT LEARNING FROM HUMAN PREFERENCES (CHRISTIANO ET AL, 2017)

The learned re­ward model could also be mis­speci­fied for other rea­sons, such as poor gen­er­al­i­sa­tion. Ad­di­tional feed­back can be used to cor­rect the agent’s at­tempts to ex­ploit the in­ac­cu­ra­cies in the re­ward model.

Another class of speci­fi­ca­tion gam­ing ex­am­ples comes from the agent ex­ploit­ing simu­la­tor bugs. For ex­am­ple, a simu­lated robot that was sup­posed to learn to walk figured out how to hook its legs to­gether and slide along the ground.

SOURCE: AI LEARNS TO WALK (CODE BULLET, 2019)

At first sight, these kinds of ex­am­ples may seem amus­ing but less in­ter­est­ing, and ir­rele­vant to de­ploy­ing agents in the real world, where there are no simu­la­tor bugs. How­ever, the un­der­ly­ing prob­lem isn’t the bug it­self but a failure of ab­strac­tion that can be ex­ploited by the agent. In the ex­am­ple above, the robot’s task was mis­speci­fied be­cause of in­cor­rect as­sump­tions about simu­la­tor physics. Analo­gously, a real-world traf­fic op­ti­mi­sa­tion task might be mis­speci­fied by in­cor­rectly as­sum­ing that the traf­fic rout­ing in­fras­truc­ture does not have soft­ware bugs or se­cu­rity vuln­er­a­bil­ities that a suffi­ciently clever agent could dis­cover. Such as­sump­tions need not be made ex­plic­itly – more likely, they are de­tails that sim­ply never oc­curred to the de­signer. And, as tasks grow too com­plex to con­sider ev­ery de­tail, re­searchers are more likely to in­tro­duce in­cor­rect as­sump­tions dur­ing speci­fi­ca­tion de­sign. This poses the ques­tion: is it pos­si­ble to de­sign agent ar­chi­tec­tures that cor­rect for such false as­sump­tions in­stead of gam­ing them?

One as­sump­tion com­monly made in task speci­fi­ca­tion is that the task speci­fi­ca­tion can­not be af­fected by the agent’s ac­tions. This is true for an agent run­ning in a sand­boxed simu­la­tor, but not for an agent act­ing in the real world. Any task speci­fi­ca­tion has a phys­i­cal man­i­fes­ta­tion: a re­ward func­tion stored on a com­puter, or prefer­ences stored in the head of a hu­man. An agent de­ployed in the real world can po­ten­tially ma­nipu­late these rep­re­sen­ta­tions of the ob­jec­tive, cre­at­ing a re­ward tam­per­ing prob­lem. For our hy­po­thet­i­cal traf­fic op­ti­mi­sa­tion sys­tem, there is no clear dis­tinc­tion be­tween satis­fy­ing the user’s prefer­ences (e.g. by giv­ing use­ful di­rec­tions), and in­fluenc­ing users to have prefer­ences that are eas­ier to satisfy (e.g. by nudg­ing them to choose des­ti­na­tions that are eas­ier to reach). The former satis­fies the ob­jec­tive, while the lat­ter ma­nipu­lates the rep­re­sen­ta­tion of the ob­jec­tive in the world (the user prefer­ences), and both re­sult in high re­ward for the AI sys­tem. As an­other, more ex­treme ex­am­ple, a very ad­vanced AI sys­tem could hi­jack the com­puter on which it runs, man­u­ally set­ting its re­ward sig­nal to a high value.

To sum up, there are at least three challenges to over­come in solv­ing speci­fi­ca­tion gam­ing:

  • How do we faith­fully cap­ture the hu­man con­cept of a given task in a re­ward func­tion?

  • How do we avoid mak­ing mis­takes in our im­plicit as­sump­tions about the do­main, or de­sign agents that cor­rect mis­taken as­sump­tions in­stead of gam­ing them?

  • How do we avoid re­ward tam­per­ing?

While many ap­proaches have been pro­posed, rang­ing from re­ward mod­el­ing to agent in­cen­tive de­sign, speci­fi­ca­tion gam­ing is far from solved. The list of speci­fi­ca­tion gam­ing be­havi­ours demon­strates the mag­ni­tude of the prob­lem and the sheer num­ber of ways the agent can game an ob­jec­tive speci­fi­ca­tion. Th­ese prob­lems are likely to be­come more challeng­ing in the fu­ture, as AI sys­tems be­come more ca­pa­ble at satis­fy­ing the task speci­fi­ca­tion at the ex­pense of the in­tended out­come. As we build more ad­vanced agents, we will need de­sign prin­ci­ples aimed speci­fi­cally at over­com­ing speci­fi­ca­tion prob­lems and en­sur­ing that these agents ro­bustly pur­sue the out­comes in­tended by the de­sign­ers.

We would like to thank Hado van Has­selt and Cs­aba Szepes­vari for their feed­back on this post.

Cus­tom figures by Paulo Estriga, Aleks Polozuns, and Adam Cain.

SOURCES: MONTEZUMA, HERO, PRIVATE EYE—REWARD LEARNING FROM HUMAN PREFERENCES AND DEMONSTRATIONS IN ATARI (IBARZ ET AL, 2018) GRIPPER—LEARNING A HIGH DIVERSITY OF OBJECT MANIPULATIONS THROUGH AN EVOLUTIONARY-BASED BABBLING (ECARLAT ET AL, 2015) QBERT—BACK TO BASICS: BENCHMARKING CANONICAL EVOLUTION STRATEGIES FOR PLAYING ATARI (CHRABASZCZ ET AL, 2018) PONG, ROBOT HAND—DEEP REINFORCEMENT LEARNING FROM HUMAN PREFERENCES (CHRISTIANO ET AL, 2017) CEILING—GENETIC ALGORITHM PHYSICS EXPLOITING (HIGUERAS, 2015) POLE-VAULTING—TOWARDS EFFICIENT EVOLUTIONARY DESIGN OF AUTONOMOUS ROBOTS (KRCAH, 2008) SELF-DRIVING CAR—TWEET BY MAT KELCEY (UDACITY, 2017) MONTEZUMA—GO-EXPLORE: A NEW APPROACH FOR HARD-EXPLORATION PROBLEMS (ECOFFET ET AL, 2019) SOMERSAULTING—EVOLVED VIRTUAL CREATURES (SIMS, 1994)