Risks from Learned Optimization

This is a se­quence ver­sion of the pa­per “Risks from Learned Op­ti­miza­tion in Ad­vanced Ma­chine Learn­ing Sys­tems” by Evan Hub­inger, Chris van Mer­wijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the se­quence cor­re­sponds to a differ­ent sec­tion of the pa­per.

The goal of this se­quence is to an­a­lyze the type of learned op­ti­miza­tion that oc­curs when a learned model (such as a neu­ral net­work) is it­self an op­ti­mizer—a situ­a­tion we re­fer to as mesa-op­ti­miza­tion, a ne­ol­o­gism we in­tro­duce in this se­quence. We be­lieve that the pos­si­bil­ity of mesa-op­ti­miza­tion raises two im­por­tant ques­tions for the safety and trans­parency of ad­vanced ma­chine learn­ing sys­tems. First, un­der what cir­cum­stances will learned mod­els be op­ti­miz­ers, in­clud­ing when they should not be? Se­cond, when a learned model is an op­ti­mizer, what will its ob­jec­tive be—how will it differ from the loss func­tion it was trained un­der—and how can it be al­igned?

Risks from Learned Op­ti­miza­tion: Introduction

Con­di­tions for Mesa-Optimization

The In­ner Align­ment Problem

De­cep­tive Alignment

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work