Conclusion to ‘Reframing Impact’

Epistemic Status

I’ve made many claims in these posts. All views are my own.

  • AU the­ory de­scribes how peo­ple feel im­pacted. I’m darn con­fi­dent (95%) that this is true.

  • Agents trained by pow­er­ful RL al­gorithms on ar­bi­trary re­ward sig­nals gen­er­ally try to take over the world. Con­fi­dent (75%). The the­o­rems on power-seek­ing only ap­ply in the limit of far­sight­ed­ness and op­ti­mal­ity, which isn’t re­al­is­tic for real-world agents. How­ever, I think they’re still in­for­ma­tive. There are also strong in­tu­itive ar­gu­ments for power-seek­ing.

  • CCC is true. Fairly con­fi­dent (70%). There seems to be a di­chotomy be­tween “catas­tro­phe di­rectly in­cen­tivized by goal” and “catas­tro­phe in­di­rectly in­cen­tivized by goal through power-seek­ing”, al­though Vika pro­vides in­tu­itions in the other di­rec­tion.

  • AUP pre­vents catas­tro­phe (in the outer al­ign­ment sense, and as­sum­ing the CCC). Very con­fi­dent (85%).

  • Some ver­sion of AUP solves side effect prob­lems for an ex­tremely wide class of real-world tasks and for sub­hu­man agents. Lean­ing to­wards yes (65%).

  • For the su­per­hu­man case, pe­nal­iz­ing the agent for in­creas­ing its own AU is bet­ter than pe­nal­iz­ing the agent for in­creas­ing other AUs. Lean­ing to­wards yes (65%).

  • There ex­ists a sim­ple closed-form solu­tion to catas­tro­phe avoidance (in the outer al­ign­ment sense). Pes­simistic (35%).


After ~700 hours of work over the course of ~9 months, the se­quence is fi­nally com­plete.

This work was made pos­si­ble by the Cen­ter for Hu­man-Com­pat­i­ble AI, the Berkeley Ex­is­ten­tial Risk Ini­ti­a­tive, and the Long-Term Fu­ture Fund. Deep thanks to Ro­hin Shah, Abram Dem­ski, Lo­gan Smith, Evan Hub­inger, TheMa­jor, Chase De­necke, Vic­to­ria Krakovna, Alper Du­manli, Cody Wild, Matthew Bar­nett, Daniel Blank, Sara Hax­hia, Con­nor Flex­man, Zack M. Davis, Jas­mine Wang, Matthew Ol­son, Rob Bens­inger, William Ells­worth, Davide Zagami, Ben Pace, and a mil­lion other peo­ple for giv­ing feed­back on this se­quence.

Ap­pendix: Easter Eggs

The big art pieces (and es­pe­cially the last illus­tra­tion in this post) were de­signed to con­vey a spe­cific mean­ing, the in­ter­pre­ta­tion of which I leave to the reader.

There are a few pop cul­ture refer­ences which I think are ob­vi­ous enough to not need point­ing out, and a lot of hid­den smaller playful­ness which doesn’t quite rise to the level of “easter egg”.

Refram­ing Impact

The bird’s nest con­tains a literal easter egg.

The pa­per­clip-Balrog draw­ing con­tains a Teng­war in­scrip­tion which reads “one mea­sure to bind them”, with “mea­sure” in im­pact-blue and “them” in util­ity-pink.

“Towards a New Im­pact Mea­sure” was the ti­tle of the post in which AUP was in­tro­duced.

At­tain­able Utility The­ory: Why Things Matter

This style of maze is from the video game Un­der­tale.

Seek­ing Power is In­stru­men­tally Con­ver­gent in MDPs

To seek power, Frank is try­ing to get at the In­finity Gaunt­let.

The tale of Frank and the or­ange Pebblehoarder

Speak­ing of un­der-tales, a friend­ship has been blos­som­ing right un­der our noses.

After the Peb­ble­hoard­ers suffer the dev­as­tat­ing trans­for­ma­tion of all of their peb­bles into ob­sidian blocks, Frank gen­er­ously gives away his fa­vorite pink mar­ble as a makeshift peb­ble.

The ti­tle cuts to the mid­dle of their ad­ven­tures to­gether, the Peb­ble­hoarder show­ing its grat­i­tude by helping Frank reach things high up.

This still at the mid­point of the se­quence is from the fi­nal scene of The Hob­bit: An Un­ex­pected Jour­ney, where the party is over­look­ing Ere­bor, the Lonely Moun­tain. They’ve made it through the Misty Moun­tains, only to find Smaug’s abode loom­ing in the dis­tance.

And, at last, we find Frank and or­ange Peb­ble­hoarder pop­ping some of the cham­pagne from Smaug’s hoard.

Since Ere­bor isn’t close to Gon­dor, we don’t see Frank and the Peb­ble­hoarder gaz­ing at Ephel Dúath from Mi­nas Tirith.