Intro to Brain-Like-AGI Safety

26 Jan 2022 3:51 UTC

Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely?

I will argue that this is an open technical problem, and my goal in this post series is to bring readers with no prior knowledge all the way up to the front-line of unsolved problems as I see them.

If this whole thing seems weird or stupid, you should start right in on Post #1, which contains definitions, background, and motivation. Then Posts #2–#7 are the neuroscience, arguing for a picture of the brain that combines large-scale learning algorithms (e.g. in the cortex) and specific evolved reflexes (e.g. in the hypothalamus and brainstem). Posts #8–#15 apply those neuroscience ideas directly to AGI safety, ending with a list of open questions and advice for getting involved in the field.

A major theme will be that the human brain runs a yet-to-be-invented variation on Model-Based Reinforcement Learning. The reward function of this system (a.k.a. “innate drives” or “primary rewards”) says that pain is bad, and eating-when-hungry is good, etc. I will argue that this reward function is centered around the hypothalamus and brainstem, and that all human desires—even “higher” desires for things like compassion and justice—come directly or indirectly from that reward function. If future programmers build brain-like AGI, they will likewise have a reward function slot in their source code, in which they can put whatever they want. If they put the wrong code in the reward function slot, the resulting AGI will wind up callously indifferent to human welfare. How might they avoid that? What code should they put in—along with training environment and other design choices—such that the AGI won’t feel callous indifference to whether its programmers, and other people, live or die? No one knows—it’s an open problem, but I will review some ideas and research directions.

To print or cite this series: I cross-posted a PDF version of this series on an academic preprint server: https://osf.io/preprints/osf/fe36n

Acknowledgements: Thanks to Beth Barnes & the Centre For Effective Altruism Donor Lottery Program for financial support when I first wrote and published this, and thanks Jed McCaleb & Astera Institute for financial support during later (minor) revisions. Thanks to the following people for critical comments on drafts: Adam Marblestone, Linda Linsefors, Justis Mills, Charlie Steiner, Maksym Taran, Adam Scholl, Aysja Johnson, Adam Shimi, Cameron Berg, Jacob Cannell, Oliver Daniels-Koch. Thanks to the many commenters for feedback after initial publication. Thanks Ash Dorsey for help reformatting for the PDF preprint version.

(Series was first published in early 2022, but I’ve been revising it periodically since then; see changelog at the bottom of each post.)