My AGI safety research—2022 review, ’23 plans

“Our greatest fear should not be of failure, but of succeeding at something that doesn’t really matter.” –DL Moody (allegedly)

The short version: In this post I’m briefly summarizing how I spent my work-time in 2022, and what I’m planning for 2023.

I expect to carry on with a similar time allocation into 2023.

If you think there are other things I should be doing instead or differently, please don’t be shy, the comment section is below, or DM me, email, etc.

The long version:

1. First half of 2022: Writing “Intro to Brain-Like AGI Safety”

So, I was writing some technical post in late 2021, and realized that the thing I was talking about was a detail sitting on top of a giant pile of idiosyncratic beliefs and terminology that nobody else would understand. So I started writing a background section to that post. That background section grew and grew and grew, and eventually turned into a book-length series of 15 blog posts entitled “Intro to Brain-Like AGI Safety”, which reorganized and re-explained almost everything I had written and thought about up to that point, since I started in the field around 2019. (My palimpsest!) Writing that series took up pretty much 100% of my work time until May 2022.

Then I spent much of the late spring and summer catching up on lots of miscellaneous to-do-list stuff that I had put off while writing the series, and everyone in my family caught COVID[1], and we took a family vacation, and I attended two conferences, and I switched jobs when Jed McCaleb generously offered me a home at Astera Institute, and various other things. So I didn’t get much research done during the late spring and summer.

Moving on to the rest of the year, my substantive work time has been divided, I dunno, something like 45%-45%-10% between “my main research project”, “other research”, and “outreach”. Let’s take those one at a time in the next three sections.

2. Second half of 2022 (1/​3): My main research project

2.1 What’s the project?

I’m working on the open neuroscience problem that I described in the post Symbol Grounding and Human Social Instincts, and motivated in the post Two paths forward: “Controlled AGI” and “Social-instinct AGI”. I’ll give an abbreviated version here.

As discussed in “Intro to Brain-Like-AGI Safety”, I hold the following opinions:

  • We should think of within-lifetime learning in the human brain as a kind of model-based reinforcement learning (RL) system;

  • We should think of that model-based RL system as potentially similar to how future AGIs will work;

  • We should (to a first approximation) think of the “reward function” of that RL system as encoding “innate drives”, like pain being bad and sweet tastes being good;

  • These “innate drives” correspond to specific genetically-hardwired circuitry primarily in the hypothalamus and brainstem;

  • A subset of that circuitry underlies human social and moral instincts;

  • …And the project I’m working on is an attempt to figure out what those circuits are and how they work.

2.2 Why do I think success on this project would be helpful for AGI safety?

I have two arguments:

The modest argument is: At some point, I hope, we will have a science that can produce predictions of the form:

(“Innate drives” a.k.a. “Reward function” X)
+ (“Life experience” a.k.a. “Training environment” Y)
→ (“Adult” AGI that’s trying to do Z)

If we knew exactly what innate drives are in humans (particularly related to sociality, morality, etc.), then we would have actual examples of X+Y→Z to ground this future science.

Even with the benefit of actual examples, building a science that can predict Z from X+Y seems very hard, don’t get me wrong. Still, I think we’ll be in a better place if we have actual examples of X+Y→Z, than if we don’t.

The bolder argument is: Maybe we can just steal ideas from human social instincts for AGI.

I need to elaborate here. I do not think it’s a good idea to slavishly and unthinkingly copy human social instincts into an AGI. Why is that a bad idea?

  • For one thing, human social instincts leave something to be desired! For example, I don’t want AGIs with teenage angst, or zero-sum status drive, or bloodlust, etc.

  • For another thing, in the X+Y→Z calculus above, an AGI with a human-like innate drives X will not necessarily grow up into human-like goals and desires Z, unless it also has human-like training environment Y. And I think it’s very likely that AGIs will have different training environments than human children, at least in some ways (e.g. lack of human body, more capacity to self-modify).

On the other hand, if we first understand human social instincts, and then maybe adapt some aspects of those for AGIs, presumably in conjunction with other non-biological ingredients, that seems like quite possibly a great idea.

Again, see Two paths forward: “Controlled AGI” and “Social-instinct AGI” for further discussion.

2.3 Why is this such a priority that it warrants a large fraction of my time?

  • Impact—see above.

  • Tractability—It’s in principle tractable, in the sense that there’s a concrete algorithmic problem, and there’s a specific solution to that problem implemented in the brain, and I am “merely” trying to figure out what that solution is. And while there’s a decent chance that I’ll hit roadblocks at some point, I feel a sense of steady progress so far (see below).

  • Neglectedness—This problem seems importantly neglected, in the “bus factor” sense. I’m happily not the only person on Earth trying to bring neuroscience knowledge to bear on AGI safety /​ alignment questions in general,[2] but I do unfortunately seem to be the only one working on this particular neuroscience puzzle.[3] For example, I’m quite sure that I’m the only person in AGI safety who cares a whit about the algorithmic role of neuropeptide receptors in the lateral septum.

2.4 Recent progress on that project

2.4.1 Shoring up foundations

I spent quite a bit of time in the summer and fall getting up to speed on the hypothalamus (see my book review on that topic) and other relevant parts of the brain (basal forebrain, amygdala, NAc, etc.—this book was especially helpful!).

I have also made a lot of progress towards cleaning up some of the sketchy bits and loose ends of my big-picture understanding of model-based RL in the brain. It seems that some aspects of my neuroscience discussion in the first half of Intro to Brain-Like AGI Safety will be different in the next iteration! But generally (1) none of those mistakes has any important downstream implications for how one should think about AGI safety, (2) those mistakes were pretty much all in areas that I had explicitly flagged as especially speculative. I mostly feel proud of myself for continuing to make progress, rather than annoyed at myself for having written things that were wrong; if you think that’s the incorrect takeaway, we can discuss in the comments.

2.4.2 OK, great, but how do social instincts work?

I still don’t know. The shoring-up-foundations work above is giving me a progressively better sense of what I’m looking for and where. But I’d better keep working!

Philosophically, my general big-picture plan /​ workflow for solving the problem is:

  • (A) Come up with plausible theories /​ pseudocode for how human social instincts might work;

  • (B) Read the literature on socially-relevant bits of the hypothalamus & brainstem, including how they interface with the striatum etc.;

  • (C) Try to match up (A)+(B)—and iterate;

  • (D) If more experiments are needed, e.g. because there’s more than one plausible theory, try to figure out which experiments, and somehow make them happen.

In the second half of 2022 I’ve been almost entirely focused on (B), but I’m finally getting to the point where it’s beneficial for me to spend more time on (A) and (C). I’m not really thinking about (D) yet, and have a looming suspicion that (D) will be intractable, especially if I wind up thinking that human social instincts are importantly different from rat social instincts, because I suspect that the kinds of experiments that we need are not possible in humans. I hope I’m wrong! But even if (D) were to fail, I think what I’m working on would still be good—I think having several plausible theories of human social instincts would still be a significant improvement over having zero, from the perspective of Safe & Beneficial AGI.

3. Second half of 2022 (2/​3): Miscellaneous other research

I do a lot of things that are not “my main research project”. Much of it is kinda scattered—email correspondence, lesswrong comments, something random that I want to get off my chest, etc. I think that’s fine.

One of the larger projects that I started was my idea to do a brain-dump-post on the AGI deployment problem, basically as a way of forcing myself to think about that topic more carefully. I’ve been publishing it in pieces—so far, this one on AGI consciousness, and this one on offense-defense balance. Hopefully there will be more. For example, I need to think more about training environments. If we raise an AGI in a VR environment for a while, and then give it access to the real world, will the AGI wind up feeling like the VR environment is “real” and the real world isn’t? (Cf. surveys about the “Experience Machine”.) If so, what can we do about that? Alternatively, if we decide to raise an AGI in a literal robot body, how on earth would that be practical and competitive? Or is there a third option? Beats me.

I’m also hoping to write a follow-up on that offense-defense balance post mentioned above, discussing how I updated from the comments /​ correspondence afterwards.

4. Second half of 2022 (3/​3): Outreach, field-building, etc.

Outreach, field-building, etc. are time-consuming, stressful for me, and not particularly my comparative advantage, I think. So I don’t do it much. Sorry everyone! One exception is outreach towards the neuroscience community in particular, which in some cases I’m somewhat-uniquely positioned to do well, I think. The “Intro to Brain-Like-AGI Safety” series itself is (in part) beginner-friendly pedagogical outreach material of that type, and later in the year I did this podcast appearance and this post. I will endeavor to continue doing things like that from time to time into 2023.

Also, I recently made a 1-hour talk (UPDATE: I also now have a 30-minute version) based on the “Intro to brain-like AGI” series. If you have previously invited me to give a talk, and I said “Sorry but I don’t have any talk to give”, then you can try asking me again. As long as I don’t have to travel.

5. On to 2023!

Looking back, I think I’m pretty happy with how I’ve been allocating time, and plan to just keep moving forward as I have since the summer. If you think that’s bad or suboptimal, let’s chat in the comments section!

I’d like to give my thanks to my family, to my old funder Beth Barnes /​ EA Funds Donor Lottery Program, to my new employer Astera, to my colleagues and coworkers, to my biweekly-productivity-status-check-in-friendly-volunteer-person, to the people who write interesting things for me to read, to the people who write helpful replies to and criticisms of my blog posts and comments, to Lightcone Infrastructure for running this site, and to all of you for reading this far. To a happy, healthy, and apocalypse-free 2023!

  1. ^

    Nobody got a bad case of COVID, but there was much time-consuming annoyance, particularly from lost childcare.

  2. ^

    For example (in reverse alphabetical order) (I think) Eli Sennesh, Adam Safron, Beren Millidge, Linda Linsefors, Seth Herd, Nathan Helm-Burger, Jon Garcia, Patrick Butlin, plus the AIntelope people and maybe some of the shard theory people, plus various other people to whom I apologize for omitting.

  3. ^

    I hope I’m not insulting the AIntelope people here. They’re interested in the same general problem, but are using very different methods from me, methods which will hopefully ultimately be complementary to what I’m trying to do.