My take on Jacob Cannell’s take on AGI safety

Jacob Cannell wrote some blog posts about AGI safety /​ alignment and neuroscience between 2010 and 2015, which I read and enjoyed quite early on when I was first getting interested in the same topics a few years ago. So I was delighted to see him reappear on Lesswrong a year ago where he has been a prolific and thought-provoking blogger and commenter (in his free time while running a startup!). See complete list of Jacob’s blog posts and comments.

Having read a bunch of his writings, and talked to him in various blog comments sections, I thought it would be worth trying to write up the places where he and I seem to agree and disagree.

This exercise will definitely be helpful for me, hopefully helpful for Jacob, and maybe helpful for people who are already pretty familiar with at least one of our two perspectives. (My perspective is here.) I’m not sure how helpful it will be for everyone else. In particular, I’m probably skipping over, without explanation, important areas where Jacob & I already agree—of which there are many!

(Before publishing I shared this post with Jacob, and he kindly left some responses /​ clarifications /​ counterarguments, which I have interspersed in the text, in gray boxes. I might reply back to some of those—check the comments section in the near future.)

1. How to think about the human brain

1.1 “Evolved modularity” versus “Universal learning machine”

Pause for background:

  • A. “Evolved modularity”: This is a school of thought wherein the human brain is a mishmosh of individual specific evolved capabilities, including a specifically-evolved language algorithm, a specifically-evolved “intuitive biology” algorithm, a specifically-evolved “intuitive physics” algorithm, an “intuitive human social relations” algorithm, a vision-processing algorithm, etc., all somewhat intermingled for sure, but all innate. Famous advocates of “evolved modularity” these days include Steven Pinker (see How the Mind Works) and Gary Marcus. I’m unfamiliar with the history but Jacob mentions early work by Cosmides & Tooby.

  • B. “Universal learning machine”: Jacob made up this term in his 2015 post “The Brain as a Universal Learning Machine”, to express the diametrically-opposite school of thought, wherein the brain has one extremely powerful and versatile within-lifetime learning algorithm, and this one algorithm learns language and biology and physics and social relations etc. This school of thought is popular among machine learning people, and it tends to be emphasized by computational neuroscientists, particularly in the “connectionist” tradition.

Here are two other things that are kinda related:

  • “Evolutionary psychology” is the basic idea of getting insight into psychological phenomena by thinking about evolution. In principle, “evolutionary psychology” and “evolved modularity” are different things, but unfortunately people seem to conflate them sometimes. For example, I read a 2018 book entitled Beyond Evolutionary Psychology, and it was entirely devoted to (a criticism of) evolved modularity, as opposed to evolutionary psychology per se. Well, I for one think that evolved modularity is basically wrong (as usually conceived; see next subsection), but I also think that doing evolutionary psychology (i.e., getting insight into psychological phenomena by thinking about evolution) is both possible and an excellent idea. Not only that, but I also think that actual evolutionary psychologists have in fact produced lots of good insights, as long as you’re able to sift them out from a giant pile of crap, just like in every field.

  • “Cortical uniformity” is the idea—due originally to Vernon Mountcastle in the 1970s and popularized by Jeff Hawkins in On Intelligence—that the neocortex (also called “isocortex” if you want to show off) is more-or-less a single configuration of neurons replicated over and over—in the case of humans, either 2 million “cortical columns” or 200 million “cortical minicolumns”, depending on who you ask. Cortical uniformity is a surprising hypothesis in light of the fact that different parts of the neocortex are intimately involved in seemingly-different domains like vision, language, math, reasoning, motor control, and so on. I say “more-or-less uniform” because neither Jeff Hawkins nor anyone else to my knowledge believes in literal “cortical uniformity”. There are well-known regional differences in the neocortex, but I like to think of them as akin to learning algorithm hyperparameters and architecture. Anyway, “cortical uniformity” is closely allied to the “universal learning machine” school of thought (see §2.5.3 here), but to flesh out that story you also need to say something about the other parts of the brain that are not the neocortex. For example, both Jacob (I think) and I take another big step in the “universal learning machine” direction by hypothesizing not only (quasi)uniformity of the cortex, but also of the striatum, cerebellum, and thalamus (with some caveats). Anyway, see below.

1.2 My compromise position

To oversimplify a bit, my position on the evolved-modularity versus universal-learning-machine spectrum is:

  • “Universal Learning Machine” is an excellent starting point for thinking about the telencephalon (neocortex, hippocampus, amygdala, striatum, etc.), thalamus, and cerebellum.

    • I.e., when we try to understand those parts of the brain, we should be mainly on the lookout for powerful large-scale learning algorithms.

  • “Evolved Modularity” is an excellent starting point for thinking about the hypothalamus and brainstem.

    • I.e., when we try to understand those parts of the brain, we should be mainly on the lookout for lots of little components that do specific fitness-enhancing things and which are specifically encoded by the genome.

    • (See, for example, my discussion of a particular cluster of cells in the hypothalamus that orchestrate hunger-related behavior in Section 3 of my recent hypothalamus post.)

1.3 How complicated are innate drives?

If the within-lifetime learning algorithm of the human brain is a kind of RL algorithm, then it needs a reward function. (I actually think this is a bit of an oversimplification, but close enough.) Let’s use the term “innate drives” to refer to the things in that reward function—avoiding pain, eating sweets, etc. The reward function, in my view, is primarily calculated in the hypothalamus and brainstem.

(For more on my picture, see my posts “Learning From Scratch” in the Brain and Two subsystems: Learning & Steering.)

Jacob and I seem to have some disagreement about how complex these innate drives are, and how much we should care about that complexity; I’m on the pro-complexity side of the debate, and Jacob is on the pro-simplicity side.

For an example of where we disagree, consider the landscape preferences theory within evolutionary aesthetics. Here’s wikipedia (hyperlinks and footnotes removed):

An important choice for a mobile organism is selecting a good habitat to live in. Humans are argued to have strong aesthetical preferences for landscapes which were good habitats in the ancestral environment. When young human children from different nations are asked to select which landscape they prefer, from a selection of standardized landscape photographs, there is a strong preference for savannas with trees. The East African savanna is the ancestral environment in which much of human evolution is argued to have taken place. There is also a preference for landscapes with water, with both open and wooded areas, with trees with branches at a suitable height for climbing and taking foods, with features encouraging exploration such as a path or river curving out of view, with seen or implied game animals, and with some clouds. These are all features that are often featured in calendar art and in the design of public parks.

A survey of art preferences in many different nations found that realistic painting was preferred. Favorite features were water, trees as well as other plants, humans (in particular beautiful women, children, and well-known historical figures), and animals (in particular both wild and domestic large animals). Blue, followed by green, was the favorite color. Using the survey, the study authors constructed a painting showing the preferences of each nation. Despite the many different cultures, the paintings all showed a strong similarity to landscape calendar art. The authors argued that this similarity was in fact due to the influence of the Western calendar industry. Another explanation is that these features are those evolutionary psychology predicts should be popular for evolutionary reasons.

My snap reaction is that this evolutionary story seems probably true, and Jacob’s is that it’s probably false. We were arguing about it in this thread.

The disagreement seems to be less about the specifics of the landscape painting experiment mentioned above, and more about priors.

My prior is mainly coming from the following:

  • By default, being in the wrong micro-habitat gives a negative reward which can be both very sparse and often irreversibly fatal (e.g. a higher chance of getting eaten by a predator, starving to death, freezing to death, burning to death, falling to death, drowning, getting stuck in the mud, etc., depending on the species).

  • Therefore, it’s very difficult for an animal to learn which micro-habitat to occupy purely by trial-and-error without the help of any micro-habitat-specific reward-shaping.

  • Such reward-shaping is straightforward to implement by doing heuristic calculations on sensory inputs.

  • Animal brains (specifically brainstem & hypothalamus in the case of vertebrates) seem to be perfectly set up with the corresponding machinery to do this—visual heuristics within the superior colliculus, auditory heuristics within the inferior colliculus, taste heuristics within the medulla, smell heuristics within the hypothalamus, etc.

  • Therefore, I have a strong prior expectation that every mobile animal including humans will find types of visual input (and sounds, smells, etc.) to be inherently “appealing” /​ “pleasant”, in a way that would statistically lead the animal to spend more time in “good” micro-habitats /​ hunting grounds /​ etc. and less time in “bad” ones.

Jacob’s prior is mainly coming from the following, I think:

(I am more-or-less on board with the first top-level bullet point here[1], but disagree with the last bullet point.)

All that was kinda priors. Now we turn to the specifics of the landscape painting thing.

Jacob & I argued about it for a while. I think the following is one of the root causes of the disagreement:

  • Jacob was interpreting the hypothesis in question as “Humans have (among other things) an innate preference for looking at water, trees, etc.”.

  • The hypothesis that I believe is: “Humans have (among other things) a pleasant innate reaction upon looking at visual scenes for which F(visual input) takes a high value, where F is some rather simple function, I don’t know exactly what, but definitely way too simple a function to include a proper water-detector, or tree-detector, etc.[2]

Jacob and I both agree that the first hypothesis is wrong. (To be fair, he wasn’t getting it from nowhere—it’s probably what most advocates of the hypothesis would say that they are arguing for!)

(And this is an example of our more general dispositions where I tend to think “10% of evolutionary psychology is true important things that we need to explain, let’s get to work explaining them properly” and Jacob tends to think “90% of evolutionary psychology is crap, let’s get to work throwing it out”. These are not inconsistent! But they’re different emphases.)

Anyway, my hypothesis is coming from:

  1. I think the function F is implemented in the superior colliculus (part of the brainstem), which is too small and low-resolution to do good image processing;

  2. We only have 25,000 genes in our whole genome, and building a proper robust tree-detector seems too complicated for that;

  3. There’s some evidence that the human superior colliculus has an innate human-face detector, but it’s not really a human-face detector, it’s really a detector of three dark blobs in a roughly triangular pattern, and this blob-detector incidentally triggers on faces. Likewise, an incoming-bird-detector in the mouse superior colliculus is really more like an “expanding dark blob in the upper field-of-view” detector (ref).

Let’s go back to evidence from surveys and market research on wall-calendars and paintings, mentioned in that Wikipedia excerpt above. Unfortunately, it seems that neither Jacob nor I have theories that make sharp predictions on what people will want to hang on their walls. One problem is that we both agree that people can hang things on walls for reasons related to neither “innate aesthetics” nor “information value”, like impressing your friends, or bringing back sentimental memories of your first kiss. I have the additional problem that I don’t know exactly what the alleged habitat-aesthetics function F is (and there are probably several F’s), and thus I find it perfectly plausible (indeed, expected) that F can be triggered by, say, an abstract painting which nobody in their right mind would mistake for a savannah landscape. And I have no predictions about which abstract paintings![3] And conversely, the question of what does or doesn’t provide information value is likewise complicated—it depends on one’s goals and prior knowledge. Thus Jacob and I were disagreeing here about whether a window view of a river provides high or low information value. (Suppose that you’ve had that same window view for the past 3 years already, and the river never has animals or boats on it.) I say the information value of that window view is roughly zero, Jacob says it’s significantly positive (“it’s constantly changing with time of day lighting, weather, etc.…the river view suggests you can actually go out and explore the landscape”), and I’m not sure how we’re going to resolve that.

DALL-E 2 prompts: “the view out my window has high information value” (left) and “a window view with high information value” (right). 🤔🤔🤔

So it seems like we’re stuck, or at least our disagreement probably won’t get resolved by looking into people’s wall-art preferences.

1.3.2 …But this doesn’t seem to be a super-deep disagreement

Why don’t I think it’s a super-deep disagreement?

For one thing, I proposed that “fully describing the [reward function] of a human would probably take like thousands of lines of pseudocode” and Jacob said “sounds reasonable”.

For another thing, while we disagree about habitat-aesthetics-in-humans, there are structurally-similar cases where Jacob & I are in fact on the same page:

  • I brought up the case of a little camouflaged animal having an innate preference to be standing on the appropriate background to its camouflage, implemented via the superior colliculus calculating some function on visual inputs and feeding that information into the reward function (as one among many contributions to the reward function). Jacob seemed at least willing to entertain that as a plausible hypothetical.

  • Jacob definitely believes that there are innate sexual preferences related to the visual appearances of potential mates. Let’s turn to that next.

1.3.3 “Correlation-guided proxy matching”

Here is Jacob describing the idea of “correlation-guided proxy matching”:

Any time evolution started using a generic learning system, it had to figure out how to solve this learned symbol grounding problem, how to wire up dynamically learned concepts to extant conserved, genetically-predetermined behavioral circuits.

Evolution’s general solution likely is correlation-guided proxy matching: a Matryoshka-style layered brain approach where a more hardwired oldbrain is redundantly extended rather than replaced by a more dynamic newbrain. Specific innate circuits in the oldbrain encode simple approximations of the same computational concepts/​patterns as specific circuits that will typically develop in the newbrain at some critical learning stage—and the resulting firing pattern correlations thereby help oldbrain circuits locate and connect to their precise dynamic circuit counterparts in the newbrain. This is why we see replication of sensory systems in the ‘oldbrain’, even in humans who rely entirely on cortical sensory processing.

[Translation guide: When Jacob talks about “oldbrain” it’s roughly equivalent to when I talk about “hypothalamus and brainstem”.]

In the case of innate sexual preferences, Jacob proposes “dumb simple humanoid shape detectors and symmetry detectors etc encoding a simple sexiness visual concept”[4] as an example.

Anyway, leaving aside some nitpicky arguments over implementation details, I see this as very much on the right track. I’m bringing it up because we’ll get back to it later.

1.3.4 Should we think of (almost) all innate drives as “an approximation to (self)-empowerment”?

Let’s loosely define “empowerment” as “having lots of options in the future”—see Jacob’s post Empowerment is (almost) all we need for better discussion, and I’ll get back to empowerment in Section 3 below in the context of AGI.

If a sufficiently-clear-thinking human were deliberately trying to empower herself, she would do lots of things that humans actually do. She would stay alive and healthy, she would win friends and allies and high social status, she would gain skills and knowledge, she would accumulate money or other resources, she would stay abreast of community gossip, and so on.

Maybe you’re tempted to look at the above paragraph and say “Aha! An “empowerment drive” is a grand unified theory of human innate drives!!” But that would be wrong for a couple reasons.

The first reason is that empowerment comes apart from inclusive genetic fitness in a couple places—particularly having sex, raising children, and more generally helping close relatives survive and have children à la kin selection theory. And we see this in e.g. the human innate sex drive.

The second reason is that infants cannot realistically calculate which actions will lead to “empowerment”.

Jacob responds: On the contrary I think it’s fairly clear now that the primary learning signals driving the infant brain are some combination of self-supervised learning for prediction and then value of information and optionality/​empowerment for decisions (motor and planning).

The evidence for this comes from DL experiments as well as neuroscience, but also just obvious case examples:


Indeed, I claim that even adult humans often do things that advance their own empowerment without understanding why and how. For example, if someone is quick to anger and vengeance, then that tendency can (indirectly via their reputation) increase their empowerment, via people learning not to mess with them. But that’s not why they’re quick to anger and vengeance—it’s just their personality! And if they haven’t read Thomas Schelling or whatever, they might never appreciate the underlying logic.

So we don’t have an innate drive for “empowerment” per se, because it’s not realistically computable. Instead:

  • We have a set of innate drives which can be collectively viewed as “an approximation to a hypothetical empowerment drive”. For example, innate fear-of-heights is part of an approximation to empowerment, insofar as falling off a cliff tends to be disempowering.

  • We will generally learn empowerment-advancing behaviors and patterns within our lifetimes, because those behaviors and patterns tend to be useful for lots of things. For example, I like having money, not because of any innate drive, but because of experience using money to get lots of other things I like.

Out of these two points, I think Jacob has more-or-less agreed with both. For the first one, he recognizes sex as a non-empowerment-related innate drive here (“The values that deviate from empowerment are near exclusively related to sex”—well that seems an overstatement given childrearing, but whatever.) For the second one, here I had proposed “There is innate stuff in the genome that makes humans want social status. Oh by the way, the reason that this stuff wound up in the genome is because social status tends to lead to empowerment, which in turn tends to lead to higher inclusive genetic fitness. Ditto curiosity, fun, etc.”, and Jacob at least “mostly” agreed.

Jacob responds: Social status drive emerges naturally from empowerment, which children acquire by learning cultural theory of mind and folk game theory through learning to communicate with and through their parents. Children quickly learn that hidden variables in their parents have huge effect on their environment and thus try to learn how to control those variables.

I mostly agree that curiosity—or value of information—is innate; which is not the same as optionality-empowerment, but is closely connected to it and a primary innate motivational drive. Fun is also probably an emergent consequence of value-of-information and optionality.

But unlike Jacob, I get the takeaway message: “OK, so at the end of the day, ‘empowerment’ is pretty useless as a way to think about human innate drives. Let’s not do that.” For example, I can say “fear-of-heights is part of an approximation to empowerment”, and that’s correct! But what’s the point? I can equally well say “fear-of-heights is part of an approximation to inclusive genetic fitness”. Or better yet, “fear-of-heights tends to stop you from falling off cliffs and getting injured or killed, which in turn would be bad for inclusive genetic fitness”. I don’t see how “empowerment” is adding anything to the conversation here.

And I think “empowerment” adds to confusion if we’re not scrupulously careful to avoid mixing up “empowerment” and “approximation-to-empowerment”. Approximations tend to come apart in new environments—that’s Goodhart’s law. We’ll get back to that in Section 3.3 below.

Likewise, we can say that “status drive is an approximation to empowerment”, and we’re correct to say that, but saying that gets us ≈0% of the way towards explaining exactly what status drive is or how it’s implemented.

(Unless you think that there’s no such thing as an innate status drive, and that humans engage in status-seeking and status-respecting behaviors purely because they’ve learned within their lifetime that those behaviors are instrumentally useful. That’s certainly a hypothesis worth entertaining, but I strongly believe that it’s wrong.)

Jacob responds (to “we can say that ‘status drive is an approximation to empowerment’”): Well no, I’d say status drive is not truly innate at all, but is learned very early on as a natural empowerment manifestation or proxy.

Infants don’t even know how to control their own limbs, but they automatically learn through a powerful general empowerment learning mechanism. That same general learning signal absolutely does not—and can not—discriminate between hidden variables representing limb poses (which it seeks to control) and hidden variables representing beliefs in other humans minds (which determine constraints on the child’s behavior). It simply seeks to control all such important hidden variables.

Steve sidenote: Leaving aside the question of who is correct, I think it’s helpful to note that this disagreement here has the same pattern as the one in Section 1.3.1 above—Jacob thinks that the human brain within-lifetime RL reward function is simpler (a.k.a. smaller number of different “innate drives”) and I think it’s more complicated (a.k.a. larger number of different “innate drives”).

OK, let’s switch gears to a somewhat different topic:

2. Will AGI algorithms look like brain algorithms?

2.1 The spectrum from “giant universe of possible AGI algorithms” versus “one natural practical way to build AGI”

Here are two opposite schools of thought:

  • “Giant Universe” school-of-thought: There is a vast universe of possible AGI algorithms. If you zoom in enough, you can eventually find a tiny speck, and inside that speck is every human mind that has ever existed. (Cf. Eliezer Yudkowsky 2008.)

  • “Unique Solution” school-of-thought: The things we expect AGI to do (learn, understand, plan, reason, invent, etc.) comprise a problem, and maybe it turns out that there’s just one natural practical way to solve that problem. If so, we would expect future AGI algorithms to resemble human brain algorithms. (Cf. Jacob Cannell 2022)

Before proceeding, a few points of clarification:

  • People can easily talk past each other by mixing up “learning algorithm” versus “trained model”. I’m closer to the “unique solution” camp when we’re talking about the learning algorithm, and I’m closer to the “giant universe” camp when we talk about the trained model.

  • As a particularly safety-relevant example of why I’m in the “giant universe” camp for trained models, I think human-brain-like RL with 1000 different reward functions can lead to trained models that have 1000 wildly different desires /​ goals /​ intuitions about what’s good and right. (But they all might act the same for a while thanks to instrumental convergence.) In this context, I think it’s important to remember that people can (and by default will) make AGIs with reward functions that are radically different from those of any human or animal, e.g. “reward for paperclips”. (More discussion and caveats in my post here.)

  • We can also reconcile the two schools of thought by the fact that the “Giant Universe” claim is about “possible” algorithms and the “Unique Solution” claim is about “practical” algorithms. Even if there is just one unique practical learning algorithm that scales to AGI, there are certainly lots of other wildly impractical ones. Two examples in the latter category (in my opinion) would be: (1) a learning algorithm that recapitulates the process of animal evolution, and (2) computable approximations to AIXI such as “AIXItl”.

Going back to those two schools of thought, and focusing on the learning algorithm not the trained model, are there any good reasons to believe in “Unique Solution”?

It seems at least plausible to me. After all, there do seem to be “natural” solutions to at least some algorithmic problems—e.g. the Fast Fourier Transform was more-or-less independently invented multiple times. Would an intelligent extraterrestrial civilization invent the belief propagation algorithm, in a form recognizable to us? Hard to say, but it seems at least plausible, right?

We get stronger evidence from the cases where AI researchers have come up with an idea and then later discover that they reinvented something that evolution has already put into the human brain. Examples are controversial, but arguably include Temporal Difference learning, self-supervised learning (i.e. the idea of updating models on sensory prediction errors), and feedback control. What about the overlap between deep learning and the brain—distributed representations, adjustable weights, etc.? Well, those things were historically brought into AI from neuroscience, which complicates our ability to draw lessons. But still, the remarkable successes of more-brain-like deep learning compared to various less-brain-like alternatives in AI does seem to be at least some evidence for “Unique Solution”. (But see next subsection.)

Jacob offers another reason that he’s strongly in the “Unique Solution” school of thought, related to his claim that brains are near various theoretical efficiency limits. Leaving aside the question of whether brains are in fact near various theoretical efficiency limits (I have no strong opinion), I don’t understand this argument. Why can’t a wildly different algorithm also approach the same theoretical efficiency limits?

Well anyway, I join Jacob in the “Unique Solution” camp, albeit with a bit less confidence and for different underlying reasons. Indeed, when I explain to people why I’m working on brain-like AGI (e.g. here), I usually offer the justification that we AGI safety researchers should be making contingency plans for any plausible AGI design that we can think of, and brain-like AGI is at least plausible. But that’s just a polite diplomatic cop-out. What I really believe is that the researchers pursuing broadly-brain-like paths to AGI are the ones who will probably succeed, and everyone else will probably fail, and/​or gradually pivot /​ converge towards brain-like approaches. If you disagree with that claim, I’m not particularly interested in arguing with you (for the obvious infohazard reasons)—we can agree to disagree, and I will fall back to my polite diplomatic cop-out answer above, and we’re all going to find out sooner or later!

2.2 How similar are brain learning algorithms versus today’s deep learning algorithms? (And implications for timelines.)

Jacob and I seem to be in agreement that human brain learning algorithms are similar in some ways and different in other ways from today’s deep learning algorithms. But I have a strong sense that Jacob expects substantially bigger similarities and substantially smaller differences than I do. That’s hard to pin down, and as above I don’t want to argue about it. We’ll find out sooner or later!

In terms of timelines, Jacob & I agree that AGI is probably already possible for a reasonable price with today’s chips and data centers, and we’re just waiting on algorithmic advances. (Jacob: “So my model absolutely is that we are limited by algorithmic knowledge. If we had that knowledge today we would be training AGI right now”.)

So then my next step is to say “OK then. How long will we be waiting on those algorithmic advances? Hmm. I dunno! Maybe 5-30 years?? Then let’s also add, umm, 5-10 more years after that to work out the kinks and run trainings before we have AGI.” (When I say “5-30” years, I have a bit more going on under the hood besides wild guessing. But not much more!)

Jacob proposes more confidently that we’ll get AGI soon (“75% by 2032”). He thinks that a certain amount of compute /​ memory /​ etc. is required to train an AGI (and we can figure out roughly how much by looking at human brain within-lifetime learning), and by the time that a great many groups around the world have easy access to this much compute /​ memory /​ etc., they will come up with whatever algorithmic advances are necessary for AGI. He writes: “Algorithmic innovation is rarely the key constraint on progress in DL, due to the vast computational training expense of testing new ideas. Ideas are cheap, hardware is not.” (I have heard that Hans Moravec’s forecasts were based on a similar assumption.)

I’m much less confident than Jacob in “ideas are cheap”. It seems to me that plenty of useful algorithms are published decades later than they theoretically could have been published, for reasons unrelated to the availability of compute. For example, Judea Pearl published the belief propagation algorithm in 1982. Why hadn’t someone already published it in 1962? Or 1922?? That’s not a rhetorical question—I’m not an expert, maybe there’s a good answer! Leave a comment if you know. But anyway, where I’m at right now is that I wouldn’t be surprised if there were, say, 10 or 20 years between lots of groups having easy access to compute sufficient for AGI, and someone actually making AGI. So I have longer timelines than Jacob, although that’s a pretty low bar by “normie” standards.

Again, this all seems probably downstream of our different opinions about how similar deep learning algorithms are to brain learning algorithms—a question which (I would argue) is slightly relevant for safety and extremely relevant for capabilities, so I don’t care to talk about it. But it certainly seems likely that Jacob is imagining smaller ideas (tweaks) which are cheap, and I’m thinking of bigger ideas which are more expensive.

2.3 Will AGI use neuromorphic (or processing-in-memory) chips?

Jacob and I both agree that (1) the first AGIs that people will make will probably use “normal” chips like GPUs or other ASICs, (2) when thousandth-generation Jupiter-brain AGIs are building Dyson spheres, they’re probably going to be using neuromorphic /​ processing-in-memory architectures of some sort, since those seem to have the best properties in terms of both scaling up to extremely large information capacity, and energy efficiency. (See Jacob’s discussion here).

I think I’m a bit more negative than Jacob on the current state of neuromorphic chips and technical challenges ahead, and thus I expect the transition to neuromorphic chips to happen later than Jacob expects, probably. I also put higher probability on AGI also using fast serial coprocessors to unlock algorithmic possibilities that brains don’t have access to, both for early AGI and in the distant future. (Think of how “a human with a pocket calculator” can do things that a human can’t. Then think much bigger than that!) But whatever; this disagreement doesn’t seem to be too important for anything.

3. Human-empowerment as an AGI motivation

See Jacob’s recent post Empowerment is (almost) All We Need (and slightly earlier LOVE in a simbox is all you need).

Two questions immediately jump to mind:

The outer alignment question is: “Do we want to make an AGI that’s trying to “empower” humanity?”

The inner alignment question is: “How would we make an AGI that’s trying to “empower” humanity?”

Jacob’s answer to the latter (inner alignment) question is mostly “correlation-guided proxy matching” as described above, possibly supplemented by interpretability—see his comment here.

My perspective is that we shouldn’t really be asking these two questions separately. I think we’re going to follow Procedure X (let’s say, correlation-guided proxy matching with proxy P and hyperparameters A,B,C in environment E), and we’re going to get an AGI that’s trying to do Y. I expect that Y will not be identical to “empowerment” because perfect inner alignment is a pipe dream. So we shouldn’t ask the two questions: “(1) How similar is Y to “empowerment”, and (2) Is “empowerment” what we want?”. Instead, I think we should ask the one question “Is Y what we want?”.

So I want to push the question of empowerment to the side and just look at the actual plan. When I do, I find that Jacob’s proposals are very similar to my own! But I do think we have some minor differences worth discussing.

Jacob’s proposed plan described here suggested two things, one related to reverse-engineering social instincts in the brain, and the other related to interpretability. Let’s take them one at a time:

3.1 Social instincts /​ empathy

Jacob and I both agree that it would be good to understand human social instincts well enough that we could write them into future AGI source code if we wanted to (here’s my own post motivating that). We both agree that this code would probably involve something like correlation-guided proxy matching (I have a post on that too). But my impression is that Jacob expects that we’re going to get most of the way towards solving this problem by reading the (massive) existing neuroscience literature concerning morality, sociality, affects, etc., whereas I think that literature is all kinda garbage—or rather, not answering the questions that I’m interested in—and we still have our work cut out.

Jacob responds: Not quite—my prior is that success in reverse engineering human altruism (which probably depends on innate social instincts for grounding) will depend on existing neuroscience literature to about the same extent that progress in DL has.

So Jacob seems to have more of a “it’s OK we have a plan” attitude, while I’m sitting here poring over technical studies of neuropeptide receptors in the lateral septum, feeling like I’m racing the clock, even though my timelines-to-AGI are actually longer than his.

Somewhat relatedly, and echoing the discussion of innate drives above, I think Jacob expects human social instincts to be simpler than I do—maybe he expects human social instincts to comprise like 5 separable “innate reactions” (e.g. here) and I expect like 30, or whatever. So maybe he thinks we can just think about it a bit in our armchairs and write down the answer, and it will be either correct or close enough, whereas I expect more of a big research project that will produce non-obvious results.

Jacob responds: I think most of the system complexity for innate symbol grounding is split vaguely equally between sexual attraction and altruism-supporting innate social instincts, and that reverse engineering, testing and improving these mechanisms for DL agents in sim sandboxes is much of the big research project.

3.2 Interpretability

Jacob suggests that we could “use introspection/​interpretability tools to more manually locate learned models of external agents (and their values/​empowerment/​etc), and then extract those located circuits and use them directly as proxies in the next agent”. (See also here.) I think that’s a perfectly good idea (see e.g. my comment here), and I think our disagreement (such as it is) is a bit like Jacob saying “Maybe it will work” and me saying “Maybe it won’t work”. These can both be true. Hopefully we can all agree that it would be better to have a strong positive reason to believe that our plan will definitely work, particularly given challenges related to “concept extrapolation”. (See also the rest of that post.)

Jacob has a clever additional twist on interpretability in his proposal that we could “listen in” on an AGI’s internal monologue (see here). Again, I do think this is a fine idea that could help us, particularly if we can figure out interventions that make the AGI a “verbal thinker” to the greatest extent possible. I don’t think that this offers any strong guarantees that this interpretability won’t be missing important things. For example, I’m somewhat of a verbal thinker, I guess, but my internal monologue has lots of idiosyncratic made-up terms which are only meaningful to myself. It also has lots of very different thoughts associated with the same words. Let’s explore this avenue anyway, for sure, but I don’t want to get my hopes too high.

3.3 OK, but still, is humanity-empowerment what we want?

(In other words, if we somehow made an AGI that wanted to maximize the future empowerment of “humanity”, would it be “aligned”?)

I argued just above that this is not really the right question to ask. But it’s not entirely irrelevant either. So let’s have at it.

Let’s say that “humanity” (CEV or whatever) has terminal goals T (a utopia of truth, beauty, friendship, love, fun, diversity, kittens, whatever). Let’s also say that, given the choice and knowledge and power, “humanity” would pursue instrumental empowerment-type goals P as a means to an end of achieving T.

If we make an AGI that wants humanity to wind up maximally empowered in the future, it would be “aligned” to the human pursuit of P, but “misaligned” to the human pursuit of T.

Jacob responds: The convergence theorems basically say that optimizing for P[t] converges to optimizing for T[t+d] for some sufficient timespan d. So optimizing for our empowerment today is equivalent to optimizing for our future ability to maximize our long term values, whatever they are. I think you are confusing optimizing for P[t] (current empowerment) with optimizing for P[t+d] (future empowerment). Convergence requires a sufficient time gap between the moment of empowerment and the future utility, which wouldn’t occur for P[t+d] and T[t+d].

In other words, the AGI does not want humans to “cash in” their empowerment to purchase T.[5]

Even worse, the AGI does not want humans to want to “cash in” their empowerment to purchase T.

Jacob responds: If the AGI is optimizing for rolling future discounted empowerment, that is equivalent only to optimizing for the long term components of our utility function. Long term utility never wants us to ‘cash’ in empowerment, and this same conflict occurs in human brains (spend vs save/​invest). The obvious solution as I mentioned is to use a learned model for the short term utility, and probably learn the discount schedule.

Also it is worth noting that lower discount rates lead to more success in the long term, and lower discount rates increase the convergence (lower the importance of short term utility).

T is the whole value of the future. T is what we’re fighting for. T is the light at the end of the tunnel. If we make a powerful autonomous AGI that doesn’t care about T, then we’re doing the wrong thing!

This seems to be the obvious objection, and indeed I find it persuasive. But Jacob offers several rebuttals.

First (see here and here), I think Jacob is imagining two stages:

  • In Stage 1, the AGI accumulates P and gives it to humans.

  • In Stage 2, the now-super-empowered uplifted posthumans (or whatever) spend their P to buy T.

Jacob responds: Yeah this is what success looks like. There may be other success stories, but the main paths look like this (empowered posthumanity). So if your AGI is not working towards this path, something is probably wrong.

Steve again: (Just to be crystal-clear, I agree that this two-stage story sounds pretty great, if we can make it happen. Here I’m questioning whether it would happen, under the given assumptions.)

I’m skeptical of this story—or at least confused. It seems like the AGI would be unhappy about (post)humanity’s decision to throw out their own option value by purchasing T in stage 2. Maybe in stage 2, the AGI is no longer able to do anything about it—it’s too late, the posthumans are super-powerful and thus back in control of their own fate. But it’s not too late in stage 1! And even in stage 1, the AGI will see this “problem” coming, and so it can and will preemptively solve it.

Jacob responds: Imagine for example that mass uploading will become feasible in 2048 (with AGI’s help), and we created the AGI to maximize our empowerment—in 2048. The AGI will then not care how we spend that empowerment in 2049. Now generalize that to a continuous empowerment schedule with a learned discount rate and learned short term utility, and we can avoid issues with the AGI changing our minds too much before handing over power.

Steve again: OK I agree that an AGI with the stable goal of “maximize human empowerment in 2048” would not have the specific problem I brought up here.

Thus, for example, as the AGI is going through the process of “uplifting” the humans to posthumans, it would presumably do so in a way that deletes the human desire for T and adds a direct posthuman desire for P. Right?

Jacob responds: Doubtful—that would only occur if you had no short term model of T and also a too loose conception of ‘humanity’ to empower.

Second (see here and here), Jacob notes that evolution was optimizing for inclusive genetic fitness, and got some amount of T incidentally. So maybe an AGI optimizing for P will also incidentally produce T. Or even better: maybe T just is what happens when an optimization process pursues P! Or in Jacob’s words:

Humans and all our complex values are the result of evolutionary optimization for a conceptually simple objective: inclusive fitness. A posthuman society transcends biology and inclusive fitness no longer applies. What is the new objective function for post-biological evolution? Post humans are still intelligent agents with varying egocentric objectives and thus still systems for which the behavioral empowerment law applies. So the outcome is a natural continuation of our memetic/​cultural/​technological evolution which fills the lightcone with a vast and varied complex cosmopolitan posthuman society. The values that deviate from empowerment are near exclusively related to sex which no longer serves any direct purpose, but could still serve fun and thus empowerment. Reproduction still exists but in a new form. Everything that survives or flourishes tends to do so because it ultimately serves the purpose of some higher level optimization objective.

I think there’s a Goodhart’s law problem here.

People intrinsically like fun and beauty and friendship—they’re part of the T. But simultaneously, it turns out that they serve as an approximation to human empowerment—they’re a proxy to P (see Section 1.3.4). That’s reassuring, right? No it’s not, thanks to Goodhart’s law.

I claim that if an AGI was really good at optimizing P, it would find places where fun and beauty and friendship come apart from P, and then make sure that the posthumans’ actual desire in those cases is for P, and not for fun and beauty and friendship. And the more we push into weird out-of-distribution futures, the more likely this is to happen.

Jacob responds: Empowerment is a convergent efficient universal long term value approximator that any successful AGI will end up using due to the difficulties in efficiently optimizing directly for very specific values in the long term future from issues like accumulating uncertainty and the optimizer’s curse. The real question then is whether the AGI is optimizing for its own empowerment, or ours.

Weird-out-of-distribution futures are exactly the scenarios where it’s important that the AGI is optimizing for our empowerment rather than its own.

The AGI will probably not replace our desire for fun/​beauty/​friendship with P because of some combination of 1.) direct approximation of T (fun/​beauty/​friendship) for short term utility, 2.) a conservative model of ‘humanity’ to empower than prevents changing humans too much (which is necessary for any successful scheme regardless, as otherwise the AGI just assimilates us into itself to make optimizing for its self-empowerment equivalent to optimizing for ‘our’ values simply by redefining/​changing us), 3.) control over the discount schedule

For example, maybe some clever futuristic system of smart contracts is objectively much better at managing interpersonal coordination and trade than the old-fashioned notion of “trust and friendship”. And if the AGI sets up this smart-contract system, while simultaneously making (post)humans feel no intrinsic trust-and-friendship-related feelings and drives whatsoever, maybe those (post)humans would be “more empowered”. But I don’t care! That’s still bad! I still don’t want the AGI to do that! I want the feelings of trust and friendship to survive into the distant future!

Anyway, I don’t really know what a maximal-P future looks like. (I’m not sure that, in our current state of knowledge, P is defined well enough to answer that??) But my strong expectation is that it would not look like a complex cosmopolitan posthuman society. Maybe it would look like a universe full of computronium and machinery, working full-time to build even more computronium and machinery.

Third (from here),

“Empowerment is only a good bound of the long term component of utility functions, for some reasonable future time cutoff defining ‘long term’. But I think modelling just the short term component of human utility is not nearly as difficult as accurately modelling the long term, so it’s still an important win. I didn’t investigate that much in the article, but that is why the title is now “Empowerment is (almost) all we need”.”

OK, well, insofar as I’m opposed to empowerment, I naturally think “empowerment + other stuff” is a step in the right direction! :) However, my hunch is that for a sufficiently good choice of “other stuff”, the “empowerment” part will be rendered unnecessary or counterproductive. It seems likely that, if the future goes well, the AGI will facilitate human empowerment at the end of the day, but maybe it can do so because the AGI ultimately wants to maximize human flourishing, and it can reason that increasing human empowerment is instrumentally useful towards that end, for example.

Another thing is: Jacob writes: “no matter what your values are, optimizing for your empowerment today is identical to optimizing for your long term values today.” I think that kind of thinking is a bit confused. I reject the idea that if the AGI is making good decisions right now, then all is well. As mentioned above, if the AGI is motivated to manipulate human values, that motivation might only manifest in the AGI’s behavior way down the line, like when the AGI is uploading human brains but deleting the parts that entail an intrinsic desire for anything besides power. But while that problem will only manifest in the distant future, the time to solve it is right at the beginning, when we’re building the AGI and thus still have direct control over its motivations.

4. Simboxes

Jacob is a big fan of “simulation sandboxes”, which he calls “simboxes” for short. These are air-gapped virtual worlds which serve as environments in which you can train an AGI. See Jacob’s recent post LOVE in a simbox is all you need, section 5.

Jacob is optimistic about being able to set up simboxes such that the AGI-under-test does not escape (mainly because it doesn’t know it’s in a simbox, or even what a simbox is—as he writes, “these agents will lack even the requisite precursor words and concepts that we take for granted such as computation, simulation, etc.”), and Jacob is also optimistic that these tests will allow us to iterate our way to AGI safety /​ alignment.

While I’m much less optimistic than Jacob about achieving both those things simultaneously, my very important take-home message is: I think simbox testing is an excellent idea. I think we should not only be doing simbox testing in the endgame, but we should be working right now to build infrastructure and culture that makes future simbox testing maximally easy and safe and effective, and maximally likely to actually happen, not just a little but a lot. (Just like every other form of code testing and validation that we can think of.) We should also be working right now to think through exactly what simbox tests to run and how. I even previously included one ingredient of the path-to-simbox-testing—namely, feature-rich user-friendly super-secure sandbox software compatible with large-scale ML—as a Steve-endorsed shovel-ready AGI safety project on my list here.

Having said all that, I think we should mainly think of simbox testing as “an extra layer of protection” on top of other reasons to expect safe and beneficial AGI.

Specifically, I proposed in this comment two ways to think about what the simbox test is doing:

  • A. We’re going to have strong theoretical reasons to expect alignment, and we’re going to use simbox testing to validate those theories.

  • B. We’re going to have an unprincipled approach that might or might not create aligned models, and we’re going to use simbox testing to explore /​ tweak specific trained models and/​or explore /​ tweak the training approach.

A is good. B is problematic, for reasons I’ll get to shortly.

But first, I want to emphasize that I see this A-vs-B distinction as a continuum, not a binary. There’s a whole spectrum from “unprincipled approach” to “strong theoretical reasons to expect alignment”, as we get a progressively more specific and fleshed-out story underlying why we expect our AGI to be aligned. For example:

  • All the way at the extreme of “strong theoretical reasons to expect alignment” would be Vanessa Kosoy’s research program working towards a rigorous mathematical proof of AGI safety (which I’m pessimistic about, but I wish her luck!).

  • All the way at “unprincipled” would be just doing capabilities research, not thinking about alignment at all, and seeing what happens with the trained models at the end. Ajeya Cotra’s “human feedback on diverse tasks” would be basically in that category.

  • Somewhere in between these two extremes would be, say, Alex Turner’s recent diamond-alignment post, where we engage in speculation about what the “baby AGI” is probably thinking about in different situations, and then try to send reward signals at carefully-chosen times to seed desired motivations. Or my toy example proposal here to make an AGI that learns the abstract concept “human flourishing” from observations, and then tries to maximize the extent to which its beliefs pattern-match to that abstract concept. These proposals may well fail, for sure, but at least we’re not totally in the dark when it comes to anticipating where and how they might fail, and what tests might help us figure that out.

In terms of simbox use strategy, I think “somewhere in between A and B” is all I’m hoping for, and I consider my research goal to be trying to get as close to A as possible.

Jacob’s response was: “As for A vs B: ideally you may want A but you settle mostly for B. That’s just how the world often works, how DL progressed, etc. We now have more established theory of how DL works as approx bayesian inference, but what actually drove most progress was B style tinkering.”

I think Jacob is selling himself short here. I think his simbox plan has a lot of “A” in it. I think Jacob has pretty specific ideas in mind for how alignment is going to happen and how it could fail, and these ideas are informing his picture of what kind of simbox testing is most useful, and what we would be looking for, etc.

By the way, what’s the problem with B? The problem is that the simboxes will be different from reality in lots of ways. For example, Jacob proposes “these agents will lack even the requisite precursor words and concepts that we take for granted such as computation, simulation, etc.” Well, that’s a great idea if we want to prevent the AGI from escaping the sim! But that’s a terrible idea if we want to avoid any distribution-shift between the simboxes and reality! (Cf. “ontological crisis”.) And if there’s any distribution-shift, then there’s the possibility that the same training procedure will produce aligned AGIs in the simboxes and misaligned AGIs in reality.

Jacob responds: The distribution shift from humans born in 0AD to humans born in 2000AD seems fairly inconsequential for human alignment. Indeed, any useful AGI alignment mechanism should be at least as robust as human brains under such mild distribution shifts. Regardless, we can use various analogs of technological concepts if needed.

Luckily, this problem is progressively less problematic as we move from “B” towards “A”. Then we have some understanding of possible failure modes, and we can ensure that those failure modes are being probed by our simboxes.

(However, on my models, right now we are NOT close enough to “A” that all the remaining failure modes can be simbox-tested. For example, the distribution shift from “agents that are unaware of the concept of computation” to “agents that are aware of the concept of computation” is fraught with danger, difficult to reason about in our current state of knowledge (see my discussion of “concept extrapolation” here), and risky to probe in simboxes. So we still have lots more simbox-unrelated work to do, in parallel with the important simbox-prep work.)

(Thanks Jacob for bearing with me through lots of discussion over the past months, and for leaving comments above. Thanks also to Linda Linsefors & Alex Turner for critical comments on an earlier draft.)

  1. ^

    I say “more or less” because I think Jacob and I have some disagreements about the “neuroscience of novelty and curiosity” literature. For example, I think there’s a theory relating serotonin to information value, which Jacob likes and I dislike. But leaving aside those details, I am strongly on board with the more basic idea that the brain has an innate curiosity drive of some sort or another, and right now I don’t have much of a specific take on how it works.

  2. ^

    In addition to the direct effects of F (“I like looking at X because F(X) is high”), there could also be indirect effects of F (“I like looking at X because it pattern-matches to /​ reminds me of Y, which I like, and oh by the way the reason I like Y is because F(Y) was high when I looked at it as a child”). See discussion of “correlation-guided proxy matching” below.

  3. ^

    It’s not that this is unknowable, but I think figuring it out would require a heroic effort and/​or detailed connectomic data about the human superior colliculus (and maybe also the neighboring parabigeminal nucleus). And someone should totally do that!! I would be very grateful!!

  4. ^

    UPDATE: Just to be clear, I don’t have an opinion on the specific question of whether or not humans have innate visual “sexiness”-related heuristics. I do think there has to be something that solves the “symbol grounding” problem, but I’m not confident that it’s even partly visual. It could alternatively involve the sense of smell, and/​or empathetic simulation of body shape and sensations (vaguely along these lines but involving the proprioceptive and somatosensory systems). Or maybe it is visual, I don’t know.

  5. ^

    There’s a weird dynamic here in that I’m saying that an AGI which supposedly wants humanity to be empowered would be motivated to prevent humanity from exercising its power. Isn’t that contradictory? I think the way to square that circle is that the proposal as I understand it is for the AGI to want humanity to be empowered later—to eventually wind up empowered. However, there’s a tradeoff between empowerment-now and empowerment-later. If I’m empowered-now, then I can choose NOT to be empowered-later—e.g., by spending my money instead of hoarding it. Or jumping off a cliff. Therefore an AGI that always wants humanity to be empowered-later is an AGI that never wants humanity to be empowered-now. So then the “later” never arrives—not even at the end of the universe!!