[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

6.1 Post summary /​ Table of contents

Part of the “Intro to brain-like-AGI safety” post series.

Thus far in the series, Post #1 set out some definitions and motivations (what is “brain-like AGI safety” and why should we care?), and Posts #2 & #3 split the brain into a Learning Subsystem (telencephalon and cerebellum) that “learns from scratch” using learning algorithms, and a Steering Subsystem (hypothalamus and brainstem) that is mostly genetically-hardwired and executes innate species-specific instincts and reactions.

Then in Post #4, I talked about the “short-term predictor”, a circuit which learns, via supervised learning, to predict a signal in advance of its arrival, but only by perhaps a fraction of a second. Post #5 then argued that if we form a closed loop involving both a set of short-term predictors in the Learning Subsystem and a corresponding set of hardwired circuits in the Steering Subsystem, we can get a “long-term predictor”. I noted that the “long-term predictor” circuit is closely related to temporal difference (TD) learning.

Now in this post, we fill in the last ingredients—roughly the “actor” part of actor-critic reinforcement learning (RL)—to get a whole big picture of motivation and decision-making in the human brain. (I’m saying “human brain” to be specific, but it would be a similar story in any other mammal, and to a lesser extent in any vertebrate.)

The reason I care about motivation and decision-making is that if we eventually build brain-like AGIs (cf. Post #1), we’ll want to build them so that they have some motivations (e.g. being helpful) and not others (e.g. escaping human control and self-reproducing around the internet). Much more on that topic in later posts.

Teaser for upcoming posts: The next post (#7) will walk through a concrete example of the model in this post, where we can watch an innate drive lead to the formation of an explicit goal, and adoption and execution of a plan to accomplish it. Then starting in Post #8 we’ll switch gears, and from then on you can expect substantially less discussion of neuroscience and more discussion of AGI safety (with the exception of one more neuroscience post towards the end).

Unless otherwise mentioned, everything in this post is “things that I believe right now”, as opposed to neuroscience consensus. (Pro tip: there is never a neuroscience consensus.) Relatedly, I will make minimal effort to connect my hypotheses to others in the literature, but I’m happy to chat about that in the comments section or by email.

Table of contents:

  • In Section 6.2, I’ll present a big picture of motivation and decision-making in the human brain, and walk through how it works. The rest of the post will go through different parts of that picture in more detail. If you’re in a hurry, I suggest reading to the end of Section 6.2 and then quitting.

  • In Section 6.3, I’ll talk about the so-called “Thought Generator”, comprising (I think) the dorsolateral prefrontal cortex, sensory cortex, and other areas. (For ML readers familiar with “actor-critic model-based RL”, the Thought Generator is more-or-less a combination of the “actor” and the “model”.) I’ll talk about the inputs and outputs of this module, and briefly sketch how its algorithm relates to neuroanatomy.

  • In Section 6.4, I’ll talk about how values and rewards work in this picture, including the reward signal that drives learning and decision-making in the Thought Generator.

  • In Section 6.5, I’ll go into a bit more detail about how and why thinking and decision-making needs to involve not only simultaneous comparisons (i.e., a mechanism for generating different options in parallel and selecting the most promising one), but also sequential comparisons (i.e., thinking of something, then thinking of something else, and comparing those two thoughts). For example, you might think: “Hmm, I think I’ll go to the gym. Actually, what if I went to the café instead?”

  • In Section 6.6, I’ll comment on the common misconception that the Learning Subsystem is the home of ego-syntonic, internalized “deep desires”, whereas the Steering Subsystem is the home of ego-dystonic, externalized “primal urges”. I will advocate more generally against thinking of the two subsystems as two agents in competition; a better mental model is that the two subsystems are two interconnected gears in a single machine.

6.2 Big picture

Yes, this is literally a big picture, unless you’re reading on your cell phone. You saw a chunk of it in the previous post (Section 5.4), but now there are a few more pieces.

The big picture—The whole post will revolve around this diagram. Note that the bracketed neuroanatomy labels are a bit oversimplified.

There’s a lot here, but don’t worry, I’ll walk through it bit by bit.

6.2.1 Relation to “two subsystems”

Here’s how this diagram fits in with my “two subsystems” perspective, first discussed in Post #3:

Same as above, but the two subsystems are highlighted in different colors.

6.2.2 Quick run-through

Before getting bogged down in details later in the post, I’ll just talk through the diagram:

1. Thought Generator generates a thought: The Thought Generator settles on a “thought”, out of the high-dimensional space of every thought you can possibly think at that moment. Note that this space of possibilities, while vast, is constrained by current sensory input, past sensory input, and everything else in your learned world-model. For example, if you’re sitting at a desk in Boston, it’s generally not possible for you to think that you’re scuba-diving off the coast of Madagascar. But you can make a plan, or whistle a tune, or recall a memory, or reflect on the meaning of life, etc.

2. Thought Assessors distill the thought into a “scorecard”: The Thought Assessors are a set of perhaps hundreds or thousands of “short-term predictor” circuits (Post #4), which I discussed more specifically in the previous post (#5). Each predictor is trained to predict a different signal from the Steering Subsystem. From the perspective of a Thought Assessor, everything in the Thought Generator (not just outputs but also latent variables) is context—information that they can use to make better predictions. Thus, if I’m thinking the thought “I’m going to eat candy right now”, a thought-assessor can predict “high probability of tasting something sweet very soon”, based purely on the thought—it doesn’t need to rely on either external behavior or sensory inputs, although those can be relevant context too.

3. The “scorecard” solves the interface problem between a learned-from-scratch world model and genetically-hardwired circuitry: Remember, the current thought and situation is an insanely complicated object in a high-dimensional learned-from-scratch space of “all possible thoughts you can think”. Yet we need the relatively simple, genetically-hardwired circuitry of the Steering Subsystem to analyze the current thought, including issuing a judgment of whether the thought is high-value or low-value (see Section 6.4 below), and whether the thought calls for cortisol release or goosebumps or pupil-dilation, etc. The “scorecard” solves that interfacing problem! It distills any possible thought /​ belief /​ plan /​ etc. into a genetically-standardized form that can be plugged directly into genetically-hardcoded circuitry.

4. The Steering Subsystem runs some genetically-hardwired algorithm: Its inputs are (1) the scorecard from the previous step and (2) various other information sources—pain, metabolic status, etc., all coming from its own brainstem sensory-processing system (see Post #3, Section 3.2.1). Its outputs include emitting hormones, motor commands, etc., as well as sending the “ground truth” supervisory signals shown in the diagram.[1]

5. The Thought Generator keeps or discards thoughts based on whether the Steering Subsystem likes them: More specifically, there’s a ground-truth value (a.k.a. reward, yes I know those don’t sound synonymous, see Post #5, Section 5.3.1). When the value is very positive, the current thought gets “strengthened”, sticks around, and can start controlling behavior and summoning follow-up thoughts, whereas when the value is very negative, the current thought gets immediately discarded, and the Thought Generator summons a new thought instead.

6. Both the Thought Generator and the Thought Assessor “learn from scratch” over the course of a lifetime, thanks in part to these supervisory signals from the Steering Subsystem. Specifically, the Thought Assessors learn to make better and better predictions of their “ground truth in hindsight” signal (a form of Supervised Learning—see Post #4), while the Thought Generator learns to disproportionately generate high-value thoughts. (The Thought Generator learning-from-scratch process also involves predictive learning of sensory inputs—Post #4, Section 4.7.)

6.3 The “Thought Generator”

6.3.1 Overview

Go back to the big-picture diagram at the top. At the top-left, we find the Thought Generator. In terms of actor-critic model-based RL, the Thought Generator is roughly a combination of “actor” + “model”, but not “critic”. (“Critic” was discussed in the previous post, and more on it below.)

At our somewhat-oversimplified level of analysis, we can think of the “thoughts” generated by the Thought Generator as a combination of constraints (from predictive learning of sensory inputs) and choices (guided by reinforcement learning). In more detail:

  • Constraints on the Thought Generator come from sensory input information, and ultimately from predictive learning of sensory inputs (Post #4, Section 4.7). For example, I cannot think the thought: There is a cat on my desk and I’m looking at it right now. There is no such cat, regrettably, and I can’t just will myself to see something that obviously isn’t there. I can imagine seeing it, but that’s not the same thought.

  • But within those constraints, there’s more than one possible thought my brain can think at any given time. It can call up a memory, it can ponder the meaning of life, it can zone out, it can issue a command to stand up, etc. I claim that these “choices” are decided by a reinforcement learning (RL) system. This RL system is one of the main topics of this post.

6.3.2 Thought Generator inputs

The Thought Generator has a number of inputs, including sensory inputs and hyperparameter-shifting neuromodulators. But the main one of interest for this post is ground-truth value, a.k.a. reward. I’ll talk about that in more detail later, but we can think of it as an estimate of whether a thought is good or bad, operationalized as “worth sticking with and pursuing” versus “deserving to be discarded so we can re-roll for a new thought”. This signal is important both for learning to think better thoughts in the future, and for thinking good thoughts right now:

6.3.3 Thought Generator outputs

There are meanwhile a lot of signals going out of the Thought Generator. Some are what we intuitively think of as “outputs”—e.g., skeletal motor commands. Other outgoing signals are, well, a bit funny…

Recall the idea of “context” from Section 4.3 of Post #4: The Thought Assessors are short-term predictors, and a short-term predictor can in principle grab any signal in the brain and leverage it to improve its ability to predict its target signal. So if the Thought Generator has a world-model, then somewhere in the world-model is a configuration of latent variable activations that encode the concept “baby kittens shivering in the cold rain”. We wouldn’t normally think of those as “output signals”—I just said in the last sentence that they’re latent variables! But as it happens, the “will lead to crying” Thought Assessor has grabbed a copy of those latent variables to use as context signals, and gradually learned through experience that these particular signals are strong predictors of me crying.

Now, as an adult, these “baby kittens in the cold rain” neurons in my Thought Generator are living a double-life:

  • They are latent variables in my world-model—i.e., they and their web of connections will help me parse an image of baby kittens in the rain, if I see one, and to reason about what would happen to them, etc.

  • Activating these neurons, e.g. via imagination, is a way for me to call up tears on command.

The Thought Generator (top left) has two types of outputs: the “traditional” outputs associated with voluntary behavior (green arrows) and the “funny” outputs wherein even latent variables in the model can directly impact involuntary behaviors (blue arrows).

6.3.4 Thought Generator neuroanatomy sketch

AUTHOR’S NOTE: When I first published this blog post, this section contained a discussion and diagrams of cortico-basal ganglia-thalamo-cortical loops, but it was very speculative and turned out to be wrong in various ways. It’s not too relevant for the series anyway, so I’m deleting it. I’ll write a corrected version at some point. Sorry!

Here’s the updated dopamine diagram from the previous post:

The “mesolimbic” dopamine signals on the right were discussed in the previous post (Section 5.5.6). The “mesocortical” dopamine signal on the left is new to this post. (I think there are even more dopamine signals in the brain, not shown here. They’re off-topic for this series, but see discussion here.)

There are many more implementation details inside the Thought Generator that I’m not discussing. However, this bare-bones section is more-or-less sufficient for my forthcoming posts on AGI safety. The gory details of the Thought Generator, like the gory details of almost everything else in the Learning Subsystem, are mainly helpful for building AGI.

6.4 Values and rewards

6.4.1 The cortex proposes a “value” estimate, but the Steering Subsystem may choose to override

There are two “values” in the diagram (it looks like three, but the two red ones are the same):

Two types of “value” in my model

The blue-circled signal is the value estimate from the corresponding Thought Assessor in the cortex. The red-circled signal (again, it’s one signal drawn twice) is the corresponding “ground truth” for what the value estimate should have been. (Recall that “ground-truth value” is a synonym for “reward”; yes I know that sounds wrong, see previous post (Section 5.3.1) for discussion.)

Just like the other “long-term predictors” discussed in the previous post, the Steering Subsystem can choose between “defer-to-predictor mode” and “override mode”. In the former, it sets the red equal to the blue, as if to say “OK, Thought Assessor, sure, I’ll take your word for it”. In the latter, it ignores the Thought Assessor’s proposal, and its own internal circuitry outputs some different value.[2]

Why might the Steering Subsystem override the Thought Assessor’s value estimate? Two factors:

  • First, the Steering Subsystem might be acting on information from other (non-value) Thought Assessors. For example, in the Dead Sea Salt Experiment (see previous post, Section 5.5.5), the value estimator says “bad things are going to happen”, but meanwhile the Steering Subsystem is getting an “I’m about to taste salt” prediction in the context of a state of salt-deprivation. So the Steering Subsystem says to itself “Whatever is happening now is very promising; the value estimator doesn’t know what it’s talking about!”

  • Second, the Steering Subsystem might be acting on its own information sources, independent of the Learning Subsystem. In particular, the Steering Subsystem has its own sensory-processing system (see Post #3, Section 3.2.1), which can sense biologically-relevant cues like pain status, hunger status, taste inputs, the sight of a slithering snake, the smell of a potential mate, and so on. All these things and more can be possible bases for overruling the Thought Assessor, i.e., setting the red-circled signal to a different value than the blue-circled one.

Interestingly (and unlike in textbook RL), in the big picture, the blue-circled signal doesn’t have a special role in the algorithm, as compared to the other Thought Assessors. It’s just one of many inputs to the Steering Subsystem’s hardwired algorithm for deciding what to put into the red-circled signal. The blue-circled signal might be an especially important signal in practice, weighed more heavily than the others, but ultimately everything is in the same pot. In fact, my longtime readers will recall that last year I was writing posts that omitted the blue-circled value signal from the list of Thought Assessors! I now think that was a mistake, but I retain a bit of that same attitude.

6.5 Decisions involve not only simultaneous but also sequential comparisons of value

Here’s a “simultaneous” model of decision-making, as described by The Hungry Brain by Stephan Guyenet in the context of studies on lamprey fish:

Each region of the pallium [= lamprey equivalent of cortex] sends a connection to a particular region of the striatum, which (via other parts of the basal ganglia) returns a connection back to the same starting location in the pallium. This means that each region of the pallium is reciprocally connected with the striatum via a specific loop that regulates a particular action…. For example, there’s a loop for tracking prey, a loop for fleeing predators, a loop for anchoring to a rock, and so on. Each region of the pallium is constantly whispering to the striatum to let it trigger its behavior, and the striatum always says “no!” by default. In the appropriate situation, the region’s whisper becomes a shout, and the striatum allows it to use the muscles to execute its action.

I endorse this as part of my model of decision-making, but only part of it. Specifically, this is one of the things that’s happening when the Thought Generator generates a thought. Different simultaneous possibilities are being compared.

The other part of my model is comparisons of sequential thoughts. You think a thought, and then you think a different thought (possibly very different, or possibly a refinement of the first thought), and the two are implicitly compared (by the Steering Subsystem picking a ground-truth value based on the temporal dynamics of Thought Assessors jumping up and down, for example), and if the second thought is worse, it gets weakened such that a new thought can replace it (and the new thought might be the first thought re-establishing itself).

I could cite experiments for the sequential-comparison aspect of decision-making (e.g. Figure 5 of this paper, which is arguing the same point as I am), but do I really need to? Introspectively, it’s obvious! You think: “Hmm, I think I’ll go to the gym. Actually, what if I went to the café instead?” You’re imagining one thing, and then another thing.

And I don’t think this is is a humans-vs-lampreys thing. My hunch is that comparisons of sequential thoughts is universal in vertebrates. As an illustration of what I mean:

6.5.1 Made-up example of what comparison-of-sequential-thoughts might look like in a simpler animal

Imagine a simple, ancient, little fish swimming along, navigating to the cave where it lives. It gets to a fork in the road, ummm, “fork in the kelp forest”? Its current navigation plan involves continuing left to its cave, but it also has the option of turning right to go to the reef, where it often forages.

Seeing this path to the right, I claim that its navigation algorithm reflexively loads up a plan: “I’m will turn right and go to the reef.” Immediately, this new plan is evaluated and compared to the old plan. If the new plan seems worse than the old plan, then the new thought gets shut down, and the old thought (“I’m going to my cave”) promptly reestablishes itself. The fish continues to its cave, as originally planned, without skipping a beat. Whereas if instead the new plan seems better than the old plan, then the new plan gets strengthened, sticks around, and orchestrates motor commands. And thus the fish turns to the right and goes to the reef instead.

(In reality, I don’t know much about little ancient fish, but rats at a fork in the road maze are known to imagine both possible navigation plans in succession, based on measurements of hippocampus neurons—ref.)

6.5.2 Comparison-of-sequential-thoughts: why it’s necessary

In my view, thoughts are complicated. To think the thought “I will go to the café”, you’re not just activating some tiny cluster of dedicated go-to-the-café neurons. Instead, it’s a distributed pattern involving practically every part of the cortex. You can’t simultaneously think “I will go to the café” and “I will go to the gym”, because they would involve different activity patterns of the same pools of neurons. They would cross-talk. Thus, the only possibility is thinking the thoughts in sequence.

As a concrete example of what I have in mind, think of how a Hopfield network can’t recall twelve different memories simultaneously. It has multiple stable states, but you can only explore them sequentially, one after the other. Or think about grid cells and place cells, etc.

6.5.3 Comparison-of-sequential-thoughts: how it might have evolved

From an evolutionary perspective, I imagine that comparison-of-sequential-thoughts is a distant descendent of a very simple mechanism akin to the run-and-tumble mechanism in swimming bacteria.

In the run-and-tumble mechanism, a bacterium swims in a straight line (“runs”), and periodically changes to a new random direction (“tumbles”). But the trick is: when the bacterium’s situation /​ environment is getting better, it tumbles less frequently, and when it’s getting worse, it tumbles more frequently. Thus, it winds up moving in a good direction (on average, over time).

Starting with a simple mechanism like that, one can imagine adding progressively more bells and whistles. The palette of behavioral options can get more and more complex, eventually culminating in “every thought you can possibly think”. The methods of evaluating whether the current plan is good or bad can get faster and more accurate, eventually involving learning-algorithm-based predictors as in the previous post. The new behavioral options to tumble into can be picked via clever learning algorithms, rather than randomly. Thus, it seems to me that there’s a smooth path all the way from something-akin-to-run-and-tumble to the intricate, finely-tuned, human brain system that I’m talking about in this series. (Other musings on run-and-tumble versus human motivation: 1, 2.)

6.6 Common misconceptions

6.6.1 The distinction between internalized ego-syntonic desires and externalized ego-dystonic urges is unrelated to Learning Subsystem vs. Steering Subsystem

(See also: my post (Brainstem, Neocortex) ≠ (Base Motivations, Honorable Motivations).)

Many people (including me) have a strong intuitive distinction between ego-syntonic drives that are “part of us” or “what we want”, versus ego-dystonic drives that feel like urges which intrude upon us from the outside.

For example, a food snob might say “I love fine chocolate”, while a dieter might say “I have an urge to eat fine chocolate”.

6.6.1.1 The explanation I like

I would claim that these two people are basically describing the same feeling, with essentially the same neuroanatomical locations and essentially the same relation to low-level brain algorithms. But the food snob is owning that feeling, and the dieter is externalizing that feeling.

These two different self-concepts go hand-in-hand with two different “higher-order preferences”: the food snob wants to want to eat fine chocolate while the dieter wants to not want to eat fine chocolate.

This leads us to a straightforward psychological explanation for why the food snob and dieter conceptualize their feelings differently:

  • The food snob finds it appealing to think of “the desire I feel for fine chocolate” as “part of who I am”. So he does.

  • The dieter finds it aversive to think of “the desire I feel for fine chocolate” as “part of who I am”. So he doesn’t.

6.6.1.2 The explanation I don’t like

Many people (including Jeff Hawkins, see Post #3) notice the distinction described above, and separately, they endorse the idea (as I do) that the brain has a Learning Subsystem and Steering Subsystem (again see Post #3). They naturally suppose that these are the same thing, with “me and my deep desires” corresponding to the Learning Subsystem, and “urges that I don’t identify with” corresponding to the Steering Subsystem.

Most people I talk to, including me, have separate concepts in our learned world-models for “me” and “my urges”. I claim that these concepts did NOT come out of veridical introspective access to our own neuroanatomy. And in particular, they do not correspond respectively to the Learning & Steering Subsystems.

I think this model is wrong. At the very least, if you want to endorse this model, then you need to reject approximately everything I’ve written in this and my previous four posts.

In my story, if you’re trying to abstain from chocolate, but also feel an urge to eat chocolate, then:

  • You have an urge to eat chocolate because the Steering Subsystem approves of the thought “I am going to eat chocolate right now”; AND

  • You’re trying to abstain from chocolate because the Steering Subsystem approves of the thought “I am abstaining from chocolate”.

(Why would the Steering Subsystem approve of the latter? It depends on the individual, but it’s probably a safe bet that social instincts are involved. I’ll talk more about social instincts in Post #13. If you want an example with less complicated baggage, imagine a lactose-intolerant person trying to resist the urge to eat yummy ice cream right now, because it will make them feel really sick later on. The Steering Subsystem likes plans that result in not feeling sick, and also likes plans that result in eating yummy ice cream.)

6.6.2 The Learning Subsystem and Steering Subsystem are not two agents

Relatedly, another frequent error is treating either the Learning Subsystem or Steering Subsystem by itself as a kind of independent agent. This is wrong on both sides:

  • The Learning Subsystem cannot think any thoughts unless the Steering Subsystem has endorsed those thoughts as being worthy of being thunk.

  • Meanwhile, the Steering Subsystem does not understand the world, or itself. It has no explicit goals for the future. It’s just a relatively simple, hardcoded input-output machine.

As an example, the following is entirely possible:

  1. The Learning Subsystem generates the thought “I am going to surgically alter my own Steering Subsystem”.

  2. The Thought Assessors distill that thought down to the “scorecard”.

  3. The Steering Subsystem gets the scorecard and runs it through its hardcoded heuristics, and the result is: “Very good thought, go right ahead and do it!”

Why not, right? I’ll talk more about that example in later posts.

If you just read the above example, and you’re thinking to yourself “Ah! This is a case where the Learning Subsystem has outwitted the Steering Subsystem”, then you’re still not getting it.

(Maybe instead try imagining the Learning Subsystem & Steering Subsystem as two interconnected gears in a single machine.)

  1. ^

    As in the previous post, the term “ground truth” here is a bit misleading, because sometimes the Steering Subsystem will just defer to the Thought Assessors.

  2. ^

    As in the previous post, I don’t really believe there is a pure dichotomy between “defer-to-predictor mode” and “override mode”. In reality, I’d bet that the Steering Subsystem can partly-but-not-entirely defer to the Thought Assessor, e.g. by taking a weighted average between the Thought Assessor and some other independent calculation.