Theoretical Neuroscience For Alignment Theory
This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.
Many additional thanks to Steve Byrnes and Adam Shimi for their helpful feedback on earlier drafts of this post.
TL;DR: Steve Byrnes has done really exciting work at the intersection of neuroscience and alignment theory. He argues that because we’re probably going to end up at some point with an AGI whose subparts at least superficially resemble those of the brain (a value function, a world model, etc.), it’s really important for alignment to proactively understand how the many ML-like algorithms in the brain actually do their thing. I build off of Steve’s framework in the second half of this post: first, I discuss why it would be worthwhile to understand the computations that underlie theory of mind + affective empathy. Second, I introduce the problem of self-referential misalignment, which is essentially the worry that initially-aligned ML systems with the capacity to model their own values could assign second-order values to these models that ultimately result in contradictory—and thus misaligned—behavioral policies. (A simple example of this general phenomenon in humans: Jack hates reading fiction, but Jack wants to be the kind of guy who likes reading fiction, so he forces himself to read fiction.)
In this post, my goal is to distill and expand upon some of Steve Byrnes’s thinking on AGI safety. For those unfamiliar with his work, Steve thinks about alignment largely through the lens of his own brand of “big-picture” theoretical neuroscience. Many of his formulations in this space are thus original and ever-evolving, which is all the more reason to attempt to consolidate his core ideas in one space. I’ll begin by summarizing Steve’s general perspectives on AGI safety and threat models. I’ll then turn to Steve’s various models of the brain and its neuromodulatory systems and how these conceptualizations relate to AGI safety. In the second half of this post, I’ll spend time exploring two novel directions for alignment theory that I think naturally emerge from Steve’s thinking.
Steve’s work in alignment theory
In order to build a coherent threat model (and before we start explicitly thinking about any brain-based algorithms), Steve reasons that we first need to operationalize some basic idea of what components we would expect to constitute an AGI. Steve asserts that four ingredients seem especially likely: a world model, a value function, a planner/actor, and a reward function calculator. As such, he imagines AGI to be fundamentally grounded in model-based RL.
So, in the simple example of an agent navigating a maze, the world model would be some learned map of that maze, the value function might assign values to every juncture (e.g., turning left here = +5, turning right here = −5), the planner/actor would transmute these values into a behavioral trajectory, and the reward function calculator would translate certain outcomes of that trajectory into rewards for the agent (e.g., +10 for successfully reaching the end of the maze). Note here that the first three ingredients of this generic AGI—its world model, value function, and planner/actor—are presumed to be learned, while the reward function calculator is considered to be hardcoded or otherwise fixed. This distinction (reward function = fixed; everything else = learned) will be critical for understanding Steve’s subsequent thinking in AGI safety and his motivations for studying neuroscience.
Using these algorithmic ingredients, Steve recasts inner alignment to simply refer to cases where an AGI’s value function converges to the sum of its reward function. Steve thinks inner-misalignment-by-default is not only likely, but inevitable, primarily because (a) many possible value functions could conceivably converge with any given reward history, (b) the reward function and value function will necessarily accept different inputs, (c) credit assignment failures are unavoidable, and (d) reward functions will conceivably encode for mutually-incompatible goals, leading to an unpredictable and/or uncontrollable internal state of the AGI. It is definitely worth noting here that Steve knowingly uses “inner alignment” slightly differently from Risks from Learned Optimization. Steve’s threat model focuses on the risks of steered optimization, where the outer layer—here the reward function—steers the inner layer towards optimizing the right target, rather than those risks associated with mesa-optimization, where a base optimizer searches over a space of possible algorithms and instantiates one that is itself performing optimization. Both uses of the term concern the alignment of some inner and an outer algorithm (and it therefore seems fine to use “inner alignment” to describe both), but the functions and relationship of these two sub-algorithms differ substantially across the two uses. See Steve’s table in this article for a great summary of the distinction. (It is also worth noting here that both steered and mesa-optimization are describable under Evan’s training story framework, where the training goal and rationale for some systems might respectively entail mesa-optimization and why a mesa-optimizer would be appropriate for the given task, while for other systems, the training goal will be to train a steered optimizer with some associated rationale for why doing so will lead to good results.)
Steve talks about outer alignment in a more conventional way: our translation into code of what we want a particular model or agent to do will be noisy, leading to unintended, unpredictable, and/or uncontrollable behavior from the model. Noteworthy here is that while Steve buys the distinction between inner and outer alignment, he believes that robust solutions to either problem will probably end up solving both problems, and so focusing exclusively on solving inner or outer alignment may not actually be the best strategy. Steve summarizes his position with the following analogy: bridge-builders have to worry both about hurricanes and earthquakes destroying their engineering project (two different problems), but it’s likely that the bridge-builders will end up implementing a single solution that addresses both problems simultaneously. So too, Steve contends, for inner and outer alignment problems. While I personally find myself agnostic on the question—I think it will depend to a large degree on the actual algorithms that end up comprising our eventual AGI—it is worth noting that this two-birds-one-stone claim might be contested by other alignment theorists.
Proposals for alignment
I think Steve’s two big-picture ideas about alignment are as follows.
Big-picture alignment idea #1: Steve advocates for what he hopes is a Goodhart-proof corrigibility approach wherein the AGI can learn the idea, say, that manipulation is bad, even in cases where it believes (1) that no one would actually catch it manipulating, and/or (2) that manipulation is in the best interest of the person being manipulated. Borrowing from the jargon of moral philosophy, we might call this “deontological corrigibility” (as opposed to “consequentialist corrigibility,” which would opt to manipulate in (2)-type cases). With this approach, Steve worries about what he calls the 1st-person-problem: namely, getting the AGI to interpret 3rd-person training signals as 1st-person training signals. I will return to this concern later and explain why I think that human cognition presents a solid working example of the kinds of computations necessary for addressing the problem.
Steve argues that this first deontological corrigibility approach would be well-supplemented by also implementing “conservatism;” that is, a sort of inhibitory fail-safe that the AGI is programmed to deploy in motivational edge-cases. For example, if a deontologically corrigible AGI learns that lying is bad and that murder is also bad, and someone with homicidal intent is pressuring the AGI to disclose the location of some person (forcing the AGI to choose between lying and facilitating murder), the conservative approach would be for the AGI to simply inhibit both actions and wait for its programmer, human feedback, etc. to adjudicate the situation. Deontological corrigibility and conservatism thus go hand-in-hand, primarily because we would expect the former approach to generate lots of edge-cases that the AGI would likely evaluate in an unstable or otherwise undesirable way. I also think that an important precondition for a robustly conservative AGI is that it exhibits indifference corrigibility as Evan operationalizes it, which further interrelates corrigibility and conservatism.
Big-picture alignment idea #2: Steve advocates, in his own words, “to understand the algorithms in the human brain that give rise to social instincts and put some modified version of those algorithms into our AGIs.” The thought here is that what would make a safe AGI safe is that it would share our idiosyncratic inductive biases and value-based intuitions about appropriate behavior in a given context. One commonly-proposed solution to this problem is to capture these intuitions indirectly through human-in-the-loop-style proposals like imitative amplification, safety via debate, reward modeling, etc., but it might also be possible to just “cut out the middleman” and install the relevant human-like social-psychological computations directly into the AGI. In slogan form, instead of (or in addition to) putting a human in the loop, we could theoretically put “humanness” in our AGI. I think that Steve thinks this second big-picture idea is the more promising of the two, not only because he says so himself, but also because it dovetails very nicely with Steve’s theoretical neuroscience research agenda.
This second proposal in particular brings us to the essential presupposition of Steve’s work in alignment theory: the human brain is really, really important to understand if we want to get alignment right. Extended disclaimer: I think Steve is spot on here, and I personally find it surprising that this kind of view is not more prominent in the field. Why would understanding the human brain matter for alignment? For starters, it seems like by far the best example that we have of a physical system that demonstrates a dual capacity for general intelligence and robust alignment to our values. In other words, if we comprehensively understood how the human brain works at the algorithmic level, then necessarily embedded in this understanding should be some recipe for a generally intelligent system at least as aligned to our values as the typical human brain. For what other set of algorithms could we have this same attractive guarantee? As such, I believe that if one could choose between (A) a superintelligent system built in the relevant way(s) like a human brain and (B) a superintelligent system that bears no resemblance to any kind of cognition with which we’re familiar, the probability of serious human-AGI misalignment and/or miscommunication happening is significantly higher in (B) than (A).
I should be clear that Steve actually takes a more moderate stance than this: he thinks that brain-like AGIs might be developed whether it’s a good idea or not—i.e., whether or not (A) is actually better than (B)—and that we should therefore (1) be ready from a theoretical standpoint if they do, and (2) figure out whether we would actually want them to be developed the first place. To this end, Steve has done a lot of really interesting distillatory work in theoretical neuroscience that I will try to further compress here and ultimately relate back to his risk models and solution proposals.
Steve’s work in theoretical neuroscience
A computational framework for the brain
I think that if one is to take any two core notions from Steve’s computational models of the brain, they are as follows: first, the brain can be bifurcated roughly into neocortex and subcortex—more specifically, the telencephalon and the brainstem/hypothalamus. Second, the (understudied) functional role of subcortex is to adaptively steer the development and optimization of complex models in the neocortex via the neuromodulatory reward signal, dopamine. Steve argues in accordance with neuroscientists like Jeff Hawkins that the neocortex—indeed, maybe the whole telencephalon—is a blank slate at birth; only through (dopaminergic) subcortical steering signals does the neocortex slowly become populated by generative world-models. Over time, these models are optimized (a) to be accurately internally and externally predictive, (b) to be compatible with our Bayesian priors, and (c) to predict big rewards (and the ones that lack one or more of these features are discarded). In Steve’s framing, these kinds of models serve as the thought/action proposals to which the basal ganglia assigns a value, looping the “selected” proposals back to cortex for further processing, and so on, until the action/thought occurs. The outcome of the action can then serve as a supervisory learning signal that updates the relevant proposals and value assignments in the neocortex and striatum for future reference.
Steve notes that there is not just a single kind of reward signal in this process; there are really something more like three signal-types. First, there is the holistic reward signal that we ordinarily think about. But there are also “local subsystem” rewards, which allocate credit (or “blame”) in a more fine-grained way. For example, slamming on the brakes to avoid a collision may phasically decrease the holistic reward signal (“you almost killed me, %*^&!”) but phasically increase particular subsystem reward signals (“nice job slamming those breaks, foot-brain-motor-loop!”). Finally, Steve argues that dopamine can also serve as a supervisory learning signal (as partially described above) in those cases for which a ground-truth error signal is available after the fact—a kind of “hindsight-is-20/20” dopamine.
So, summing it all up, here are the basic Steve-neuroscience-thoughts to keep in mind: neocortex is steered and subcortex is doing the steering, via dopamine. The neocortex is steered towards predictive, priors-compatible, reward-optimistic models, which in turn propose thoughts/actions to implement. The basal ganglia assigns value to these thoughts/actions, and when the high-value ones are actualized, we use the consequences (1) to make our world model more predictive, compatible, etc., and (2) to make our value function more closely align with the ground-truth reward signal(s). I’m leaving out many really interesting nuances in Steve’s brain models for the sake of parsimony here; if you want a far richer understanding of Steve’s models of the brain, I highly recommend going straight to the source(s).
Relationship to alignment work
So, what exactly is the relationship of Steve’s theoretical neuroscience work to his thinking on alignment? One straightforward point of interaction is Steve’s inner alignment worry about the value function differing from the sum of the reward function calculator. In the brain, Steve posits that the reward function calculator is something like the brainstem/hypothalamus—perhaps more specifically, the phasic dopamine signals produced by these areas—and that the brain’s value function is distributed throughout the telencephalon, though perhaps mainly to be found in the striatum and neocortex (specifically, in my own view, anterior neocortex). Putting these notions together, we might find that the ways that the brain’s reward function calculator and value function interact will tell us some really important stuff about how we should safely build and maintain similar algorithms in an AGI.
To evaluate the robustness of the analogy, it seems critical to pin down whether the reward signals that originate in the hypothalamus/brainstem can themselves be altered by learning or whether they are inflexibly hardcoded by evolution. Recall in Steve’s AGI development model that while the world model, value model, and planner/actor are all learned, the reward function calculator is not—therefore, it seems like the degree to which this model is relevant to the brain depends on (a) how important it is for an AGI that the reward function calculator is fixed the model, and (b) whether it actually is fixed in the brain. For (a), it seems fairly obvious that the reward function must be fixed in the relevant sense—namely, that the AGI cannot fundamentally change what constitutes a reward or punishment. As for (b), whether the reward function is actually fixed in the brain, Steve differentiates between the capacity for learning-from-scratch (e.g., what a neural network does) and “mere plasticity” (e.g., self-modifying code in Linux), arguing that the brain’s reward function is fixed in the first sense—but probably not the second. At the end of the day, I don’t think this asterisk on the fixedness of the brain’s reward function is a big problem for reconciling Steve’s safety and brain frameworks, given the comparatively limited scope of the kinds of changes that are possible under “mere plasticity.”
Steve’s risk models also clearly entail our elucidating the algorithms in the brain give rise to distinctly human social behavior (recall big picture alignment idea #2)—though up to this point, Steve has done (relatively) less research on this front. I think it is worthwhile, therefore, to pick up in the next section by briefly introducing and exploring the implications of one decently-well-understood phenomenon that seems highly relevant to Steve’s work in this sphere: theory of mind (ToM).
Building from Steve’s framework
Theory of mind as ‘hierarchical IRL’ that addresses Steve’s first-person problem
A cognitive system is said to have theory of mind (ToM) when it is able to accurately and flexibly infer the internal states of other cognitive systems. For instance, if we are having a conversation and you suddenly make the face pictured below, my ToM enables me to automatically infer a specific fact about what’s going on in your mind: namely, that you probably don’t agree with or are otherwise unsure about something I’m saying.
This general capacity definitely seems to me like a—if not the—foundational computation underlying sophisticated social cognition and behavior: it enables empathy, perspective-taking, verbal communication, and ethical consideration. Critically, however, there is a growing amount of compelling experimental evidence that ToM is not one homogenous thing. Rather, it seems to be functionally dissociable into two overlapping but computationally distinct subparts: affective ToM (roughly, “I understand how you’re feeling”) and cognitive ToM (roughly, “I understand what you’re thinking”). Thus, we should more precisely characterize the frown-eyebrow-raise example from above as an instance of affective ToM. Cognitive theory of mind, on the other hand, is classically conceptualized and tested as follows: Jessica has a red box and blue box in front of her. She puts her phone in the red box and leaves the room. While she’s gone, her phone is moved to the blue box. When Jessica comes back into the room, which box will she look for her phone? As obvious as it seems, children under the age of about three will respond at worse-than-chance levels. Answering correctly requires cognitive ToM: that is, the ability to represent that Jessica can herself court representations of the world that are distinct from actual-world states (this set-up is thus appropriately named the false-belief task). This is also why cognitive ToM is sometimes referred to as a meta-representational capacity.
One final piece of the puzzle that seems relevant is affective empathy, which adds to the “I understand how you’re feeling...” of affective ToM: “...and now I feel this way, too!”. The following diagram provides a nice summary of the three concepts:
To the extent Steve is right that “[understanding] the algorithms in the human brain that give rise to social instincts and [putting] some modified version of those algorithms into our AGIs” is a worthwhile safety proposal, I think we should be focusing our attention on instantiating the relevant algorithms that underlie affective and cognitive ToM + affective empathy. For starters, I believe these brain mechanisms supply the central computations that enable us homo sapiens to routinely get around Steve’s “1st-person-problem” (getting a cognitive system to interpret 3rd-person training signals as 1st-person training signals).
Consider an example: I see a classmate cheat on a test and get caught (all 3rd-person training signals). I think this experience would probably update my “don’t cheat (or at least don’t get caught cheating)” Q-value proportionally—i.e., not equivalently—to how it would have been updated were I the one who actually cheated (ultimately rendering the experience a 1st-person training signal). Namely, the value is updated to whatever quantity of context-dependent phasic dopamine is associated with the thought, “if I wasn’t going to try it before, I’m sure as hell not going to try it now.”
It seems clear to me that the central underlying computational mechanism here is ToM + affective empathy: I infer the cheater’s intentional state from his behavior (cognitive ToM; “he’s gone to the bathroom five times during this test = his intention is to cheat”), the affective valence associated with the consequences of this behavior (affective ToM; “his face went pale when the professor called him up = he feels guilty, embarrassed, screwed, etc.”), and begin to feel a bit freaked out myself (affective empathy; “that whole thing was pretty jarring to watch!”).
For this reason, I’m actually inclined to see Steve’s two major safety proposals (corrigibility + conservatism / human-like social instincts) as two sides of the same coin. That is, I think you probably get the kind of deontological corrigibility that Steve is interested in “for free” once you have the relevant human-like social instincts—namely, ToM + affective empathy.
The computation(s) underlying ToM + affective empathy are indisputably open research questions, ones that I think ought to be taken up by alignment theorists who share Steve-like views about the importance of instantiating the algorithms underpinning human-like social behavior in AGI. I do want to motivate this agenda here, however, by gesturing at one intriguingly simple proposal: ToM is basically just inverse reinforcement learning (IRL) through Bayesian inference. There already exists some good theoretical and neurofunctional work that supports this account. Whereas RL maps a reward function onto behavior, IRL (as its name suggests) maps behavior onto the likely reward/value function that generated it. So, RL: you take a bite of chocolate and you enjoy it, so you take another bite. IRL: I see you take one bite of chocolate and then another, so I infer that you expected there to be some reward associated with taking another bite—i.e., I infer that you enjoyed your first bite. At first glance, IRL does seem quite a bit like ToM. Let’s look a bit closer:
The basic story this model tells is as follows: an agent (inner loop) finds itself in some state of the world at time t. Assuming a Steve-like model of the agent’s various cognitive sub-algorithms, we can say the agent uses (A) its world model to interpret its current state and (B) its value function to assign some context-dependent value to the activated concepts in its world model. Its actor/planner module then searches over these values to find a high-value behavioral trajectory that the agent subsequently implements, observes the consequences of, and reacts to. The world state changes as a result of the agent’s action, and the cycle recurs. In addition to the agent, there is an observer (outer ovals) who is watching the agent act within its environment.
Here, the cognitive ToM of the observer performs Bayesian inference over the agent’s (invisible) world model and intended outcome given their selected action. For instance, given that you just opened the fridge, I might infer (1) you believe there is food in the fridge, and (2) you probably want to eat some of that food. (This is Bayesian because my priors constrain my inference—e.g., given my assorted priors about your preferences, typical fridge usage, etc., I assign higher probability to your opening the fridge because you’re hungry than to your opening the fridge because you just love opening doors.)
The observer’s affective ToM then takes one of these output terms—the agent’s intended outcome—as input and compares it to the actual observed outcome in order to infer the agent’s reaction. For example, if you open the fridge and there is no food, I infer, given (from cognitive ToM) that you thought there was going to be food and you intended to eat some of it, that (now with affective ToM) you’re pretty disappointed. (I label this whole sub-episode as “variably visible” because in some cases, we might get additional data that directly supports a particular inference, like one’s facial expression demonstrating their internal state as in the cartoon from earlier.)
Finally, affective empathy computes how appropriate it is for the observer to feel way X given the inference that the agent feels way X. In the fridge example, this translates to how disappointed I should feel given that (I’ve inferred) you’re feeling disappointed. Maybe we’re good friends, so I feel some “secondhand” disappointment. Or, maybe your having raided the fridge last night is the reason it’s empty, in which case I feel far less for you.
I suppose this sort of simple computational picture instantiates a kind of “hierarchical IRL,” where each inference provides the foundation upon which the subsequent inference occurs (cognitive ToM → affective ToM → affective empathy). This hypothesis would predict that deficits in one inference mechanism should entail downstream (but not upstream) deficits—e.g., affective ToM deficits should entail affective empathy deficits but not necessarily cognitive ToM deficits. (The evidence for this is murky and probably just deserves a blog post of its own to adjudicate.)
Suffice it to simply say here that I think alignment theorists who find human sociality interesting should direct their attention to the neural algorithms that give rise to cognitive and affective ToM + affective empathy. (One last technical note: ToM + empathetic processing seems relevantly lateralized. As excited about Steve’s computational brain framework as I am, I think the question of functional hemispheric lateralization is a fruitful and fascinating one that Steve tends to emphasize less in his models, I suspect because of his sympathies to “neocortical blank-slate-ism.”)
The last thing I’d like to do in this post is to demonstrate how Steve’s “neocortex-subcortex, steered-steerer” computational framework might lead to novel inner alignment problems. Recall that we are assuming our eventual AGI (whatever degree of actual neuromorphism it displays) will be composed of a world model, a value function, a planner/actor, and a reward function calculator. Let’s also assume that something like Steve’s picture of steered optimization is correct: more specifically, let’s assume that our eventual AGI displays some broad dualism of (A) telencephalon-like computations that constitute the world model, value function, and actor/planner, and (B) hypothalamus-/brainstem-like computations that constitute the reward function calculator. With these assumptions in place, let’s consider a simple story:
Jim doesn’t particularly care for brussel sprouts. He finds them to be a bit bitter and bland, and his (hypothalamus-/brainstem-supplied) hardwired reaction to foods with this flavor profile is negatively-valenced. Framed slightly differently, perhaps in Jim’s vast Q-table/complex value function, the action “eat brussel sprouts” in any state where brussel sprouts are present has a negative numerical value (in neurofunctional terms, this would correspond to some reduction in phasic dopamine). Let’s also just assert that this aversion renders Jim aligned with respect to his evolutionarily-installed objective to avoid bitter (i.e., potentially poisonous) foods. But Jim, like virtually all humans—and maybe even some clever animals—does not just populate his world model (and the subsequent states that feature in his Q-table/value function) with exogenous phenomena like foods, places, objects, and events; he also can model (and subsequently feature in his Q-table/value function) various endogenous phenomena like his own personality, behavior, and desires. So, for instance, Jim not only could assign some reward-function-mediated value to “brussel sprouts;” he could also assign some reward-function-mediated value to the abstract state of “being the kind of person who eats brussel sprouts.”
If we assume that Jim’s brain selects greedy behavioral policies and that, for Jim, the (second-order) value of being the kind of guy who eats brussel sprouts relevantly outweighs the (first-order) disvalue of brussel sprouts, we should expect that Jim’s hypothalamus-/brainstem-supplied negative reaction to bitter foods will be ignored in favor of his abstract valuation that it is good to be the brussel-sprout-eating-type. Now, from the perspective of Jim’s “programmer” (the evolutionary pressure to avoid bitter foods), he is demonstrating inner misalignment—in Steve’s terms, his value function certainly differs from the sum of his “avoid-bitter-stuff” reward function.
There are many names that this general type of scenario goes by: delaying gratification, exhibition of second-order preferences (e.g., “I really wanted to like Dune, but…”), appealing to higher-order values, etc. However, in this post, I’ll more specifically refer to this kind of problem as self-referential misalignment. Informally, I’m thinking of self-referential misalignment as what happens when some system capable of self-modeling develops and subsequently acts upon misaligned second-order preferences that conflict with its aligned first-order preferences.
There seem to be at least three necessary conditions for ending up with an agent displaying self-referential misalignment. I’ll spell them out in Steve-like terminology:
The agent is a steered optimizer/online learner whose value function, world model, and actor/planner modules update with experience.
The agent is able to learn the relevant parts of its own value function, actor/planner, and/or reward function calculator as concepts within its world model. I’ll call these “endogenous models.”
The agent can assign value to endogenous models just as it can for any other concept in the world model.
If a system can’t do online learning at all, it is unclear how it would end up with Jim-like preferences about its own preferences—presumably, while bitterness aversion is hardcoded into the reward function calculator at “deployment,” his preference to keep a healthy diet is not. So, if this latter preference is to emerge at some point, there has to be some mechanism for incorporating it into the value function in an online manner (condition 1, above).
Next, the agent must be capable of a special kind of online learning: the capacity to build endogenous models. Most animals, for example, are presumably unable to do this: a squirrel can model trees, buildings, predators, and other similarly exogenous concepts, but it can’t endogenously model its own proclivities to eat acorns, climb trees, and so on (thus, a squirrel-brain-like-algorithm would fail to meet condition 2, above).
Finally, the agent must not only be capable of merely building endogenous models, but also of assigning value to and acting upon them—that is, enabling them to recursively flow back into the value and actor/planner functions that serve as their initial basis. It is not enough for Jim to be able to reason about himself as a kind of person who eats/doesn’t eat brussel sprouts (a descriptive fact); he must also be able to assign some value about this fact (a normative judgment) and ultimately alter his behavioral policy in light of this value assignment (condition 3, above).
If a system displays all three of the capacities, I think it is then possible for that system to exhibit self-referential misalignment in the following, more formal sense:
Let’s see what self-referential misalignment might look like in a more prosaic-AI-like example. Imagine we program an advanced model-based RL system to have conversations with humans, where its reward signal is calculated given interlocutor feedback. We might generally decide that a system of this type is outer aligned as long as it doesn’t say or do any hateful/violent/harmful stuff. The system is inner aligned (in Steve’s sense) if the reward signal shapes a value function that converges to an aversion to saying or doing hateful/violent/harmful stuff (for the right reasons). Throw in capability robustness (i.e., the system can actually carry on a conversation), and, given the core notion of impact alignment, I think one would then have the necessary conditions the system would have to fulfill in order to be considered aligned. So then let’s say we build a reward function that takes as input the feedback of the system’s past conversation partners and outputs some reward signal that is conducive to shaping a value function that is aligned in the aforementioned sense. It’s plausible that this value function (when interpreted) would have some of the following components: “say mean things = −75; say funny things = 25; say true things = 35; say surprising things = 10”.
Then, if the system can build endogenous models, this means that it will be conceivable that (1) the system learns the fact that it disvalues saying mean things, the fact that it values saying funny things, etc., and that (2) the system is subsequently able to assign value to these self-referential concepts in its world-model. With the four values enumerated above, for instance, the system could plausibly learn to assign some context-dependent negative value to the very fact that it disvalues saying mean things (i.e., the system learns to “resent” the fact that it’s always nice to everyone). This might be because, in certain situations, the system learns that saying something a little mean would have been surprising, true, and funny (high-value qualities)—and yet it chose not to. Once any valuation like this gains momentum or is otherwise “catalyzed” under the right conditions, I think it is conceivable that the system could end up displaying self-referential misalignment, learning to override its aligned first-order preferences in the service of misaligned higher-order values.
This example is meant to demonstrate that a steered optimizer capable of building/evaluating endogenous models might be totally aligned + capability robust over its first-order preferences but may subsequently become seriously misaligned if it generates preferences about these preferences. Here are two things that make me worry about self-referential alignment as a real and important problem:
Self-referential concepts are probably really powerful; there is therefore real incentive to build AGI with the ability to build endogenous models.
The more generally capable the system (i.e., the closer to AGI we get), the more self-referential misalignment seems (a) more likely and (b) more dangerous.
I think (1) deserves a post of its own (self-reference is probably a really challenging double-edged sword), but I will try to briefly build intuition here: for starters, the general capacity for self-reference has been hypothesized to underlie language production and comprehension, self-consciousness, complex sociality—basically, much of the key stuff that makes humans uniquely intelligent. So to the degree we’re interested in instantiating a competitive, at-least-human-level general intelligence in computational systems, self-reference may prove a necessary feature. If so, we should definitely be prepared to deal with the alignment problems that accompany it.
Regarding (2), I think that the general likelihood self-referential misalignment is proportional to the general intelligence of the system in question—this is because the more generally capable a system is, the more likely it will be to court diversified and complex reward streams and value functions that can be abstracted over and plausibly interact at higher levels of abstraction. One reason, for instance, that Jim may actually want to be the kind of person who eats his veggies is because, in addition to his bitterness aversion, his value function is also shaped by social rewards (e.g., his girlfriend thinks it’s gross that he only ate Hot Pockets in college, and Jim cares a lot about what his girlfriend thinks of him). In practice, any higher-order value could conceivably override any aligned first-order value. Thus, the more complex and varied the first-order value function of an endogenous-modeling-capable system, the more likely that one or more emergent values will be in conflict with one of the system’s foundational preferences.
On this note, one final technical point worth flagging here is that Steve’s framework almost exclusively focuses on dopamine as the brain’s unitary “currency” for reward signals (which themselves may well number hundreds), but I don’t think it’s obvious that dopamine is the brain’s only reward signal currency, at least not across larger spans of time. Specifically, I think that serotonin is a plausible candidate for another neuromodulatory (social, I think) reward-like signal in the brain. If correct, this would matter a lot: if there is more than one major neuromodulatory signal-type in the brain that shapes the telencephalic value function, I think the plausibility of getting adversarial reward signals—and consequently self-referential misalignment—substantially increases (e.g., dopamine trains the agent to have some value function, but serotonin separately trains the agent to value being the kind of agent that doesn’t reflexively cater to dopaminergically-produced values). For this reason, I think a better computational picture of serotonin in the brain to complement Steve’s Big picture of phasic dopamine is thus highly relevant for alignment theory.
This is more to say about this problem of self-referential misalignment in steered optimizers, but I will now turn my attention to discussing two potential solutions (and some further questions that need to be answered about each of these solutions).
Solution #1: avoid one or more of the necessary conditions that result in a system exhibiting conflicting second-order preferences. Perhaps more specifically, we might focus on simply preventing the system from developing endogenous models (necessary condition 2, above). I think there is some merit to this, and I definitely want to think more about this somewhere else. One important problem I see with this proposal, however, is that it doesn’t fully appreciate the complexities of embedded agency—for example, the agent’s values will inevitably leave an observable trace on its (exogenous) environment across time that may still allow the agent to learn about itself (e.g., a tree-chopping agent who cannot directly endogenously model may still be able to indirectly infer from its surroundings the self-referential notion that it is the kind of agent who cuts down trees). It’s possible that some form of myopia could helpfully address this kind of problem, though I’m currently agnostic about this.
Solution #2: simply implement the same kind of Steve-like conservative approach we might want to employ for other kinds of motivational edge-cases (e.g., from earlier, don’t be manipulative vs. don’t facilitate murder). I think this is an interesting proposal, but it also runs into problems. I suppose that I am just generally skeptical of conservatism as a competitive safety proposal, as Evan puts it—it seems to entail human intervention whenever the AGI is internally conflicted about what to do, which is extremely inefficient and would probably be happening constantly. But in the same way that the direct instantiation of human-like social instincts may be a more parsimonious and “straight-from-the-source” solution than constantly deferring to humans, perhaps so too for conflicted decision-making: might it make sense “simply” to better understand the computational underpinnings of how we trade-off various good alternatives rather than defer to humans every time the AGI encounters a motivational conflict? Like with the first solution, I think there is something salvageable here, but it requires a more critical look. (It’s worth noting that Steve is skeptical of my proposal here. He thinks that the way humans resolve motivational conflicts isn’t actually a great template for how AGI should do it, both because we’re pretty bad at this ourselves and because there may be a way for AGI to go “back to ground truth” in resolving these conflicts—i.e. somehow query the human—in a way that biology can’t—i.e., you can’t go ask Inclusive Genetic Fitness what to do in a tricky situation.)
Finally, I should note that I don’t yet have a succinct computational story of how self-referential (mis)alignment might be instantiated in the brain. I suspect that it would roughly boil down to having a neural-computational description of how endogenous modeling happens—i.e., what kinds of interactions between the areas of neocortex differentially responsible for building the value function and those responsible for building the world model are necessary/sufficient for endogenous modeling? As I was hinting at previously, there are some animals (e.g., humans) that are certainly capable of endogenous modeling, while there are others (e.g., squirrels) that are certainly not—and there are yet other animals that occupy something of a grey area (e.g., dolphins). There are presumably neurostructural/neurofunctional cross-species differences that account for this variance in the capacity to endogenously model, but I am totally ignorant of them at present. Needless to say, I think it is critical to get clearer on exactly how self-referential misalignment happens in the brain so that we can determine whether a similar algorithm is being instantiated in an AGI. I also think that this problem is naturally related to instrumental behavior in learning systems, most notably deceptive alignment, and it seems very important to elucidate this relationship in further work.
Steve Byrnes’s approach to AGI safety is powerful, creative, and exciting, and that far more people should be doing alignment theory research through Steve-like frameworks. I think that the brain is the only working example we have of a physical system that demonstrates both (a) general intelligence and, as Eliezer Yudkowsky has argued, (b) the capacity to productively situate itself within complex human value structures, so attempting to understand how it achieves these things at the computational level and subsequently instantiating the relevant computations in an AGI seems far more likely to be a safe and effective strategy than building some giant neural network that shares none of our social intuitions or inductive biases. Steve’s high-level applications of theoretical neuroscience to AGI alignment has proved a highly generative research framework, as I have tried to demonstrate here by elaborating two natural extensions of Steve’s ideas: (1) the necessity to understand the computational underpinnings of affective and cognitive theory of mind + affective empathy, and (2) the concern that a “neocortex-subcortex, steered-steerer” framework superimposed upon Steve’s “four-ingredient AGI” gives rise to serious safety concerns surrounding endogenous modeling and self-referential misalignment, both of which I claim are ubiquitously displayed by the human brain.