Human preferences as RL critic values—implications for alignment

TLDR: Human preferences might be largely the result of a critic network head, much like that used in SOTA agentic RL systems. The term “values” in humans might mean almost exactly what it does in RL: an estimate of discounted sum of future rewards. In humans this is based on better and more abstract representations than current RL systems.

Work on aligning RL systems often doesn’t address the critic system as distinct from the actor. But using systems with a critic head may provide a much simpler interface for interpreting and directly editing the system’s values and, therefore, its goals and behavior. In addition, including a powerful critic system may be advantageous for capabilities as well.

One way to frame this is that human behavior, and therefore a neuromorphic AGI, might well be governed primarily by a critic system, and that’s simpler to align than understanding the complex mess of representations and action habits in the remainder of the system.

Readers will hopefully be familiar with Steve Byrnes’ sequence intro to brain-like AGI safety. What I present here seems to be entirely consistent with his theories. Post 14 of that sequence, his recent elaboration, and other recent work[1] present similar ideas. We use different terminology[2] and explanatory strategies, and I focus more on specifics of the critic system, so hopefully the two explanations are not redundant.

The payoff: a handle for alignment.

Wouldn’t it be nice if the behavior of a system were governed by one system, and it provided an easily trainable (or even hand-editable) set of weights? And if that system had a clear readout, meaning something like “I’ll pursue what I’m currently thinking about, as a goal, with priority 0-1 in this context”?

Suppose we had a proto-AGI system including a component with the above properties. That subsystem is what I’m terming the critic. Now suppose further that this system is relatively well-trained but is (by good planning and good fortune) still under human control. We could prompt it with language like “think about helping humans get things they want” or ”...human flourishing” or whatever your top pick(s) are for the outer alignment problem. Then we’d hit the “set value to max” button, and all of the weights into the critic system from the active conceptual representations would be bumped up.

If we knew that the critic system’s value estimate would govern future model-based behavior, and that it would over time train new model-free habits of thought and behavior, that would seem a lot better than trying to teach it what to do by rewarding behavior alone and guessing at the internal structure leading to that behavior.

I’m suggesting here that not only is such a system advantageous for alignment but that it’s advantageous for capabilities. Which would be a nice conjunction, if it turns out to be true.

This wouldn’t solve the whole alignment problem by a long shot. We’d still have to somehow decode or evoke the representations feeding that system, and we’d still have to figure out outer alignment; how to define what we really want. And we’d have to worry about its values shifting after it was out of our control, which I worry about. But having such a handle on values would make the problem a lot easier.

Values in RL and brains: what is known.

The basal ganglia and dopamine system act much like the actor and critic systems, respectively, in RL. This has been established by extensive experimental work. Papers and computational models, including ones I’ve worked on, review that empirical research[3]. In brief, the amygdala and related subcortical brain areas seem to collectively act much like the critic in RL[4].

Most readers will be familiar with the role of a critic in RL. Briefly, it learns to make educated guesses on when something good happens, so that it can provide a training signal to the network. In mammals, this has been studied in lab situations like lighting a red light exactly three seconds before giving a hungry mouse a food pellet and observing that its dopamine system increasingly releases dopamine when that red light comes on, as it learns that reward reliably follows. The same dopamine critic system also learns to cancel the dopamine signal when the food pellet arrives.

The critic system thus makes a judgment about when something seems likely to be new-and-good , or new-and-bad. It should be intuitively clear why those are the right occasions to apply learning, either to do more of what produced a new-and-probably-good outcome, or less of what produced a new-and-probably-bad outcome. In mathematical RL, this is the derivative of a sum of time-discounted predicted rewards. Neuroscience refers to the dopamine signal as a reward prediction error signal. They mean the same thing.

There are some wrinkles on this story raised by recent work. For instance, some dopamine neurons fire when punishment happens instead of pausing[5]. That and several other results add to the story but don’t fundamentally change it[6].

Most of the high-quality experimental evidence is in animals, and in pretty short and easy predictions of future rewards, like the red light exactly three seconds before every reward. There are lots of variations that show rough[7] adherence to the math of a critic: 50% prediction of reward means half as much dopamine released at the conditioned stimulus (CS in the jargon, the light) and roughly 50% as much dopamine at the unconditioned response (US, the food pellet).

In humans, the data is much less precise, as we don’t go poking electrodes into human heads except sometimes when and where they are needed for therapeutic purposes. In animals, we don’t have data on solving complex tasks or thinking about abstract ideas. Therefore, we don’t have direct data on how the critic system works in more complex and abstract domains.

Extrapolation: human preferences as RL values

The obvious move is to suppose that all human decision-making is done using the same computations.

The available data is quite consistent with this hypothesis; for instance, humans show blood oxygenation (BOLD from fMRI) activity that matches release of dopamine when they receive social and monetary rewards[8], and monkeys show dopamine release when they’re about to get information about an upcoming water reward, even when that information won’t change their odds or amount of reward at all[9]. There are many other results, none directly demonstrative, but all consistent with the idea that the dopamine critic system is active and involved in all behavior.

One neuroscientist who has directly addressed this explanation of human behavior is Read Montague, who wrote a whole popular press book on this hypothesis[10]. He says we have an ability to plug in essentially anything as a source of reward and calls that our human superpower. He’s proposing that we have a more powerful critic than other mammals, and that’s the source of our intelligence. That’s the same theory I’m presenting here.

Preferences for abstractions

The extreme version of this hypothesis is that this critic system works even for very abstract representations. For instance, when a human says that they value freedom, a representation of their concept of freedom triggers dopamine release.

The obvious problem with this idea is that a system that can provide a reward signal to itself just by thinking is obviously dysfunctional. It will have an easy mechanism to wirehead and that will be terrible for its performance/​survival. So the system needs to draw a distinction between just imagining freedom and making a plan for action that is predicted to actually produce freedom. This seems like something that a critic system can learn pretty easily. It’s known that the rodent dopamine system can learn blockers, such as not predicting reward when a blue light comes on at the same time as the otherwise reward-predictive red light.

The critic system would apply to everything we’d refer to as preferences. Although we wouldn’t ordinarily think of these as preferences, it would also extend to ascribing a value to contextually useful subgoals, like coding a function as one element of finishing a piece of code. While these things are much more numerous than things we’d call our values, they may simply be a continuum of how valuable, in what contexts the critic system judges them to be.

Advantages of a powerful critic system

I think this extreme version does a lot of work in explaining very complex human behavior, like coding up a working piece of software. Concepts like “break the problem into pieces” and “create a function that does a piece of the problem” seem to be the types of strategies and subgoals we use to solve complex problems. Breaking problems into serial steps seems like a key computational strategy that allows us to do so much more than other mammals. I’ve written about this elsewhere[11], but I can’t really recommend those papers as they’re written for cognitive neuroscientists. I hope to write more, and more clearly, on the advantage of choosing cognitive subgoals flexibly from all of the representations one has learned in future posts.

A second advantage of a powerful critic system is in bridging the gap to rare rewards in any environment. It’s generally thought that humans sometimes use model-based computations, as when we think about possible outcomes of plans. But it also seems pretty clear that we often don’t evaluate our plans all the way out to actual material rewards; we don’t try to look to the end of the game, but rather estimate the value of board positions after looking just a few moves deep.

And so do current SOTA RL agents. The systems I understand all use critic heads. I’m not sure if ChatGPT uses a critic, but the AlphaZero family, the OpenAI Five family of RL agents, and others all use a critic head that benefits from using the same body as the actor head and, thus, has access to all of the representations the whole system learns during training. Those representations can go beyond those learned through RL. ChatGPT, EfficientZero, and almost certainly humans demonstrate the advantages of combining RL with predictive or other self-supervised learning.

Conclusions, caveats, and directions

The potential payoff of having such a system is having an easier handle to do some alignment work, as discussed in the first section. I wanted to say the exciting part first.

There are a few dangling pieces of the logic:

  • Current systems do not use explicit representations of their current goals.

    • that’s central to the human-like cognition I envision above.

    • I think it’s terrifyingly dangerous to let a system choose its own subgoals

      • since there’s no clear delineation from final goals, but

  • I think it’s also advantageous to do so,

    • but that requires more discussion

  • As discussed above, such a system would not address:

    • Outer alignment

    • Interpretability

    • Value shift and representational shift over time

  • Such a system would be highly agentic. Which is a terrible idea.

    • I think agentic systems are inevitable,

    • they have better capabilities

    • And are just fascinating, even if they don’t

I hope to cover all of the above in future posts.


  1. ↩︎

    Order Matters for Deceptive Alignment makes the point that alignment would be way easier if our agent already has a good set of world representations when we align it. I think this is the core assumption made when people say “won’t an advanced AI understand what we want?”. But it’s not that easy to maintain control while an AI develops those representations. Kaj Sotala’s recent The Preference Fulfillment Hypothesis presents roughly the same idea.

  2. ↩︎

    Byrnes uses the concept of a thought assessor roughly as I’m using critic. He puts the basal ganglia as part of the thought assessor system, where the actor-critic hypothesis refers to the basal ganglia as part of the actor system. These appear to be purely terminological differences, positing at least approximately the same function.

  3. ↩︎

    Brown, J., Bullock, D., & Grossberg, S. (1999). How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. Journal of Neuroscience, 19(23), 10502-10511.
    Hazy, T. E., Frank, M. J., & O’reilly, R. C. (2007). Towards an executive without a homunculus: computational models of the prefrontal cortex/​basal ganglia system. Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1485), 1601-1613.
    Herd, S. A., Hazy, T. E., Chatham, C. H., Brant, A. M., & Friedman, N. P. (2014). A neural network model of individual differences in task switching abilities. Neuropsychologia, 62, 375-389.

  4. ↩︎

    Mollick, J. A., Hazy, T. E., Krueger, K. A., Nair, A., Mackie, P., Herd, S. A., & O’Reilly, R. C. (2020). A systems-neuroscience model of phasic dopamine. Psychological review, 127(6), 972.

  5. ↩︎

    Brooks, A. M., & Berns, G. S. (2013). Aversive stimuli and loss in the mesocorticolimbic dopamine system. Trends in Cognitive Sciences, 17(6), 281-286.

  6. ↩︎

    Dopamine release for aversive events may happen to focus attention on their potential causes, for purposes of plans and actions to avoid them.

  7. ↩︎

    One important difference between these findings and mathematical RL is that the reward signal scales with recent experience. If I’ve gotten a maximum of four food pellets in this experiment, getting or learning that I will definitely get four pellets produces a large dopamine response. But if I’ve sometimes received ten pellets in the recent past, 4 pellets will only produce something like 410 the dopamine response. This could be an important difference for alignment purposes because it means that humans aren’t maximizing any single quantity; their effective utility function is path-dependent.

  8. ↩︎

    Wake, S. J., & Izuma, K. (2017). A common neural code for social and monetary rewards in the human striatum. Social Cognitive and Affective Neuroscience, 12(10), 1558-1564.

  9. ↩︎

    Bromberg-Martin, E. S., & Hikosaka, O. (2009). Midbrain dopamine neurons signal preference for advance information about upcoming rewards. Neuron, 63(1), 119-126.

  10. ↩︎

    Read, M. (2006). Why Choose This Book. EP Dutton, New York.

  11. ↩︎

    Herd, S., Krueger, K., Nair, A., Mollick, J., & O’Reilly, R. (2021). Neural mechanisms of human decision-making. Cognitive, Affective, & Behavioral Neuroscience, 21(1), 35-57.
    Herd, S. A., Krueger, K. A., Kriete, T. E., Huang, T. R., Hazy, T. E., & O’Reilly, R. C. (2013). Strategic cognitive sequencing: a computational cognitive neuroscience approach. Computational intelligence and neuroscience, 2013, 4-4.