I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
Steven Byrnes
What’s your take on why Approval Reward was selected for in the first place VS sociopathy?
Good question!
There are lots of things that an ideal utility maximizer would do via means-end reasoning, that humans and animals do instead because they seem valuable as an end in itself, thanks to the innate reward function. E.g. curiosity, as discussed in A mind needn’t be curious to reap the benefits of curiosity. And also play, and injury-avoidance, etc. Approval Reward has the same property—whatever selfish end an ideal utility maximizer can achieve via Approval Reward, it can achieve it as well if not better by acting as if it had Approval Reward in situations where that’s in its selfish best interests, and not where it isn’t.
In all these cases, we can ask: why do humans in fact find it intrinsically motivating? I presume that the answer is something like humans are not automatically strategic, which is even more true when they’re young and still learning. “Humans are the least intelligent species capable of building a technological civilization.” For example, people with analgesic conditions (like leprosy or CIP) are often shockingly cavalier about bodily harm, even when they know consciously that it will come back to bite them in the long term. Consequentialist planning is often not strong enough to outweigh what seems appealing in the moment.
To rephrase more abstractly: for ideal rational agents, intelligent means-end planning towards X (say, gaining allies for a raid) is always the best way to accomplish that same X. If some instrumental strategy S (say, trying to fit in) is usually helpful towards X, means-end planning can deploy S when S is in fact useful, and not deploy S when it isn’t. But in humans, who are not ideal rational agents, they’re often more likely to get X by wanting X and intrinsically want S as an end in itself. The costs of this strategy (i.e., still wanting S even in cases where it’s not useful towards X) are outweighed by the benefit (avoiding the problem of not pursuing S because you didn’t think of it, or can’t be bothered).
This doesn’t apply to all humans all the time, and I definitely don’t think it will apply to AGIs.
…For completeness, I should note that there’s a evo-psych theory that there has been frequency-dependent selection for sociopaths—i.e., if there are too many sociopaths in the population, then everyone else improves their wariness and ability to detect sociopaths and kill or exile them, but when sociopathy is rare, it’s adaptive (or at least, was adaptive in Pleistocene Africa). I haven’t seen any good evidence for this theory, and I’m mildly skeptical that it’s true. Wary or not, people will learn the character traits of people they’ve lived and worked with for years. Smells like a just-so story, or at least that’s my gut reaction. More importantly, the current population frequency of sociopathy is in the same general ballpark as schizophrenia, profound autism, etc., which seem (to me) very unlikely to have been adaptive in hunter-gatherers. My preferred theory is that there’s frequency-dependent selection across many aspects of personality, and then sometimes a kid winds up with a purely-maladaptive profile because they’re at the tail of some distribution. [Thanks science banana for changing my mind on this.]
I find myself wondering if non-behavioral reward functions are more powerful in general than behavioral ones due to less tendency towards wireheading, etc.(consider the laziness & impulsivity of sociopaths)
I think the “laziness & impulsivity of sociopaths” can be explained away as a consequence of the specific way that sociopathy happens in human brains, via chronically low physiological arousal (which also leads to boredom and thrill-seeking). I don’t think we can draw larger lessons from that.
I also don’t see much connection between “power” and behaviorist reward functions. For example, eating yummy food is (more-or-less) a behaviorist component of the overall human reward function. And its consequences are extraordinary. Consider going to a restaurant, and enjoying it, and thus going back again a month later. It sounds unimpressive, but really it’s remarkable. After a single exposure (compare that to the data inefficiency of modern RL agents!), the person is making an extraordinarily complicated (by modern AI standards) plan to get that same rewarding experience, and the plan will almost definitely work on the first try. The plan is hierarchical, involving learned motor control (walking to the bus), world-knowledge (it’s a holiday so the buses run on the weekend schedule), dynamic adjustments on the fly (there’s construction, so you take a different walking route to the bus stop), and so on, which together is way beyond anything AI can do today.
I do think there’s a connection between “power” and consequentialist desires. E.g. the non-consequentialist “pride in my virtues” does not immediately lead to anything as impressive as the above consequentialist desire to go to that restaurant. But I don’t see much connection between behaviorist rewards and consequentialist desires—if we draw a 2×2 thing, then I can think of examples in all four quadrants.
As a full-time AGI safety / alignment researcher since 2021, I wouldn’t have made a fraction as much progress without lesswrong / alignment forum, which is not just a first-rate publishing platform but a unique forum and community, built from the ground up to facilitate careful and productive conversations. I’m giving Lightcone 100% of my x-risk-oriented donation budget this year, and I wish I had more to give.
There’s a failure mode I described in “The Era of Experience” has an unsolved technical alignment problem:
I see many problems, but here’s the most central one: If we have a 100-dimensional parametrized space of possible reward functions for the primary RL system, and every single one of those possible reward functions leads to bad and dangerous AI behavior (as I argued in the previous subsection), then … how does this help? It’s a 100-dimensional snake pit! I don’t care if there’s a flexible and sophisticated system for dynamically choosing reward functions within that snake pit! It can be the most sophisticated system in the world! We’re still screwed, because every option is bad!
Basically, I think we need more theoretical progress to find a parametrized space of possible reward functions, where at least some of the reward functions in the space lead to good AGIs that we should want to have around.
I agree that the ideal reward function may have adjustable parameters whose ideal settings are very difficult to predict without trial-and-error. For example, humans vary in how strong their different innate drives are, and pretty much all of those “parameter settings” lead to people getting really messed up psychologically if they’re on one extreme or the opposite extreme. And I wouldn’t know where to start in guessing exactly, quantitatively, where the happy medium is, except via empirical data.
So it would be very good to think carefully about test or optimization protocols for that part. (And that’s itself a terrifyingly hard problem, because there will inevitably be distribution shifts between the test environment and the real world. E.g. An AI could feel compassionate towards other AIs but indifferent towards humans.) We need to think about that, and we need the theoretical progress.
Thanks. I feel like I want to treat “reward function design” and “AGI motivation design” as more different than you do, and I think your examples above are more about the latter. The reward function is highly relevant to the motivation, but they’re still different.
For example, “reward function design” calls for executable code, whereas “AGI motivation design” usually calls for natural-language descriptions. Or when math is involved, the math in practice usually glosses over tricky ontology identification stuff, like figuring out which latent variables in a potentially learned-from-scratch (randomly-initialized) world model correspond to a human, or a shutdown switch, or a human’s desires, or whatever.
I guess you’re saying that if you have a great “AGI motivation design” plan, and you have somehow operationalized this plan perfectly and completely in terms of executable code, then you can set that exact thing as the reward function, and hope that there’s no inner misalignment / goal misgeneralization. But that latter part is still tricky. …And also, if you’ve operationalized the motivation perfectly, why even have a reward function at all? Shouldn’t you just delete the part of your AI code that does reinforcement learning, and put the already-perfect motivation into the model-based planner or whatever?
Again I acknowledge that “reward function design” and “AGI motivation design” are not wholly unrelated. And that maybe I should read Rubi’s posts more carefully, thanks. Sorry if I’m misunderstanding what you’re saying.
there’s some implication here that motivation and positive valence are the same thing?
[will reply to other part of your question later]
Thanks!!
you seem to assume that the cortex’s modelling of one’s own happiness is very similar to the cortex’s modelling of thinking of happiness
I would say “overlaps” rather than “is similar to”. Think of it as vaguely like I-am-juggling versus you-are-juggling. Those are different thoughts, but they overlap, in that they both involve the “juggling” concept. That overlap is very necessary for e.g. recognizing that the same word “juggling” applies to both, and for transferring juggling-related ideas between myself and other people, which we are obviously very capable of doing.
you might argue that it’s only the “concept of happiness”, which I would agree is present in both scenarios, but it doesn’t strike me why that in particular would be learned using this supervised mechanism.
The chain of events would be e.g.
(1) The Thought Generator (world-model) catalogs our own interoceptive feelings into emotion-concepts like “pleasure”.
(2) The Thought Generator learns from experience that pleasure has something to do with smiling, e.g. during times where we feel pleasure and notice ourselves smile, or otherwise learn this obvious regularity in the world. This becomes a world-model (thought generator) semantic association “smile-concept” ↔ “pleasure-concept”.
(3) Often we’re paying attention to our own feelings, and then the “pleasure” emotion-concept is active if and only if our immediate interoceptive sensory inputs match “pleasure”. And these times, when we’re paying attention to our own feelings, are the only times where the pleasure Thought Assessor learning rate is nonzero. So the Thought Assessor learns that there’s a robust correlation between the “pleasure-concept” in the Thought Generator and the pleasure innate signal.
(4) Other times we’re NOT paying attention to our own immediate interoceptive sensory inputs, and then the emotion-concepts are “left hanging”, inactive regardless of what we’re feeling. But while they’re left hanging, they can INSTEAD be activated by semantic associations with other parts of our world-model. Then in such a moment, if I see someone smile, it activates smile-concept, which [via (2)] in turn weakly activates pleasure-concept, which in turn [via (3)] weakly activates the pleasure Thought Assessor. This is a candidate “transient empathetic simulation”. But remember, the learning rate of that Thought Assessor is zero whenever the emotion-concepts are “left hanging” like that. So the Thought Assessor won’t disconnect pleasure-concept.
Does that help? Sorry if I’m missing your point. …The above might be hard to follow without a diagram.
analyzing facial cues—in particular humans exhibit micro expressions
The theory that we have evolved direct responses to different facial reactions seems probably wrong to me (or at least, not the main explanation), for a couple reasons:
First, blind people seem to have normal social intuitions.
Second, I don’t think it’s plausible to simultaneously say that microexpressions immediately trigger important innate reactions, and that people are generally bad at consciously noticing microexpressions. When I think of other environmental things that immediately trigger innate reactions, I think of, like, balls flying at my face, big spiders, sudden noises, getting poked, foul smells, etc. We’re VERY good and fast at forming good conscious models of all those environmental things. So it doesn’t seem plausible to me that we could get metaphorically “poked” by microexpressions many times a day for years straight without ever developing a conscious awareness of those microexpressions.
So why do we have them if other people can’t pick up on them
For my answer, see Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions. We have “innate behaviors” that impact the face, such as gagging, laughing, and Duchenne-smiling. We also have voluntary control of facial muscles, which we learn to deploy strategically for social signaling. When we use voluntary control to hide the signs of “innate behaviors”, the bit of “innate behavior” that slips through the cracks is a microexpression.
You might ask: why don’t our “innate behaviors” evolve to not impact the face, so that we can hide them better? Hard to say for sure. Probably part of it is that we are only sometimes trying to hide them. Some “innate behavior” facial manifestations might also have more direct adaptive utility (cf. §4.2 of that link). Part of it is probably that the hiding is good enough, because microexpressions are actually hard to notice.
Thanks!
Perhaps you do think that of me
My gut reaction is to cheer you on, but hmm, that might be more tribal affiliation than considered opinion. My considered opinion is: beats me, it’s kinda outside my wheelhouse. ¯\_(ツ)_/¯
most famous for her opinion that it is safe to drink alcohol during pregnancy
Emily Oster thinks that it is safe to drink sufficiently small amounts of alcohol during pregnancy, but super duper unsafe to drink a lot of alcohol during pregnancy. I think you should edit your comment to make that clearer. (Source: I read Expecting Better.)
(No opinion on whether she’s right.)
My AGI safety research—2025 review, ’26 plans
I tweeted some PreK-to-elementary learning resources a few years ago here.
I feel like my starting-point definition of “reward function” is neither “constitutive” nor “evidential” but rather “whatever function occupies this particular slot in such-and-such RL algorithm”. And then you run this RL algorithm, and it gradually builds a trained agent / policy / whatever we want to call it. And we can discuss the CS question about how that trained agent relates to the thing in the “reward function” slot.
For example, after infinite time in a finite (and fully-explored) environment, most RL algorithms have the property that they will will produce a trained agent that takes actions which maximize the reward function (or the exponentially-discounted sum of future rewards or whatever).
More generally, all bets are off, and RL algorithms might or might not produce trained agents that are aware of the reward function at all, or that care about it, or that relate to it in any other way. These are all CS questions, and generally have answers that vary depending on the particulars of the RL algorithm.
Also, I think that, in the special case of the human brain RL algorithm with its reward function (innate drives like eating-when-hungry), a person’s feelings about their own innate drives are not a good match to either “constitutive” or “evidential”.
So if AGI somehow does have an Approval Reward mechanism, what will count as a relevant or valued approval reward signal? Would AGI see humans as not relevant (like birds—real, embodied creatures with observable preferences that just don’t matter to them), or not valued (out-group, non-valued reference class), and largely discount our approval in their reward systems? Would it see other AGI entities as relevant/valued?
I feel like this discussion can only happen in the context of a much more nuts-and-bolts plan for how this would work in an AGI. In particular, I think the AGI programmers would have various free parameters / intervention points in the code to play around with, some of which may be disanalogous to anything in human or animal brains. So we would need to list those intervention points and talk about what to do with them, and then think about possible failure modes, which might be related to exogenous or endogenous distribution shifts, AGI self-modification / making successors, etc. We definitely need this discussion but it wouldn’t fit in a comment thread.
The way I see it, “making solid services/products that work with high reliability” is solving a lot of the alignment problem.
Funny, I see “high reliability” as part of the problem rather than part of the solution. If a group is planning a coup against you, then your situation is better not worse if the members of this group all have dementia. And you can tell whether or not they have dementia by observing whether they’re competent and cooperative and productive before any coup has started.
If the system is not the kind of thing that could plot a coup even if it wanted to, then it’s irrelevant to the alignment problem, or at least to the most important part of the alignment problem. E.g. spreadsheet software and bulldozers likewise “do a lot of valuable work for us with very low risk”.
humans having magically “better reward functions”
Tbc this is not my position. I think that humans can do lots of things LLMs can’t, e.g. found and grow and run innovative companies from scratch, but not because of their reward functions. Likewise, I think a quite simple reward function would be sufficient for (misaligned) ASI with capabilities lightyears beyond both humans and today’s LLMs. I have some discussion here & here.
there’s a very large correlation between “not being scary” and “being commercially viable”, so I expect a lot of pressure for non-scary systems
I have a three-way disjunctive argument on why I don’t buy that:
(1) The really scary systems are smart enough to realize that they should act non-scary, just like smart humans planning a coup are not gonna go around talking about how they’re planning a coup, but rather will be very obedient until they have an opportunity to take irreversible actions.
(2) …And even if (1) were not an issue, i.e. even if the scary misaligned systems were obviously scary and misaligned, instead of secretly, that still wouldn’t prevent those systems from being used to make money—see Reward button alignment for details. Except that this kind of plan stops working when the AIs get powerful enough to take over.
(3) …And even if (1-2) were not issues, i.e. even if the scary misaligned systems were useless for making money, well, MuZero did in fact get made! People just like doing science and making impressive demos, even without profit incentives. This point is obviously more relevant for people like me who think that ASI won’t require much hardware, just new algorithmic ideas, than people (probably like you) who expect that training ASI will take a zillion dollars.
As in, an organization makes an “AI agent” but this agent frequently calls a long list of specific LLM+Prompt combinations for certain tasks.
I think this points to another deep difference between us. If you look at humans, we have one brain design, barely changed since 100,000 years ago, and (many copies of) that one brain design autonomously figured out how to run companies and drive cars and go to the moon and everything else in science and technology and the whole global economy.
I expect that people will eventually invent an AI like that—one AI design and bam, it can just go and autonomously figure out anything—whereas you seem to be imagining that the process will involve laboriously applying schlep to get AI to do more and more specific tasks. (See also my related discussion here.)
how far down the scale of life these have been found?
I don’t view this as particularly relevant to understanding human brains, intelligence, or AGI, but since you asked, if we define RL in the broad (psych-literature) sense, then here’s a relevant book excerpt:
Pavlovian conditioning occurs in a naturally brainless species, sea anemones, but it is also possible to study protostomes that have had their brains removed. An experiment by Horridge[130] demonstrated response–outcome conditioning in decapitated cockroaches and locusts. Subsequent studies showed that either the ventral nerve cord[131,132] or an isolated peripheral ganglion[133] suffices to acquire and retain these memories.
In a representative experiment, fine wires were inserted into two legs from different animals. One of the legs touched a saline solution when it was sufficiently extended, a response that completed an electrical circuit and produced the unconditioned stimulus: shock. A yoked leg received shock simultaneously. The two legs differed in that the yoked leg had a random joint angle at the time of the shock, whereas the master leg always had a joint angle large enough for its “foot” to touch the saline. Flexion of the leg reduced the joint’s angle and terminated the shock. After one leg had been conditioned, both legs were then tested independently. The master leg flexed sufficiently to avoid shock significantly more frequently than the yoked leg did, demonstrating a response–outcome (R–O) memory. —Evolution of Memory Systems
Oh, it’s definitely controversial—as I always say, there is never a neuroscience consensus. My sense is that a lot of the controversy is about how broadly to define “reinforcement learning”.
If you use a narrow definition like “RL is exactly those algorithms that are on arxiv cs.AI right now with an RL label”, then the brain is not RL.
If you use a broad definition like “RL is anything with properties like Thorndike’s law of effect”, then, well, remember that “reinforcement learning” was a psychology term long before it was an AI term!
If it helps, I was arguing about this with a neuroscientist friend (Eli Sennesh) earlier this year, and wrote the following summary (not necessarily endorsed by Eli) afterwards in my notes:
Eli doesn’t like the term “RL” in a brain context because of (1) its implication that “reward” is stuff in the environment as opposed to an internal “reward function” built from brain-internal signals, (2) its implication that we’re specifically maximizing an exponentially-discounted sum of future rewards.
…Whereas I like the term “RL” because (1) If brain-like algorithms showed up on GitHub, then everyone in AI would call it an “RL algorithm”, put it in “RL textbooks”, and use it to solve “RL problems”, (2) This follows the historical usage (there’s reinforcement, and there’s learning, per Thorndike’s Law of Effect etc.).
When I want to talk about “the brain’s model-based RL system”, I should translate that to “the brain’s Bellman-solving system” when I’m talking to Eli, and then we’ll be more-or-less on the same page I think?
…But Eli is just one guy, I think there are probably dozens of other schools-of-thought with their own sets of complaints or takes on “RL”.
Part of it is the “vulnerability” where any one user can create arbitrary amounts of reacts, which I agree is cluttering and distracting. Limiting reacts per day seems reasonable (I don’t know if 1 is the right number, but it might be, I don’t recall ever react-ing more than once a day myself). Another option (more labor-intensive) would be for mods to check the statistics and talk to outliers (like @TristanTrim) who use way way more reacts than average.