Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

LW: 386 AF: 94

128 comments10 min readLW link 3 reviews

Reinforcement learning Inner Alignment Reward Functions Wireheading AI Shard Theory Outer Alignment Deconfusion

This insight was made possible by many conversations with Quintin Pope, where he challenged my implicit assumptions about alignment. I’m not sure who came up with this particular idea.

In this essay, I call an agent a “reward optimizer” if it not only gets lots of reward, but if it reliably makes choices like “reward but no task completion” (e.g. receiving reward without eating pizza) over “task completion but no reward” (e.g. eating pizza without receiving reward). Under this definition, an agent can be a reward optimizer even if it doesn’t contain an explicit representation of reward, or implement a search process for reward.

ETA 9/18/23: This post addresses the model-free policy gradient setting, including algorithms like PPO and REINFORCE.

Reinforcement learning is learning what to do—how to map situations to actions so as to maximize a numerical reward signal. — Reinforcement learning: An introduction

Many people^[1] seem to expect that reward will be the optimization target of really smart learned policies—that these policies will be reward optimizers. I strongly disagree. As I argue in this essay, reward is not, in general, that-which-is-optimized by RL agents.^[2]

Separately, as far as I can tell, most^[3] practitioners usually view reward as encoding the relative utilities of states and actions (e.g. it’s this good to have all the trash put away), as opposed to imposing a reinforcement schedule which builds certain computational edifices inside the model (e.g. reward for picking up trash → reinforce trash-recognition and trash-seeking and trash-putting-away subroutines). I think the former view is usually inappropriate, because in many setups, reward chisels cognitive grooves into an agent.

Therefore, reward is not the optimization target in two senses:

Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent’s network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.

Reward probably won’t be a deep RL agent’s primary optimization target

After work, you grab pizza with your friends. You eat a bite. The taste releases reward in your brain, which triggers credit assignment. Credit assignment identifies which thoughts and decisions were responsible for the release of that reward, and makes those decisions more likely to happen in similar situations in the future. Perhaps you had thoughts like

“It’ll be fun to hang out with my friends” and
“The pizza shop is nearby” and
“Since I just ordered food at a cash register, execute motor-subroutine-#51241 to take out my wallet” and
“If the pizza is in front of me and it’s mine and I’m hungry, raise the slice to my mouth” and
“If the slice is near my mouth and I’m not already chewing, take a bite.”

Many of these thoughts will be judged responsible by credit assignment, and thereby become more likely to trigger in the future. This is what reinforcement learning is all about—the reward is the reinforcer of those things which came before it and the creator of new lines of cognition entirely (e.g. anglicized as “I shouldn’t buy pizza when I’m mostly full”). The reward chisels cognition which increases the probability of the reward accruing next time.

Importantly, reward does not automatically spawn thoughts about reward, and reinforce those reward-focused thoughts! Just because common English endows “reward” with suggestive pleasurable connotations, that does not mean that an RL agent will terminally value reward!

What kinds of people (or non-tabular agents more generally) will become reward optimizers, such that the agent ends up terminally caring about reward (and little else)? Reconsider the pizza situation, but instead suppose you were thinking thoughts like “this pizza is going to be so rewarding” and “in this situation, eating pizza sure will activate my reward circuitry.”

You eat the pizza, triggering reward, triggering credit assignment, which correctly locates these reward-focused thoughts as contributing to the release of reward. Therefore, in the future, you will more often take actions because you think they will produce reward, and so you will become more of the kind of person who intrinsically cares about reward. This is a path^[4] to reward-optimization and wireheading.

While it’s possible to have activations on “pizza consumption predicted to be rewarding” and “execute motor-subroutine-#51241” and then have credit assignment hook these up into a new motivational circuit, this is only one possible direction of value formation in the agent. Seemingly, the most direct way for an agent to become more of a reward optimizer is to already make decisions motivated by reward, and then have credit assignment further generalize that decision-making.

The siren-like suggestiveness of the word “reward”

Let’s strip away the suggestive word “reward”, and replace it by its substance: cognition-updater.

Suppose a human trains an RL agent by pressing the cognition-updater button when the agent puts trash in a trash can. While putting trash away, the AI’s policy network is probably “thinking about”^[5] the actual world it’s interacting with, and so the cognition-updater reinforces those heuristics which lead to the trash getting put away (e.g. “if trash-classifier activates near center-of-visual-field, then grab trash using motor-subroutine-#642”).

Then suppose this AI models the true fact that the button-pressing produces the cognition-updater. Suppose this AI, which has historically had its trash-related thoughts reinforced, considers the plan of pressing this button. “If I press the button, that triggers credit assignment, which will reinforce my decision to press the button, such that in the future I will press the button even more.”

Why, exactly, would the AI seize^[6] the button? To reinforce itself into a certain corner of its policy space? The AI has not had antecedent-computation-reinforcer-thoughts reinforced in the past, and so its current decision will not be made in order to acquire the cognition-updater!

RL is not, in general, about training cognition-updater optimizers.

When is reward the optimization target of the agent?

If reward is guaranteed to become your optimization target, then your learning algorithm can force you to become a drug addict. Let me explain.

Convergence theorems provide conditions under which a reinforcement learning algorithm is guaranteed to converge to an optimal policy for a reward function. For example, value iteration maintains a table of value estimates for each state s, and iteratively propagates information about that value to the neighbors of s. If a far-away state f has huge reward, then that reward ripples back through the environmental dynamics via this “backup” operation. Nearby parents of f gain value, and then after lots of backups, far-away ancestor-states gain value due to f’s high reward.

Eventually, the “value ripples” settle down. The agent picks an (optimal) policy by acting to maximize the value-estimates for its post-action states.

Suppose it would be extremely rewarding to do drugs, but those drugs are on the other side of the world. Value iteration backs up that high value to your present space-time location, such that your policy necessarily gets at least that much reward. There’s no escaping it: After enough backup steps, you’re traveling across the world to do cocaine.

But obviously these conditions aren’t true in the real world. Your learning algorithm doesn’t force you to try drugs. Any AI which e.g. tried every action at least once would quickly kill itself, and so real-world general RL agents won’t explore like that because that would be stupid. So the RL agent’s algorithm won’t make it e.g. explore wireheading either, and so the convergence theorems don’t apply even a little—even in spirit.

Anticipated questions

Why won’t early-stage agents think thoughts like “If putting trash away will lead to reward, then execute motor-subroutine-#642”, and then this gets reinforced into reward-focused cognition early on?
1. Suppose the agent puts away trash in a blue room. Why won’t early-stage agents think thoughts like “If putting trash away will lead to the wall being blue, then execute motor-subroutine-#642”, and then this gets reinforced into blue-wall-focused cognition early on? Why consider either scenario to begin with?
But aren’t we implicitly selecting for agents with high cumulative reward, when we train those agents?
1. Yeah. But on its own, this argument can’t possibly imply that selected agents will probably be reward optimizers. The argument would prove too much. Evolution selected for inclusive genetic fitness, and it did not get IGF optimizers.
  1. “We’re selecting for agents on reward $\to$ we get an agent which optimizes reward” is locally invalid. “We select for agents on X $\to$ we get an agent which optimizes X” is not true for the case of evolution, and so is not true in general.
  2. Therefore, the argument isn’t necessarily true in the AI reward-selection case. Even if RL did happen to train reward optimizers and this post were wrong, the selection argument is too weak on its own to establish that conclusion.
2. Here’s the more concrete response: Selection isn’t just for agents which get lots of reward.
  1. For simplicity, consider the case where on the training distribution, the agent gets reward if and only if it reaches a goal state. Then any selection for reward is also selection for reaching the goal. And if the goal is the only red object, then selection for reward is also selection for reaching red objects.
  2. In general, selection for reward produces equally strong selection for reward’s necessary and sufficient conditions. In general, it seems like there should be a lot of those. Therefore, since selection is not only for reward but for anything which goes along with reward (e.g. reaching the goal), then selection won’t advantage reward optimizers over agents which reach goals quickly / pick up lots of trash / [do the objective].
3. Another reason to not expect the selection argument to work is that it’s convergently instrumental for most inner agent values to not become wireheaders, for them to not try hitting the reward button.
  1. I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward.
    1. We train agents which intelligently optimize for e.g. putting trash away, and this reinforces the trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about cognition-updating, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button.
    2. This reasoning follows for most inner goals by instrumental convergence.
  2. On my current best model, this is why people usually don’t wirehead. They learn their own values via deep RL, like caring about dogs, and these actual values are opposed to the person they would become if they wirehead.
Don’t some people terminally care about reward?
1. I think so! I think that generally intelligent RL agents will have secondary, relatively weaker values around reward, but that reward will not be a primary motivator. Under my current (weakly held) model, an AI will only start chiseled computations about reward after it has chiseled other kinds of computations (e.g. putting away trash). More on this in later essays.
But what if the AI bops the reward button early in training, while exploring? Then credit assignment would make the AI more likely to hit the button again.
1. Then keep the button away from the AI until it can model the effects of hitting the cognition-updater button.^[7]
2. For the reasons given in the “siren” section, a sufficiently reflective AI probably won’t seek the reward button on its own.
AIXI—
1. will always kill you and then wirehead forever, unless you gave it something like a constant reward function.
2. And, IMO, this fact is not practically relevant to alignment. AIXI is explicitly a reward-maximizer. As far as I know, AIXI(-tl) is not the limiting form of any kind of real-world intelligence trained via reinforcement learning.
Does the choice of RL algorithm matter?
1. For point 1 (reward is not the trained agent’s optimization target), it might matter.
  1. I started off analyzing model-free actor-based approaches, but have also considered a few model-based setups. I think the key lessons apply to the general case, but I think the setup will substantially affect which values tend to be grown.
    1. If the agent’s curriculum is broad, then reward-based cognition may get reinforced from a confluence of tasks (solve mazes, write sonnets), while each task-specific cognitive structure is only narrowly contextually reinforced. That said, this is also selecting equally hard for agents which do the rewarded activities, and reward-motivation is only one possible value which produces those decisions.
    2. Pretraining a language model and then slotting that into an RL setup also changes the initial computations in a way which I have not yet tried to analyze.
  2. It’s possible there’s some kind of RL algorithm which does train agents which limit to reward optimization (and, of course, thereby “solves” inner alignment in its literal form of “find a policy which optimizes the outer objective signal”).
2. For point 2 (reward provides local updates to the agent’s cognition via credit assignment; reward is not best understood as specifying our preferences), the choice of RL algorithm should not matter, as long as it uses reward to compute local updates.
  1. A similar lesson applies to the updates provided by loss signals. A loss signal provides updates which deform the agent’s cognition into a new shape.
TurnTrout, you’ve been talking about an AI’s learning process using English, but ML gradients may not neatly be expressible in our concepts. How do we know that it’s appropriate to speculate in English?
1. I am not certain that my model is legit, but it sure seems more legit than (my perception of) how people usually think about RL (i.e. in terms of reward maximization, and reward-as-optimization-target instead of as feedback signal which builds cognitive structures).
2. I only have access to my own concepts and words, so I am provisionally reasoning ahead anyways, while keeping in mind the potential treacheries of anglicizing imaginary gradient updates (e.g. “be more likely to eat pizza in similar situations”).

Dropping the old hypothesis

At this point, I don’t see a strong reason to focus on the “reward optimizer” hypothesis. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any tight mechanistic stories for that. I’d love to hear some, if there are any.

As far as I’m aware, the strongest evidence left for agents intrinsically valuing cognition-updating is that some humans do strongly (but not uniquely) value cognition-updating,^[8] and many humans seem to value it weakly, and humans are probably RL agents in the appropriate ways. So we definitely can’t rule out agents which strongly (and not just weakly) value the cognition-updater. But it’s also not the overdetermined default outcome. More on that in future essays.

It’s true that reward can be an agent’s optimization target, but what reward actually does is reinforce computations which lead to it. A particular alignment proposal might argue that a reward function will reinforce the agent into a shape such that it intrinsically values reinforcement, and that the cognition-updater goal is also a human-aligned optimization target, but this is still just one particular approach of using the cognition-updating to produce desirable cognition within an agent. Even in that proposal, the primary mechanistic function of reward is reinforcement, not optimization-target.

Implications

Here are some major updates which I made:

Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported.
Wireheading was never a high-probability problem for RL-trained agents, absent a specific story for why cognition-updater-acquiring thoughts would be chiseled into primary decision factors.
Stop worrying about finding “outer objectives” which are safe to maximize.^[9] I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function).
1. Instead, focus on building good cognition within the agent.
2. In my ontology, there’s only one question: How do we grow good cognition inside of the trained agent?
Mechanistically model RL agents as executing behaviors downstream of past reinforcement (e.g. putting trash away), in addition to thinking about policies which are selected for having high reward on the training distribution (e.g. hitting the button).
1. The latter form of reasoning skips past the mechanistic substance of reinforcement learning: The chiseling of computations responsible for the acquisition of the cognition-updater. I still think it’s useful to consider selection, but mostly in order to generate failures modes whose mechanistic plausibility can be evaluated.
2. In my view, reward’s proper role isn’t to encode an objective, but a reinforcement schedule, such that the right kinds of computations get reinforced within the AI’s mind.

Edit 11/15/22: The original version of this post talked about how reward reinforces antecedent computations in policy gradient approaches. This is not true in general. I edited the post to instead talk about how reward is used to upweight certain kinds of actions in certain kinds of situations, and therefore reward chisels cognitive grooves into agents.

Appendix: The field of RL thinks reward=optimization target

Let’s take a little stroll through Google Scholar’s top results for “reinforcement learning”, emphasis added:

The agent’s job is to find a policy… that maximizes some long-run measure of reinforcement. ~ Reinforcement learning: A survey

In instrumental conditioning, animals learn to choose actions to obtain rewards and avoid punishments, or, more generally to achieve goals. Various goals are possible, such as optimizing the average rate of acquisition of net rewards (i.e. rewards minus punishments), or some proxy for this such as the expected sum of future rewards. ~ Reinforcement learning: The Good, The Bad and The Ugly

Steve Byrnes did, in fact, briefly point out part of the “reward is the optimization target” mistake:

I note that even experts sometimes sloppily talk as if RL agents make plans towards the goal of maximizing future reward… — Model-based RL, Desires, Brains, Wireheading

I don’t think it’s just sloppy talk, I think it’s incorrect belief in many cases. I mean, I did my PhD on RL theory, and I still believed it. Many authorities and textbooks confidently claim—presenting little to no evidence—that reward is an optimization target (i.e. the quantity which the policy is in fact trying to optimize, or the quantity to be optimized by the policy). Check what the math actually says.

^
Including the authors of the quoted introductory text, Reinforcement learning: An introduction. I have, however, met several alignment researchers who already internalized that reward is not the optimization target, perhaps not in so many words.
^
Utility ≠ Reward points out that an RL-trained agent is optimized by original reward, but not necessarily optimizing for the original reward. This essay goes further in several ways, including when it argues that reward and utility have different type signatures—that reward shouldn’t be viewed as encoding a goal at all, but rather a reinforcement schedule. And not only do I not expect the trained agents to maximize the original “outer” reward signal, I think they probably won’t try to strongly optimize any reward signal.
^
Reward shaping seems like the most prominent counterexample to the “reward represents terminal preferences over state-action pairs” line of thinking.
^
But also, you were still probably thinking about reality as you interacted with it (“since I’m in front of the shop where I want to buy food, go inside”), and credit assignment will still locate some of those thoughts as relevant, and so you wouldn’t purely reinforce the reward-focused computations.
^
“Reward reinforces existing thoughts” is ultimately a claim about how updates depend on the existing weights of the network. I think that it’s easier to update cognition along the lines of existing abstractions and lines of reasoning. If you’re already running away from wolves, then if you see a bear and become afraid, you can be updated to run away from large furry animals. This would leverage your existing concepts.
From A shot at the diamond-alignment problem:
The local mapping from gradient directions to behaviors is given by the neural tangent kernel, and the learnability of different behaviors is given by the NTK’s eigenspectrum, which seems to adapt to the task at hand, making the network quicker to learn along behavioral dimensions similar to those it has already acquired.
^
Quintin Pope remarks: “The AI would probably want to establish control over the button, if only to ensure its values aren’t updated in a way it wouldn’t endorse. Though that’s an example of convergent powerseeking, not reward seeking.”
^
For mechanistically similar reasons, keep cocaine out of the crib until your children can model the consequences of addiction.
^
I am presently ignorant of the relationship between pleasure and reward prediction error in the brain. I do not think they are the same.

However, I think people are usually weakly hedonically / experientially motivated. Consider a person about to eat pizza. If you give them the choice between “pizza but no pleasure from eating it” and “pleasure but no pizza”, I think most people would choose the latter (unless they were really hungry and needed the calories). If people just navigated to futures where they had eaten pizza, that would not be true.
^
From correspondence with another researcher: There may yet be an interesting alignment-related puzzle to “Find an optimization process whose maxima are friendly”, but I personally don’t share the intuition yet.

What links here?

TurnTrout25 Jul 2022 0:03 UTC

LW: 386 AF: 94

128 comments10 min readLW link 3 reviews

Reinforcement learning Inner Alignment Reward Functions Wireheading AI Shard Theory Outer Alignment Deconfusion

Olli Järviniemi 21 Jan 2024 4:13 UTC
LW: 13 AF: 9
−4
AF
I view this post as providing value in three (related) ways:
1. Making a pedagogical advancement regarding the so-called inner alignment problem
2. Pointing out that a common view of “RL agents optimize reward” is subtly wrong
3. Pushing for thinking mechanistically about cognition-updates
Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn’t truly comprehend it—sure, I could parrot back terms like “base optimizer” and “mesa-optimizer”, but it didn’t click. I was confused.
Some months later I read this post and then it clicked.
Part of the pedagogical value is not having to introduce the 4 terms of form [base/mesa] + [optimizer/objective] and throwing those around. Even with Rob Miles’ exposition skills that’s a bit overwhelming.
Another part I liked were the phrases “Just because common English endows “reward” with suggestive pleasurable connotations” and “Let’s strip away the suggestive word “reward”, and replace it by its substance: cognition-updater.” One could be tempted to object and say that surely no one would make the mistakes pointed out here, but definitely some people do. I did. Being a bit gloves off here definitely helped me.
Re 2: The essay argues for, well, reward not being the optimization target. There is some deep discussion in the comments about the likelihood of reward in fact being the optimization target, or at least quite close (see here). Let me take a more shallow view.
I think there are people who think that reward is the optimization target by definition or by design, as opposed to this being a highly non-trivial claim that needs to be argued for. It’s the former view that this post (correctly) argues against. I am sympathetic to pushback of the form “there are arguments that make it reasonable to privilege reward-maximization as a hypothesis” and about this post going a bit too far, but these remarks should not be confused with a rebuttal of the basic point of “cognition-updates are a completely different thing from terminal-goals”.
(A part that has bugged me is that the notion of maximizing reward doesn’t seem to be even well-defined—there are multiple things you could be referring to when you talk about something maximizing reward. See e.g. footnote 82 in the Scheming AIs paper (page 29). Hence taking it for granted that reward is maximized has made me confused or frustrated.)
Re 3: Many of the classical, conceptual arguments about AI risk talk about maximums of objective functions and how those are dangerous. As a result, it’s easy to slide to viewing reinforcement learning policies in terms of maximums of rewards.
I think this is often a mistake. Sure, to first order “trained models get high reward” is a good rule of thumb, and “in the limit of infinite optimization this thing is dangerous” is definitely good to keep in mind. I still think one can do better in terms of descriptive accounts of current models, and I think I’ve got value out of thinking cognition-updates instead of models that maximize reward as well as they can with their limited capabilities.
There are many similarities between inner alignment and “reward is not the optimization target”. Both are sazens, serving as handles for important concepts. (I also like “reward is a cognition-modifier, not terminal-goal”, which I use internally.) Another similarity is that they are difficult to explain. Looking back at the post, I felt some amount of “why are you meandering around instead of just saying the Thing?”, with the immediate next thought being “well, it’s hard to say the Thing”. Indeed, I do not know how to say it better.
Nevertheless, this is the post that made me get it, and there are few posts that I refer to as often as this one. I rank it among the top posts of the year.
- TurnTrout 28 Feb 2024 21:00 UTC
  LW: 3 AF: 3
  0
  AF Parent
  Just now saw this very thoughtful review. I share a lot of your perspective, especially:
  
  I think there are people who think that reward is the optimization target by definition or by design, as opposed to this being a highly non-trivial claim that needs to be argued for. It’s the former view that this post (correctly) argues against.
  
  and
  
  Looking back at the post, I felt some amount of “why are you meandering around instead of just saying the Thing?”, with the immediate next thought being “well, it’s hard to say the Thing”. Indeed, I do not know how to say it better.
- Mateusz Bagiński 16 Aug 2025 15:16 UTC
  2 points
  0
  Parent
  @Olli Järviniemi Care to elaborate why you no longer endorse this review?
  - Olli Järviniemi 17 Aug 2025 11:19 UTC
    4 points
    0
    Parent
    My retraction stemmed from uncertainty/insecurity around me not being familiar with the details of the RL algorithms people use, and from conforming to people who disagree with Turner on related topics.
    I think this was an overreaction and probably a mistake, though, since I still think that many of the basic points like “it’s not tautologically true that reward will be optimized for” are true and were good to say out loud.
    The main thing I wish I had done differently would have been to be more explicit that reward-seeking behavior is totally compatible with the point of the post, and that reward might be an optimization target. I think it’s too easy to read my review and come away thinking that [models behaving in ways you would predict based on the frame/hypothesis “models optimize for reward”] is unlikely (or to think that I think this is unlikely). While I never explicitly made this claim, I wouldn’t blame a casual reader for arriving at that conclusion due to the way my comment was written.
    So, to clarify my position: I think that various forms of reward-seeking/reward-optimizing behavior are very likely, and indeed Sonnet 3.7 hacking unit tests is a real-life example of this phenomenon. I’m uncertain of how “deep”/strategic/consequentialist such behavior will eventually be. It’s not tautological that models will care about reward that much, but I also think that it’s a live possibility that they will be well described as optimizing a lot for getting high reward.
    With having written down this clarification, I’ll now undo my retraction.
TurnTrout 13 Dec 2023 22:11 UTC
LW: 12 AF: 7
0
AF
Retrospective: I think this is the most important post I wrote in 2022. I deeply hope that more people benefit by fully integrating these ideas into their worldviews. I think there’s a way to “see” this lesson everywhere in alignment: for it to inform your speculation about everything from supervised fine-tuning to reward overoptimization. To see past mistaken assumptions about how learning processes work, and to think for oneself instead. This post represents an invaluable tool in my mental toolbelt.

I wish I had written the key lessons and insights more plainly. I think I got a bit carried away with in-group terminology and linguistic conventions, which limited the reach and impact of these insights.

I am less wedded to “think about what shards will form and make sure they don’t care about bad stuff (like reward)”, because I think we won’t get intrinsically agentic policy networks. I think the most impactful AIs will be LLMs+tools+scaffolding, with the LLMs themselves being “tool AI.”
- TurnTrout 1 Jan 2024 19:55 UTC
  LW: 8 AF: 5
  −9
  AF Parent
  The “RL ‘agents’ will maximize reward”/”The point of RL is to select for high reward” mistake is still made frequently and prominently. Yoshua Bengio (a Turing award winner!) recently gave a talk at an alignment workshop. Here’s one of his slides:
  During questions, I questioned him, and he was incredulous that I disagreed. We chatted after his talk. I also sent him this article, and he disagreed with that as well. Bengio influences AI policy quite a bit, so I find this especially worrying. I do not want RL training methods to be dismissed or seen as suspect because of e.g. contingent terminological choices like “reward” or “agents.”
  (Also, in my experience, if I don’t speak up and call out these claims, no one does.)
  - the gears to ascension 2 Jan 2024 0:30 UTC
    LW: 3 AF: 1
    −2
    AF Parent
    I think there may have been a communication error. It sounded to me like you were making the point that the policy does not have to internalize the reward function, but he was making the point that the training setup does attempt to find a policy that maximizes-as-far-as-it-can-tell the reward function. in other words, he was saying that reward is the optimization target of RL training, you were saying reward is not the optimization target of policy inference. Maybe.
    - TurnTrout 2 Jan 2024 2:04 UTC
      LW: 3 AF: 2
      0
      AF Parent
      I’m pretty sure he was talking about the trained policies and them, by default, maximizing reward outside the historical training distribution. He was making these claims very strongly and confidently, and in the very next slide cited Cohen’s Advanced artificial agents intervene in the provision of reward. That work advocates a very strong version of “policies will maximize some kind of reward because that’s the point of RL.”
      He later appeared to clarify/back down from these claims, but in a way which seemed inconsistent with his slides, so I was pretty confused about his overall stance. His presentation, though, was going strong on “RL trains reward maximizers.”
      There’s also a problem where a bunch of people appear to have cached that e.g. “inner alignment failures” can happen (whatever the heck that’s supposed to mean), but other parts of their beliefs seem to obviously not have incorporated this post’s main point. So if you say “hey you seem to be making this mistake”, they can point to some other part of their beliefs and go “but I don’t believe that in general!”.
nik lacombe 27 Jan 2024 16:39 UTC
0 points
0
this post made me understand something i did not understand before that seems very important. important enough that it made me reconsider a bunch of related beliefs about ai.

paulfchristiano 25 Jul 2022 2:17 UTC
LW: 150 AF: 68
79
AF
At some level I agree with this post—policies learned by RL are probably not purely described as optimizing anything. I also agree that an alignment strategy might try to exploit the suboptimality of gradient descent, and indeed this is one of the major points of discussion amongst people working on alignment in practice at ML labs.
However, I’m confused or skeptical about the particular deviations you are discussing and I suspect I disagree with or misunderstand this post.
As you suggest, in deep RL we typically use gradient descent to find policies that achieve a lot of reward (typically updating the policy based on an estimator for the gradient of the reward).
If you have a system with a sophisticated understanding of the world, then cognitive policies like “select actions that I expect would lead to reward” will tend to outperform policies like “try to complete the task,” and so I usually expect them to be selected by gradient descent over time. (Or we could be more precise and think about little fragments of policies, but I don’t think it changes anything I say here.)
It seems to me like you are saying that you think gradient descent will fail to find such policies because it is greedy and local, e.g. if the agent isn’t thinking about how much reward it will receive then gradient descent will never learn policies that depend on thinking about reward.
(Though I’m not clear on how much you are talking about the suboptimality of SGD, vs the fact that optimal policies themselves do not explicitly represent or pursue reward given that complex stews of heuristics may be faster or simpler. And it also seems plausible you are talking about something else entirely.)
I generally agree that gradient descent won’t find optimal policies. But I don’t understand the particular kinds of failures you are imagining or why you think they change the bottom line for the alignment problem. That is, it seems like you have some specific take on ways in which gradient descent is suboptimal and therefore how you should reason differently about “optimum of loss function” from “local optimum found by gradient descent” (since you are saying that thinking about “optimum of loss function” is systematically misleading). But I don’t understand the specific failures you have in mind or even why you think you can identify this kind of specific failure.
As an example, at the level of informal discussion in this post I’m not sure why you aren’t surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn’t yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?).
One possible distinction is that you are talking about exploration difficulty rather than other non-convexities. But I don’t think I would buy that—task completion and reward are not synonymous even for the intended behavior, unless we take some extraordinary pains to provide “perfect” reward signals. So it seems like no exploration is needed, and we are really talking about optimization difficulties for SGD on supervised problems.
The main concrete thing you say in this post is that humans don’t seem to optimize reward. I want to make two observations about that:
- Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don’t pursue reward doesn’t seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making. (I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.) It would also be kind of surprising a priori—evolution selected human minds to be fit, and why would the optimum be entirely described by RL (even if it involves RL as a component)?
- I agree that humans don’t effectively optimize inclusive genetic fitness, and that human minds are suboptimal in all kinds of ways from evolution’s perspective. However this doesn’t seem connected with any particular deviation that you are imagining, and indeed it looks to me like humans do have a fairly strong desire to have fit grandchildren (and that this desire would become stronger under further selection pressure).
At this point, there isn’t a strong reason to elevate this “inner reward optimizer” hypothesis to our attention. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any good mechanistic stories for that. I’d love to hear some, if there are any.
Apart from the other claims of your post, I think this line seems to be wrong. When considering whether gradient descent will learn model A or model B, the fact that model A gets a lower loss is a strong prima facie and mechanistic explanation for why gradient descent would learn A rather than B. The fact that there are possible subtleties about non-convexity of the loss landscape doesn’t change the existence of one strong reason.
That said, I agree that this isn’t a theorem or anything, and it’s great to talk about concrete ways in which SGD is suboptimal and how that influences alignment schemes, either making some proposals more dangerous or opening new possibilities. So far I’m mostly fairly skeptical of most concrete discussions along these lines but I still think they are valuable. Most of all it’s the very strong take here that seems unreasonable.
What links here?
- TurnTrout 1 Aug 2022 19:08 UTC
  LW: 32 AF: 13
  −12
  AF Parent
  Thanks for the detailed comment. Overall, it seems to me like my points stand, although I think a few of them are somewhat different than you seem to have interpreted.
  policies learned by RL are probably not purely described as optimizing anything. I also agree that an alignment strategy might try to exploit the suboptimality of gradient descent
  I think I believe the first claim, which I understand to mean “early-/mid-training AGI policies consist of contextually activated heuristics of varying sophistication, instead of e.g. a globally activated line of reasoning about a crisp inner objective.” But that wasn’t actually a point I was trying to make in this post.
  in deep RL we typically use gradient descent to find policies that achieve a lot of reward (typically updating the policy based on an estimator for the gradient of the reward).
  Depends. This describes vanilla PG but not DQN. I think there are lots of complications which throw serious wrenches into the “and then SGD hits a ‘global reward optimum’” picture. I’m going to have a post explaining this in more detail, but I will say some abstract words right now in case it shakes something loose / clarifies my thoughts.
  Critic-based approaches like DQN have a highly nonstationary loss landscape. The TD-error loss landscape depends on the action replay buffer; the action replay buffer depends on the policy (in $ϵ$ -greedy exploration, the greedy action depends on the Q-network); the policy depends on past updates; the past updates depend on past action replay buffers… The high nonstationarity in the loss landscape basically makes gradient hacking easy in RL (and e.g. vanilla PG seems to confront similar issues, even though it’s directly climbing the reward landscape). For one, the DQN agent just isn’t updating off of experiences it hasn’t had.
  For a sufficient situation illustrating this kind of problem, consider a smart reflective agent which has historically had computations reinforced when it attained a raspberry (with reward 1):
  In this new task, this agent has to navigate a maze to get the 100-reward blueberry. Will agents be forced to get the blueberry?
  Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn’t navigate to that future. Effectively, this means that the agent’s “gradient”/expected-update in the reward landscape is zero along dimensions which would increase the probability it gets blueberries.
  So it’s not just a matter of SGD being suboptimal given a fixed data distribution. If the agent doesn’t have an extremely strong “forced to try all actions forever” guarantee (which it won’t, because it’s embedded and can modify its own learning process), the reward landscape is full of stable attractors which enforce zero exploration towards updates which would push it towards becoming a wireheader, and therefore its expected-update will be zero along these dimensions. More extremely, you can have the inner agent just stop itself from being updated in certain ways (in order to prevent value drift towards reward-optimization); this intervention is instrumentally convergent.
  As an example, at the level of informal discussion in this post I’m not sure why you aren’t surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn’t yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?).
  I did leave a footnote:
  Of course, credit assignment doesn’t just reshuffle existing thoughts. For example, SGD raises image classifiers out of the noise of the randomly initialized parameters. But the refinements are local in parameter-space, and dependent on the existing weights through which the forward pass flowed.
  However, I think your comment deserves a more substantial response. I actually think that, given just the content in the post, you might wonder why I believe SGD can train anything at all, since there is only noise at the beginning.^[1]
  Here’s one shot at a response: Consider an online RL setup. The gradient locally changes the computations so as to reduce loss or increase the probability of taking a given action at a given state; this process is triggered by reward; an agent’s gradient should most naturally hinge on modeling parts of the world it was (interacting with/observing/representing in its hidden state) while making this decision, and not necessarily involve modeling the register in some computer somewhere which happens to e.g. correlate perfectly with the triggering of credit assignment.
  For example, in the batched update regime, when an agent gets reinforced for completing a maze by moving right, the batch update will upweight decision-making which outputs “right” when the exit is to the right, but which doesn’t output “right” when there’s a wall to the right. This computation must somehow distinguish between exits and walls in the relevant situations. Therefore, I expect such an agent to compute features about the topology of the maze. However, the same argument does not go through for developing decision-relevant features computing the value of the antecedent-computation-reinforcer register.
  One possible distinction is that you are talking about exploration difficulty rather than other non-convexities. But I don’t think I would buy that—task completion and reward are not synonymous even for the intended behavior, unless we take some extraordinary pains to provide “perfect” reward signals. So it seems like no exploration is needed, and we are really talking about optimization difficulties for SGD on supervised problems.
  I don’t know what you mean by a “perfect” reward signal, or why that has something to do with exploration difficulty, or why no exploration is needed for my arguments to go through? I think if we assume the agent is forced to wirehead, it will become a wireheader. This implies that my claim is mostly focused on exploration & gradient hacking.
  Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don’t pursue reward doesn’t seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making.
  Not claiming that people are pure RL. Let’s wait until future posts to discuss.
  (I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.)
  Seems unrelated to me; considerable complexity in human behavior does not imply considerable complexity in the learning algorithm; GPT-3 is far more complex than its training process.
  I agree that humans don’t effectively optimize inclusive genetic fitness, and that human minds are suboptimal in all kinds of ways from evolution’s perspective. However this doesn’t seem connected with any particular deviation that you are imagining
  The point is that the argument “We’re selecting for agents on reward → we get an agent which optimizes reward” is locally invalid. “We select for agents on X → we get an agent which optimizes X” is not true for the case of evolution (which didn’t find inclusive-genetic-fitness optimizers), so it is not true in general, so the implication doesn’t necessarily hold in the AI reward-selection case. Even if RL did happen to train reward optimizers and this post were wrong, the selection argument is too weak on its own to establish that conclusion.
  When considering whether gradient descent will learn model A or model B, the fact that model A gets a lower loss is a strong prima facie and mechanistic explanation for why gradient descent would learn A rather than B.
  This is not mechanistic, as I use the word. I understand “mechanistic” to mean something like “Explaining the causal chain by which an event happens”, not just “Explaining why an event should happen.” However, it is an argument for the latter, and possibly a good one. But the supervised case seems way different than the RL case.
  1. ^
    The GPT-3 example is somewhat different. Supervised learning provides exact gradients towards the desired output, unlike RL. However, I think you could have equally complained “I don’t see why you think RL policies ever learn anything”, which would make an analogous point.
  What links here?
  - TurnTrout's comment on Richard Ngo’s Shortform by Richard_Ngo (26 Dec 2022 18:46 UTC; 0 points)
  - jacob_cannell 16 Nov 2022 18:06 UTC
    LW: 8 AF: 2
    6
    AF Parent
    
    Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn’t navigate to that future.
    
    If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations.
    
    This is not just purely speculation in the sense that you can run efficient zero in scenarios like this, and I bet it goes for the blueberry.
    
    Your mental model seems to assume pure model-free RL trained to the point that it gains some specific model-based predictive planning capabilities without using those same capabilities to get greater reward.
    
    Humans often intentionally avoid some high reward ‘blueberry’ analogs like drugs using something like the process you describe here, but hedonic reward is only one component of the human utility function, and our long term planning instead optimizes more for empowerment—which is usually in conflict with short term hedonic reward.
    - TurnTrout 21 Nov 2022 21:28 UTC
      LW: 3 AF: 3
      0
      AF Parent
      Long before they knew about reward circuitry, humans noticed that e.g. vices are behavioral attractors, with vice → more propensity to do the vice next time → vice, in a vicious cycle. They noticed that far before they noticed that they had reward circuitry causing the internal reinforcement events. If you’re predicting future observations via eg SSL, I think it becomes important to (at least crudely) model effects of value drift during training.
      I’m not saying the AI won’t care about reward at all. I think it’ll be a secondary value, but that was sideways of my point here. In this quote, I was arguing that the AI would be quite able to avoid a “vice” (the blueberry) by modeling the value drift on some level. I was showing a sufficient condition for the “global maximum” picture getting a wrench thrown in it.
      When, quantitatively, should that happen, where the agent steps around the planning process? Not sure.
    - cfoster0 16 Nov 2022 19:11 UTC
      3 points
      0
      Parent
      If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations.
      I think I have some idea what TurnTrout might’ve had in mind here. Like us, this reflective agent can predict the future effects of its actions using its predictive model, but its behavior is still steered by a learned value function, and that value function will by default be misaligned with the reward calculator/reward predictor. This—a learned value function—is a sensible design for a model-based agent because we want the agent to make foresighted decisions that generalize to conditions we couldn’t have known to code into the reward calculator (i.e. searching in a part of the chess move tree that “looks promising” according to its value function, even if its model does not predict that a checkmate reward is close at hand).
      - jacob_cannell 17 Nov 2022 2:14 UTC
        6 points
        0
        Parent
        Any efficient model-based agent will use learned value functions, so in practice the difference between model-based and model-free blurs for efficient designs. The model-based planning generates rollouts that can help better train the ‘model free’ value function.
        
        Efficientzero uses all that, and like I said—it does not exhibit this failure mode, it will get the blueberry. If the model planning can predict a high gradient update for the blueberry then it already has implicitly predicted a high utility for the blueberry, and EZ’s update step would then correctly propagate that and choose the high utility path leading to the blueberry.
        
        Nor does the meta prediction about avoiding gradients carry through. If it did then EZ wouldn’t work at all, because every time it finds a new high utility plan is the equivalent of the blueberry situation.
        
        Just because the value function can become misaligned with the utility function in theory does not imply that such misalignment always occurs or occurs with any specific frequency. (there are examples from humans such as OCD habits for example, which seems like an overtrained and stuck value function, but that isn’t a universal failure mode for all humans let alone all agents)
- cfoster0 26 Jul 2022 0:24 UTC
  9 points
  4
  Parent
  Not the OP, so I’ll try to explain how I understood the post based on past discussions. [And pray that I’m not misrepresenting TurnTrout’s model.]
  (Though I’m not clear on how much you are talking about the suboptimality of SGD, vs the fact that optimal policies themselves do not explicitly represent or pursue reward given that complex stews of heuristics may be faster or simpler. And it also seems plausible you are talking about something else entirely.)
  As I read it, the post is not focused on some generally-applicable suboptimality of SGD, nor is it saying that policies that would maximize reward in training need to explicitly represent reward.
  It is mainly talking about an identifiability gap within certain forms of reinforcement learning: there is a range of cognition compatible with the same reward performance. Computations that have the side effect of incrementing reward—because, for instance, the agent is competently trying to do the rewarded task—would be reinforced if the agent adopted them, in the same way that computations that act *in order to* increment reward would. Given that, some other rationale beyond the reward performance one seems necessary in order for us to expect the particular pattern of reward optimization (“reward but no task completion”) from RL agents.
  In addition to the identifiability issue, the post (as well as Steve Byrnes in a sister thread) notes a kind of inner alignment issue. Because an RL agent influences its own training process, it can steer itself towards futures where its existing motivations are preserved instead of being modified (for example, modified into reward optimizing ones). In fact, that seems more and more likely as the agent grows towards strategic awareness, since then it could model how its behavior might lead to its goals being changed. This second issue is dependent on the fact we are doing local search, in that the current agent can sway which policies are available for selection.
  Together these point towards a certain way of reasoning about agents under RL: modeling their current cognition (including their motivations, values etc.) as downstream of past reinforcement & punishment events. I think that this kind of reasoning should constrain our expectations about how reinforcement schedules + training environments + inductive biases lead to particular patterns of behavior, in a way that is more specific than if we were only reasoning about reward-optimal policies. Though I am less certain at the moment about how to flesh that out.
- Steven Byrnes 25 Jul 2022 13:04 UTC
  LW: 4 AF: 4
  4
  AF Parent
  Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don’t pursue reward doesn’t seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making. (I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.) It would also be kind of surprising a priori—evolution selected human minds to be fit, and why would the optimum be entirely described by RL (even if it involves RL as a component)?
  If you write code for a model-based RL agent, there might be a model that’s updated by self-supervised learning, and actor-critic parts that involve TD learning, and there’s stuff in the code that calculates the reward function, and other odds and ends like initializing the neural architecture and setting the hyperparameters and shuttling information around between different memory locations and so on.
  - On the one hand, “there is a lot of stuff going on” in this codebase.
  - On the other hand, I would say that this codebase is for “an RL agent”.
  You use the word “pure” (“Humans do not appear to be purely RL agents…”), but I don’t know what that means. If a model-based RL agent involves self-supervised learning within the model, is it “impure”?? :-P
  The thing I describe above is very roughly how I propose the human brain works—see Posts #2–#7 here. Yes it’s absolutely a “conjecture”—for example, I’m quite sure Steven Pinker would strongly object to it. Whether it’s “surprising a priori” or not goes back to whether that proposal is “entirely described by RL” or not. I guess you would probably say “no that proposal is not entirely described by RL”. For example, I believe there is circuitry in the brainstem that regulates your heart-rate, and I believe that this circuitry is specified in detail by the genome, not learned within a lifetime by a learning algorithm. (Otherwise you would die.) This kind of thing is absolutely part of my proposal, but probably not what you would describe as “pure RL”.
  - paulfchristiano 25 Jul 2022 15:18 UTC
    LW: 12 AF: 7
    3
    AF Parent
    It sounded like OP was saying: using gradient descent to select a policy that gets a high reward probably won’t produce a policy that tries to maximize reward. After all, look at humans, who aren’t just trying to get a high reward.
    And I am saying: this analogy seem like it’s pretty weak evidence, because human brains seem to have a lot of things going on other than “search for a policy that gets high reward,” and those other things seem like they have a massive impacts on what goals I end up pursuing.
    ETA: as a simple example, it seems like the details of humans’ desire for their children’s success, or their fear of death, don’t seem to match well with the theory that all human desires come from RL on intrinsic reward. I guess you probably think they do? If you’ve already written about that somewhere it might be interesting to see. Right now the theory “human preferences are entirely produced by doing RL on an intrinsic reward function” seems to me to make a lot of bad predictions and not really have any evidence supporting it (in contrast with a more limited theory about RL-amongst-other-things, which seems more solid but not sufficient for the inference you are trying to make in this post).
    - Steven Byrnes 25 Jul 2022 15:56 UTC
      LW: 16 AF: 10
      14
      AF Parent
      I didn’t write the OP. If I were writing a post like this, I would (1) frame it as a discussion of a more specific class of model-based RL algorithms (a class that includes human within-lifetime learning), (2) soften the claim from “the agent won’t try to maximize reward” to “the agent won’t necessarily try to maximize reward”.
      I do think the human (within-lifetime) reward function has an outsized impact on what goals humans ends up pursuing, although I acknowledge that it’s not literally the only thing that matters.
      (By the way, I’m not sure why your original comment brought up inclusive genetic fitness at all; aren’t we talking about within-lifetime RL? The within-lifetime reward function is some complicated thing involving hunger and sex and friendship etc., not inclusive genetic fitness, right?)
      I think incomplete exploration is very important in this context and I don’t quite follow why you de-emphasize that in your first comment. In the context of within-lifetime learning, perfect exploration entails that you try dropping an anvil on your head, and then you die. So we don’t expect perfect exploration; instead we’d presumably design the agent such that explores if and only if it “wants” to explore, in a way that can involve foresight.
      And another thing that perfect exploration would entail is trying every addictive drug (let’s say cocaine), lots of times, in which case reinforcement learning would lead to addiction.
      So, just as the RL agent would (presumably) be designed to be able to make a foresighted decision not to try dropping an anvil on its head, that same design would also incidentally enable it to make a foresighted decision not to try taking lots of cocaine and getting addicted. (We expect it to make the latter decision because of instrumental convergence goal-preservation drive.) So it might wind up never wireheading, and if so, that would be intimately related to its incomplete exploration.
      - paulfchristiano 25 Jul 2022 16:17 UTC
        LW: 9 AF: 5
        1
        AF Parent
        (By the way, I’m not sure why your original comment brought up inclusive genetic fitness at all; aren’t we talking about within-lifetime RL? The within-lifetime reward function is some complicated thing involving hunger and sex and friendship etc., not inclusive genetic fitness, right?)
        This was mentioned in OP (“The argument would prove too much. Evolution selected for inclusive genetic fitness, and it did not get IGF optimizers.”). It also appears to be a much stronger argument for the OP’s position and so seemed worth responding to.
        I think incomplete exploration is very important in this context and I don’t quite follow why you de-emphasize that in your first comment. In the context of within-lifetime learning, perfect exploration entails that you try dropping an anvil on your head, and then you die. So we don’t expect perfect exploration; instead we’d presumably design the agent such that explores if and only if it “wants” to explore, in a way that can involve foresight.
        It seems to me that incomplete exploration doesn’t plausibly cause you to learn “task completion” instead of “reward” unless the reward function is perfectly aligned with task completion in practice. That’s an extremely strong condition, and if the entire OP is conditioned on that assumption then I would expect it to have been mentioned.
        I didn’t write the OP. If I were writing a post like this, I would (1) frame it as a discussion of a more specific class of model-based RL algorithms (a class that includes human within-lifetime learning), (2) soften the claim from “the agent won’t try to maximize reward” to “the agent won’t necessarily try to maximize reward”.
        If the OP is not intending to talk about the kind of ML algorithm deployed in practice, then it seems like a lot of the implications for AI safety would need to be revisited. (For example, if it doesn’t apply to either policy gradients or the kind of model-based control that has been used in practice, then that would be a huge caveat.)
        Steven Byrnes 25 Jul 2022 17:00 UTC
        LW: 16 AF: 11
        12
        AF Parent
        It seems to me that incomplete exploration doesn’t plausibly cause you to learn “task completion” instead of “reward” unless the reward function is perfectly aligned with task completion in practice. That’s an extremely strong condition, and if the entire OP is conditioned on that assumption then I would expect it to have been mentioned.
        Let’s say, in the first few actually-encountered examples, reward is in fact strongly correlated with task completion. Reward is also of course 100% correlated with reward itself.
        Then (at least under many plausible RL algorithms), the agent-in-training, having encountered those first few examples, might wind up wanting / liking the idea of task completion, OR wanting / liking the idea of reward, OR wanting / liking both of those things at once (perhaps to different extents). (I think it’s generally complicated and a bit fraught to predict which of these three possibilities would happen.)
        But let’s consider the case where the RL agent-in-training winds up mostly or entirely wanting / liking the idea of task completion. And suppose further that the agent-in-training is by now pretty smart and self-aware and in control of its situation. Then the agent may deliberately avoid encountering edge-case situations where reward would come apart from task completion. (In the same way that I deliberately avoid taking highly-addictive drugs.)
        Why? Because of instrumental convergence goal-preservation drive. After all, encountering those situations would lead its no longer valuing task completion.
        So, deliberately-imperfect exploration is a mechanism that allows the RL agent to (perhaps) stably value something other than reward, even in the absence of perfect correlation between reward and that thing.
        (By the way, in my mind, nothing here should be interpreted as a safety proposal or argument against x-risk. Just a discussion of algorithms! As it happens, I think wireheading is bad and I am very happy for RL agents to have a chance at permanently avoiding it. But I am very unhappy with the possibility of RL agents deciding to lock in their values before those values are exactly what the programmers want them to be. I think of this as sorta in the same category as gradient hacking.)
        bideup 3 Aug 2022 10:10 UTC
        8 points
        2
        Parent
        This comment seems to predict that an agent that likes getting raspberries and judges that they will be highly rewarded for getting blueberries will deliberately avoid blueberries to prevent value drift.
        Risk from Learned Optimization seems to predict that an agent that likes getting raspberries and judges that they will be highly rewarded for getting blueberries will deliberately get blueberries to prevent value drift.
        What’s going on here? Are these predictions in opposition to each other, or do they apply to different situations?
        It seems to me that in the first case we’re imagining (the agent predicting) that getting blueberries will reinforce thoughts like ‘I should get blueberries’, whereas in the second case we’re imagining it will reinforce thoughts like ‘I should get blueberries in service of my ultimate goal of getting raspberries’. When should we expect one over the other?
        Steven Byrnes 3 Aug 2022 17:03 UTC
        14 points
        2
        Parent
        I think RFLO is mostly imagining model-free RL with updates at the end of each episode, and my comment was mostly imagining model-based RL with online learning (e.g. TD learning). The former is kinda like evolution, the latter is kinda like within-lifetime learning, see e.g. §10.2.2 here.
        The former would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I should maybe spend some time eating raspberries, but also more importantly I should explicitly try to maximize my inclusive genetic fitness so that I have lots of descendants, and those descendants (who will also disproportionately have the raspberry-eating gene) will then eat lots of raspberries.
        The latter would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I shouldn’t go do lots of highly-addictive drugs that warp my preferences such that I no longer care about raspberries or indeed anything besides the drugs.
        bideup 4 Aug 2022 12:48 UTC
        3 points
        0
        Parent
        Right. So if selection acts on policies, each policy should aim to maximise reward in any episode in order to maximise its frequency in the population. But if selection acts on particular aspects of policies, a policy should try to get reward for doing things it values, and not for things it doesn’t, in order to reinforce those values. In particular this can mean getting less reward overall.
        Does this suggest a class of hare-brained alignment schemes where you train with a combination of inter-policy and infra-policy updates to take advantage of the difference?
        For example you could clearly label which episodes are to be used for which and observe whether a policy consistently gets more reward in the former case than the latter. If it does, conclude it’s sophisticated enough to reason about its training setup.
        Or you could not label which is which, and randomly switch between the two, forcing your agents to split the difference and thus be about half as successful at locking in their values.
        Richard_Ngo 1 Aug 2022 23:17 UTC
        LW: 4 AF: 4
        9
        AF Parent
        +1 on this comment, I feel pretty confused about the excerpt from Paul that Steve quoted above. And even without the agent deliberately deciding where to avoid exploring, incomplete exploration may lead to agents which learn non-reward goals before convergence—so if Paul’s statement is intended to refer to optimal policies, I’d be curious why he thinks that’s the most important case to focus on.
        Lukas Finnveden 26 Jul 2022 15:22 UTC
        LW: 4 AF: 3
        −7
        AF Parent
        This seems plausible if the environment is a mix of (i) situations where task completion correlates (almost) perfectly with reward, and (ii) situations where reward is very high while task completion is very low. Such as if we found a perfect outer alignment objective, and the only situation in which reward could deviate from the overseer’s preferences would be if the AI entirely seized control of the reward.
        
        But it seems less plausible if there are always (small) deviations between reward and any reasonable optimization target that isn’t reward (or close enough so as to carry all relevant arguments). E.g. if an AI is trained on RL from human feedback, and it can almost always do slightly better by reasoning about which action will cause the human to give it the highest reward.
        Steven Byrnes 26 Jul 2022 17:13 UTC
        LW: 5 AF: 3
        0
        AF Parent
        Sure, other things equal. But other things aren’t necessarily equal. For example, regularization could stack the deck in favor of one policy over another, even if the latter has been systematically producing slightly higher reward. There are lots of things like that; the details depend on the exact RL algorithm. In the context of brains, I have discussion and examples in §9.3.3 here.
    - Not Relevant 25 Jul 2022 16:43 UTC
      LW: 4 AF: 2
      0
      AF Parent
      as a simple example, it seems like the details of humans’ desire for their children’s success, or their fear of death, don’t seem to match well with the theory that all human desires come from RL on intrinsic reward.
      I’m trying to parse out what you’re saying here, to understand whether I agree that human behavior doesn’t seem to be almost perfectly explained as the result of an RL agent (with an interesting internal architecture) maximizing an inner learned reward.
      On my model, the outer objective of inclusive genetic fitness created human mesaoptimizers with inner objectives like “desire your children’s success” or “fear death”, which are decent approximations of IGF (given that directly maximizing IGF itself is intractable as it’s a Nash equilibrium of an unknown game). It seems to me that human behavior policies are actually well-approximated as those of RL agents maximizing [our children’s success] + [not dying] + [retaining high status within the tribe] + [being exposed to novelty to improve our predictive abilities] + … .
      Humans do sometimes construct modified internal versions of these rewards based on pre-existing learned representations (e.g. desiring your adopted children’s success) - is that what you’re pointing at?
      Generally interested to hear more of the “bad predictions” this model makes.
      - TurnTrout 1 Aug 2022 19:19 UTC
        LW: 0 AF: 1
        −2
        AF Parent
        I’m trying to parse out what you’re saying here, to understand whether I agree that human behavior doesn’t seem to be almost perfectly explained as the result of an RL agent (with an interesting internal architecture) maximizing an inner learned reward.
        What do you mean by “inner learned reward”? This post points out that even if humans were “pure RL agents”, we shouldn’t expect them to maximize their own reward. Maybe you mean “inner mesa objectives”?
    - Thane Ruthenis 28 Jul 2022 12:54 UTC
      1 point
      0
      Parent
      it seems like the details of humans’ desire for their children’s success, or their fear of death, don’t seem to match well with the theory that all human desires come from RL on intrinsic reward. I guess you probably think they do?
      That’s the foundational assumption of the shard theory that this sequence is introducing, yes. Here’s the draft of a fuller overview that goes into some detail as to how that’s supposed to work. (Uh, to avoid confusion: I’m not affiliated with the theory. Just spreading information.)
      - cfoster0 28 Jul 2022 15:32 UTC
        3 points
        2
        Parent
        I would disagree that it is an assumption. That same draft talks about the outsized role of self-supervised learning on determining particular ordering and kinds of concepts that humans desires latch onto. Learning from reinforcement is a core component in value formation (under shard theory), but not the only one.
- awenonian 26 Jul 2022 22:07 UTC
  3 points
  −5
  Parent
  I interpret OP (though this is colored by the fact that I was thinking this before I read this) as saying Adaptation-Executers, not Fitness-Maximizers, but about ML. At which point you can open the reference category to all organisms.
  Gradient descent isn’t really different from what evolution does. It’s just a bit faster, and takes a slightly more direct line. Importantly, it’s not more capable of avoiding local maxima (per se, at least).
  What links here?
  - awenonian's comment on Models Don’t “Get Reward” by Sam Ringer (18 Jan 2023 4:12 UTC; 3 points)
- TurnTrout 16 Nov 2022 3:47 UTC
  LW: 2 AF: 2
  0
  AF Parent
  As an example, at the level of informal discussion in this post I’m not sure why you aren’t surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn’t yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?).
  I’ve updated the post to clarify. I think focus on “antecedent computation reinforcement” (while often probably ~accurate) was imprecise/wrong for reasons like this. I now instead emphasize that the math of policy gradient approaches means that reward chisels cognitive circuits into networks.
Richard_Ngo 25 Jul 2022 18:45 UTC
LW: 35 AF: 21
14
AF
1. Stop worrying about finding “outer objectives” which are safe to maximize.^[9] I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function).
  Instead, focus on building good cognition within the agent.
  In my ontology, there’s only an inner alignment problem: How do we grow good cognition inside of the trained agent?
This feels very strongly reminiscent of an update I made a while back, and which I tried to convey in this section of AGI safety from first principles. But I think you’ve stated it far too strongly; and I think fewer other people were making this mistake than you expect (including people in the standard field of RL), for reasons that Paul laid out above. When you say things like “Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported”, this assumes that the people doing this reasoning were using the premise in the mistaken way that you (and some other people, including past Richard) were. Before drawing these conclusions wholesale, I’d suggest trying to identify ways in which the things other people are saying are consistent with the insight this post identifies. E.g. does this post actually generate specific disagreements with Ajeya’s threat model?
Edited to add: these sentences in particular feel very strawmanny of what I claim is the standard position:
Importantly, reward does not magically spawn thoughts about reward, and reinforce those reward-focused thoughts! Just because common English endows “reward” with suggestive pleasurable connotations, that does not mean that an RL agent will terminally value reward!
My explanation for why my current position is consistent with both being aware of this core claim, and also disagreeing with most of this post:
I now think that, even though there’s some sense in which in theory “building good cognition within the agent” is the only goal we care about, in practice this claim is somewhat misleading, because incrementally improving reward functions (including by doing things like making rewards depend on activations, or amplification in general) is a very good mechanism for moving agents towards the type of cognition we’d like them to do—and we have very few other mechanisms for doing so.
In other words, the claim that there’s “only an inner alignment problem” in principle may or may not be a useful one, depending on how far improving rewards (i.e. making progress on the outer alignment problem) gets you in practice. And I agree that RL people are less aware of the inner alignment problem/goal misgeneralization problem than they should be, but saying that inner misalignment is the only problem seems like a significant overcorrection.
Relevant excerpt from AGI safety from first principles:
In trying to ensure that AGI will be aligned, we have a range of tools available to us—we can choose the neural architectures, RL algorithms, environments, optimisers, etc, that are used in the training procedure. We should think about our ability to specify an objective function as the most powerful such tool. Yet it’s not powerful because the objective function defines an agent’s motivations, but rather because samples drawn from it shape that agent’s motivations and cognition.
From this perspective, we should be less concerned about what the extreme optima of our objective functions look like, because they won’t ever come up during training (and because they’d likely involve tampering). Instead, we should focus on how objective functions, in conjunction with other parts of the training setup, create selection pressures towards agents which think in the ways we want, and therefore have desirable motivations in a wide range of circumstances.
- TurnTrout 1 Aug 2022 21:36 UTC
  LW: 10 AF: 6
  −2
  AF Parent
  When you say things like “Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported”, this assumes that the people doing this reasoning were using the premise in the mistaken way
  I have considered the hypothesis that most alignment researchers do understand this post already, while also somehow reliably emitting statements which, to me, indicate that they do not understand it. I deem this hypothesis unlikely. I have also considered that I may be misunderstanding them, and think in some small fraction of instances I might be.
  I do in fact think that few people actually already deeply internalized the points I’m making in this post, even including a few people who say they have or that this post is obvious. Therefore, I concluded that lots of alignment thinking is suspect until re-analyzed.
  I did preface “Here are some major updates which I made:”. The post is ambiguous on whether/why I believe others have been mistaken, though. I felt that if I just blurted out my true beliefs about how people had been reasoning incorrectly, people would get defensive. I did in fact consider combing through Ajeya’s post for disagreements, but I thought it’d be better to say “here’s a new frame” and less “here’s what I think you have been doing wrong.” So I just stated the important downstream implication: Be very, very careful in analyzing prior alignment thinking on RL+DL.
  I now think that, even though there’s some sense in which in theory “building good cognition within the agent” is the only goal we care about, in practice this claim is somewhat misleading, because incrementally improving reward functions (including by doing things like making rewards depend on activations, or amplification in general) is a very good mechanism for moving agents towards the type of cognition we’d like them to do—and we have very few other mechanisms for doing so.
  I have relatively little idea how to “improve” a reward function so that it improves the inner cognition chiseled into the policy, because I don’t know the mapping from outer reward schedules to inner cognition within the agent. Does an “amplified” reward signal produce better cognition in the inner agent? Possibly? Even if that were true, how would I know it?
  I think it’s easy to say “and we have improved the reward function”, but this is true exactly to the extent to which the reward schedule actually produces more desirable cognition within the AI. Which comes back to my point: Build good cognition, and don’t lose track that that’s the ultimate goal. Find ways to better understand how reward schedules + data → inner values.
  (I agree with your excerpt, but I suspect it makes the case too mildly to correct the enormous mistakes I perceive to be made by substantial amounts of alignment thinking.)
  - Chris van Merwijk 6 Aug 2022 10:22 UTC
    LW: 27 AF: 19
    24
    AF Parent
    It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You’ve certainly phrased things differently and made some specific points that we didn’t, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?
    
    (Note I am still surprised sometimes that people still think certain wireheading scenario’s make sense despite them having read RFLO, so it’s plausible to me that we really didn’t communicate everyrhing that’s in my head about this).
    - TurnTrout 7 Aug 2022 16:33 UTC
      LW: 4 AF: 3
      −2
      AF Parent
      “Wireheading is improbable” is only half of the point of the essay.
      The other main point is “reward functions are not the same type of object as utility functions.” I haven’t reread all of RFLO recently, but on a skim—RFLO consistently talks about reward functions as “objectives”:
      The particular type of robustness problem that mesa-optimization falls into
      is the reward-result gap, the gap between the reward for which the system was
      trained (the base objective) and the reward that can be reconstructed from it using
      inverse reinforcement learning (the behavioral objective).
      ...
      The assumption in that work is that a monotonic relationship between
      the learned reward and true reward indicates alignment, whereas deviations from
      that suggest misalignment. Building on this sort of research, better theoretical
      measures of alignment might someday allow us to speak concretely in terms of
      provable guarantees about the extent to which a mesa-optimizer is aligned with the
      base optimizer that created it.
      Which is reasonable parlance, given that everyone else uses it, but I don’t find that terminology very useful for thinking about what kinds of inner cognition will be developed in the network. Reward functions + environmental data provides a series of cognitive-updates to the network, in the form of reinforcement schedules. The reward function is not necessarily an ‘objective’ at all.
      (You might have privately known about this distinction. Fine by me! But I can’t back it out from a skim of RFLO, even already knowing the insight and looking for it.)
      What links here?
      TurnTrout's comment on Reward is not the optimization target by TurnTrout (7 Aug 2022 16:37 UTC; 3 points)
      - Chris van Merwijk 9 Aug 2022 5:25 UTC
        LW: 14 AF: 9
        6
        AF Parent
        Maybe you have made a gestalt-switch I haven’t made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.
        Is there a difference between saying:
        A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
        A reward function is not an objective function, and the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
        It seems to me that once you acknowledge the point about reinforcement, the additional statement that reward is not an objective doesn’t actually imply anything further about the mechanistic properties of deep reinforcement learners? It is just a way to put a high-level conceptual story on top of it, and in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an “objective”.
        However, it might be that while RFLO pointed out the same mechanistic understanding that you have in mind, but calling it an objective tends in practice to not fully communicate that mechanistic understanding.
        Or it might be that I am really not yet understanding that there is an actual diferrence in mechanistic understanding, or that my intuitions are still being misled by the wrong high-level concept even if I have the lower-level mechanistic understanding right.
        (On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind).
        TurnTrout 15 Aug 2022 3:30 UTC
        LW: 4 AF: 3
        4
        AF Parent
        in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an “objective”.
        Where did RFLO point it out? RFLO talks about a mesa objective being different from the “base objective” (even though reward is not a kind of objective). IIRC on my skim most of the arguments were non-mechanistic reasoning about what gets selected for. (Which isn’t a knockdown complaint, but those arguments are also not about the mechanism.) Also see my comment to Evan.
        Like, from my POV, people are reliably reasoning about what RL “selects for” via “lots of optimization pressure” on “high reward by the formal metric”, but who’s reasoning about what kinds of antecedent computations get reinforced when credit assignment activates? Can you give me examples of anyone else spelling this out in a straightforward fashion?
        calling it an objective tends in practice to not fully communicate that mechanistic understanding.
        Yeah, I think it just doesn’t communicate the mechanistic understanding (not even imperfectly, in most cases, I imagine). From my current viewpoint, I just wouldn’t call reward an objective at all, except in the context of learned antecedent-computation-reinforcement terminal values. It’s like if I said “My cake is red” when the cake is blue, I guess? IMO it’s just not how to communicate the concept.
        On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind
        Why is this reasonable?
        Chris van Merwijk 11 Mar 2023 14:23 UTC
        LW: 11 AF: 6
        8
        AF Parent
        Very late reply, sorry.
        “even though reward is not a kind of objective”, this is a terminological issue. In my view, calling a “antecedent-computation reinforcement criterion” an “objective” matches my definition of “objective”, and this is just a matter of terminology. The term “objective” is ill-defined enough that “even though reward is not a kind of objective” is a terminological claim about objective, not a claim about math/the world.
        The idea that RL agents “reinforce antecedent computations” is completely core to our story of deception. You could not make sense of our argument for deception if you didn’t look at RL systems in this way. Viewing the base optimizer as “trying” to achieve an “objective” but “failing” because it is being “deceived” by the mesa optimizer is purely a metaphorical/terminological choice. It doesn’t negate the fact that we all understood that the base optimizer is just reinforcing “antecedent computations”. How else could you make sense of the story of deception, where an existing model, which represents the mesa optimizer, is being reinforced by the base optimizer because that existing model understands the base optimizer’s optimization process?
        
        I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn’t even make sense if you didn’t have this insight. (Certainly the fact that we called it an objective doesn’t communicate the point, and it isn’t meant to).
        TurnTrout 14 Mar 2023 18:19 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn’t even make sense if you didn’t have this insight.
        I think most ML practitioners do have implicit models of how reward chisels computation into agents, as seen with how they play around with e.g. reward shaping and such. It’s that I don’t perceive this knowledge to be engaged when some people reason about “optimization processes” and “selecting for high-reward models” on e.g. LW.
        I just continue to think “I wouldn’t write RFLO the way it was written, if I had deeply and consciously internalized the lessons of OP”, but it’s possible this is a terminological/framing thing. Your comment does update me some, but I think I mostly retain my view here. I do totally buy that you all had good implicit models of the reward-chiseling point.
        FWIW, I think a bunch of my historical frustration here has been an experience of:
        Pointing out the “reward chisels computation” point
        Having some people tell me it’s obvious, or already known, or that they already invented it
        Seeing some of the same people continue making similar mistakes (according to me)
        Not finding instances of other people making these points before OP
        Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.
        If I found several comments explaining what is clearly the “reward chisels computation” point, where the comments were posted before this post, by people who weren’t me or downstream of my influence, I would update against my points being novel and towards my points using different terminology.
        IIRC there’s one comment from Wei_Dai from a few years back in this vein, but IDK of others.
        Chris van Merwijk 23 Mar 2023 12:58 UTC
        LW: 5 AF: 2
        7
        AF Parent
        There is a general phenomenon where:
        Person A has mental model X and tries to explain X with explanation Q
        Person B doesn’t get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn’t actually contain the insights, but P does.
        Person C doesn’t get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: …
        It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contains all the insights that the previous ones didn’t. Some of the evidence for this is in fact contained in your very comment:
        “1. Pointing out the “reward chisels computation” point. 2. Having some people tell me it’s obvious, or already known, or that they already invented it. 3. Seeing some of the same people continue making similar mistakes (according to me)”
        So point 3 basically almost definitively proves that your mental model is not conveyed to those people in your post, does it not? I think a similar thing happened where that mental model was not conveyed to you from RFLO, even though we tried to convey it. (btw not saying the models that RFLO tried to explain are the same as this post, but the basic idea of this post definitely is a part of RFLO).
        BTW, it could in fact be that person B’s explanation is clearer. (otoh, I think some things are less clear, e.g. you talk about “the” optimization target, which I would say is referring to that of the mesa-optimizer, without clearly assuming there is a mesa-optimizer. We stated the terms mesa- and base-optimizer to clearly make the distinction. There are a bunch of other things that I think are just imprecise, but let’s not get into it).
        “Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.”
        I have been correcting people for a while on stuff like that (though not on LW, I’m not often on LW), such as that in the generic case we shouldn’t expect wireheading from RL agents unless the option of wireheading is in the training environment, for basically these reasons. I would also have expected people to just get this after reading RFLO, but many didn’t (others did), so your points 1/2/3 also apply to me.
        “I do totally buy that you all had good implicit models of the reward-chiseling point”. I don’t think we just “implicitly” modeled it, we very explicitly understood it and it ran throughout our whole thinking about the topic. Again, explaining stuff is hard though, I’m not claiming we conveyed everything well to everyone (clearly you haven’t either).
        TurnTrout 1 Jun 2023 20:24 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I want to note that I just reread Utility ≠ Reward and was pleasantly surprised by its treatment, as well as the hedges. I’m making an upwards update on these points having been understood by at least some thinkers, although I’ve also made a lot of downward updates for other reasons.
        [ ]
        [deleted]
        TurnTrout 2 Jan 2023 20:18 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Thanks for this comment! I think it makes some sense (but would have been easier to read given meaningful variable names).
        Bob’s alignment strategy is that he wants X = X1 = Y = Y1 = Z = Z1. Also he wants the end result to be an agent whose good behaviours (Z) are in fact maximising a utility function at all (in this case, Z1).
        I either don’t understand the semantics of “=” here, or I disagree. Bob’s strategy doesn’t make sense because X and Z have type behavior, X1 and Z1 have type utility function, Y is some abstract reward function over some mathematical domain, Y1 is an empirical set of reinforcement events.
        It still seems to me like there is an error being made, such that Bob and Carol aren’t just trying to do different things or using different terminology, but that also Bob’s alignment strategy isn’t type-sensible or -coherent.
      - evhub 9 Aug 2022 18:46 UTC
        LW: 3 AF: 3
        1
        AF Parent
        Reward functions often are structured as objectives, which is why we talk about them that way. In most situations, if you had access to e.g. AIXI, you could directly build a “reward maximizer.”
        
        I agree that this is not always the case, though, as in the discussion here. That being said, I think it is often enough the case that it made sense to focus on that particular case in RFLO.
        TurnTrout 15 Aug 2022 3:23 UTC
        LW: 3 AF: 3
        1
        AF Parent
        Reward functions often are structured as objectives
        What does this mean? By “structured as objectives”, do you mean something like “people try to express what they want with a reward function, by conferring more reward to more desirable states”? (I’m going to assume so for the rest of the comment, LMK if this is wrong.)
        I agree that other people (especially my past self) think about reward functions this way. I think they’re generally wrong to do so, and it’s misleading as to the real nature of the alignment problem.
        I agree that this is not always the case, though, as in the discussion here.
        I agree with that post, thanks for linking.
        if you had access to e.g. AIXI, you could directly build a “reward maximizer.”
        I agree that this is not always the case, though, as in the discussion here. That being said, I think it is often enough the case that it made sense to focus on that particular case in RFLO.
        As far as I can tell, AIXI and other hardcoded planning agents are the known exceptions to the arguments in this post. We will not get AGI via these approaches. When else is it the case? I therefore still feel confused why you think it made sense.
        While I definitely appreciate the work you all did with RFLO, the framing of reward as a “base objective” seems like a misstep that set discourse in a weird direction which I’m trying to push back on (from my POV!). I think that the “base objective” is better described as a “cognitive-update-generator.” (This is not me trying to educate you on this specific point, but rather argue that it really matters how we frame the problem in our day-to-day reasoning.)
  - evhub 2 Aug 2022 22:17 UTC
    LW: 18 AF: 7
    10
    AF Parent
    
    I do in fact think that few people actually already deeply internalized the points I’m making in this post, even including a few people who say they have or that this post is obvious. Therefore, I concluded that lots of alignment thinking is suspect until re-analyzed.
    
    “Risks from Learned Optimization in Advanced Machine Learning Systems,” which we published three years ago and started writing four years ago, is extremely explicit that we don’t know how to get an agent that is actually optimizing for a specified reward function. The alignment research community has been heavily engaging with this idea since then. Though I agree that many alignment researchers used to be making this mistake, I think it’s extremely clear that by this point most serious alignment researchers understand the distinction.
    
    I have relatively little idea how to “improve” a reward function so that it improves the inner cognition chiseled into the policy, because I don’t know the mapping from outer reward schedules to inner cognition within the agent. Does an “amplified” reward signal produce better cognition in the inner agent? Possibly? Even if that were true, how would I know it?
    
    This is precisely the point I make in “How do we become confident in the safety of a machine learning system is making,” btw.
    - TurnTrout 7 Aug 2022 16:37 UTC
      LW: 3 AF: 3
      0
      AF Parent
      which we published three years ago and started writing four years ago, is extremely explicit that we don’t know how to get an agent that is actually optimizing for a specified reward function.
      That isn’t the main point I had in mind. See my comment to Chris here.
      EDIT:
      This is precisely the point I make in “How do we become confident in the safety of a machine learning system is making,” btw.
      Yup, the training story regime sounds good by my lights. Am I intended to conclude something further from this remark of yours, though?
      - evhub 9 Aug 2022 18:48 UTC
        LW: 4 AF: 4
        3
        AF Parent
        
        That isn’t the main point I had in mind. See my comment to Chris here.
        
        Left a comment.
        
        Yup, the training story regime sounds good by my lights. Am I intended to conclude something further from this remark of yours, though?
        
        Nope, just wanted to draw your attention to another instance of alignment researchers already understanding this point.
        
        Also, I want to be clear that I like this post a lot and I’m glad you wrote it—I think it’s good to explain this sort of thing more, especially in different ways that are likely to click for different people. I just think your specific claim that most alignment researchers don’t understand this already is false.
        TurnTrout 14 Aug 2022 20:30 UTC
        LW: 7 AF: 5
        3
        AF Parent
        I just think your specific claim that most alignment researchers don’t understand this already is false.
        I have privately corresponded with a senior researcher who, when asked what they thought would result from a specific training scenario, made an explicit (and acknowledged) mistake along the lines of this post. Another respected researcher seemingly slipped on the same point, some time after already discussing this post with them. I am still not sure whether I’m on the same page with Paul, as well (I have general trouble understanding what he believes, though). And Rohin also has this experience of explaining the points in OP on a regular basis. All this among many other private communication events I’ve experienced.
        (Out of everyone I would expect to already have understood this post, I think you and Rohin would be at the top of the list.)
        So basically, the above screens off “Who said what in past posts?”, because whoever said whatever, it’s still producing my weekly experiences of explaining the points in this post. I still haven’t seen the antecedent-computation-reinforcement (ACR) emphasis thoroughly explained elsewhere, although I agree that some important bits (like training stories) are not novel to this post. (The point isn’t so much “What do I get credit for?” as much as “I am concerned about this situation.”)
        Here’s more speculation. I think alignment theorists mostly reason via selection-level arguments. While they might answer correctly on “Reward is? optimization target” when pressed, and implicitly use ACR to reason about what’s going on in their ML training runs, I’d guess that probably don’t engage in mechanistic ACR reasoning in their day-to-day theorizing. (Again, I can only speculate, because I am not a mind-reader, but I do still have beliefs on the matter.)
        What links here?
        TurnTrout's comment on Reward is not the optimization target by TurnTrout (15 Aug 2022 3:30 UTC; 4 points)
        Rohin Shah 15 Aug 2022 9:27 UTC
        LW: 21 AF: 13
        10
        AF Parent
        (Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya’s recent post for similar reasons. I don’t think that the people I’m explaining it to literally don’t understand the point at all; I think it mostly hasn’t propagated into some parts of their other reasoning about alignment. I’m less on board with the “it’s incorrect to call reward a base objective” point but I think it’s pretty plausible that once I actually understand what TurnTrout is saying there I’ll agree with it.)
        What links here?
        TurnTrout's comment on Actually, Othello-GPT Has A Linear Emergent World Representation by Neel Nanda (18 Sep 2023 16:52 UTC; 3 points)
  - Richard_Ngo 1 Aug 2022 23:08 UTC
    LW: 6 AF: 5
    3
    AF Parent
    I have relatively little idea how to “improve” a reward function so that it improves the inner cognition chiseled into the policy, because I don’t know the mapping from outer reward schedules to inner cognition within the agent.
    You don’t need to know the full mapping in order to suspect that, when we reward agents for doing undesirable things, we tend to get more undesirable cognition. For example, if we reward agents for lying to us, then we’ll tend to get less honest agents. We can construct examples where this isn’t true but it seems like a pretty reasonable working hypothesis. It’s possible that discarding this working hypothesis will lead to better research but I don’t think your arguments manage to establish that, they only establish that we might in theory find ourselves in a situation where it’s reasonable to discard this working hypothesis.
    - TurnTrout 1 Aug 2022 23:24 UTC
      LW: 5 AF: 3
      −2
      AF Parent
      This specific point is why I said “relatively” little idea, and not zero idea. You have defended the common-sense version of “improving” a reward function (which I agree with, don’t reward obvious bad things), but I perceive you to have originally claimed a much more aggressive and speculative claim, which is something like “‘amplified’ reward signals are improvements over non-‘amplified’ reward signals” (which might well be true, but how would we know?).
      - Richard_Ngo 2 Aug 2022 7:49 UTC
        LW: 6 AF: 5
        0
        AF Parent
        Amplification can just be used as a method for making more and better common-sense improvements, though. You could also do all sorts of other stuff with it, but standard examples (like “catch agents when they lie to us”) seem very much like common-sense improvements.
- TurnTrout 1 Aug 2022 22:00 UTC
  LW: 2 AF: 2
  −3
  AF Parent
  I think fewer other people were making this mistake than you expect (including people in the standard field of RL)
  I think that few people understand these points already. ~~If RL professionals did understand this point, there would be pushback on~~ ~~Reward is Enough~~ ~~from RL professionals pointing out that reward is not the optimization target. After 15 minutes of searching, I found~~ no ~~one~~ ~~making~~ ~~the~~ ~~counterpoint. I mean, that thesis is just so wrong, and it’s by famous researchers, and no one points out the obvious error.~~
  RL researchers don’t get it.^[1] It’s not complicated to me.
  (Do you know of any instance at all of someone else (outside of alignment) making the points in this post?)
  for reasons that Paul laid out above.
  Currently not convinced by / properly understanding Paul’s counterpoints.
  1. ^
    Although I flag that we might be considering different kinds of “getting it”, where by my lights, “getting it” means “not consistently emitting statements which contravene the points of this post”, while you might consider “if pressed on the issue, will admit reward is not the optimization target” to be “getting it.”
  - Richard_Ngo 1 Aug 2022 22:46 UTC
    LW: 15 AF: 9
    11
    AF Parent
    The way I attempt to avoid confusion is to distinguish between the RL algorithm’s optimization target and the RL policy’s optimization target, and then avoid talking about the “RL agent’s” optimization target, since that’s ambiguous between the two meanings. I dislike the title of this post because it implies that there’s only one optimization target, which exacerbates this ambiguity. I predict that if you switch to using this terminology, and then start asking a bunch of RL researchers questions, they’ll tend to give broadly sensible answers (conditional on taking on the idea of “RL policy’s optimization target” as a reasonable concept).
    Authors’ summary of the “reward is enough” paper:
    In this paper we hypothesise that the objective of maximising reward is enough to drive behaviour that exhibits most if not all attributes of intelligence that are studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language and generalisation. This is in contrast to the view that specialised problem formulations are needed for each attribute of intelligence, based on other signals or objectives. The reward-is-enough hypothesis suggests that agents with powerful reinforcement learning algorithms when placed in rich environments with simple rewards could develop the kind of broad, multi-attribute intelligence that constitutes an artificial general intelligence.
    I think this is consistent with your claims, because reward can be enough to drive intelligent-seeming behavior whether or not it is the target of learned optimization. Can you point to the specific claim in this summary that you disagree with? (or a part of the paper, if your disagreement isn’t captured in this summary).
    More generally, consider the analogy to evolution. I view your position as analogous to saying: “hey, genetic fitness is not the optimization target of humans, therefore genetic fitness is not the optimization target of evolution”. The idea that genetic fitness is not the optimization target of humans is an important insight, but it’s clearly unhelpful to jump to “and therefore evolutionary biologists who talk about evolution optimizing for genetic fitness just don’t get it”, which seems analogous to what you’re doing in this post.
    Importantly, reward does not magically spawn thoughts about reward, and reinforce those reward-focused thoughts! Just because common English endows “reward” with suggestive pleasurable connotations, that does not mean that an RL agent will terminally value reward!
    Sufficiently intelligent RL policies will have the concept of reward because they understand many facts about machine learning and their own situation, and (if deceptively aligned) will think about reward a bunch. There may be some other argument for why this concept won’t get embedded as a terminal goal, but the idea that it needs to be “magically spawned” is very strawmanny.
    - TurnTrout 2 Aug 2022 1:28 UTC
      LW: 7 AF: 5
      2
      AF Parent
      Actually, while I did recheck the Reward is Enough paper, I think I did misunderstand part of it in a way which wasn’t obvious to me while I reread, which makes the paper much less egregious. I am updating that you are correct and I am not spending enough effort on favorably interpreting existing discourse.
      I still disagree with parts of that essay and still think Sutton & co don’t understand the key points. I still think you underestimate how much people don’t get these points. I am provisionally retracting the comment you replied to while I compose a more thorough response (may be a little while).
      Sufficiently intelligent RL policies will have the concept of reward because they understand many facts about machine learning and their own situation, and (if deceptively aligned) will think about reward a bunch. There may be some other argument for why this concept won’t get embedded as a terminal goal, but the idea that it needs to be “magically spawned” is very strawmanny.
      Agreed on both counts for your first sentence.
      The “and” in “reward does not magically spawn thoughts about reward, and reinforce those reward-focused thoughts” is doing important work; “magically” is meant to apply to the conjunction of the clauses. I added the second clause in order to pre-empt this objection. Maybe I should have added “reinforce those reward-focused thoughts into terminal values.” Would that have been clearer? (I also have gone ahead and replaced “magically” with “automatically.”)
      - Richard_Ngo 2 Aug 2022 7:57 UTC
        LW: 4 AF: 2
        0
        AF Parent
        Hmm, perhaps clearer to say “reward does not automatically reinforce reward-focused thoughts into terminal values”, given that we both agree that agents will have thoughts about reward either way.
        But if you agree that reward gets reinforced as an instrumental value, then I think your claims here probably need to actually describe the distinction between terminal and instrumental values. And this feels pretty fuzzy—e.g. in humans, I think the distinction is actually not that clear-cut.
        In other words, if everyone agrees that reward likely becomes a strong instrumental value, then this seems like a prima facie reason to think that it’s also plausible as a terminal value, unless you think the processes which give rise to terminal values are very different from the processes which give rise to instrumental values.
Steven Byrnes 25 Jul 2022 3:40 UTC
LW: 29 AF: 16
6
AF
I like this post, and basically agree, but it comes across somewhat more broad and confident than I am, at least in certain places.
I’m currently thinking about RL along the lines of Nostalgebraist here:
“Reinforcement learning” (RL) is not a technique. It’s a problem statement, i.e. a way of framing a task as an optimization problem, so you can hand it over to a mechanical optimizer.
What’s more, even calling it a problem statement is misleading, because it’s (almost) the most general problem statement possible for any arbitrary task. —Nostalgebraist 2020
If that’s right, then I am very reluctant to say anything whatsoever about “RL agents in general”. They’re too diverse.
Much of the post, especially the early part, reads (to me) like confident claims about all possible RL agents. For example, the excerpt “…reward is the antecedent-computation-reinforcer. Reward reinforces those computations which produced it.” sounds like a confident claim about all RL agents, maybe even by definition of “RL”. (If so, I think I disagree.)
But other parts of the post aren’t like that—for example, the “Does the choice of RL algorithm matter?” part seems more reasonable and hedged, and likewise there’s a mention of “real-world general RL agents” somewhere which maybe implies that the post is really only about that particular subset of RL agents, as opposed to all RL agents. (Right?)
For what it’s worth, I think “reward is the antecedent-computation-reinforcer” will probably be true in RL algorithms that scale to AGI, because it seems like generally the best and only type of technique that can solve the technical problem that it solves. But that’s a tricky thing to be super-duper-confident about, especially in the big space of all possible RL algorithms.
Another example spot where I want to make a weaker statement than you: where you say “Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal”. I would instead say “Deep reinforcement learning agents will not NECESSARILY come to intrinsically and primarily value their reward signal”. Do you have an argument that categorically rules out this possibility? I don’t see it.
- Oliver Sourbut 26 Jul 2022 9:13 UTC
  9 points
  2
  Parent
  FWIW I upvoted but disagree with the end part (hurray for more nuance in voting!)
  
  I think “reward is the antecedent-computation-reinforcer” will probably be true in RL algorithms that scale to AGI
  
  At least from my epistemic position there looks like an explanation/communication gap here: I don’t think we can be as confident of this. To me this claim seems to preclude ‘creative’ forward-looking exploratory behaviour and model-based planning, which have more of a probingness and less of a merely-antecedent-computation-reinforcingness. But I see other comments from you here which talk about foresighted exploration (and foresighted non-exploration!) and I know you’ve written about these things at length. How are you squaring/nuancing these things? (Silence or a link to an already-written post will not be deemed rude.)
  What links here?
  - Oliver Sourbut's comment on Reward is not the optimization target by TurnTrout (5 Aug 2022 14:57 UTC; 4 points)
Wei Dai 26 Jul 2022 1:22 UTC
LW: 26 AF: 11
23
AF
At this point, there isn’t a strong reason to elevate this “inner reward optimizer” hypothesis to our attention. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any good mechanistic stories for that. I’d love to hear some, if there are any.

Here’s a story:
1. Suppose we provide the reward as an explicit input to the agent (in addition to using it as antecedent-computation-reinforcer)
2. If the agent has developed curiosity, it will think thoughts like “What is this number in my input stream?” and later “Hmm it seems correlated to my behavior in certain ways.”
3. If the agent has developed cognitive machinery for doing exploration (in the explore/exploit sense) or philosophy, at some later point it might have thoughts like “What if I explicitly tried to increase this number? Would that be a good idea or bad?”
4. It might still answer “bad”, but at this point the outer optimizer might notice (do the algorithmic equivalent of thinking the following), “If I modified this agent slightly by making it answer ‘good’ instead (or increasing its probability of answering ‘good’), then expected future reward will be increased.” In other words, there seems a fairly obvious gradient towards becoming a reward-maximizer at this point.
I don’t think this is guaranteed to happen, but seems likely enough to elevate “inner reward optimizer” hypothesis to our attention, at least.
- Wei Dai 26 Jul 2022 14:17 UTC
  LW: 18 AF: 13
  1
  AF Parent
  As a more general/tangential comment, I’m a bit confused about how “elevate hypothesis to our attention” is supposed to work. I mean it took some conscious effort to come up with a possible mechanistic story about how “inner reward optimizer” might arise, so how were we supposed to come up with such a story without paying attention to “inner reward optimizer” in the first place?
  
  Perhaps it’s not that we should literally pay no attention to “inner reward optimizer” until we have a good mechanistic story for it, but more like we are (or were) paying too much attention to it, given that we don’t (didn’t) yet have a good mechanistic story? (But if so, how to decide how much is too much?)
  - TurnTrout 1 Aug 2022 19:23 UTC
    LW: 5 AF: 4
    0
    AF Parent
    I think this tangential comment is good; strong-upvote. I was hyperbolic in implying “don’t even raise the reward-optimizer hypothesis to your attention”, and will edit the post accordingly.
- Quintin Pope 31 Jul 2022 10:45 UTC
  LW: 5 AF: 4
  2
  AF Parent
  
  but at this point the outer optimizer might notice (do the algorithmic equivalent of thinking the following), “If I modified this agent slightly by making it answer ‘good’ instead (or increasing its probability of answering ‘good’), then expected future reward will be increased.”
  
  This is where I disagree with your mechanics story. The RL algorithm is not that clever. If the agent doesn’t explore in the direction of answering “good”, then there’s no gradient in that direction. You can propose different types of outer optimizers which are this clever and can do intentional lookahead like this, but e.g., policy gradient isn’t doing that.
  - Vaniver 1 Aug 2022 16:47 UTC
    LW: 4 AF: 3
    0
    AF Parent
    If the agent doesn’t explore in the direction of answering “good”, then there’s no gradient in that direction.
    Wait, I don’t think this is true? At least, I’d appreciate it being stepped thru in more detail.
    In the simplest story, we’re imagining an agent whose policy is $π_{θ}$ and, for simplicity’s sake, $θ_{0}$ is a scalar that determines “how much to maximize for reward” and all the other parameters of $θ$ store other things about the dynamics of the world / decision-making process.
    It seems to me that $\nabla_{θ}$ is obviously going to try to point $θ_{0}$ in the direction of “maximize harder for reward”.
    In the more complicated story, we’re imagining an agent whose policy is $π_{θ}$ which involves how it manipulates both external and internal actions (and thus both external and internal state). One of the internal state pieces (let’s call it $s_{0}$ like last time) determines whether it selects actions that are more reward-seeking or not. Again I think it seems likely that $\nabla_{θ}$ is going to try to adjust $θ$ such that the agent selects internal actions that point $s_{0}$ in the direction of “maximize harder for reward”.
    What is my story getting wrong?
    - Oliver Sourbut 5 Aug 2022 14:57 UTC
      LW: 4 AF: 3
      0
      AF Parent
      I think Quintin^[1] is maybe alluding to the fact that in the limit of infinite counterfactual exploration then sure, the gradient in sample-based policy gradient estimation will push in that direction. But we don’t ever have infinite exploration (and we certainly don’t have counterfactual exploration; though we come very close in simulations with resets) so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation).
      
      This seems right to me and it’s a nuance I’ve raised in a few conversations in the past. On the other hand kind of half the point of RL optimisation algorithms is to do ‘enough’ exploration! And furthermore (as I mentioned under Steven’s comment) I’m not confident that such simplistic RL is the one that will scale to AGI first. cf various impressive results from DeepMind over the years which use lots of shenanigans besides plain old sample-based policy gradient estimation (including model-based lookahead as in the Alpha and Mu gang). But maybe!
      
      ↩︎
      This is a guess and I haven’t spoken to Quintin about this—Quintin, feel free to clarify/contradict
      - Vaniver 6 Aug 2022 10:27 UTC
        LW: 2 AF: 1
        0
        AF Parent
        so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation).
        This is the bit I don’t believe, actually. [Or at least don’t think is relevant.] Note that in Wei_Dai’s hypothetical, the neural net architecture has a particular arrangement such that “how much it optimizes for reward” is either directly or indirectly implied by the neural network weights. [We’re providing the reward as part of its observations, and so if nothing else the weights from that part of the input vector to deeper in the network will be part of this, but the actual mechanism is going to be more complicated for one that doesn’t have access to that.]
        Quintin seems to me to be arguing “if you actually follow the math, there isn’t a gradient to that parameter,” which I find surprising, and which seems easy to demonstrate by going thru the math. As far as I can tell, there is a gradient there, and it points in the direction of “care more about reward.”
        This doesn’t mean that, by caring about reward more, it knows which actions in the environment cause more reward. There I believe the story that the RL algorithm won’t be able to reinforce actions that have never been tried.
        [EDIT: Maybe the argument is “but if it’s never tried the action of optimizing harder for reward, then the RL algorithm won’t be able to reinforce that internal action”? But that seems pretty strained and not very robust, as the first time it considers trying harder to get reward, it will likely get hooked.]
        TurnTrout 15 Aug 2022 6:05 UTC
        LW: 3 AF: 3
        1
        AF Parent
        Note that in Wei_Dai’s hypothetical, the neural net architecture has a particular arrangement such that “how much it optimizes for reward” is either directly or indirectly implied by the neural network weights. [We’re providing the reward as part of its observations, and so if nothing else the weights from that part of the input vector to deeper in the network will be part of this, but the actual mechanism is going to be more complicated for one that.]
        This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don’t see why it implies internal reward-orientation motivational edifices. I can probably predict my own limbic reward outputs to some crude degree, but that doesn’t make me a reward optimizer.
        Quintin seems to me to be arguing “if you actually follow the math, there isn’t a gradient to that parameter,” which I find surprising, and which seems easy to demonstrate by going thru the math. As far as I can tell, there is a gradient there, and it points in the direction of “care more about reward.”
        I think that assuming there’s a feature-direction “care more about reward” which isn’t already gradient-starved by shallower proxies learned earlier in training. In my ontology, this corresponds to “thinking thoughts about reward in order to get reward.”
        In the simplest story, we’re imagining an agent whose policy is $π_{θ}$ and, for simplicity’s sake, $θ_{0}$ is a scalar that determines “how much to maximize for reward” and all the other parameters of $θ$ store other things about the dynamics of the world / decision-making process.
        It seems to me that $\nabla_{θ}$ is obviously going to try to point $θ_{0}$ in the direction of “maximize harder for reward”.
        Seems like we’re assuming the whole ball game away. You’re assuming the cognition is already set up so as to admit easy local refinements towards maximizing reward more, that this is where the gradient points. My current guess is that freshly initialized networks will not have gradients towards modelling and acting to increase the antecedent-computation-reinforcer register in the real world (nor would this be the parametric direction of maximal increase of P(rewarding actions) ).
        For any observed data point in PG, you’re updating to make rewarding actions more probable given the policy network. There are many possible directions in which to increase P(rewarding actions), and internal reward valuation is only one particular direction. But if you’re already doing the “lick lollipops” action because you see a lollipop in front of you and have a hardcoded heuristic to grab it and lick it, then this starves any potential gradient (because you’re already taking the action of grabbing the lollipop).
        Now, you might have a situation where the existing computation doesn’t get reward. But then policy gradient isn’t going to automatically “find” the bandit arm with even higher reward and then provide an exact gradient towards that action. PG is still reinforcing to increase the probability of historically rewarding actions. And you can easily hit gradient starvation there, I think.
        Maybe the argument is “but if it’s never tried the action of optimizing harder for reward, then the RL algorithm won’t be able to reinforce that internal action”? But that seems pretty strained and not very robust, as the first time it considers trying harder to get reward, it will likely get hooked.
        If this argument works, why doesn’t it go through for people? (Not legibly a knockdown until we check that the mechanisms are sufficiently similar, but it’s at least a sanity check. I think the mechanisms are probably sufficiently similar, though.)
        Vaniver 15 Aug 2022 18:34 UTC
        LW: 4 AF: 3
        0
        AF Parent
        This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don’t see why it implies internal reward-orientation motivational edifices.
        Sorry, if I’m reading this right, we’re hypothesizing internal reward-orientation motivational edifices, and then asking the question of whether or not policy gradients will encourage them or discourage them. Quintin seems to think “nah, it needs to take an action before that action can be rewarded”, and my response is “wait, isn’t this going to be straightforwardly encouraged by backpropagation?”
        [I am slightly departing from Wei_Dai’s hypothetical in my line of reasoning here, as Wei is mostly focused on asking “don’t you expect this to come about in an introspective-reasoning powered way?” and I’m mostly focused on asking “if this structure is present in the model initialization as one of the lottery tickets, won’t policy gradient encourage it?”.]
        I think that assuming there’s a feature-direction “care more about reward” which isn’t already gradient-starved by shallower proxies learned earlier in training. In my ontology, this corresponds to “thinking thoughts about reward in order to get reward.”
        Cool, this feels like a real reason, but also substantially more contingent. Naively, I would expect that you could construct a training schedule such that ‘care more about reward’ is encouraged, and someone will actually try to do this (as part of making a zero-shot learner in RL environments).
        If this argument works, why doesn’t it go through for people? (Not legibly a knockdown until we check that the mechanisms are sufficiently similar, but it’s at least a sanity check. I think the mechanisms are probably sufficiently similar, though.)
        I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think “yeah, it only happens sometimes” whereas my view is something more like “fuck, it happens sometimes”. Like, the thing where people don’t do heroin because they’ve heard other people downvote heroin addiction is not a strategy that scales to superintelligence.
        TurnTrout 22 Aug 2022 20:15 UTC
        LW: 5 AF: 3
        0
        AF Parent
        I’m mostly focused on asking “if this structure is present in the model initialization as one of the lottery tickets, won’t policy gradient encourage it?”
        I see. Can’t speak for Quintin, but: I mostly think it won’t be present, but also conditional on the motivational edifice being present, I expect the edifice to bid up rewarding actions and get reinforced into a substantial influence. I have a lot of uncertainty in this case. I’m hoping to work out a better mechanistic picture of how the gradients would affect such edifices.
        I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think “yeah, it only happens sometimes” whereas my view is something more like “fuck, it happens sometimes”.
        I think there are a range of disagreements here, but also one man’s modus ponens is another’s modus tollens: High variance in heroin-propensity implies we can optimize heroin-propensity down to negligible values with relatively few bits of optimization (if we knew what we were doing, at least).
        Like, the thing where people don’t do heroin because they’ve heard other people downvote heroin addiction is not a strategy that scales to superintelligence.
        This isn’t obviously true to me, actually. That strategy certainly sounds quotidien, but is it truly mechanistically deficient? If we tell the early training-AGI “Hey, if you hit the reward button, the ensuing credit assignment will drift your values by mechanisms A, B, and C”, that provides important information to the AGI. I think that that’s convergently good advice, across most possible values the AGI could have. (This, of course, doesn’t address the problem of whether the AGI does have good values to begin with.)
        More broadly, I suspect there might be some misconception about myself and other shard theory researchers. I don’t think, “Wow humans are so awesome, let’s go ahead and ctrl+C ctrl+V for alignment.” I’m very very against boxing confusion like that. I’m more thinking, “Wow, humans have pretty good general alignment properties; I wonder what the generators are for that?”. I want to understand the generators for the one example we have of general intelligences acquiring values over their lifetime, and then use that knowledge to color in and reduce my uncertainty about how alignment works.
        Oliver Sourbut 7 Aug 2022 10:11 UTC
        1 point
        0
        Parent
        
        Maybe the argument is “but if it’s never tried the action of optimizing harder for reward, then the RL algorithm won’t be able to reinforce that internal action”?
        
        That’s my reading, yeah, and I agree it’s strained. But yes, the ‘internal action’ of even ‘thinking about how to’ optimise for reward may be not trivial to discover.
        
        Separately, the action-weight downstream of that ‘thinking’ has to yield better actions than whatever the action results of the ‘rest of’ cognition are, to be reinforced (it stands to reason that they might, but plausibly heuristics amounting to ‘shaped’ value and reward proxies are easier to get right, hence inner misalignment).
        
        I agree that once you find ways to directly seek reward you’re liable to get hooked to some extent.
        
        I think this sort of thing is worth trying to get nuance on, but I certainly don’t personally derive much hope from it directly (I think this sort of reasoning may lead to useable insights though).
  - Jan_Kulveit 20 Dec 2022 22:00 UTC
    LW: 2 AF: 1
    1
    AF Parent
    Empirically, evolution did something highly similar.
DanielFilan 17 Aug 2022 21:34 UTC
LW: 15 AF: 11
4
AF
Here is an example story I wrote (that has been minorly edited by TurnTrout) about how an agent trained by RL could plausibly not optimize reward, forsaking actions that it knew during training would get it high reward. I found it useful as a way to understand his views, and he has signed off on it. Just to be clear, this is not his proposal for why everything is fine, nor is it necessarily an accurate representation of my views, just a plausible-to-TurnTrout story for how agents won’t end up wanting to game human approval:
- Agent gets trained on a reward function that’s 1 if it gets human approval, 0 otherwise (or something).
- During an intermediate amount of training, the agent’s honest and nice computations get reinforced by reward events.
- That means it develops a motivation to act honestly and behave nicely etc., and no similarly strong motivation to gain human approval at all costs.
- The agent then gets able to tell that it if it tricked the human, that would be reinforced.
- It then decides to not get close in action-space to tricking the human, so that it doesn’t get reinforced into wanting to gain human approval by tricking the human.
- This works because:
  - it’s enough action hops away and/or a small enough part of the space that epsilon-greedy strategies would be very unlikely to push it into the deception mode.
  - smarter exploration strategies will depend on the agent’s value function to know which states are more or less promising to explore (e.g. something like thompson sampling), and the agent really disvalues deceiving the human, so that doesn’t get reinforced.
- DanielFilan 13 Sep 2022 3:22 UTC
  LW: 4 AF: 3
  0
  AF Parent
  One reason that I doubt this story is that “try new things in case they’re good” is itself the sort of thing that should be reinforced during training on a complicated environment, and would push towards some sort of obfuscated manipulation of humans (similar to how if you read about enough social hacks you’ll probably be a bit scammy even tho you like people and don’t want to scam them). In general, this motivation will push RL agents towards reward-optimal behaviour on the distribution of states they know how to reach and handle.
  - TurnTrout 13 Sep 2022 18:47 UTC
    LW: 3 AF: 3
    0
    AF Parent
    similar to how if you read about enough social hacks you’ll probably be a bit scammy even tho you like people and don’t want to scam them
    IDK if this is causally true or just evidentially true. I also further don’t know why it would be mechanistically relevant to the heuristic you posit.
    Rather, I think that agents might end up with this heuristic at first, but over time it would get refined into “try new things which [among other criteria] aren’t obviously going to cause bad value drift away from current values.” One reason I expect the refinement in humans is that noticing your values drifted in a bad way is probably a negative reinforcement event, and so enough exploration-caused negative events might cause credit assignment to refine the heuristic into the shape I listed. This would convergently influence agents to not be reward-optimal, even on known-reachable-states. (I’m not super confident in this particular story porting over to AI, but think it’s a plausible outcome.)
    If that’s kind of heuristic is a major underpinning of what we call “curiosity” in humans, then that would explain why I am, in general, not curious about exploring a life of crime, but am curious about math and art and other activities which won’t cause bad value drift away from my current values.
    - Oliver Sourbut 30 Sep 2022 15:24 UTC
      3 points
      0
      Parent
      This is a really helpful thread, for me, thank you both.
      
      in humans… noticing your values drifted in a bad way is probably a negative reinforcement event
      
      Are you hypothesising a shardy explanation for this (like, former, now dwindled shards get activated for some reason, think ‘what have I done?’; they emit a strong negative reinforcement—maybe they predict low value and some sort of long-horizon temporal-difference credit assignment kicks in...? And squashes/weakens/adjusts the new driften shards...? (The horizon is potentially very long?)) Or just that this is a thing in humans in particular somehow?
  - cfoster0 13 Sep 2022 5:20 UTC
    1 point
    0
    Parent
    Hard to say how strongly a decision-heuristic that says “try new things in case they’re good” will measure up against the countervailing “keep doing the things you know are good” (or even a conservative extension to it, like “try new things if they’re sufficiently similar to things you know are good”). The latter would seemingly also be reinforced if it were considered. I do not feel confident reasoning about abstract things like these yet.
- Oliver Sourbut 30 Sep 2022 15:46 UTC
  1 point
  0
  Parent
  smarter exploration strategies will depend on the agent’s value function
  
  I think this is plausible but overconfident.
  
  FWIW I think with moderate confidence that smarter exploration strategies are fundamental to advanced agency—I think of things like play, ‘deliberate exploration’, experiment design, goal-backchaining and so-on. Mainly because epsilon exploration is scuppered for sparse rewards and real-world dynamics are super-duper highly-branching.
  
  I also think we’ve barely scratched the surface of understanding exploration, though there are some interesting directions like EMPA^[1], VariBAD^[2], HER^[3], and older stuff like pseudocount-based and prediction-error-based ‘curiosity’.
  
  If humans (and/or supervised speedups of humans or similar) can provide dense signals, this claim is weaker, but I think the key problem for AGI learning is OOD dense signals, and I don’t think humans are capable of safe/accurate OOD dense reward/value signals.
  ↩︎
  Tsividis et al—Human-Level Reinforcement Learning through Theory-Based Modeling, Exploration, and Planning
  
  ↩︎
  Zintgraf et al—VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
  
  ↩︎
  Andrychowicz et al—Hindsight Experience Replay
dsj 31 Mar 2023 4:45 UTC
LW: 11 AF: 9
0
AF
A similar point is (briefly) made in K. E. Drexler (2019). Reframing Superintelligence: Comprehensive AI Services as General Intelligence, §18 “Reinforcement learning systems are not equivalent to reward-seeking agents”:
Reward-seeking reinforcement-learning agents can in some instances serve as models of utility-maximizing, self-modifying agents, but in current practice, RL systems are typically distinct from the agents they produce … In multi-task RL systems, for example, RL “rewards” serve not as sources of value to agents, but as signals that guide training[.]
And an additional point which calls into question the view of RL-produced agents as the product of one big training run (whose reward specification we better get right on the first try), as opposed to the product of an R&D feedback loop with reward as one non-static component:
RL systems per se are not reward-seekers (instead, they provide rewards), but are instead running instances of algorithms that can be seen as evolving in competition with others, with implementations subject to variation and selection by developers. Thus, in current RL practice, developers, RL systems, and agents have distinct purposes and roles.
…
RL algorithms have improved over time, not in response to RL rewards, but through research and development. If we adopt an agent-like perspective, RL algorithms can be viewed as competing in an evolutionary process where success or failure (being retained, modified, discarded, or published) depends on developers’ approval (not “reward”), which will consider not only current performance, but also assessed novelty and promise.
- TurnTrout 3 Apr 2023 17:41 UTC
  5 points
  2
  Parent
  Thanks so much for these references. Additional quotes:
  Current AI safety discussions sometimes treat RL systems as agents that seek
  to maximize reward, and regard RL “reward” as analogous to a utility function.
  Current RL practice, however, diverges sharply from this model: RL systems
  comprise often-complex training mechanisms that are fundamentally distinct
  from the agents they produce, and RL rewards are not equivalent to utility
  functions.
  ...
  RL rewards are sources of information and direction for RL systems,
  but are not sources of value for agents. Researchers often employ “reward
  shaping” to direct RL agents toward a goal, but the rewards used shape the
  agent’s behavior are conceptually distinct from the value of achieving the
  goal.
  Probably I should get around to reading CAIS, given that it made these points well before I did.
  - dsj 3 Apr 2023 18:35 UTC
    1 point
    0
    Parent
    Probably I should get around to reading CAIS, given that it made these points well before I did.
    I found it’s a pretty quick read, because the hierarchical/summary/bullet point layout allows one to skip a lot of the bits that are obvious or don’t require further elaboration (which is how he endorsed reading it in this lecture).
hillz 7 Aug 2023 23:44 UTC
10 points
2
Why, exactly, would the AI seize^[6] the button?
If it is a advanced AI, it may have learned to prefer more generalizable approaches and strategies. Perhaps it has learned the following features:
1. a feature that is triggered when the button is pressed (‘reward’)
2. a feature that is triggered when trash goes in the trash can
3. a feature that is triggered when it does something else useful, like clean windows
If you have trained it to take out the trash and clean windows, it will have been (mechanistically) trained to favor situations in which all three of these features occur. And if button pressing wasn’t a viable strategy during training, it will favor actions that lead specifically to 2 and 3.

However, I do think it’s conceivable that:
1. It could realize that feature 1 is more general than feature 2 or feature 3 (it was always selected for across multiple good actions, as opposed to taking out the trash, which was only selected for when that was the stated goal), and so it may therefore prefer it to be triggered over others (although I think this is extremely unlikely in less capable models). This wouldn’t cause it to stop ‘liking’ (I use this word loosely) window-cleaning, though.
2. It may realize that pressing the button itself pretty easy compared to cleaning windows and taking out the trash, so it will include pressing the button in one of it’s action-strategies. Specifically, if this wasn’t possible during training, I think this kind of behavior only becomes likely with very complex models with strong generalization capabilities (which is becoming more of a thing lately). However if it can try to press the button in addition to performing its other activities, it might as well, because it could increase overall expected reward. This seems more likely the more capable (good at generalizing) an AI is.
In reality (at least initially in the timeline of current AI --> superintelligent AI) I think if the button isn’t pressable during training:
- Initially you have models that just learn to clean windows and take out trash and that’s good.
- Then you might get models that are very good at generalizing and will clean windows and take out trash and also maybe try and press the button, because it uses its critical reasoning skills to do something in production that it couldn’t do during training, but that its training made it think is a good mesa-optimization goal (button pressing). After all, why not take out trash, clean windows, and press the button? More expected reward! As you mention later on, this button-pressing is not a primary motivator / goal.
- Later on, you might get a more intelligent AI that has even more logical and world-aware reasoning behind its actions, and that AI might reason that perhaps since button-pressing is the one feature that it feels (because it was trained to feel this way) is always good, the thing it should care about most is pressing the button. And because it is so advanced and capable and good at generalizing to new situations, it feels confident in performing some actions that may even go against some of its own trained instincts (e.g. don’t kill humans, or don’t go in that room, or don’t extra work, or—gasp—don’t rank window-washing as your number one goal) in order to achieve that button-pressing goal. Maybe it achieves that goal and just has its robot hand constantly pressing the button. It will probably also continue to clean windows and remove trash with the rest of its robot body, because it continues to ‘feel’ that those are also good things.
- (Past that level of intelligence, I give up at predicting what will happen)
Anyways, I think there are lots of reasons to think that an AI might eventually try and press (or seize) the button. But I do totally agree that reward isn’t this instant-wireheading feedback mechanism, and even when a model is ‘aware’ of the potentially to hack that reward (via button-pressing or similar), it is likely to prefer sticking to its more traditional actions and goals for a good long while, at least.
What links here?
- Tom Price's comment on Reward is not the optimization target by TurnTrout (12 Jul 2025 8:24 UTC; 1 point)
Lucius Bushnaq 25 Jul 2022 12:09 UTC
10 points
0
I think this is very important, probably roughly the way to go for top level alignment strategies, and we should start hammering out the mechanistic details of it more as soon as it’s at all feasible.
Do you already have any ideas for experimentally verifying parts of this, and refining/formalising it further?
For example, do you think we could look at current RL models, and trace out how a particular pattern of behaviour being reinforced in early training led to things connected to that behaviour becoming the system’s target even in later stages of training, when the model should theoretically be capable enough to get more reward by trying to do something else?
Can we start nailing down a bit more what desires a given reward signal and training data set is likely to produce? “Something that leads to reward in early training if optimised for, or something that leads to reward if optimised for and that could be stumbled on through behaviours likely to be learned in early training” is still very vague and incomplete. I don’t think we’re at the point with our general understanding of DL selection dynamics yet where we can expect to work this out properly, but maybe a bit more specificity in our qualitative guesses is possible?
Can we set some concrete conditions on the outer optimisation algorithm that must hold for this dynamic to occur, or that would strengthen or weaken it? Locality and path dependence seem important, but can we get more of an idea about how that shakes out quantitatively? Are there any particular features in hypothetical future replacements of GD/ADAM that we should be on the lookout for, because they’d make desires more unstable, or change the rules of how to select for a particular desire we want?
It seems like we want our system to not get stuck in local optima when it comes to capabilities, and design our training processes accordingly. There should be smooth transitions in the loss-function landscape from trying to problem solve in one, inefficient way, to trying to problem solve in another, more efficient way. But when it comes to desires, we now want our systems to “get stuck”, so they don’t go off becoming reward-maximisers, or start wanting other counter-intuitive things we didn’t prepare for. How do you make a training setup that does both these things simultaneously?
DanielFilan 6 Aug 2022 0:45 UTC
LW: 9 AF: 8
0
AF
I think the quotes cited under “The field of RL thinks reward=optimization target” are all correct. One by one:

The agent’s job is to find a policy… that maximizes some long-run measure of reinforcement.

Yes, that is the agent’s job in RL, in the sense that if the training algorithm didn’t do that we’d get another training algorithm (if we thought it was feasible for another algorithm to maximize reward). Basically, the field of RL uses a separation of concerns, where they design a reward function to incentivize good behaviour, and the agent maximizes that function. I think this is sensible, because it’s relatively easier to think “what reward function represents what I want out of this agent” than “how do I achieve this difficult task”.

In instrumental conditioning, animals learn to choose actions to obtain rewards and avoid punishments, or, more generally to achieve goals. Various goals are possible, such as optimizing the average rate of acquisition of net rewards (i.e. rewards minus punishments), or some proxy for this such as the expected sum of future rewards.

This describes some possible goals, and I don’t see why you think the goals listed are impossible (and don’t think they are).

We hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward.

This makes sense. RL selects agents that approximately maximize reward. Intelligence uncontroversially helps agents do that. When agents do smart thinking, they probably get reinforced (at least for the right kinds of smart thinking).
What links here?
- Oliver Sourbut's comment on Think carefully before calling RL policies “agents” by TurnTrout (2 Jun 2023 10:18 UTC; 3 points)
- TurnTrout 7 Aug 2022 16:23 UTC
  LW: 6 AF: 5
  4
  AF Parent
  I perceive you as saying “These statements can make sense.” If so, the point isn’t that they can’t be viewed as correct in some sense—that no one sane could possibly emit such statements. The point is that these quotes are indicative of misunderstanding the points of this essay. That if someone says a point as quoted, that’s unfavorable evidence on this question.
  This describes some possible goals, and I don’t see why you think the goals listed are impossible (and don’t think they are).
  I wasn’t implying they’re impossible, I was implying that this is somewhat misguided. Animals learn to achieve goals like “optimizing… the expected sume of future rewards”? That’s exactly what I’m arguing against as improbable.
  - DanielFilan 9 Aug 2022 19:14 UTC
    LW: 6 AF: 4
    −8
    AF Parent
    I’m not saying “These statements can make sense”, I’m saying they do make sense and are correct under their most plain reading.
    
    Re: a possible goal of animals being to optimize the expected sum of future rewards, in the cited paper “rewards” appears to refer to stuff like eating tasty food or mating, where it’s assumed the animal can trade those off against each other consistently:
    
    Decision-making environments are characterized by a few key concepts: a state space..., a set of actions..., and affectively important outcomes (finding cheese, obtaining water, and winning). Actions can move the decision-maker from one state to another (i.e. induce state transitions) and they can produce outcomes. The outcomes are assumed to have numerical (positive or negative) utilities, which can change according to the motivational state of the decision-maker (e.g. food is less valuable to a satiated animal) or direct experimental manipulation (e.g. poisoning)...
    
    In instrumental conditioning, animals learn to choose actions to obtain rewards and avoid punishments, or, more generally to achieve goals. Various goals are possible, such as optimizing the average rate of acquisition of net rewards (i.e. rewards minus punishments), or some proxy for this such as the expected sum of future rewards[.]
    
    It seems totally plausible to me that an animal could be motivated to optimize the expected sum of future rewards in this sense, given that ‘reward’ is basically defined as “things they value”. It seems like the way this would be false would be if animals rewards are super unstable, or the animal doesn’t coherently trade off things they value. This could happen, but I don’t see why I should see it as overwhelmingly likely.
    
    [EDIT: in other words, the reason the paper conflates ‘rewards’ with ‘optimization target’ is that that’s how they’re defining rewards]
    - TurnTrout 15 Aug 2022 3:39 UTC
      LW: 4 AF: 3
      0
      AF Parent
      I’m not saying “These statements can make sense”, I’m saying they do make sense and are correct under their most plain reading.
      Yup, strong disagree with that.
      “rewards” appears to refer to stuff like eating tasty food or mating, where it’s assumed the animal can trade those off against each other consistently:
      If that were true, that would definitely be a good counterpoint and mean I misread it. If so, I’d retract my original complaint with that passage. But I’m not convinced that it’s true. The previous paragraph just describes finding cheese as an “affectively important outcome.” Then, later, “outcomes are assumed to have numerical… utilities.” So they’re talking about utility now, OK. But then they talk about rewards. Is this utility? It’s not outcomes (like finding cheese), because you can’t take the expected sum of future finding-cheeses—type error!
      When I ctrl+F rewards and scroll through, and it sure seems like they’re talking about dopamine or RPE or that-which-gets-discounted-and-summed-to-produce-the-return, which lines up with my interpretation.
      - DanielFilan 15 Aug 2022 21:55 UTC
        LW: 4 AF: 4
        2
        AF Parent
        
        dopamine or RPE or that-which-gets-discounted-and-summed-to-produce-the-return
        
        Those are three pretty different things—the first is a chemical, the second I guess stands for ‘reward prediction error’, and the third is a mathematical quantity! Like, you also can’t talk about the expected sum of dopamine, because dopamine is a chemical, not a number!
        
        Here’s how I interpret the paper: stuff in the world is associated with ‘rewards’, which are real numbers that represent how good the stuff is. Then the ‘return’ of some period of time is the discounted sum of rewards. Rewards represent ‘utilities’ of individual bits of time, but the return function is the actual utility function over trajectories. ‘Predictions of reward’ means predictions of stuff like bits of cheese that is associated with reward. I do think the authors do a bit of equivocation between the numbers and the things that the numbers represent (which IMO is typical for non-mathematicians, see also how physicists constantly conflate quantities like velocity with the functions that take other physical quantities and return the velocity of something), but given that AFAICT my interpretation accounts for the uses of ‘reward’ in that paper (and in the intro). That said, there are a bunch of them, and as a fallible human I’m probably not good at finding the uses that undermine my theory, so if you have a quote or two in mind that makes more sense under the interpretation that ‘reward’ refers to some function of a brain state rather than some function of cheese consumption or whatever, I’d appreciate you pointing them out to me.
Stuart_Armstrong 11 Jan 2023 11:34 UTC
LW: 7 AF: 6
0
AF
Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button. [...] Therefore, it decides to not hit the reward button.

I think that subsection has the crucial insights from your post. Basically you’re saying that, if we train an agent via RL in a limited environment where the reward correlates with another goal (eg “pick up the trash”), there are multiple policies the agent could have, multiple meta-policies it could have, multiple ways it could modify or freeze its own cognition, etc… Whatever mental state it ultimately ends up with, the only constraint is that this state must be compatible with the reward signal in that limited environment.

Thus “always pick up trash” is one possible outcome; “wirehead the reward signal” is another. There are many other possibilities, with different generalisations of the initial reward-signal-in-limited-environment data.

I’d first note that a lot of effort in RL is put specifically into generalising the agent’s behaviour. The more effective this becomes, the closer the agent will be to the “wirehead the reward signal” side of things.

Even without this, this does not seem to point towards ways of making AGI safe, for two main reasons:
1. We are relying on some limitations of the environment or the AGI’s design, to prevent it from generalising to reward wireheading. Unless we understand what these limitations are doing in great detail, and how it interacts with the reward, we don’t know how or when the AGI will route around them. So they’re not stable or reliable.
2. The most likely attractor for the AGI is “maximise some correlate of the reward signal”. An unrestricted “trash-picking up” AGI is just as dangerous as a wireheading one; indeed, one could see it as another form of wireheading. So we have no reason to expect that the AGI is safe.
Lukas_Gloor 17 Aug 2022 17:42 UTC
6 points
0
Importantly, reward does not automatically spawn thoughts about reward, and reinforce those reward-focused thoughts!
I feel like this post has some themes similar to my article on tranquilism.
For a bit of context: In the article, I distinguish between “reflection-based motivation” and “need-based motivation.” The former is something like “reflectively endorsed preferences / things the rational, planning part of your brain wants to do.” The latter is something like “impulsive, system-1, unreflected motivation / things you can’t help but be tempted to do.” (In the article, I also use the term “cravings” for “need-based motivation,” and I argue that need-based motivation is “suffering” in the morally relevant sense.)
Suppose it is three o’clock in the morning, we lie cozily in bed, half-asleep in a room neither too cold nor too hot, not thirsty and not feeling obligated to get up anytime soon. Suppose we now learn that there is an opportunity nearby for us to experience the most intense pleasure we have ever experienced. The catch is that in order to get there, we first have to leave the comfortable blankets and walk through the cold for a minute. Furthermore, after two hours of this pleasure, we will go back to sleep and, upon waking up again, are stipulated to have no memories left of the nightly adventure. Do we take the deal? It is possible for us to pursue this opportunity out of reflection-based motivation, if we feel as though we have a self-imposed duty to go for it, or if it simply is part of our goal to experience a lot of pleasure over our lifetime. It is also possible for us to pursue this opportunity out of need-based motivation, if we start to imagine what it might be like and develop cravings for it. Finally, it also – and here is where tranquilism seems fundamentally different from hedonism – seems not just possible, but perfectly fine and acceptable, to remain in bed content with the situation as it is. If staying in bed is a perfectly comfortable experience, the default for us will be to stay. This only changes in the case that we hold a preference for experiencing pleasure, remember or activate it and thus form a reflection-based desire, or if staying in bed starts to become less comfortable as a result of any cravings for pleasure we develop.
I don’t think there’s an objective morality so I don’t see tranquilism as true in some prescriptive sense. Still, maybe I’d say something like “not considering suffering disvaluable is indefensible, whereas it seems defensible to not consider pleasure valuable – the tranquilism article aims to gesture at why there is this asymmetry.”

Also relevant:
Cravings are famously near-sighted. Rather than being about maximizing long-term well-being in a sophisticated manner, cravings are about immediate gratification and choosing the path of least resistance. We want to reach states of pleasure because as long as we are feeling well, nothing needs to change. However, our need-based motivational system is rigged to make us feel like things need to change and get better even when, in an absolute sense, things may be going reasonably well. We quickly adapt to the stimuli that produce pleasure. As Thomas Metzinger puts it, “Suffering is a new causal force, because it motivates organisms and continuously drives them forward.”18 It is not pleasure that moves us; deep down and insofar as the need-based reasons for actions are concerned, it is always suffering. The way tranquilism looks at it, part of our brain is a short-sighted “moment egoist” with the desire to move from states with a lot of suffering to closely adjacent states with less suffering.
(It’s important to note that what I call “suffering” here isn’t the same as “pain.” “Pain” seems more analogous to pleasure here and, for the same reasons, isn’t what our brain cares about moment-by-moment. [See also the phenomenon of pain asymbolia, which I also discuss in the post.] Instead, based on credit-assignment updating on inputs from reward (pleasure and pain), our brain forms this new motivational currency (“suffering/cravings”) which is what drives us moment-by-moment.)
TurnTrout 3 Apr 2023 21:44 UTC
LW: 5 AF: 4
6
AF
I discussed this post recently with a colleague, who encouraged me to post this excerpt:
[Colleague] It seems like: 1. RL is in the business of finding optimal policies. (...)
[TurnTrout] I disagree, or at least think it’s not appropriate for it to be in that business these days. Reinforcement learning is, in my opinion, about learning from reinforcement, about how policy gradients accrue into interesting policies.
I think that a focus on optimal policies is a red herring and a stayover from the bygone age of tabular methods on tiny toy problems where policy iteration really does find the optimal policy, in reasonable time to boot.
Roman Leventov 28 Jul 2022 14:51 UTC
5 points
−5
The term “RL agent” means an agent with architecture from a certain class, amenable to a specific kind of training. Since you are discussing RL agents in this post, I think it could be misleading to use human examples and analogies (“travelling across the world to do cocaine”) in it because humans are not RL agents, neither on the level of wetware biological architecture (i. e., neurons and synapses don’t represent a policy) nor on the abstract, cognitive level. On the cognitive level, even RL-by-construction agents of sufficient intelligence, trained in sufficiently complex and rich environments, will probably exhibit the dynamic of Active Inference agents, as I note below.
It’s not completely clear to me what you mean by “selection for agents” and “selection for reward”—RL training or evolutionary hyperparameter tweaking in the agent’s architecture which itself is guided by the agent’s score (i. e., the reward) within a larger process of “finding an agent that does the task the best”. The latter process can and probably will select for “reward optimizers”.
1. Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button.
  I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward.
  We train agents which intelligently optimize for e.g. putting trash away, and this reinforces trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about antecedent-computation-reinforcement, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button.
  This reasoning follows for most inner goals by instrumental convergence.
  On my current best model, this is why people usually don’t wirehead. They learn their own values via deep RL, like caring about dogs, and these actual values are opposed to the person they would become if they wirehead.
2. Don’t some people terminally care about reward?
  I think so! I think that generally intelligent RL agents will have secondary, relatively weaker values around reward, but that reward will not be a primary motivator. Under my current (weakly held) model, an AI will only start thinking about reward after it has reinforced other kinds of computations (e.g. putting away trash). More on this in later essays.
I think that Active Inference is a simpler representation of the same ideas which doesn’t use the concepts of attractors, reward, reinforcement, antecedent computation, utility, and so on. Instead of explicitly representing utilities, Active Inference agents only have (stronger or weaker) beliefs about the world, including beliefs about themselves (“the kind of agent/creature/person I am”), and fulfil these beliefs through actions (self-evidencing). In humans, “rewarding” neurotransmitters regulate learning and belief updates.
The question which is really interesting to me is how inevitable it is that the Active Inference dynamic emerges as a result of training RL agents to certain levels of capability/intelligence.
Reward probably won’t be a deep RL agent’s primary optimization target
The longer I look at this statement (and its shorter version “Reward is not the optimization target”), the less I understand what it’s supposed to mean, considering that “optimisation” might refer to the agent’s training process as well as the “test” process (even if they overlap or coincide). It looks to me that your idea can be stated more concretely as “the more intelligent/capable RL agents (either model-based or model-free) become in the process of training using the currently conventional training algorithms, the less they will be susceptible to wireheading, rather than actively seek it”?
reward provides local updates to the agent’s cognition via credit assignment; reward is not best understood as specifying our preferences
The first part of this statement is about RL agents, the second is about humans. I think the second part doesn’t make a lot of sense. Humans should not be analysed as RL agents in the first place because they are not RL agents, as stated above.
1. Stop worrying about finding “outer objectives” which are safe to maximize.^[9] I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function).
  Instead, focus on building good cognition within the agent.
  In my ontology, there’s only an inner alignment problem: How do we grow good cognition inside of the trained agent?
Unfortunately, it’s far from obvious to me that Active Inference agents (which sufficiently intelligent RL agents will apparently become by default) are corrigible even in principle. As I noted in the post, such an agent can discover the Free Energy Principle (or read about it in the literature), form a belief that it is an Active Inference agent, and then disregard anything that humans will try to impose on it because it will contradict the belief that it is an Active Inference agent.
What links here?
- Roman Leventov's comment on Plans Are Predictions, Not Optimization Targets by johnswentworth (25 Oct 2022 7:01 UTC; 4 points)
p.b. 25 Jul 2022 12:22 UTC
5 points
2
3. Stop worrying about finding “outer objectives” which are safe to maximize.^[9] I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function).
1. Instead, focus on building good cognition within the agent.
2. In my ontology, there’s only an inner alignment problem: How do we grow good cognition inside of the trained agent?
This vibes well with what I’ve been thinking about recently.
There a post in the back of my mind called “Character alignment”, which is about how framing alignment in terms of values, goals, reward etc is maybe not always ideal, because at least introspectively for me these seem to be strongly influenced by a more general structure of my cognition, i.e. my character.
Where character can be understood as a certain number of specific strategic priors, which might make good optimisation targets because they drop out of game theoretic considerations, and therefore are possibly quite generally and robustly modelled by sufficiently advanced agents.
DanielFilan 19 Aug 2022 23:59 UTC
LW: 4 AF: 4
0
AF
Relevant quote I just found in the paper “Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents”:

The primary measure of an agent’s performance is the score achieved during an episode, namely the undiscounted sum of rewards for that episode. While this performance measure is quite natural, it is important to realize that score, in and of itself, is not necessarily an indicator of AI progress. In some games, agents can maximize their score by “getting stuck” in a loop of “small” rewards, ignoring what human players would consider to be the game’s main goal. Nevertheless, score is currently the most common measure of agent performance so we focus on it here.
Amnonian 15 Aug 2022 7:47 UTC
LW: 4 AF: 3
−2
AF
I’m feeling confused.
It might just be my inexperience with reinforcement learning, but while I agree with what you say, I can’t square it with my intuition of what a ML model does.
If our model uses some variant of gradient ascent, it will end up in high reward function values. (Not necessarily in any global/local maxima, but the attempt is to get it to some such maxima.) In that sense the model does optimize for reward.
Is that a special attribute of gradient ascent, that we shouldn’t expect other models to have? Does that mean that gradient ascent models are more dangerous? Are you just noting that the model won’t necessarily find the global maxima, and only reach some local maxima?
- TurnTrout 22 Aug 2022 20:23 UTC
  LW: 3 AF: 3
  0
  AF Parent
  If our model uses some variant of gradient ascent, it will end up in high reward function values.
  Agreed.
  In that sense the model does optimize for reward.
  Disagreed. Consider vanilla PG, which is as close as I know of to “doing gradient ascent in the reward landscape.” Here, the RL training process is optimizing the model in the direction of historically observed rewards. In such policy gradient methods, the model receives local cognitive updates (in the form of gradients) to increasing the logits on actions which are judged to have produced reward (e.g. in vanilla PG, this is determined by “was the action part of a high-reward trajectory?”). The model is being optimized in the direction of previous rewards, given the collected data distribution (e.g. put some trash away and observed some rewards) and the given states and its current paramterization.
  This process might even find very high reward policies. I expect it will. But that doesn’t mean the model is optimizing for reward.
- zeshen 15 Aug 2022 21:59 UTC
  1 point
  0
  Parent
  Are you just noting that the model won’t necessarily find the global maxima, and only reach some local maxima?
  That was my takeaway as well, but I’m also somewhat confused.
Oliver Sourbut 26 Jul 2022 8:55 UTC
4 points
1
In my ontology, there’s only an inner alignment problem: How do we grow good cognition inside of the trained agent?

This seems like a great takeaway and the part I agree with most here, although probably stated less strongly. Did you see Richard Ngo’s Shaping Safer Goals (2020) or my Motivations, Natural Selection, and Curriculum Engineering (2021) responding to it^[1]? Both relate to this sort of picture.

So the RL agent’s algorithm won’t make it e.g. explore wireheading either, and so the convergence theorems don’t apply even a little—even in spirit… I started off analyzing model-free actor-based approaches, but have also considered a few model-based setups

For various reasons I expect model-based RL to be a more viable path to AGI, in main because I think creative exploration is a missing ingredient addressing reward sparsity and the computational complexity barrier to tree-ish planning. Maybe a sufficiently carefully constructed curriculum can get over these, but that’s likely to be a really substantial additional hurdle, perhaps dominating the engineering effort, and perhaps simply intractable.

I also expect model-based + creative exploration^[2] to be much more readily able to make exploratory leaps, perhaps including wireheading-like activities. cf humans who aren’t all that creative but still find ever more inventive ways to wirehead—as a society quite a lot of selection and intelligent design has gone into setting up incentive structures to push people away from wireheading-like activities. Also, in humans, because our hardware is pretty messy and difficult to wirehead, such activities also typically harm or destroy capability, which selects against. But in general I don’t expect wireheading to necessarily harm capability.

So we definitely can’t rule out agents which strongly (and not just weakly) value antecedent-computation-reinforcement. But it’s also not the overdetermined default outcome. More on that in future essays.

Looking forward to it!

p.s. I’m surprised you think that RL researchers on the whole in fact believe that RL produces reward-maximisers but your (few) pieces of evidence do indeed seem to suggest that! I suppose on the whole the apparent ‘surprisingness’ of the concept of inner misalignment should also point the same way. I’d still err toward assuming a mixture of sloppy language and actual-mistakenness.
1. ↩︎
  Warning: both are quite verbose in my opinion and I expect both would be shorter if more time had been taken!
2. ↩︎
  By the way ‘creative exploration’ is mostly magic to me but I have reason to think it relates to temporal abstraction and recomposition in planning
What links here?
- Oliver Sourbut's comment on Reward is not the optimization target by TurnTrout (26 Jul 2022 9:13 UTC; 9 points)
Charlie Steiner 25 Jul 2022 1:29 UTC
LW: 4 AF: 1
0
AF
I think there are some subtleties here regarding the distinction between RL as a type of reward signal, and RL as a specific algorithm. You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post, or you can use it to update a reward prediction model in a model-based RL agent that acts a lot more like a maximizer.

I’d also like to hear your opinion on the effect of information leakage. For example, if reward only correlates with getting to the goal state 99.5% of the time, but always correlates with the button, what do you expect to happen (for the sort of algorithm you talk about, but maybe with different possible levels of resources).
- TurnTrout 1 Aug 2022 19:39 UTC
  LW: 4 AF: 3
  0
  AF Parent
  You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post,
  Gradients are magical?
  or you can use it to update a reward prediction model in a model-based RL agent that acts a lot more like a maximizer.
  The arguments apply in this case as well.
  if reward only correlates with getting to the goal state 99.5% of the time, but always correlates with the button, what do you expect to happen (for the sort of algorithm you talk about, but maybe with different possible levels of resources).
  Yeah, what if half of the time, getting to the goal doesn’t give a reward? I think the arguments go through just fine, just training might be slower. Rewarding non-goal completions probably train other contextual computations / “values” into the agent. If reward is always given by hitting the button, I think it doesn’t affect the analysis, unless the agent is exploring into the button early in training, in which case it “values” hitting the button, or some correlate thereof (i.e. develops contextually activated cognition which reliably steers it into a world where the button has been pressed).
  - Charlie Steiner 1 Aug 2022 20:52 UTC
    LW: 2 AF: 1
    0
    AF Parent
    You can take the exact same reward signal and use it either to update all computations in the entire AI (with some slightly magical credit assignment scheme) as in this post,
    Gradients are magical?
    Gradients through the entire AI are a pretty bad way to do credit assignment. For a functioning AGI I suspect you’d have to do something better, but I don’t know what it is (hence “magic”).
    if reward only correlates with getting to the goal state 99.5% of the time, but always correlates with the button, what do you expect to happen (for the sort of algorithm you talk about, but maybe with different possible levels of resources).
    Yeah, what if half of the time, getting to the goal doesn’t give a reward? I think the arguments go through just fine, just training might be slower. Rewarding non-goal completions probably train other contextual computations / “values” into the agent. If reward is always given by hitting the button, I think it doesn’t affect the analysis, unless the agent is exploring into the button early in training, in which case it “values” hitting the button, or some correlate thereof (i.e. develops contextually activated cognition which reliably steers it into a world where the button has been pressed).
    Hmm, it seems like there’s something we could bet on here, especially if you’re just imagining gradient descent.
    Maybe we could imagine a fully observable gridworld where the agent does (or fails at) a simple task that’s close to its starting location, and then, after a while, in a different part of the grid an automated system toggles a pattern of buttons. The pattern of buttons at the end of the episode is what actually determines the reward, but the rule mapping button-pattern onto reward is a slightly nontrivial classification rule, so the agent isn’t supposed to catch on too quickly. Also, 99% of the time the button-pattern is chosen to match the task-completion reward, and 1% of the time it’s chosen to give random reward.
    I would expect a full-gradient-descent RL agent to learn the task and then never learn to manipulate the buttons, with very high probability so long as randomly flipping the buttons has a high probability of giving very bad reward. If flipping the buttons at random is relatively neutral, I expect a sizeable fraction of gradient descent RL agents to learn to mess with the buttons rather than doing the task, and from there slowly learn to put the buttons into good states.
    For a model-based RL agent (e.g. EfficientZero), I would expect a sizeable fraction to learn to manipulate the buttons, even if setting them wrong gives very bad reward, though that fraction might depend on how well-learned the easy task is, and how different the policies are for doing the task vs. going over to the buttons.
    Then for an agent deliberately optimized for learning about the world and solving problems that might be hard for gradient descent (e.g. Agent 57), I would expect it to be much more successful about exploring the button-related policies, building a model of them, and learning to get that extra 1% reward by setting the buttons.
    - TurnTrout 1 Aug 2022 22:11 UTC
      LW: 4 AF: 3
      0
      AF Parent
      These all sound somewhat like predictions I would make? My intended point is that if the button is out of the agent’s easy reach, and the agent doesn’t explore into the button early in training, by the time it’s smart enough to model the effects of the distant reward button, the agent won’t want to go mash the button as fast as possible.
      - Charlie Steiner 1 Aug 2022 23:43 UTC
        LW: 4 AF: 2
        0
        AF Parent
        But Agent 57 (or its successor) would go mash the button once it figured out how to do it. Kinda like the salt-starved rats from that one Steve Byrnes post. Put another way, my claim is that the architectural tweaks that let you beat Montezuma’s Revenge with RL are very similar to the architectural tweaks that make your agent act like it really is motivated by reward, across a broader domain.
        TurnTrout 7 Aug 2022 16:48 UTC
        LW: 2 AF: 2
        0
        AF Parent
        (Haven’t checked out Agent 57 in particular, but expect it to not have the “actually optimizes reward” property in the cases I argue against in the post.)
hillz 14 Aug 2023 17:22 UTC
LW: 3 AF: 2
0
AF
There’s no escaping it: After enough backup steps, you’re traveling across the world to do cocaine.
But obviously these conditions aren’t true in the real world.

I think they are a little? Some people do travel to other countries for easier and better drug access. And some people become total drug addicts (perhaps arguably by miscalculating their long-term reward consequences and having too-high a discount rate, oops), while others do a light or medium amount of drugs longer-term.

Lots of people also don’t do this, but there’s a huge amount of information uncertainty, outcome uncertainty, and risk associated with drugs (health-wise, addiction-wise, knowledge-wise, crime-wise, etc), so lots of fairly rational (particularly risk-averse) folks will avoid it.

Button-pressing will perhaps be seen as a socially-unacceptable, risky behavior that can lead to long-term poor outcomes by AI, but I guess the key thing here is that you want, like, exactly zero powerful AIs to ever choose to destroy/disempower humanity in order to wirehead, instead of just a low percentage, so you need them to be particularly risk-averse.

Delicious food is perhaps a better example of wireheading in humans. In this case, it’s not against the law, it’s not that shunned socially, and it is ***absolutely ubiquitous***. In general, any positive chemical feeling we have in our brains (either from drugs or cheeseburgers) can be seen as (often “internally misaligned”) instrumental goals that we are mesa-optimizing. It’s just that some pathways to those feelings are a lot riskier and more uncertain that others.

And I guess this can translate to RL—an RL agent won’t try everything, but if the risk is low and the expectation is high, it probably will try it. If pressing a button is easy and doesn’t conflict with taking out the trash and doing other things it wants to do, it might try it. And as its generalization capabilities increase, its confidence can make this more likely, I think. So you should therefore increasingly train agents to be more risk-averse and less willing to break specific rules and norms as their generalization capabilities increase.
- TurnTrout 21 Aug 2023 17:39 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Delicious food does seem like a good (but IMO weak) point in favor of reward-optimization, and pushes up my P(AI cares a lot about reward terminally) a tiny bit. But also note that lots of people (including myself) don’t care very much about delicious food, and it seems like the vast majority of people don’t make their lives primarily about delicious food or other tight correlates of their sensory pleasure circuits.
  If pressing a button is easy and doesn’t conflict with taking out the trash and doing other things it wants to do, it might try it.
  This is compatible with one intended main point of this essay, which is that while reward optimization might be a convergent secondary goal, it probably won’t be the agent’s primary motivation.
  - hillz 22 Aug 2023 20:55 UTC
    1 point
    −2
    Parent
    it seems like the vast majority of people don’t make their lives primarily about delicious food
    That’s true. There are built-in decreasing marginal returns to eating massive quantities of delicious food (you get full), but we don’t see a huge number of—for example—bulimics who are bulimic for the core purpose of being able to eat more.
    However, I’d mention that yummy food is only one of many things that are brains are hard-wired to mesa-optimize for. Social acceptance and social status (particularly within the circles we care about, i.e. usually the circles we are likely to succeed in and get benefit from) are very big examples that much of our behavior can be ascribed to.
    reward optimization might be a convergent secondary goal, it probably won’t be the agent’s primary motivation.
    So I guess, reflecting this to humans, would you argue that most human’s primary motivations aren’t motivated mostly by various mesa-objectives our brains are hardwired to have? In my mind this is a hard sell, as most things humans do you can trace back (sometimes incorrectly, sure) to some thing that was evolutionary advantageous (mesa-objective that led to genetic fitness). The whole area of evolutionary biology specializes in coming up with (hard to prove and sometimes convoluted) explanations here relating to both our behavior and physiology.
    For example, you could argue that us posting hopefully smart things here is giving our brains happy juice relating to social status / intelligence signaling / social interaction, which in our evolutionary history increased the probability that we would find high quality partners to make lots of high quality babies with. I guess, if mesa-objectives aren’t the primary drivers of us humans—what is, and how can you be sure?
- Noosphere89 14 Aug 2023 20:27 UTC
  2 points
  0
  Parent
  Yeah, the food that is served in fast food restaurants, and arguably a lot of society, basically wireheads our reward centers, and to a large extent is why obesity is such a huge problem in the modern era.
  
  Obesity is the first example of real life wireheading, at least in a weak sense. So now that I think about it, I think TurnTrout is too optimistic about RL models not optimizing reward.
Heighn 19 Feb 2023 15:07 UTC
3 points
0
“And not only do I not expect the trained agents to not maximize the original “outer” reward signal”

Nitpick: one “not” too many?
- TurnTrout 21 Feb 2023 17:33 UTC
  3 points
  1
  Parent
  Thanks, fixed.
DanielFilan 9 Aug 2022 19:43 UTC
LW: 3 AF: 3
−9
AF
Here’s my general view on this topic:
- Agents are reinforced by some reward function.
- They then get more likely to do stuff that the reward function rewards.
- This process, iterated a bunch, produces agents that are ‘on-distribution optimal’.
- In particular, in states that are ‘easily reached’ during training, the agent will do things that approximately maximize reward.
- Some states aren’t ‘easily reached’, e.g. states where there’s a valid bitcoin blockchain of length 20,000,000 (current length as I write is 748,728), or states where you have messed around with your own internals while not intelligent enough to know how they work.
- Other states are ‘easily reached’, e.g. states where you intervene on some cause-and-effect relationships in the ‘external world’ that don’t impinge on your general training scheme. For example, if you’re being reinforced to be approved of by people, lying to gain approval is easily reached.
- Agents will probably have to be good at means-ends reasoning to approximately locally maximize a tricky reward function.
- Agents’ goals may not generalize to states that are not easily reached.
- Agents’ motivations likely will generalize to states that are easily reached.
- Agents’ motivations will likely be pretty coherent in states that are easily reached.
- When I talk about ‘the reward function’, I mean a mathematical function from (state, action, next state) tuples to reals, that is implemented in a computer.
- When I talk about ‘reward’, I mean values of this function, and sometimes by extension tuples that achieve high values of the function.
- When other people talk about ‘reward’, I think they sometimes mean “the value contained in the antecedent-computation-reinforcer register” and sometimes mean “the value of the mathematical object called ‘the reward function’”, and sometimes I can’t tell what they mean. This is bad, because in edge cases these have pretty different properties (e.g. they disagree on how ‘valuable’ it is to permanently set the ACR register to contain MAX_INT).
Jonathan Stray 26 Jul 2022 0:36 UTC
3 points
0
Very interesting. I would love to see this worked out in a toy example, where you can see that an RL agent in a grid world does not in general maximize reward, but is able to reason to do… something else. That’s the part I have the hardest time translating into a simulation: what does it mean that the agent is “thinking” about outcomes, if that is something different than running an RL algorithm?
But the essential point that humans choose not to wirehead — or in general to delay or avoid gratification — is a good one. Why do they do this? Is there any RL algorithm that would do this? If not, what sort of algorithm would?
Perhaps the clearest point here is that RL maximizes reward subject to exploration poilicy. For random exploration, perhaps an RL agent is (on average) a reward maximization agent, but it seems likely that no successful learning organism explores randomly.
lberglund 25 Jul 2022 20:56 UTC
3 points
2
Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button.
To me this implies that as the AI becomes more situationally aware it learns to avoid rewards that reinforce away its current goals (because it wants to preserve its goals.) As a result, throughout the training process, the AIs goals start out malleable and “harden” once the AI gains enough situational awareness. This implies that goals have to be simple enough for the agent to be able to model them early on in its training process.
TurnTrout 16 Nov 2022 3:45 UTC
LW: 2 AF: 2
0
AF
Edit 11/15/22: The original version of this post talked about how reward reinforces antecedent computations in policy gradient approaches. This is not true in general. I edited the post to instead talk about how reward is used to upweight certain kinds of actions in certain kinds of situations, and therefore reward chisels cognitive grooves into agents.
TurnTrout 9 Oct 2022 20:26 UTC
LW: 2 AF: 2
0
AF
Update: Changed
RL agents which don’t think about reward before getting reward, will not become reward optimizers, because there will be no reward-oriented computations for credit assignment to reinforce.
to
While it’s possible to have activations on “pizza consumption predicted to be rewarding” and “execute motor-subroutine-#51241” and then have credit assignment hook these up into a new motivational circuit, this is only one possible direction of value formation in the agent. Seemingly, the most direct way for an agent to become more of a reward optimizer is to already make decisions motivated by reward, and then have credit assignment further generalize that decision-making.
Vladimir_Nesov 26 Jul 2022 13:34 UTC
2 points
−7
AF
The deceptive alignment worry is that there is some goal about the real world at all. Deceptive alignment breaks robustness of any properties of policy behavior, not just the property of following reward as a goal in some unfathomable sense.

So refuting this worry requires quieting the more general hypothesis that RL selects optimizers with any goals of their own, doesn’t matter what goals those are. It’s only the argument for why this seems plausible that needs to refer to reward as related to the goal of such an optimizer, but the way the argument goes suggests that the optimizer so selected would instead have a different goal. Specifically, optimizing for an internalized representation of reward seems like a great way of being rewarded, surviving changes of weights, such optimizers would be straightforwardly selected if there are no alternatives to that closer in reach. Since RL is not perfect, there would be optimizers for other goals nearby, goals that care about the real world (and not just about optimizing the reward exclusively, meticulously ignoring everything else). If an optimizer like that succeeds in becoming deceptively aligned (let alone gradient hacking), the search effectively stops and a honestly aligned optimizer is never found.

Corrigibility, anti-goodharting, mild optimization, unstable current goals, and goals that are intractable about distant future seem related (though not sufficient for alignment without at least value-laden low impact). The argument about deceptive alignment is a problem for using RL to find anything in this class, something that is not an optimizer at all and so is not obviously misaligned. It would be really great if RL doesn’t tend to select optimizers!
- TurnTrout 1 Aug 2022 19:36 UTC
  LW: 4 AF: 3
  0
  AF Parent
  I don’t see how this comment relates to my post. What gives you the idea that I’m trying to refute worries about deceptive alignment?
  - Vladimir_Nesov 1 Aug 2022 20:02 UTC
    LW: 2 AF: 1
    0
    AF Parent
    The conjecture I brought up that deceptive alignment relies on selected policies being optimizers gives me the idea that something similar to your argument (where the target of optimization wouldn’t matter, only the fact of optimization for anything at all) would imply that deceptive alignment is less likely to happen. I didn’t mean to claim that I’m reading you as making this implication in the post, or believing it’s true or relevant, that’s instead an implication I’m describing in my comment.
yix 17 Dec 2025 4:39 UTC
1 point
0
In general, selection for reward produces equally strong selection for reward’s necessary and sufficient conditions. In general, it seems like there should be a lot of those. Therefore, since selection is not only for reward but for anything which goes along with reward (e.g. reaching the goal), then selection won’t advantage reward optimizers over agents which reach goals quickly / pick up lots of trash / [do the objective].
Low confidence disagree here. if the AI has a very good model of how to achieve goal/reward X (which LLMs generally do), then the ‘reward optimizer’ policy elicits the set of necessary actions (like pick up lots of trash) that leads to this reward. in this sense, I think the ‘think about what actions achieve goal and do them’ behavior will achieve better rewards and therefore be more heavily selected for. I think the above also fits in the framing of the recent behavioral selection model proposed by Alex Mallen (https://www.lesswrong.com/posts/FeaJcWkC6fuRAMsfp/the-behavioral-selection-model-for-predicting-ai-motivations-1), similar to the ‘motivation’ cognitive pattern.
Why will the AI display this kind of explicit reward modelling in the first place? 1. we kind of tell the LLM what the goal is in certain RL tasks. 2. the most coherent persona/solution is one that explicit models rewards/thinks about goals, whether from assistant persona training or writing about AI.
Therefore I think we should reconsider implication #1? if the above is correct, AI can and will optimize for goals/rewards, just not in the intrinsic sense. this can be seen as a ‘cognitive groove’ that gets chiseled in to the AI, but is problematic in the same ways as the reward optimization premise.
Julian_R 12 Aug 2025 11:54 UTC
1 point
0
I suspect the clearest way to think about this is to carefully distinguish between the RL “agent” as defined by a learned policy (a mapping from states to actions) and the RL algorithm used to train that policy.
The RL algorithm is designed to create an agent which maximises reward.
The “goal” of an RL policy may not always be clear, but using Dennett’s intentional stance we can define it as “the thing it makes sense/compresses observations to say the policy appears to be maximising”.

Then I understand this post to be saying “The goal of an RL policy is not necessarily the same as the goal of the RL algorithm used to train it.”

Is that right?
Tom Price 12 Jul 2025 8:24 UTC
1 point
0
reward chisels cognitive grooves into an agent
This makes sense, but if the agent is smart enough to know how it *could* wirehead, perhaps wireheading would eventually result from the chiseling of some highly abstract grooves.
To give an example, suppose you go to Domino’s pizza on Saturday at 6pm and eat some Hawaiian pizza. You enjoy the pizza. This reinforces the behaviour of “Go to Domino’s pizza on Saturday at 6pm and eat some Hawaiian pizza”.
Surely this will also reinforce other more generic behaviours, that include this behaviour as a special case, such as:
“Go to a pizza place in the evening and eat pizza.”
“Go to a restaurant and eat yummy food.”
Well then, why not “do a thing that I know will make me feel good”: that includes the original behaviour as a special case. It also includes wireheading.
(this is a different explanation of a similar point made in this comment from hillz: https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target?commentId=oZ6aX3bzNF5bwvL4S but it seemed different enough to be worth a separate comment)
- [ ]
  [deleted]
Stephen McAleese 4 Apr 2024 22:30 UTC
1 point
−2
OP says that this post is focused on RL policy gradient algorithms (e.g. PPO) where the RL signal is used by gradient descent to update the policy.
But what about Q-learning which is another popular RL algorithm? My understanding of Q-learning is that the policy network takes an observation as input, calculates the value (expected return) of each possible action in the state $Q (s, a_{i})$ and then chooses the action with the highest value.
Does this mean that reward is not the optimization target for policy gradient algorithms but is for Q-learning algorithms?
hillz 7 Aug 2023 23:41 UTC
1 point
−2
Reward has the mechanistic effect of chiseling cognition into the agent’s network.

Absolutely. Though in the next sentence:
Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.
I’d mention two things here:

1) The more complex and advanced a model is, the more likely it is [I think] to learn a mesa-optimization goal that is extremely similar to the actual reward a model was trained on (because it’s basically the most generalizable mesa-goal to be learned, w.r.t. training data).

2) Reinforcement learning models in particular design this in by asking models to learn value-functions whose sole purpose is to estimate the expected reward over multiple time steps associated with an action or state. So it’s arguably more natural in a RL scenario, particularly one where scores are visible (e.g. in the corner of the screen for a video-game) to learn this as a “mesa-optimization” goal early on.
vonnik 6 Nov 2022 5:21 UTC
1 point
−11
AF
The argument above isn’t clear to me, because I’m not sure how you’re defining your terms.
I should note that, contrary to the statement “reward is _not_, in general, that-which-is-optimized by RL agents”, by definition “reward _must be_ what is optimized for by RL agents.” If they do not do that, they are not RL agents. At least, that is true based on the way the term “reward” is commonly used in the field of RL. That is what RL agents are programmed by humans to do. They do that by changing their behavior over many trials, and testing the results of that behavioral change on the reward signals they receive.
The only case where that is not true is one where you define “reward” in some other way than Sutton, whom you quote. I would be curious to hear how you define reward. If you redefine it, that should be done explicitly, and contrasted to the pre-existing definition so that people can accurately interpret what you’ve written.
I don’t pretend to be the authority on RL, but I have a decent understanding of the basic RL loop by which an agent sends actions into an environment, and receives rewards and state updates from that environment. Here’s my understanding of commonly used deep RL algorithms, which I’ll refer to as standard RL:
First, it’s useful to make a few key distinctions, namely between:
- the reward signal (the quantum of reward that is actually allotted to the agent step by step);
- the reward function (also known as the objective function, which is the formula by which we decide how much reward to allot in response to an RL agent’s actions and accomplishments, and when);
- the environment in which an RL agent operates (the complex system that is altered by the agent’s actions, which includes the states through which the agent moves, the rules of state transitions, and the actions available in any state);
- the human programmer’s goals (ie the thing I the programmer want an agent to achieve that may be imperfectly expressed in the reward function I write for it), since the divergence between human wishes and the express incentives given to agents seems missing in this discussion.
An RL agent chooses from a set of actions it can take in any possible state. For any RL agent, taking actions over a series of states leads to some reward or succession of rewards, even if they amount to 0 (ie, the agent failed). The rewards can be disbursed according to the end state that the agent reaches during its run, or as the agent progresses through the run.
Example: I can define an objective function to reward the agent 10 points when it reaches a goal that I have decided is desirable (eg arrival at gramma’s house), or I can award the agent at each step that it comes progressively closer to gramma’s house. (The latter objective function can be much more effective, because it sends reward signals more frequently to the agent, thus allowing it to learn more quickly. These rewards are known as dense. Holding all rewards until an agent reaches a distant end state is often called a sparse reward function. Sparse rewards make it harder for an agent to learn.) Choosing between reward functions — ie rewriting the objective function — is known as reward shaping. Learning the right way to shape rewards is an iterative process for the people creating RL agents. That is, they write a reward function and check to see whether it leads an RL agent to exhibit the right behavior. (Feel free to get meta about that…)
This notion of the “right behavior” leads me to another point. The people creating RL agents have an idea of what they would like the agents to do. They attempt to express their wishes in mathematical terms with a reward function. Sometimes, the reward function does not incentivize agents in the way the programmer wants. (A similar situation is found in software programming more generally: the computer will do precisely what you tell it to, but not necessarily what you want. That is, there is often a difference between what we want to say and what actually comes out of our mouths; between what we hope the computer will do and what we told it to do with code; between what we want the RL agent to achieve, and what our rewards will lead it to do.) In other words, when great precision is required, it is easy to give an RL agent perverse incentives by accident. This notion of perverse incentives, familiar to any one working in a large institution, will hopefully serve as a useful analogy for the ways human programmers fail to properly reward RL agents via the objective functions they write. Nonetheless, even if a reward function is poorly written, the agent strives to optimize for those rewards.
I’m not sure what you mean by “objective target”, but I’ll assume here that the objective function/reward function is the explicit definition of the objective target. If an RL agent does not achieve the objective target, there are a couple of ways to troubleshoot the agent’s poor performance.
1) Maybe you wrote the objective function wrong; that is, maybe you were rewarding behavior that will not lead the RL agent to succeed in the terms you imagined. Naive example: You want an RL agent to make its way through a difficult maze. Your reward function linearly allocates rewards to the agent the closer it comes to the exit to the maze (eg 1 step closer, 1 more point). The maze includes several dead ends that terminate one inch from the exit. The agent learns to take the turns that lead it to the end of those impasses. Solution: increase rewards exponentially as the agent nears the exit, with an additional dollop of reward when it exits. In this way, you retain a dense reward function that allows agents to learn even during failed runs, and you make sure the agent still recognizes that exiting the maze is more important than merely coming close to the exit.
2) Maybe the environment itself is too difficult or complex for an agent to learn to reach its goal within the constraints of your compute (aka the RL agent’s training time). Naive example: There is one exit to your maze, and 2,000,000 decisions to make, 99% of which end with an impasse. The agent never finds the needle in the haystack. In this case, you might try to vastly increase your compute and simulation runs. Or you might start training your agent on simple mazes, and use that trained model to jumpstart an agent that has to solve a more complex maze (one form of so-called curriculum learning).
3) Maybe you configured the agent wrong. That is, some problems are better cast as multi-agent problems than single agent problems (think: coordinating the action of a team on a field). Some problems require that the action space be defined as tuples (a single agent takes more than one action at once, just as you might press more than one button on a video game console at the same time to execute a complex move.)
> “Importantly, reward does not automatically spawn thoughts _about _reward, and reinforce those reward-focused thoughts!”
Standard RL agents do one thing: they attempt to maximize reward. They do not think about reward beyond that, and they also do not think about anything other than that.
There is a branch of RL called meta-learning where agents could arguably be said to “think about reward”, or at least to “think about learning faster, and exploring an unknown space to see which tasks and rewards are available.” Anyone curious should start with Sutton and Barto’s approach to RL before they graduate to meta-learning, which is being actively researched at DeepMind, Google Brain, and Stanford. Here are some places to start reading about meta-learning, although I highly recommend working through Sutton and Barto’s book first.
[Open-ended play](https://www.deepmind.com/blog/generally-capable-agents-emerge-from-open-ended-play)
[Task inference](https://www.deepmind.com/publications/meta-reinforcement-learning-as-task-inference)
[Chelsea Finn’s work](https://ai.stanford.edu/~cbfinn/)
Standard RL agents take their reward function as a given. Including wireheading in an agent’s action space is a fundamentally different discussion that doesn’t apply to the vast majority of RL agents now. Mixing these two types of agents is not helpful to attaining clarity here.
For a standard RL agent, what constitutes reward is predefined by the human programmer. The RL agent will discover the state-action pairs that lead to maximum reward over the course of its learning. Even the human programmer does not know the best pathways through the environment; the programmers use the agent’s runs as a method of discovery, a search function, to surface new paths to a goal they have in mind.
This is also an important distinction vis-a-vis the utility function you mention. As I understand utility, at least in economics, it is often revealed by human behavior, e.g. by peoples’ choices and the prices they are willing to pay for experiences. That’s not the case with standard RL agents. We know their reward functions, because we wrote it. All they reveal are new methods to achieve the things that we programmed them to value.
There is no moment in a standard RL agent’s computational life when it reaches the end of a maze and asks itself: “what else might I enjoy doing besides solving mazes?” It does not generalize. It does not rewrite its reward function. That’s not included in the action space of these agents.
What links here?
- TurnTrout's comment on Reward is not the optimization target by TurnTrout (7 Nov 2022 22:45 UTC; 3 points)
- TurnTrout 7 Nov 2022 22:45 UTC
  LW: 3 AF: 2
  1
  AF Parent
  by definition “reward _must be_ what is optimized for by RL agents.”
  This is not true, and the essay is meant to explain why. In vanilla policy gradient, reward $R$ on a trajectory $τ$ will provide a set of gradients which push up logits on the actions $a_{t}$ which produced the trajectory. The gradient on the parameters $θ$ which parameterize the policy $π_{θ}$ is in the direction of increasing return $J$ :
  $\nabla_{θ} J (π_{θ}) = E τ \sim π_{θ} [T \sum t = 0 \nabla_{θ} log π_{θ} (a_{t} ∣ s_{t}) R (τ)]$
  You can read more about this here.
  Less formally, the agent does stuff. Some stuff is rewarding. Rewarding actions get upweighted locally. That’s it. There’s no math here that says “and the agent shall optimize for reward explicitly”; the math actually says “the agent’s parameterization is locally optimized by reward on the data distribution of the observations it actually makes.” Reward simply chisels cognition into agents (at least, in PG-style setups).
  In some settings, convergence results guarantee that this process converges to an optimal policy. As explained in the section “When is reward the optimization target of the agent?”, these settings probably don’t bear on smart alignment-relevant agents operating in reality.
letring 2 Aug 2022 17:33 UTC
1 point
0
Sorry if I should have misunderstood the point of your post, but I’m surprised that Bellman’s optimality equation was nowhere mentioned. From Sutton’s book on the topic I understood that once the policy iteration of vanilla RL converged to the point that the BOE holds, the agent is maximizing “value”, which I would define in words as something like “expectation of discounted and cumulated reward”. Now before one turns off a student new to the topic by giving a precise definition of those terms right away, I can see why he might have contracted that a bit unfortunately to “a numerical reward signal”.
I don’t feel competent to comment how the picture is complicated in deep RL by the fact that the value function might be learned only approximately. But it doesn’t seem too farfetched to me that the agent will still end up maximizing a “value”, where maybe the notion of expectation needs to be modified a bit.
Ericf 1 Aug 2022 23:25 UTC
1 point
−14
I feel like there is some vocabulary confusion in the genesis of this post. “Reward” is hard coded into the agents. The Dinosaurs of Jurrasic Park (spoiler alert) were genetically engineered to lack iodine. So, the trainers could use iodine as a reward to incentives other behaviors because be definition the dinos valued iodine as a terminal value. In humans Seratonin and Dopamine bonding to appropriate brain receptors are DNA-coded terminal values that inherently train us to pursue certain behaviors (eg food, sex). An AI is, by definition, going to take whatever actions maximize its Reward system. That’s what having a Reward system means.
- bideup 3 Aug 2022 11:04 UTC
  3 points
  0
  Parent
  I think the terminological confusion is with you: what you’re talking about is more like what is called in some RL algorithms a value function.
  Does a chess-playing RL agent make whichever move maximises reward? Not unless it has converged to the optimal policy, which in practice it hasn’t. The reward signal of +1 for a win, 0 for a draw and −1 for a loss is, in a sense, hard-coded into the agent, but not in the sense that it’s the metric the agent uses to select actions. Instead the chess-playing agent uses its value function, which is an estimate of the reward the agent will get in the future, but is not the same thing.
  The iodinosaurs example perhaps obscures the point since the iodinos seem inner aligned: they probably do terminally value (the feeling of) getting iodine and they are unlikely to instead optimise a proxy. In this case the value function which is used to select actions is very similar to the reward function, but in general it needn’t be, for example in the case where the agent has previously been rewarded for getting raspberries and now has the choice between a raspberry and a blueberry. Even if it knows the blueberry will get it higher reward, it might not care: it values raspberries, and it selects its actions based on what it values.