TurnTrout comments on Training a Reward Hacker Despite Perfect Labels

TurnTrout 15 Aug 2025 0:46 UTC
LW: 7 AF: 2
1
AF
Retrospective: This is a win for the frame of “reward reinforces previous computations.” Ever since 2022, I’ve thought of “reward” as reinforcing the computations which led to the reward and as a chisel which carves circuits into the policy. From “Reward is not the optimization target”:
What reward actually does is reinforce computations which lead to it…
I suggest that you mechanistically model RL agents as executing behaviors downstream of past reinforcement (e.g. putting trash away), in addition to thinking about policies which are selected for having high reward on the training distribution (e.g. hitting the button). The latter form of reasoning skips past the mechanistic substance of reinforcement learning: The chiseling of computations responsible for the acquisition of the cognition-updater...
In my view, reward’s proper role isn’t to encode an objective, but a reinforcement schedule, such that the right kinds of computations get reinforced within the AI’s mind.
By thinking about reward in this way, I was able to predict^[1] and encourage the success of this research direction.
Ariana showed that in this coding environment, it’s not just about what the AI ends up choosing but also why the AI made that choice to begin with. Even though we “perfectly” reinforce the AI for doing what we wanted (i.e. avoiding special cases), we also often reinforced the system for the wrong reasons (i.e. considering special-casing the algorithm, even when not asked to do so). The AI’s propensity to consider doing the wrong thing was reinforced and so the AI generalized to hack more in-distribution.
Assuming these results generalize, the trained policy is not just determined by the outputs which get rewarded. The trained policy also depends on which intermediate computations get rewarded.
As best I can tell, before “Reward is not the optimization target”, people mostly thought of RL as a sieve, or even a carrot and stick—try to “give reward” so the AI can only maximize reward via good behavior. Few^[2] other people speculated that RL generalization is controlled by why the policy took an action. So I give myself and @Quintin Pope^[3] a bunch of points.
1. ^
  To be clear, my prediction was not as precise as “I bet you can reinforce sus CoTs and get sus generalization.” The brainstorming process went like:
  What are some of the most open important problems in alignment? → Reward hacking
  What are common assumptions about reward hacking? Oh, yeah, that hacking comes from reward function imperfections.
  Hmm I wonder whether models can be trained to reward hack even given “perfect” feedback
  We should really think more about this
  Time passes, continue encouraging research into the importance of CoT and prompts in RL (thinking about RL using the chisel-frame, as I ~always do)
  Victor and Ariana get this result.
2. Perhaps Steve Byrnes is an exception.
3. ^
  Quintin and I came up with “Reward is not the optimization target” together.
- Oliver Daniels 15 Aug 2025 4:03 UTC
  49 points
  38
  Parent
  As best I can tell, before “Reward is not the optimization target”, people mostly thought of RL as a sieve, or even a carrot and stick—try to “give reward” so the AI can only maximize reward via good behavior. Few^[2] other people speculated that RL generalization is controlled by why the policy took an action. So I give myself and @Quintin Pope^[3] a bunch of points.
  I’m confused by this claim—goal mis-generalization/deceptive-misalignment/scheming is centrally about generalization failures due to policies ~~maximizing~~ receiving high reward for the wrong reasons, and these threat models were widely discussed prior to 2022.
  
  (I do agree that shard-theory made “think rigorously about reward-shaping” more salient and exciting)
  - sunwillrise 18 Aug 2025 0:08 UTC
    1 point
    −1
    Parent
    I’m confused by this claim—goal mis-generalization/deceptive-misalignment/scheming is centrally about generalization failures due to policies maximizing reward for the wrong reasons, and these threat models were widely discussed prior to 2022.
    “Maximizing” reward? I don’t expect a supposedly deceptively misaligned RL policy to necessarily grab control of the reward button and press it over and over and over again to maximize reward.^[1] It can have high reward, but I don’t expect it to maximize the reward. Words have meanings, let’s actually try to be precise when talking about this.^[2]
    And this is one of the points of Reward is not the optimization target: stop conflating “reward” as the chisel that changes a policy’s cognition and “reward” as what the model actually tries to accomplish. Through the process of training, you are selecting for a model that scores well according to some metric. But that doesn’t make the policy intentionally attempt to make that metric go higher and higher and higher (in the case of model-free RL).
    The same way human genes were selected for something-approximating-IGF, but humans don’t intentionally try to have as many children as possible (instead, we execute adaptations that at least in some environments result in us having a large, but not maximal, number of children).
    ^
    Which is what the word “maximize” means in this context: to make maximal, or at least to try to make maximal
    ^
    I have already observed before how loosey-goosey reasoning and argumentation can lead to faulty conclusions about important topics here
    - Oliver Daniels 18 Aug 2025 1:00 UTC
      1 point
      0
      Parent
      yup, my bad, editing to “receiving high reward”
  - David Johnston 17 Aug 2025 7:51 UTC
    1 point
    −1
    Parent
    This seems different to “maximising rewards for the wrong reasons”. That view generally sees the reward maximised because it is instrumental for or aliased with the wrong goal. Here it’s just a separate behaviour that is totally unhelpful for maximising rewards but is learned as a reflex anyway.
    - Oliver Daniels 17 Aug 2025 16:44 UTC
      2 points
      1
      Parent
      You could have a view of RL that is totally agnostic about training dynamics and just reasons about policies conditioned on reward-maximization.
      
      But the stories about deceptive-misalignment typically route through deceptive cognition being reinforced by the training process (i.e. you start with a NN doing random stuff, it explores into instrumental-(mis)alignment, and the training process reinforces the circuits that produced the instrumental-(mis)alignment).
      …
      I see shard-theory as making two interventions in the discourse:
      1. emphasizing path-dependence in RL training (vs simplicity bias)
      2. emphasizing messy heuristic behaviors (vs cleanly-factored goal directed agents)
      
      I think both these interventions are important and useful, but I am sometimes frustrated by broader claims made about the novelty of shard-theory. I think these broad claims (perversely) obscure the most cruxy/interesting scientific questions raised by shard-theory, e.g. “how path-dependent is RL, actually” (see my other comment)
      - David Johnston 19 Aug 2025 6:02 UTC
        1 point
        0
        Parent
        To be more specific, I think this kind is result is suggested by thinking about how policy gradient RL works (not goal misgeneralization), and you could say the good bits of shard theory are basically just explaining policy gradient RL to the safety community … but it needed explaining, so they deserve credit for doing it.
        Oliver Daniels 19 Aug 2025 7:26 UTC
        2 points
        1
        Parent
        I think “explaining” vs “raising the saliency of” is an important distinction—I’m skeptical that the safety community needed policy gradient RL “explained”, but I do think the perspective “maybe we can shape goals / desires through careful sculpting of reward” was neglected.
        (e.g. I’m a big fan of Steve’s recent post on under-vs-over-sculpting)
      - David Johnston 17 Aug 2025 22:23 UTC
        1 point
        −2
        Parent
        I share your general feelings about shard theory, but think you were being a bit too stingy with credit in this particular case.
- Richard_Ngo 16 Aug 2025 9:19 UTC
  LW: 9 AF: 7
  3
  AF Parent
  By thinking about reward in this way, I was able to predict^[1] and encourage the success of this research direction.
  Congratulations on doing this :) More specifically, I think there are two parts of making predictions: identifying a hypothesis at all, and then figuring out how likely the hypothesis is to be true or false. The former part is almost always the hard part, and that’s the bit where the “reward reinforces previous computations” frame was most helpful.
  (I think Oliver’s pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)
  - Rohin Shah 16 Aug 2025 15:00 UTC
    LW: 21 AF: 12
    2
    AF Parent
    I think Oliver’s pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)
    The experimental setup (in the sense of getting bad behavior despite perfect labels on the training set) was also done prior to the popularization of reward-as-chisel.
- Algon 15 Aug 2025 22:47 UTC
  7 points
  3
  Parent
  AFAICT, the “Reward is not the optimization target” represents a bundle of ideas which, as a whole, differ from the LW-baseline a bunch, but individually, not so much. This, IMO, leads to some unfortunate miscommunications.
  E.g. see the sibling comment from @Oliver Daniels who claims that goal-mis generalization was already hitting the idea that policies may maximize reward for the wrong reasons. Whilst that is true, your position does away with the maximization framing that LW folk tend to associate with RL training, even when they view RL as operant conditioning i.e. they view RL as selecting for an EU maximizer, as you point out with the “sieve” analogy. But operant conditioning and RL in the limit winds up selecting an EU maximizer are two distinct claims.
  And IIUC, there are other differences which come up depending on whom is invoking the “Reward is not the optimization target” pointer, as the “AI Optimists” have wildly differing views beyond the shibboleth of “alignment isn’t that hard”. (LW, of course, has its own uniting shibboleths hiding a great difference in underlying world-views.)
  Anyway, what I’m getting at is that communication is hard, and I think there’s productive conversation to be had in these parts regarding “Reward is not the optimization target”. Thank you for trying. : )
- mattmacdermott 15 Aug 2025 20:40 UTC
  3 points
  0
  Parent
  RL generalization is controlled by why the policy took an action
  
  Is this that good a framing for these experiments? Just thinking out loud:
  
  Distinguish two claims
  1. what a model reasons about in its output tokens on the way to getting its answer affects how it will generalise
  2. why a model produces its output tokens affects how it will generalise
  These experiments seem to test (1), while the claim from your old RL posts is more like (2).
  
  You might want to argue that the claims are actually very similar, but I suspect that someone who disagrees with the quoted statement would believe (1) despite not believing (2). To convince such people we’d have to test (2) directly, or argue that (1) and (2) are very similar.
  
  As for whether the claims are very similar…. I’m actually not sure they are (I changed my mind while writing this comment).
  
  Re: (1), it’s clear that when you get an answer right using a particular line of reasoning, the gradients point towards using that line of reasoning more often. But re: (2), the gradients point towards whatever is the easiest way to get you to produce those output tokens more often, which could in principle be via a qualitatively different computation to the one that actually caused you to output them this time.
  
  So the two claims are at least somewhat different. (1) seems more strongly true than (2) (although I still believe (2) is likely true to a large extent).
- RussellThor 15 Aug 2025 2:30 UTC
  2 points
  1
  Parent
  If we assume that the current LLM/Transformers dont get to ASI, how much does this help aligning a new architecture. (My best guess is one copied from biology/neo-cortex) Do all the lessons transfer?