paul_dfr comments on Reward Function Design: a starter pack

paul_dfr 10 Dec 2025 3:59 UTC
1 point
0
I think that there is a further distinction that might be drawn between “constitutive” and “evidential” interpretations of the reward signal.

On the constitutive interpretation, the reward signal is the thing being optimized for, call it intrinsic value. (This is what I take to be the textbook interpretation in RL). On the evidential interpretation, the reward signal is evidence of intrinsic value that the agent uses to update its expected value representations. On this view, the reward signal is more like a perceptual signal but with intrinsic value as its content.

Assuming the evidential interpretation, standard RL models like TD learning are a special case where the reward signal is perfectly accurate, and known to be as much. However, we could make a generalized model where intrinsic value is a latent variable that the value function estimates, and the reward signal is modelled as an observation thereof.

Why would any of this matter? I’ve been thinking that an agent that is uncertain about whether a reward signal is accurate about the thing it’s trying to optimize for would rationally hesitate to pursue any action with excessive conviction. This would create a rational pressure against irreversible actions that give up option value, given the risk that they might find that what they thought was valuable turned out not to be with more evidence.

I could see myself being convinced otherwise on this being an important distinction, but currently I think it might be an important and neglected one.
- Steven Byrnes 10 Dec 2025 19:54 UTC
  4 points
  0
  Parent
  I feel like my starting-point definition of “reward function” is neither “constitutive” nor “evidential” but rather “whatever function occupies this particular slot in such-and-such RL algorithm”. And then you run this RL algorithm, and it gradually builds a trained agent / policy / whatever we want to call it. And we can discuss the CS question about how that trained agent relates to the thing in the “reward function” slot.
  For example, after infinite time in a finite (and fully-explored) environment, most RL algorithms have the property that they will will produce a trained agent that takes actions which maximize the reward function (or the exponentially-discounted sum of future rewards or whatever).
  More generally, all bets are off, and RL algorithms might or might not produce trained agents that are aware of the reward function at all, or that care about it, or that relate to it in any other way. These are all CS questions, and generally have answers that vary depending on the particulars of the RL algorithm.
  Also, I think that, in the special case of the human brain RL algorithm with its reward function (innate drives like eating-when-hungry), a person’s feelings about their own innate drives are not a good match to either “constitutive” or “evidential”.
  - paul_dfr 21 Dec 2025 21:46 UTC
    1 point
    0
    Parent
    Thanks for responding, and apologies for the belated response in turn. Generally, I agree with all of this, but I think it might miss the point that I had in mind. What I intended to say was that a reward function performs two separate roles that can in principle be separated.
    
    The first is to provide what I elsewhere call a success metric (contrasting it with learned target states). The other is to provide evidence of whether that success metric is achieved or not.
    
    By a constitutive interpretation, I meant that the reward signal itself constitutes the success metric, in which case this distinction is beside the point. However, in principle, one could also define the success metric as a latent variable that a reward signal provides observations of, in which case the distinction applies.
    
    I take all of this to be mostly orthogonal to the question of whether trained agents are aware of their reward function, for example.
    - Steven Byrnes 24 Dec 2025 14:41 UTC
      2 points
      0
      Parent
      I’m very confused about what you’re trying to say. In my mind:
      “Reward function” is whatever you put in the reward function slot in the RL algorithm that you’re using.
      “Success” (in technical alignment) is if the AI is trying to doing things that you (the human programmer) wanted it to be trying to do, via methods that you generally wanted it to be using, etc.
      A “success metric” is a thing that might or might not exist at all, but if it did exist, it would be code that computes a number such that, when the number is high, you have “success” (previous bullet).
      Maybe I’m finding your comments confusing because you’re adopting the AI’s normative frame instead of the human programmer’s? …But you used the word “interpretation”. Who or what is “interpreting” the reward function? The AI? The human? If the latter, why does it matter? (I care a lot about what some piece of AI code will actually do when you run it, but I don’t directly care about how humans “interpret” that code.)
      Or are you saying that some RL algorithms have a “constitutive” reward function while other RL algorithms have an “evidential” reward function? If so, can you name one or more RL algorithms from each category?
      I read your linked post but found it unhelpful, sorry.
      - paul_dfr 2 Jan 2026 11:36 UTC
        1 point
        0
        Parent
        Thanks for responding again. This would probably benefit from a longer and more precise writeup if there is anything of value to be said, but I think that some of the confusion you raised here is something I can clarify.
        
        You are correct that by “success metric” I meant success from the agent’s (in this case an AI’s) own perspective, not that of a principal aiming to align it. Really, all I had in mind was a framework-neutral expression for the value that is being maximized in any expected value representation of an agent. So this is meant to denote the number that “counts as success” for the agent themselves.
        
        On the “interpretation” point, this was probably a poor choice of words on my part. I meant to say that reward functions typically play two roles in RL agents: (a) they determine what counts as valuable for the agent (i.e. what I intended to express with a success metric above) and (b) signals whether a state is valuable in that way. My point was that, in principle, these functions can come apart. For example, you can have a function that provides a noisy signal about a quantity that the agent is maximizing in expectation, without that signal itself being the thing the agent aims to maximize. In this case, the signal is merely evidence. (See below).
        
        My language above was meant to say that a reward function that performs both of these roles is constitutive, by both determining and evincing what has value. While one that is evidential merely evinces what has value, and that value can be some other quantity that is not the reward signal itself. So:
        Or are you saying that some RL algorithms have a “constitutive” reward function while other RL algorithms have an “evidential” reward function? If so, can you name one or more RL algorithms from each category?
        Yes, so what I’m saying makes sense only if this is true. And I think it is. So standard TD-learning for example has a unified (“constitutive”) reward function, in that the observed reward $r$ is both what is (in long-term discounted expectation) being approximated by a value function, and a signal about the value of states. But yes, we can also construct an algorithm where the agent is optimizing for some underlying “intrinsic value” (i.e. what I called the success metric above) $θ_{i}$ , and observes $r$ as a noisy signal about $θ_{i}$ (e.g. $r = θ_{i} + ε$ , where $ε \sim N (0, σ^{2})$ ). In this case, the reward signal plays “merely” an evidential function.
        
        I don’t know a well-known algorithm that fits this bill well, but I would think Bayesian RL algorithms like BOSS can be adapted to it. And Cooperative Inverse Reinforcement Learning is definitely using some similar ideas, though then the latent value is often explicitly assumed to be a human reward function. My point here is rather that sheer uncertainty can be exploited to change incentives. For example, it’s not clear that you have an incentive to wirehead a reward signal if you’re not optimizing for its output, as that would just be deliberately introducing noise.