paul_dfr comments on Reward Function Design: a starter pack

paul_dfr 2 Jan 2026 11:36 UTC
1 point
0
Thanks for responding again. This would probably benefit from a longer and more precise writeup if there is anything of value to be said, but I think that some of the confusion you raised here is something I can clarify.

You are correct that by “success metric” I meant success from the agent’s (in this case an AI’s) own perspective, not that of a principal aiming to align it. Really, all I had in mind was a framework-neutral expression for the value that is being maximized in any expected value representation of an agent. So this is meant to denote the number that “counts as success” for the agent themselves.

On the “interpretation” point, this was probably a poor choice of words on my part. I meant to say that reward functions typically play two roles in RL agents: (a) they determine what counts as valuable for the agent (i.e. what I intended to express with a success metric above) and (b) signals whether a state is valuable in that way. My point was that, in principle, these functions can come apart. For example, you can have a function that provides a noisy signal about a quantity that the agent is maximizing in expectation, without that signal itself being the thing the agent aims to maximize. In this case, the signal is merely evidence. (See below).

My language above was meant to say that a reward function that performs both of these roles is constitutive, by both determining and evincing what has value. While one that is evidential merely evinces what has value, and that value can be some other quantity that is not the reward signal itself. So:
Or are you saying that some RL algorithms have a “constitutive” reward function while other RL algorithms have an “evidential” reward function? If so, can you name one or more RL algorithms from each category?
Yes, so what I’m saying makes sense only if this is true. And I think it is. So standard TD-learning for example has a unified (“constitutive”) reward function, in that the observed reward $r$ is both what is (in long-term discounted expectation) being approximated by a value function, and a signal about the value of states. But yes, we can also construct an algorithm where the agent is optimizing for some underlying “intrinsic value” (i.e. what I called the success metric above) $θ_{i}$ , and observes $r$ as a noisy signal about $θ_{i}$ (e.g. $r = θ_{i} + ε$ , where $ε \sim N (0, σ^{2})$ ). In this case, the reward signal plays “merely” an evidential function.

I don’t know a well-known algorithm that fits this bill well, but I would think Bayesian RL algorithms like BOSS can be adapted to it. And Cooperative Inverse Reinforcement Learning is definitely using some similar ideas, though then the latent value is often explicitly assumed to be a human reward function. My point here is rather that sheer uncertainty can be exploited to change incentives. For example, it’s not clear that you have an incentive to wirehead a reward signal if you’re not optimizing for its output, as that would just be deliberately introducing noise.