Thanks for responding, and apologies for the belated response in turn. Generally, I agree with all of this, but I think it might miss the point that I had in mind. What I intended to say was that a reward function performs two separate roles that can in principle be separated.
The first is to provide what I elsewhere call a success metric (contrasting it with learned target states). The other is to provide evidence of whether that success metric is achieved or not.
By a constitutive interpretation, I meant that the reward signal itself constitutes the success metric, in which case this distinction is beside the point. However, in principle, one could also define the success metric as a latent variable that a reward signal provides observations of, in which case the distinction applies.
I take all of this to be mostly orthogonal to the question of whether trained agents are aware of their reward function, for example.
I’m very confused about what you’re trying to say. In my mind:
“Reward function” is whatever you put in the reward function slot in the RL algorithm that you’re using.
“Success” (in technical alignment) is if the AI is trying to doing things that you (the human programmer) wanted it to be trying to do, via methods that you generally wanted it to be using, etc.
A “success metric” is a thing that might or might not exist at all, but if it did exist, it would be code that computes a number such that, when the number is high, you have “success” (previous bullet).
Maybe I’m finding your comments confusing because you’re adopting the AI’s normative frame instead of the human programmer’s? …But you used the word “interpretation”. Who or what is “interpreting” the reward function? The AI? The human? If the latter, why does it matter? (I care a lot about what some piece of AI code will actually do when you run it, but I don’t directly care about how humans “interpret” that code.)
Or are you saying that some RL algorithms have a “constitutive” reward function while other RL algorithms have an “evidential” reward function? If so, can you name one or more RL algorithms from each category?
Thanks for responding again. This would probably benefit from a longer and more precise writeup if there is anything of value to be said, but I think that some of the confusion you raised here is something I can clarify.
You are correct that by “success metric” I meant success from the agent’s (in this case an AI’s) own perspective, not that of a principal aiming to align it. Really, all I had in mind was a framework-neutral expression for the value that is being maximized in any expected value representation of an agent. So this is meant to denote the number that “counts as success” for the agent themselves.
On the “interpretation” point, this was probably a poor choice of words on my part. I meant to say that reward functions typically play two roles in RL agents: (a) they determine what counts as valuable for the agent (i.e. what I intended to express with a success metric above) and (b) signals whether a state is valuable in that way. My point was that, in principle, these functions can come apart. For example, you can have a function that provides a noisy signal about a quantity that the agent is maximizing in expectation, without that signal itself being the thing the agent aims to maximize. In this case, the signal is merely evidence. (See below).
My language above was meant to say that a reward function that performs both of these roles is constitutive, by both determining and evincing what has value. While one that is evidential merely evinces what has value, and that value can be some other quantity that is not the reward signal itself. So:
Or are you saying that some RL algorithms have a “constitutive” reward function while other RL algorithms have an “evidential” reward function? If so, can you name one or more RL algorithms from each category?
Yes, so what I’m saying makes sense only if this is true. And I think it is. So standard TD-learning for example has a unified (“constitutive”) reward function, in that the observed reward r is both what is (in long-term discounted expectation) being approximated by a value function, and a signal about the value of states. But yes, we can also construct an algorithm where the agent is optimizing for some underlying “intrinsic value” (i.e. what I called the success metric above) θi, and observes r as a noisy signal about θi (e.g. r=θi+ε, where ε∼N(0,σ2)). In this case, the reward signal plays “merely” an evidential function.
I don’t know a well-known algorithm that fits this bill well, but I would think Bayesian RL algorithms like BOSS can be adapted to it. And Cooperative Inverse Reinforcement Learning is definitely using some similar ideas, though then the latent value is often explicitly assumed to be a human reward function. My point here is rather that sheer uncertainty can be exploited to change incentives. For example, it’s not clear that you have an incentive to wirehead a reward signal if you’re not optimizing for its output, as that would just be deliberately introducing noise.
Thanks for responding, and apologies for the belated response in turn. Generally, I agree with all of this, but I think it might miss the point that I had in mind. What I intended to say was that a reward function performs two separate roles that can in principle be separated.
The first is to provide what I elsewhere call a success metric (contrasting it with learned target states). The other is to provide evidence of whether that success metric is achieved or not.
By a constitutive interpretation, I meant that the reward signal itself constitutes the success metric, in which case this distinction is beside the point. However, in principle, one could also define the success metric as a latent variable that a reward signal provides observations of, in which case the distinction applies.
I take all of this to be mostly orthogonal to the question of whether trained agents are aware of their reward function, for example.
I’m very confused about what you’re trying to say. In my mind:
“Reward function” is whatever you put in the reward function slot in the RL algorithm that you’re using.
“Success” (in technical alignment) is if the AI is trying to doing things that you (the human programmer) wanted it to be trying to do, via methods that you generally wanted it to be using, etc.
A “success metric” is a thing that might or might not exist at all, but if it did exist, it would be code that computes a number such that, when the number is high, you have “success” (previous bullet).
Maybe I’m finding your comments confusing because you’re adopting the AI’s normative frame instead of the human programmer’s? …But you used the word “interpretation”. Who or what is “interpreting” the reward function? The AI? The human? If the latter, why does it matter? (I care a lot about what some piece of AI code will actually do when you run it, but I don’t directly care about how humans “interpret” that code.)
Or are you saying that some RL algorithms have a “constitutive” reward function while other RL algorithms have an “evidential” reward function? If so, can you name one or more RL algorithms from each category?
I read your linked post but found it unhelpful, sorry.
Thanks for responding again. This would probably benefit from a longer and more precise writeup if there is anything of value to be said, but I think that some of the confusion you raised here is something I can clarify.
You are correct that by “success metric” I meant success from the agent’s (in this case an AI’s) own perspective, not that of a principal aiming to align it. Really, all I had in mind was a framework-neutral expression for the value that is being maximized in any expected value representation of an agent. So this is meant to denote the number that “counts as success” for the agent themselves.
On the “interpretation” point, this was probably a poor choice of words on my part. I meant to say that reward functions typically play two roles in RL agents: (a) they determine what counts as valuable for the agent (i.e. what I intended to express with a success metric above) and (b) signals whether a state is valuable in that way. My point was that, in principle, these functions can come apart. For example, you can have a function that provides a noisy signal about a quantity that the agent is maximizing in expectation, without that signal itself being the thing the agent aims to maximize. In this case, the signal is merely evidence. (See below).
My language above was meant to say that a reward function that performs both of these roles is constitutive, by both determining and evincing what has value. While one that is evidential merely evinces what has value, and that value can be some other quantity that is not the reward signal itself. So:
Yes, so what I’m saying makes sense only if this is true. And I think it is. So standard TD-learning for example has a unified (“constitutive”) reward function, in that the observed reward r is both what is (in long-term discounted expectation) being approximated by a value function, and a signal about the value of states. But yes, we can also construct an algorithm where the agent is optimizing for some underlying “intrinsic value” (i.e. what I called the success metric above) θi, and observes r as a noisy signal about θi (e.g. r=θi+ε, where ε∼N(0,σ2)). In this case, the reward signal plays “merely” an evidential function.
I don’t know a well-known algorithm that fits this bill well, but I would think Bayesian RL algorithms like BOSS can be adapted to it. And Cooperative Inverse Reinforcement Learning is definitely using some similar ideas, though then the latent value is often explicitly assumed to be a human reward function. My point here is rather that sheer uncertainty can be exploited to change incentives. For example, it’s not clear that you have an incentive to wirehead a reward signal if you’re not optimizing for its output, as that would just be deliberately introducing noise.