paul_dfr

Karma: 2

paul_dfr 10 Dec 2025 3:59 UTC
1 point
0
on: Reward Function Design: a starter pack
I think that there is a further distinction that might be drawn between “constitutive” and “evidential” interpretations of the reward signal.

On the constitutive interpretation, the reward signal is the thing being optimized for, call it intrinsic value. (This is what I take to be the textbook interpretation in RL). On the evidential interpretation, the reward signal is evidence of intrinsic value that the agent uses to update its expected value representations. On this view, the reward signal is more like a perceptual signal but with intrinsic value as its content.

Assuming the evidential interpretation, standard RL models like TD learning are a special case where the reward signal is perfectly accurate, and known to be as much. However, we could make a generalized model where intrinsic value is a latent variable that the value function estimates, and the reward signal is modelled as an observation thereof.

Why would any of this matter? I’ve been thinking that an agent that is uncertain about whether a reward signal is accurate about the thing it’s trying to optimize for would rationally hesitate to pursue any action with excessive conviction. This would create a rational pressure against irreversible actions that give up option value, given the risk that they might find that what they thought was valuable turned out not to be with more evidence.

I could see myself being convinced otherwise on this being an important distinction, but currently I think it might be an important and neglected one.

paul_dfr 16 Nov 2025 1:10 UTC
3 points
0
on: Human Values ≠ Goodness
I basically agree with the thrust of this post, namely that we need a distinction between our values and goodness. Otherwise, we would not be able to ask the question whether we want what is good, for example. Or to put it differently, there is a conceptual distinction between what is desired and what is desirable, whatever determines the latter.

Furthermore, I agree that it is rather common to see what is desirable as some kind of function of what we in fact value. For example, in economics it is rather common to identify welfare with preference-satisfaction. However, even those who see such close relations between what we want and what would be good to want tend not to identify them, typically arguing that the latter is given by some kind of coherent extrapolated volition (e.g. philosophers like Bernard Williams or Richard Brandt).

With that said, I also think that the OP bakes in a few too many commitments about what it means for something to be valued or to be good, for that matter. On the value side, I agree with Steven Byrnes that this is best identified with something like a desire. However, it is worth noting that desires understood as such motivational pulls don’t necessarily come with a phenomenology of yumminess. For example, some of the things that I care about the most, I feel the least when thinking about, such as having a room to sleep. And other things I might feel positively bad when thinking about, such as the realization that I have to give up something that I was really looking forward to to meet a commitment that I really value. (There are variations of these cases that might make them less effective, e.g. if I faced the prospect of losing my room, but I think the point that the relation between experienced yuminess and desire is a contingent empirical claim stands).

On the goodness side, I am a bit worried that the OP conflates a few different things. To start, it seems that we want to distinguish between our representations of things as good (e.g. norms) and the good stuff itself. For example, I don’t think that we want to identify goodness with the norms that we have around what art is good as opposed to the art itself. Furthermore, it seems like the OP identifies goodness with something like moral goodness specifically. We probably want to separate that, and make it a subset of things that can be good. For example, we might think that things like healthy food and good conversation are good things that we should desire, but they are not obviously morally good.

Notably, I’m taking no stance on the question of what makes something good or bad here. It might be that things are good and bad because our norms say so, but that’s a stronger commitment than merely saying that goodness is separate from what we want, and one I don’t think we want to bake into the distinction itself. (So contrary to Nina Panickssery, I don’t think that the realist/antirealist distinction is as central here).

Finally, I think that this is probably a discussion where the conversation would benefit from some context in analytic philosophy, where many people have discussed this question at length, I believe quite fruitfully. Some classic papers I like on this include Railton (1986), ‘Moral Realism’ (on a naturalistic account of goodness and its relation to desire), and Quinn (1994), ‘Putting Rationality in Its Place’ (an argument against identifying goodness with what we want, or rational behavior with the mere satisfaction of desires for that matter).