Thanks that clarifies somewhat, but guess I’ll need to read the paper—still a bit confused about the justification for a uniform distribution.
with states/actions sampled according to a uniform distribution. I give justification as to why this is a very natural way to define distance in a separate comment.
A uniform distribution actually seems like a very weird choice here.
Defining utility functions over full world states seems fine (even if not practical at larger scale), and defining alignment as dot products over full trajectory/state space utility functions also seems fine, but only if using true expected utility (ie the actual bayesian posterior distribution over states). That of course can get arbitrarily complex.
But it also seems necessary in that for one to say that two utility functions are truly ‘close’ seems like that must cash out to closeness of (perhaps normalized) expected utilities given the true distribution of future trajectories.
Do you see a relation between the early stopping criteria and regularization/generalization of the proxy reward?
The reason we’re using a uniform distribution is that it follows naturally from the math, but maybe an intuitive explanation is the following: the reason this is weird is that most realistic distributions are only going to sample from a small number of states/actions. Whereas the uniform distribution more or less encodes that the reward functions are similar across most states/actions. So it’s encoding something about generalization.
Thanks that clarifies somewhat, but guess I’ll need to read the paper—still a bit confused about the justification for a uniform distribution.
Defining utility functions over full world states seems fine (even if not practical at larger scale), and defining alignment as dot products over full trajectory/state space utility functions also seems fine, but only if using true expected utility (ie the actual bayesian posterior distribution over states). That of course can get arbitrarily complex.
But it also seems necessary in that for one to say that two utility functions are truly ‘close’ seems like that must cash out to closeness of (perhaps normalized) expected utilities given the true distribution of future trajectories.
Do you see a relation between the early stopping criteria and regularization/generalization of the proxy reward?
The reason we’re using a uniform distribution is that it follows naturally from the math, but maybe an intuitive explanation is the following: the reason this is weird is that most realistic distributions are only going to sample from a small number of states/actions. Whereas the uniform distribution more or less encodes that the reward functions are similar across most states/actions. So it’s encoding something about generalization.