Rényi divergence as a secondary objective
Summary: Given that both imitation and maximization have flaws, it might be reasonable to interpolate between these two extremes. It’s possible to use Rényi divergence (a family of measures of distance between distributions) to define a family of these interpolations.
Rényi divergence is a measure of distance between two probability distributions. It is defined as
with . Special values can be filled in by their limits.
Particularly interesting values are and .
Consider some agent choosing a distribution over actions to simultaneously maximize a score function and minimize Rényi divergence from some base distribution . That is, score according to
where controls how much the secondary objective is emphasized. Define . We have , and is a quantilizer with score function and base distribution (with the amount of quantilization being some function of , , and ). For , will be some interpolation between and .
It’s not necessarily possible to compute . To approximate this quantity, take samples and compute
Of course, this requires and to be specified in a form that allows efficiently estimating probabilities of particular values. For example, and could both be variational autoencoders.
As approaches 1, this limits to
As approaches , this limits to
Like a true quantilizer, a distribution trained to maximize this value (an approximate quantilizer) will avoid assigning much higher probability to any action than does.
These approximations yield training objectives for agents which will interpolate between imitating and maximizing . What do we use these for? Patrick suggested that could be an estimation of the distribution of actions a human would take (trained using something like this training procedure). Then, the distribution maximizing the combined objective will try to maximize score in a somewhat human-like way; it will interpolate between imitation and score-maximization.
There are problems, though. Suppose a human and an AI can both solve Sudoku, but the AI can’t solve it the way a human would. Suppose the AI trains a distribution over ways of filling out the puzzle to imitate the human. will usually not solve the puzzle, since the AI can’t solve the puzzle the way a human would. Suppose the AI is choosing a distribution over ways of filling out the puzzle to maximize a combined objective based on solving the puzzle and having low Renyi divergence from . If , then will be an approximate quantilizer with base distribution , so it is unlikely to solve the puzzle unless is very low (since very rarely solves the puzzle). With , there is not much of a guarantee that the AI is solving the puzzle the way a human would; unlike a quantilizer, a distribution trained with may assign much higher probability to some ways of filling out the puzzle than does. Something like meeting halfway might be necessary to ensure that the AI solves the problem in a humanlike way.