RobertKirk comments on Deconfusing Direct vs Amortised Optimization

RobertKirk 13 Dec 2022 15:10 UTC
LW: 3 AF: 3
0
AF
An existing example of something like the difference between amortised and direct optimisation is doing RLHF (w/o KL penalties to make the comparison exact) vs doing rejection sampling (RS) with a trained reward model. RLHF amortises the cost of directly finding good outputs according to the reward model, such that at evaluation the model can produce good outputs with a single generation, whereas RS requires no training on top of the reward model, but uses lots more compute at evaluation by generating and filtering with the RM. (This case doesn’t exactly match the description in the post as we’re using RL in the amortised optimisation rather than SL. This could be adjusted by gathering data with RS, and then doing supervised fine-tuning on that RS data, and seeing how that compares to RS).
Given we have these two types of optimisation, I think two key things to consider are how each type of optimisation interacts with Goodhart’s Law, and how they both generalise (kind of analogous to outer/inner alignment, etc.):
- The work on overoptimisation scaling laws in this setting shows that, at least on distribution, there does seem to be a meaningful difference to the over-optimisation behaviour between the two types of optimisation—as shown by the different functional forms for RS vs RLHF.
- I think the generalisation point is most relevant when we consider that the optimisation process used (either in direct optimisation to find solutions, or in amortised optimisation to produce the dataset to amortise) may not generalise perfectly. In the setting above, this corresponds to the reward model not generalising perfectly. It would be interesting to see a similar investigation as the overoptimisation work but for generalisation properties—how does the generalisation of the RLHF policy relate to the generalisation of the RM, and similarly to the RS policy? Of course, over-optimisation and generalisation probably interact, so it may be difficult to disentangle whether poor performance under distribution shift is due to over-optimisation or misgeneralisation, unless we have a gold RM that also generalises perfectly.