Charlie Steiner comments on Steering Behaviour: Testing for (Non-)Myopia in Language Models

Charlie Steiner 22 Dec 2022 15:38 UTC
LW: 3 AF: 2
0
AF
The way I see it, LLMs are already computing properties of the next token that correspond to predictions about future tokens (e.g. see gwern’s comment). RLHF, to first order, just finds these pre-existing predictions and uses them in whatever way gives the biggest gradient of reward.
If that makes it non-myopic, it can’t be by virtue of considering totally different properties of the next token. Nor can it be by doing something that’s impossible to train a model to do with pure sequence-prediction. Instead it’s some more nebulous thing like “how its most convenient for us to model the system,” or “it gives a simple yet powerful rule for predicting the model’s generalization properties.”