For DTs its really just a linear function to convert the scalar reward into the same dimmensions the token embeddings.
So e.g. a single token’s embedding has a hidden state of size 1024 .
We can learn a linear function that takes this scalar and outputs something of size 1024.
The more annoying (PITA) part was offset the positional/attention masks/labels for this.
I do agree think there are two product use cases with instruct that have distinct optimal levels of entropy.
1. The more explorative use cases you have mentioned. And for example when users do want diversity e.g. generating story ideas
2. Having factual / accurate answers
I’m not sure how exactly OpenAI set their “KL budgets” for davinci instruct.
For WebGPT3 they “compared a couple of KL budgets using human evaluations”. And those evaluations were for how factual the answers were.
So in that scenario, we’ll see a KL budget that optimizes for 2. Since the users don’t care about the diversity of multiple generations. They just care about the factual quality of a single generation.
Now i’m interested to see what happens if we somehow change the evaluations such that users e.g. are shown 3 examples from each model. In a scenario where diversity is desirable (e.g. generating story ideas). Now in deciding for the KL budget, we will probably get a much lower number. And that will allow them to serve a model more suited to tasks 1.