My intuition about the kl-divergence term is that it mostly operates on a sentence-sentence level. It reduces the chance that the model goes off the walls and outputs stuff that looks like random gibberish. If you put it high enough it forces the model to always output natural-seeming english text.
But I don’t think it puts much regularization on the long-term plans of preferences of the model. You can still for example get blatant reward hacking. Its just that the model will give long beautiful paragraphs whose natural conclusion is that egregious reward hacking is the way to go.
My intuition about the kl-divergence term is that it mostly operates on a sentence-sentence level. It reduces the chance that the model goes off the walls and outputs stuff that looks like random gibberish. If you put it high enough it forces the model to always output natural-seeming english text.
But I don’t think it puts much regularization on the long-term plans of preferences of the model. You can still for example get blatant reward hacking. Its just that the model will give long beautiful paragraphs whose natural conclusion is that egregious reward hacking is the way to go.