I think it might be that the undesired response in RLHF/DPO settings isn’t good enough.
Imagine two responses, one leveraging stylistic devices and persuasive words while the other… well, just doesn’t.
Naturally the first is better and more desirable. If we now inspect this over the whole training batch, these distinctions of the preferred response at any point in the response leveraging stylistic devices will become clear. That is, a phrase like “It’s not an X, it’s a Y” will occur at a bunch of different positions throughout all the different positive examples in contrast to the negative examples which very rarely, if at all, showcase such pleasant phrasing.
But wouldn’t then such behavior of constantly repeating stylistic devices be exactly what we would expect? This clear contrast between positive and negative example will be what we distill in our final model, basically telling it that stylistic devices at any point are preferred.
To move away from this, looking for better high quality positive examples won’t be helpful at all—instead we need the negative examples throughout training to become closer and closer to the positive examples just like a writer progresses through his career: first learning about stylistic devices, then understanding when to use them meaningfully, when less is more and finally fully mastering it. This contrast between good and really good writer needs to be captured more in the posttraining data for something like DPO.
Do take this with a grain of salt, just a random theory i came up with while thinking about this for 10 minutes or so, i didn’t research the empirical state of research with this hypothesis but it does seem somewhat convincing to me at least.
There are many different stylistic devices that humans writers use. I believe that there’s a subset of stylistic devices that all LLMs use. Do you believe this isn’t the case?
Ah, I think I might have slightly misunderstood the intent of your posts title and tried answering a different question: why does LLM writing often seem shallow and bad rather than why LLMs specifically seem biased to a subset of stylistic devices.
I honestly don’t use LLMs much to chat or write with, so my personal experience is rather limited. But I do find the point others made, on the data distribution for post training just not being an accurate sample, convincing enough—just not particularly satisfying.
So, here’s my thoughts on why both RLHF but also SFT or DPO could, even with a perfect sample of training data, result in converged down distributions of stylistic devices.
In the case of RLHF, we can go even further by assuming the distillation of the training data went perfectly—the reward model isn’t biased towards any stylistic devices but a perfect representation of its training data.
Even then, the key problem is that the reward model only sees a single trace. This is important because it makes the reward model unable to determine whether the distribution of stylistic devices seen in the trace is simply a reasonable sample from the whole distribution or only a subset of it.
And because of constant optimization pressure, only mastering a few stylistic devices (just enough to fool the RM in a single trace) will quickly become the path forward.
Now what about something like SFT—after all, here we don’t do any rollouts anymore.
This does help. We can assume because of the unbiased loss, the distribution of stylistic devices when presented with some training examples is pretty accurate. But that’s the extent to which we can make statements: we were completely offline.
The traces during inference are very different from the training data: Errors propagate during token generation, biases accumulate and suddenly we are faced with only a subset of the training distribution or worse, something not encountered at all. Assuming the distribution of stylistic devices, once aligned to a completely different distribution of traces, will still be unbiased, is wishful thinking at best.
Online based training where you look beyond a single trace seems most promising. This can happen by either including stuff like logits (KL distillation, see this post for an idea which should work well as well) or simply incorporating multiple traces into the judgement of one—how diverse is this trace compared to others generated? (including its stylistic devices for example)
I think it might be that the undesired response in RLHF/DPO settings isn’t good enough.
Imagine two responses, one leveraging stylistic devices and persuasive words while the other… well, just doesn’t. Naturally the first is better and more desirable. If we now inspect this over the whole training batch, these distinctions of the preferred response at any point in the response leveraging stylistic devices will become clear. That is, a phrase like “It’s not an X, it’s a Y” will occur at a bunch of different positions throughout all the different positive examples in contrast to the negative examples which very rarely, if at all, showcase such pleasant phrasing.
But wouldn’t then such behavior of constantly repeating stylistic devices be exactly what we would expect? This clear contrast between positive and negative example will be what we distill in our final model, basically telling it that stylistic devices at any point are preferred.
To move away from this, looking for better high quality positive examples won’t be helpful at all—instead we need the negative examples throughout training to become closer and closer to the positive examples just like a writer progresses through his career: first learning about stylistic devices, then understanding when to use them meaningfully, when less is more and finally fully mastering it. This contrast between good and really good writer needs to be captured more in the posttraining data for something like DPO.
Do take this with a grain of salt, just a random theory i came up with while thinking about this for 10 minutes or so, i didn’t research the empirical state of research with this hypothesis but it does seem somewhat convincing to me at least.
There are many different stylistic devices that humans writers use. I believe that there’s a subset of stylistic devices that all LLMs use. Do you believe this isn’t the case?
Ah, I think I might have slightly misunderstood the intent of your posts title and tried answering a different question: why does LLM writing often seem shallow and bad rather than why LLMs specifically seem biased to a subset of stylistic devices.
I honestly don’t use LLMs much to chat or write with, so my personal experience is rather limited. But I do find the point others made, on the data distribution for post training just not being an accurate sample, convincing enough—just not particularly satisfying.
So, here’s my thoughts on why both RLHF but also SFT or DPO could, even with a perfect sample of training data, result in converged down distributions of stylistic devices.
In the case of RLHF, we can go even further by assuming the distillation of the training data went perfectly—the reward model isn’t biased towards any stylistic devices but a perfect representation of its training data.
Even then, the key problem is that the reward model only sees a single trace. This is important because it makes the reward model unable to determine whether the distribution of stylistic devices seen in the trace is simply a reasonable sample from the whole distribution or only a subset of it.
And because of constant optimization pressure, only mastering a few stylistic devices (just enough to fool the RM in a single trace) will quickly become the path forward.
Now what about something like SFT—after all, here we don’t do any rollouts anymore. This does help. We can assume because of the unbiased loss, the distribution of stylistic devices when presented with some training examples is pretty accurate. But that’s the extent to which we can make statements: we were completely offline.
The traces during inference are very different from the training data: Errors propagate during token generation, biases accumulate and suddenly we are faced with only a subset of the training distribution or worse, something not encountered at all. Assuming the distribution of stylistic devices, once aligned to a completely different distribution of traces, will still be unbiased, is wishful thinking at best.
Online based training where you look beyond a single trace seems most promising. This can happen by either including stuff like logits (KL distillation, see this post for an idea which should work well as well) or simply incorporating multiple traces into the judgement of one—how diverse is this trace compared to others generated? (including its stylistic devices for example)