There are many different stylistic devices that humans writers use. I believe that there’s a subset of stylistic devices that all LLMs use. Do you believe this isn’t the case?
Ah, I think I might have slightly misunderstood the intent of your posts title and tried answering a different question: why does LLM writing often seem shallow and bad rather than why LLMs specifically seem biased to a subset of stylistic devices.
I honestly don’t use LLMs much to chat or write with, so my personal experience is rather limited. But I do find the point others made, on the data distribution for post training just not being an accurate sample, convincing enough—just not particularly satisfying.
So, here’s my thoughts on why both RLHF but also SFT or DPO could, even with a perfect sample of training data, result in converged down distributions of stylistic devices.
In the case of RLHF, we can go even further by assuming the distillation of the training data went perfectly—the reward model isn’t biased towards any stylistic devices but a perfect representation of its training data.
Even then, the key problem is that the reward model only sees a single trace. This is important because it makes the reward model unable to determine whether the distribution of stylistic devices seen in the trace is simply a reasonable sample from the whole distribution or only a subset of it.
And because of constant optimization pressure, only mastering a few stylistic devices (just enough to fool the RM in a single trace) will quickly become the path forward.
Now what about something like SFT—after all, here we don’t do any rollouts anymore.
This does help. We can assume because of the unbiased loss, the distribution of stylistic devices when presented with some training examples is pretty accurate. But that’s the extent to which we can make statements: we were completely offline.
The traces during inference are very different from the training data: Errors propagate during token generation, biases accumulate and suddenly we are faced with only a subset of the training distribution or worse, something not encountered at all. Assuming the distribution of stylistic devices, once aligned to a completely different distribution of traces, will still be unbiased, is wishful thinking at best.
Online based training where you look beyond a single trace seems most promising. This can happen by either including stuff like logits (KL distillation, see this post for an idea which should work well as well) or simply incorporating multiple traces into the judgement of one—how diverse is this trace compared to others generated? (including its stylistic devices for example)
There are many different stylistic devices that humans writers use. I believe that there’s a subset of stylistic devices that all LLMs use. Do you believe this isn’t the case?
Ah, I think I might have slightly misunderstood the intent of your posts title and tried answering a different question: why does LLM writing often seem shallow and bad rather than why LLMs specifically seem biased to a subset of stylistic devices.
I honestly don’t use LLMs much to chat or write with, so my personal experience is rather limited. But I do find the point others made, on the data distribution for post training just not being an accurate sample, convincing enough—just not particularly satisfying.
So, here’s my thoughts on why both RLHF but also SFT or DPO could, even with a perfect sample of training data, result in converged down distributions of stylistic devices.
In the case of RLHF, we can go even further by assuming the distillation of the training data went perfectly—the reward model isn’t biased towards any stylistic devices but a perfect representation of its training data.
Even then, the key problem is that the reward model only sees a single trace. This is important because it makes the reward model unable to determine whether the distribution of stylistic devices seen in the trace is simply a reasonable sample from the whole distribution or only a subset of it.
And because of constant optimization pressure, only mastering a few stylistic devices (just enough to fool the RM in a single trace) will quickly become the path forward.
Now what about something like SFT—after all, here we don’t do any rollouts anymore. This does help. We can assume because of the unbiased loss, the distribution of stylistic devices when presented with some training examples is pretty accurate. But that’s the extent to which we can make statements: we were completely offline.
The traces during inference are very different from the training data: Errors propagate during token generation, biases accumulate and suddenly we are faced with only a subset of the training distribution or worse, something not encountered at all. Assuming the distribution of stylistic devices, once aligned to a completely different distribution of traces, will still be unbiased, is wishful thinking at best.
Online based training where you look beyond a single trace seems most promising. This can happen by either including stuff like logits (KL distillation, see this post for an idea which should work well as well) or simply incorporating multiple traces into the judgement of one—how diverse is this trace compared to others generated? (including its stylistic devices for example)