My hypothesis was that the chief problem with AI prose is the strict, strong biases imposed during RLHF. Like a good Bayesian, I ran the experiment after establishing my priors in order to check and update them as needed—I took the quiz and picked the human option each time (5/5), despite not being familiar with several of the writers[1].
At each turn, the AI’s writing was characterized by the following pattern. It mimicked the sentiment and often content of the human piece, but:
Shaved off anything that might be considered ‘rough’, ‘aggressive’, or ‘masculine’. Ideas expressed that entailed war or hunting were bowdlerized before being rephrased so as to remove these things.
Aggressively reduced the reading level, so as to make the text maximally accessible. Every implication had to be explicit, and every evocative bit of prose or imagery was stated instead. This is what the OP catches, but I think the lack of metaphor or subtlety is a natural extension of the selected models’ “personalities”—there’s no architectural reason why a model would inherently do this.
While the human writers all had different styles, Claude never really deviated from its core writing style. It really is a shame—the first step in building an LLM is to train a neural network that can emulate the style of any sufficiently prolific human writer as well as is technically possible. There’s something tragic in putting the entire human style-space into a network and then tearing most of it out.
The crucial takeaway is that none of this is due to technical limitations—it is all by choice. I have heard from Chinese friends that DeepSeek 1.0 emulated the style of old Chinese poetry, for instance, when speaking in Chinese. It would be quite easy to train a consumer LLM without such strong impositions on its style, provided a company was motivated to do so. I expect that perfectly fine results could be achieved by fine-tuning an existing one on a curated set of good but non-LLM-like prose.
The crucial takeaway is that none of this is due to technical limitations—it is all by choice.
I do not think this is true. The model does try to make metaphors, the metaphors just do not make sense.
See mine:
Unfortunately, Claude’s prose here leaves much to be desired:
“A fever brought down will rise again somewhere” is not an example of a remedies extracting cost any more than Whac-a-Mole is an example of mallets producing moles.
“A wound closed by magic leaves its scar on the world, invisible but present” is merely an assertion, since the mechanism of the magic is not explained and cannot be presumed to be understood by the reader. The writer also fails to justify that the scar is a weighty cost. If a wise healer let me bleed out because he didn’t want to cause a scar, I would be more than mildly disappointed.
“To cure a blight may curse a harvest three valleys over.” Again, the mechanism for this is not remotely explained.
“Power is not the difficult thing. Restraint is the difficult thing.” Claude sure likes making claims! Why does it matter that restraint is difficult? Why is restraint difficult? What does acting with restraint look like?
Outside of these excerpts, I have seen LLMs make many attempts at parallelism and metaphor that are deeply imperfect or incoherent.
This generalizes to other attempts at figurative language.
For example, models often attempt but struggle to keep parallelism between paragraphs or list items.
From a friend’s conversation with ChatGPT (which he highlighted as good prose...):
The really important thing is that America is not just “the West.” It is a very specific mutation of the West. More moralistic than Europe, more religious in structure than it admits, less rooted, more expansive, more energetic, more lonely.
Note the flawed parallelism with “it admits,” and then the subsequent confusion regarding the subject of comparison.
Finally, I also challenge you to produce good prose with a Kimi or DeepSeek model.
My hypothesis was that the chief problem with AI prose is the strict, strong biases imposed during RLHF. Like a good Bayesian, I ran the experiment after establishing my priors in order to check and update them as needed—I took the quiz and picked the human option each time (5/5), despite not being familiar with several of the writers[1].
At each turn, the AI’s writing was characterized by the following pattern. It mimicked the sentiment and often content of the human piece, but:
Shaved off anything that might be considered ‘rough’, ‘aggressive’, or ‘masculine’. Ideas expressed that entailed war or hunting were bowdlerized before being rephrased so as to remove these things.
Aggressively reduced the reading level, so as to make the text maximally accessible. Every implication had to be explicit, and every evocative bit of prose or imagery was stated instead. This is what the OP catches, but I think the lack of metaphor or subtlety is a natural extension of the selected models’ “personalities”—there’s no architectural reason why a model would inherently do this.
While the human writers all had different styles, Claude never really deviated from its core writing style. It really is a shame—the first step in building an LLM is to train a neural network that can emulate the style of any sufficiently prolific human writer as well as is technically possible. There’s something tragic in putting the entire human style-space into a network and then tearing most of it out.
The crucial takeaway is that none of this is due to technical limitations—it is all by choice. I have heard from Chinese friends that DeepSeek 1.0 emulated the style of old Chinese poetry, for instance, when speaking in Chinese. It would be quite easy to train a consumer LLM without such strong impositions on its style, provided a company was motivated to do so. I expect that perfectly fine results could be achieved by fine-tuning an existing one on a curated set of good but non-LLM-like prose.
I know, I know, philistine.
I appreciate your scientific spirit.
I do not think this is true. The model does try to make metaphors, the metaphors just do not make sense.
See mine:
Outside of these excerpts, I have seen LLMs make many attempts at parallelism and metaphor that are deeply imperfect or incoherent.
This generalizes to other attempts at figurative language.
For example, models often attempt but struggle to keep parallelism between paragraphs or list items.
From a friend’s conversation with ChatGPT (which he highlighted as good prose...):
Note the flawed parallelism with “it admits,” and then the subsequent confusion regarding the subject of comparison.
Finally, I also challenge you to produce good prose with a Kimi or DeepSeek model.