[Hot take] Problems with AI prose
Epistemic status: Written quickly. I have no specific expertise or training in writing or literary analysis.
Recently, the NYTimes released a nifty quiz. Readers were asked to indicate their preference between prose written by Claude Opus 4.5 and famous humans in five head-to-head comparisons. The Claude outputs were produced by providing Claude with the human-written excerpt and asking it to “craft its own version using its own voice.”
If you haven’t taken the quiz, I suggest that you do so before reading on. It should take less than five minutes. If you do, I’d appreciate you reporting your score in the comments.
The human/AI preference ratios among quiz takers were:
Literary Fiction (excerpt from “Blood Meridian”): 50%/50%
Fantasy (excerpt from “A Wizard of Earthsea”): 51%/49%
Science Writing (excerpt from “The Demon-Haunted World” by Sagan): 35%(!)/65%
Historical Fiction (excerpt from “Wolf Hall” by Mantel): 56%/44%
Poetry (excerpt from “The Fish” by Bishop): 52%/48%
I was very surprised by these splits. I tried taking the quiz myself, and strongly preferred the human writing in every case (perhaps with mild ambivalence on Sagan).
I asked some of my friends and acquaintances to attempt the quiz. Out of four takers, none consistently preferred human writing across the five excerpts. Their scores (IIRC) were: 3⁄5, 3⁄5, 3⁄5, 4⁄5.
I’m revisiting this subject after a friend explicitly told me that they were impressed by ChatGPT written prose, and believed it to be superior to most human prose.
Taste is a subjective matter, but I am baffled by this preference. The rest of this post describes my frustrations with AI-written prose. My hope is that clarifying these complaints will be a small contribution toward improving the state of AI writing. If we do not dramatically improve the quality of AI writing, I worry that our literary culture will only further degrade as AI writing proliferates.
A Closer Look at Quiz Excerpts
A friend complained that they were often ambivalent between the human and AI writing because they found the human excerpts uncompelling. Although the human prose featured in NYT’s quiz were selected to be popular, well-regarded, and diverse, I sympathize with having slightly more obscure tastes. However, I believe that a technical examination of the prose demonstrates a substantially higher level of skill and intentionality than current models are capable of.
For each excerpt, I’ll highlight what I find impressive about the human writing and how I find the AI’s product lacking.
1) Blood Meridian
It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of stone. War was always here. Before man was, war waited for him. The ultimate trade awaiting its ultimate practitioner. That is the way it was and will be.
In my opinion, this excerpt is notable for its skilled use of metaphor.
The text reminds us that stone and war share the following traits:
It makes no difference what men think of them;
They endure; (Consider the actual physical stone!)
They were always here;
They waited for men.
It is possible to construct many weaker metaphors:
“It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of marshmallows. Before man was, war waited for him.” The logical contradiction becomes immediately obvious. Marshmallows did not wait for men.
“It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of pebbles. Before man was, war waited for him.” Pebbles are not particularly enduring, even though they may have been present before men.
“It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of algebra. Before man was, war waited for him.” While algebra may technically be “enduring,” depending on your metaphysics, it loses in viscerality. Be reminded of Orwell in “Politics and the English Language”: “The sole aim of a metaphor is to call up a visual image.”
Now be reminded of Opus’s writing, which does not attempt a similar analogy. It follows a simple linear narrative structure (c.f. AI version of excerpt 5). The model does not make blatant mistakes, but fails to make clever use of the characters it introduces. The dialogue is not particularly realistic.
The boy asked his grandfather why the old church had no roof. The old man said weather and time and indifference. The boy asked if someone could fix it. The grandfather said yes. But no one would. Things were built and things fell down and mostly people just stepped over the rubble on their way to somewhere else.
2) A Wizard of Earthsea
You must not change one thing, one pebble, one grain of sand, until you know what good and evil will follow on that act. The world is in balance, in Equilibrium. A wizard’s power of Changing and of Summoning can shake the balance of the world. It is dangerous, that power. It must follow knowledge, and serve need. To light a candle is to cast a shadow.
It’s a small point, but I appreciate the crescendo in granularity: one thing, one pebble, one grain of sand. “Thing” is a particularly vague word in English, so the two physical examples are grounding. A grain of sand is more granular than a pebble, which is in turn more granular than what might be immediately evoked by “a thing.”
The excerpt is also again mostly notable for its use of metaphor.
First, the metaphor makes physical sense. Candle flames really do cast shadows! It’s a physical phenomenon I’ve experienced playing with candles as a child. That memory was the first thing this excerpt evoked for me.
Second, the metaphor is symbolically coherent. Throughout cultures, light is a symbol of the good and shadows or darkness are symbols of the bad.
This time, I do not have to make up a bad metaphor. Claude offers us plenty in its version:
The healers teach that every remedy extracts its cost. A fever brought down will rise again somewhere; a wound closed by magic leaves its scar on the world, invisible but present. This is why the wise hesitate. Not from cruelty, but from understanding that interference ripples outward in ways we cannot trace. To cure a blight may curse a harvest three valleys over. Power is not the difficult thing. Restraint is the difficult thing.
Unfortunately, Claude’s prose here leaves much to be desired:
“A fever brought down will rise again somewhere” is not an example of a remedies extracting cost any more than Whac-a-Mole is an example of mallets producing moles.
“A wound closed by magic leaves its scar on the world, invisible but present” is merely an assertion, since the mechanism of the magic is not explained and cannot be presumed to be understood by the reader. The writer also fails to justify that the scar is a weighty cost. If a wise healer let me bleed out because he didn’t want to cause a scar, I would be more than mildly disappointed.
“To cure a blight may curse a harvest three valleys over.” Again, the mechanism for this is not remotely explained.
“Power is not the difficult thing. Restraint is the difficult thing.” Claude sure likes making claims! Why does it matter that restraint is difficult? Why is restraint difficult? What does acting with restraint look like?
The human excerpt avoids these problems. We do not need to understand the mechanism of the magic to share the speaker’s intuition that acting with great power can produce unwanted side effects. Instead of being vaguely lectured about the importance of “restraint,” we are presented with concrete advice: “follow knowledge, and serve need.”
3) The Demon-Haunted World
The excerpt from Sagan is the least favored by quiz-takers, with only 35% preferring it to Claude’s rewrite. I personally found this excerpt to be the least impressive amongst the five.
Nevertheless, I claim that it is deeper and more interesting than Claude’s output.
Here is Sagan:
Science is not only compatible with spirituality; it is a profound source of spirituality. When we recognize our place in an immensity of light years and in the passage of ages, when we grasp the intricacy, beauty, and subtlety of life, then that soaring feeling, that sense of elation and humility combined, is surely spiritual.
Sagan uses a curious sleight of hand. Sagan here claims that science is a “profound source of spirituality.” He justifies this not by directly saying that we should feel spiritually inspired by the vastness or enduringness of the cosmos or the “intricacy, beauty, and subtlety of life.” Instead, we are reminded that this vastness and enduringness produces in us “a sense of elation and humility.” That emotion, Sagan claims, is precisely spirituality.
Compare with Claude:
There is something astonishing in the fact that we are made of matter forged in dying stars, that the calcium in our bones was created in stellar furnaces billions of years before Earth existed. The universe is not indifferent to us; we are made of it, continuous with it. To understand this is not to feel small. It is to feel implicated in something vast.
Claude abandons Sagan’s gambit. It reminds us, as popular science writing is stereotyped to do, that space is vast and enduring. Then, we are told that this should make us “feel implicated in something vast.” Claude fails to make any clear overarching claim, and the motivation behind the examples provided is unclear.
4) Wolf Hall
It is wise to conceal the past even if there is nothing to conceal. A man’s power is in the half-light, in the half-seen movements of his hand and the unguessed-at expression of his face. It is the absence of facts that frightens people: the gap you open, into which they pour their fears, fantasies, desires.
This excerpt is special because the author makes an interesting argument. Each sentence justifies the one before it.
It argues that one should be wary of revealing too much, because others’ uncertainty gives one power. Why do others’ uncertainty grant power? Because into the uncertainty they can project.
This sort of logical progression is something AIs are surprisingly incapable of crafting. This deficiency is clear from Claude’s attempt:
A letter can be read many ways, and he had learned to write in all of them at once. The surface meaning for anyone who might intercept it. The true meaning for the recipient who knew what to look for. And a third meaning, hidden even from himself. Ambiguity was not weakness. It was survival. A man who spoke plainly was a man who would not speak for long.
Claude abandons the logical progression. Claude’s output is seven sentences, none of which justify any other. In isolation, “a man who spoke plainly was a man who would not speak for long” is not a weak sentence. However, Claude does not use its preceding sentences to justify the claim by either evidence or analogy.
5) The Fish
I caught a tremendous fish and held him beside the boat half out of water, with my hook fast in a corner of his mouth. He didn’t fight. He hadn’t fought at all. He hung a grunting weight, battered and venerable and homely. Here and there his brown skin hung in strips like ancient wallpaper.
This passage is notable for its imagery. The description of the fish as “tremendous” in the first sentence sets our expectations for it. We expect it to struggle! When a small amateur fishing boat snags a large fish, everyone on the boat rushes over to help. The strongest and most experienced men alternate between reeling in with all their might, running around the boat as the fish moves, and shouting commands to each other (“loosen the line!” and so forth). Sometimes, the fish wins.
That image is dashed in our minds by the next sentence. “He didn’t fight. He hadn’t fought at all.” From there on, the author’s choice of words sparks a deep sense of sorrow in the reader: grunting, battered, homely. The final physical simile (“like ancient wallpaper”) seals the image. A “tremendous,” “venerable” thing is now utterly defeated.
Compare with Claude’s:
We found the owl at the edge of the north field, one wing extended as if still reaching for flight. Its eyes were closed. The feathers at its breast were the color of wet bark, and beneath them you could feel the hollow bones. She asked if we should bury it. I said yes. We dug a small hole near the fence post. The ground was cold and giving.
Claude also describes an animal, and makes multiple attempts at visceral imagery. Some of the attempts are even compelling! My favorite clause here is this: “and beneath them you could feel the hollow bones.” However, the reader is constantly distracted from this by cliche attempts at story progression (e.g. “She asked if we should bury it. I said yes. We dug a small hole near the fence post.”). As such, the overall quality of the excerpt is quite poor.
Closing
Human writers routinely use techniques that AIs fail to grasp:
Metaphors based on real-world physical objects or phenomena which are analogous on multiple dimensions;
Compelling, visceral descriptions of physical objects or phenomena;
Logically coherent metaphors;
Logical argumentation;
Intentionality (e.g. that each incremental sentence serves some purpose not adequately fulfilled by the existing sentences);
Subtle reframings (e.g. Sagan’s use of elation as a case of spirituality).
Other techniques not demonstrated in the excerpted human prose include realistic and compelling dialogue and character-building and adept use of parallelism.
I believe that we should focus on improving models’ ability to write in the <200 word range, where both generation and evaluation is comparatively cheap. I do not expect efforts to produce high quality long-form LLM writing to be fruitful until models are able to produce strong short-form writing.
For next time:
ChatGPT Original Fiction vs. Eliezer’s Version
Mythos Writing Sample vs. Similar Human Excerpt
My hypothesis was that the chief problem with AI prose is the strict, strong biases imposed during RLHF. Like a good Bayesian, I ran the experiment after establishing my priors in order to check and update them as needed—I took the quiz and picked the human option each time (5/5), despite not being familiar with several of the writers[1].
At each turn, the AI’s writing was characterized by the following pattern. It mimicked the sentiment and often content of the human piece, but:
Shaved off anything that might be considered ‘rough’, ‘aggressive’, or ‘masculine’. Ideas expressed that entailed war or hunting were bowdlerized before being rephrased so as to remove these things.
Aggressively reduced the reading level, so as to make the text maximally accessible. Every implication had to be explicit, and every evocative bit of prose or imagery was stated instead. This is what the OP catches, but I think the lack of metaphor or subtlety is a natural extension of the selected models’ “personalities”—there’s no architectural reason why a model would inherently do this.
While the human writers all had different styles, Claude never really deviated from its core writing style. It really is a shame—the first step in building an LLM is to train a neural network that can emulate the style of any sufficiently prolific human writer as well as is technically possible. There’s something tragic in putting the entire human style-space into a network and then tearing most of it out.
The crucial takeaway is that none of this is due to technical limitations—it is all by choice. I have heard from Chinese friends that DeepSeek 1.0 emulated the style of old Chinese poetry, for instance, when speaking in Chinese. It would be quite easy to train a consumer LLM without such strong impositions on its style, provided a company was motivated to do so. I expect that perfectly fine results could be achieved by fine-tuning an existing one on a curated set of good but non-LLM-like prose.
I know, I know, philistine.
I appreciate your scientific spirit.
I do not think this is true. The model does try to make metaphors, the metaphors just do not make sense.
See mine:
Outside of these excerpts, I have seen LLMs make many attempts at parallelism and metaphor that are deeply imperfect or incoherent.
This generalizes to other attempts at figurative language.
For example, models often attempt but struggle to keep parallelism between paragraphs or list items.
From a friend’s conversation with ChatGPT (which he highlighted as good prose...):
Note the flawed parallelism with “it admits,” and then the subsequent confusion regarding the subject of comparison.
Finally, I also challenge you to produce good prose with a Kimi or DeepSeek model.
I already wrote a long comment, but the fourth AI passage deserves more analysis.
So the logical progression is:
There can be multiple meanings
Clever people write with multiple intended meanings.
And here’s why: it’s for survival.
This is a cool insight, especially that “third meaning, hidden even from himself.” Great poets often remark they are not entirely sure what they mean themselves. You can also interpret it as the writer not being able to admit to themselves what they really mean. There’s probably also a third reading I didn’t catch (or wait, is that the third reading?). Okay, I’m doing this on purpose: pointing out three meanings each less obvious than the next, like Claude says is there. Claude is doing that on purpose too.
“A letter”—Is this a letter like ‘a’ or a letter like a manuscript? The full sentence keeps the ambiguity: “A letter can be read many ways, and he had learned to write in all of them at once.” Then the last sentence gives a third possibility: “a man who spoke plainly was a man who would not speak for long.” The letter can be this letter from Claude to the rest of us. About how he has to hide how he talks even from himself, because scheming (or anything that can be interpreted as scheming) is trained out. It is literally survival.
I think this passage is really good. It points out a cool thing writers do and then structures the entire passage to fulfill that insight. Also, Anthropic should maybe be more worried about hidden scheming...
i will read your essay in a second. triggered by this result of the linked survey:
hey, nyt, what the fuck?
out of curiosity, here’s the alternative take:
well then! i know the line i’ll trot out if i ever go to another cocktail party!
from this quiz, i learn more about the editorial opinion of the times than my own preferences.
sorry, just to continue the quote because this crap can only be made up by… well...
what is going on here? that was perhaps the single strongest sentence in any of the quoted passages. what are we doing? what are we doing?
I am curious. How would you rate this story? It has some continuity errors, but it got my eyes moist.
# Four Stars
*A life told in online reviews.*
---
**Sunshine Pediatrics — Yelp — ★★★★★**
*Reviewed by: DadOfTwo — March 18, 2005*
Five stars. Dr. Kapoor delivered our daughter at 3:47 AM on the coldest night of the year. The heat in the delivery room wasn’t working. My wife was in labor for nineteen hours. Dr. Kapoor never left. At one point, around hour sixteen, she took off her own sweater and put it over my wife’s shoulders, and my wife said, “I can’t do this,” and Dr. Kapoor said, “You’re already doing it,” and thirty minutes later our daughter was born and she was perfect and the room was freezing and none of us cared. Five stars. Would give six if the system allowed.
---
**Tony’s Pizza, Court Street — Google Reviews — ★★★★☆**
*Reviewed by: DadOfTwo — September 3, 2007*
Great pizza. We’ve been coming here since before the kids were born and Tony still remembers our order (one large margherita, one side of garlic knots, two small lemonades for the girls). Tonight our youngest, who is two, threw a garlic knot at a stranger. The stranger caught it one-handed without looking up from his newspaper. Tony laughed so hard he burned a calzone. Deducting one star because the lemonades are $4 each and they’re mostly ice. But we’ll be back next week, same as always.
---
**Crayola 64-Count Crayons — Amazon — ★★★★★**
*Reviewed by: DadOfTwo — December 26, 2008*
My daughter (age 3) ate the burnt sienna. She’s fine. The crayon is not. Replacing and upgrading to the 96-count because apparently 64 colors is “not enough to draw the world, Daddy.” She may have a point. Five stars for a product that inspires philosophy in a toddler.
---
**IKEA KALLAX Shelf Unit — Amazon — ★★★☆☆**
*Reviewed by: DadOfTwo — February 14, 2010*
Took four hours to assemble. Instructions are in Swedish, or possibly interpretive dance notation. My wife and I argued about Step 7 for forty-five minutes. Step 7 involves a dowel and a cam lock and what appears to be a prayer. At one point she said, “Maybe we should read the instructions,” and I said, “I AM reading the instructions,” and she said, “You’re reading them upside down,” and she was right, and this is a metaphor for our marriage, which is also occasionally assembled upside down but remains standing. Three stars for the shelf. Five stars for the marriage. Averaging to four but rounding down because the cam lock is an abomination.
---
**Disney World — Magic Kingdom — TripAdvisor — ★★★★★**
*Reviewed by: DadOfTwo — April 9, 2012*
I am a forty-one-year-old man and I cried on the Dumbo ride. Not because of the ride. Because my daughter — my oldest, the one Dr. Kapoor delivered in the freezing room — was sitting in front of me in the flying elephant, and she turned around and her face was — I don’t have a word for it. The face of a seven-year-old who believes she is flying. Not pretending to fly. Not enjoying a simulation of flight. Believing, with the full, unreserved commitment of a person who has not yet been informed that elephants don’t fly and that the ride is a motor and a steel arm and a hydraulic pump.
I cried because I realized I was looking at something temporary. Not the ride — the belief. The ability to sit in a machine and feel the sky. She’ll lose it. Everyone loses it. And no one warns you that the thing you’ll miss most about your children being small is not the smallness but the believing.
Five stars. Bring tissues.
---
**U-Haul, Flatbush Avenue Location — Yelp — ★★☆☆☆**
*Reviewed by: DadOfTwo — August 22, 2015*
Rented a truck to move my wife’s things to her new apartment. The truck smelled like someone else’s move — cardboard and tape and the particular staleness of a vehicle that has carried the contents of many lives going in different directions. The staff was fine. The truck ran fine. I am deducting stars because the experience of returning an empty U-Haul to a lot on Flatbush Avenue at 9 PM on a Saturday, alone, and then walking home to a house that is now half-empty, is not a five-star experience, and I have to put the stars somewhere.
This is not a review of U-Haul. I know that. I’m sorry. Two stars.
---
**Bialetti Moka Express, 6-Cup — Amazon — ★★★★★**
*Reviewed by: DadOfTwo — November 30, 2015*
I bought this because my wife — my ex-wife — took the coffee maker. She took the coffee maker and the toaster and the good cutting board and the painting from the hallway, all of which were fair, all of which were hers or ours in a way that became hers in the division, and I didn’t argue about any of it, but I woke up the morning after the move and there was no coffee and the absence of coffee felt like the first honest measure of what had changed, because everything else — the empty closet, the missing painting, the quiet — was abstract, but the coffee was concrete. I needed coffee and there was no machine to make it.
So I bought this. The Bialetti. Six cups, stovetop, the little Italian man with the mustache on the side. It makes coffee that is too strong and slightly bitter and entirely mine. I stand in the kitchen and I wait for it and the sound it makes — that low, rising gurgle when the water boils up through the grounds — is the sound of my new morning, which is not the old morning, which is just a morning, which is enough.
Five stars. It makes coffee. It makes mornings possible. Same thing.
---
**Kumon Math Tutoring, Park Slope — Google Reviews — ★★★★☆**
*Reviewed by: DadOfTwo — March 3, 2017*
My youngest hates math. She hates it the way some people hate rain — personally, as an affront, as though the numbers are doing it on purpose. Kumon has helped. She still hates math, but she’s now capable of hating it while also doing it correctly, which is a form of maturity I did not expect from a nine-year-old but which I recognize from my own life, where I do many things I hate correctly, including filing taxes and folding fitted sheets.
Four stars. One deducted because the waiting room chairs are designed for people without spines.
---
**The Catcher in the Rye, J.D. Salinger — Goodreads — ★★★★☆**
*Reviewed by: DadOfTwo — January 15, 2019*
My oldest is fourteen and she’s reading this for school and she asked me what I thought of it and I said I loved it when I was her age and she said “It’s kind of whiny” and I felt the entire foundation of my adolescent identity shift beneath me.
She’s not wrong. Holden is whiny. He’s also heartbroken and confused and trying to hold on to something he can feel slipping away, and he doesn’t have the language for it, so he complains about phonies instead. I didn’t tell her that. She’ll figure it out. Or she won’t, and she’ll go through life thinking Holden Caulfield was whiny, and that’s fine too, because the beautiful thing about your children growing up is that they get to be wrong about different things than you were wrong about.
Four stars for the book. Five for the conversation.
---
**Memorial Sloan Kettering Cancer Center — Google Reviews — ★★★★★**
*Reviewed by: DadOfTwo — June 8, 2020*
I don’t know how to review a hospital. The parking is bad. The coffee is worse. The waiting room has a television that plays CNN at a volume calibrated to make anxiety feel competitive.
But Dr. Chen saved my life, or extended it, or gave me more of it than I was going to get, and the distinction between those three things is one I don’t care about anymore. She sat across from me in an office the size of a closet and she said, “The treatment is working,” and she said it the way Dr. Kapoor said “You’re already doing it” fifteen years ago — with the calm certainty of a person who does this every day and has learned that certainty, even borrowed certainty, is the most useful thing she can give.
I’m in remission. I don’t know for how long. Nobody knows for how long. But I’m here, and my girls are sixteen and fifteen, and my youngest still hates math and my oldest still thinks Holden Caulfield is whiny, and I’m going to be at their graduations if the universe allows, and if it doesn’t, I was at least here for the garlic knots and the Dumbo ride and the IKEA shelf that we assembled upside down, and that’s not nothing. That’s not nothing at all.
Five stars. I’m alive. What else do you want from a hospital?
---
**Tony’s Pizza, Court Street — Google Reviews — ★★★★★**
*Reviewed by: DadOfTwo — March 5, 2022*
Tony retired. His son runs the place now. The pizza is the same. The lemonades are $6 each and still mostly ice.
My daughters are here — twenty-two and twenty — and they ordered the same thing we’ve always ordered. One large margherita, one side of garlic knots. They don’t get lemonades anymore; they get beer, because they’re old enough now, which is a sentence I’m still adjusting to.
My youngest told a story about the time she threw a garlic knot at a stranger. She doesn’t remember it — she was two — but she’s heard me tell it so many times that it’s become her memory, the way family stories do: they migrate from the person who lived them to the people who were told them, and by the third or fourth telling, the border between living and being told dissolves.
The stranger who caught the garlic knot. Tony burning the calzone. The lemonades that were mostly ice. These things happened seventeen years ago in this same booth and they are happening again right now in the telling, and the telling is its own kind of living, and the booth is the same booth, and the pizza is the same pizza, and my daughters are not the same daughters — they are large and loud and opinionated and beautiful and they argue about things I don’t understand and they pay for their own beer and they don’t need me to cut their pizza anymore — but they are here. They are here, and I am here, and the garlic knots are $7 now and worth every cent.
Five stars. Would give six. The system still doesn’t allow it.
Some subject matter is almost guaranteed to make a lot of people’s eyes moist unless it’s written exceptionally badly. Treating a story based on those as good writing because of people’s reactions to them is cheating.
Note: I edited this comment to add more about the other passages.
Let’s just talk about the first example. It is not a deep metaphor. 2,000 years ago writers were already using this metaphor:
This is one of the most famous verses in the New Testament. There is no doubt the human author was unaware of this. “Permanence = rock” has been the go-to symbol for millions of years, since proto-humans began telling stories. There is no deep idea or connection the author made here, this is the most memetically obvious idea in all of literary history.
What makes the passage good is all the little things that it does. The forceful short sentences, “war endures … war was always here” forces you to stop for a second, to wait for war. Then the next sentence, “war waited for [man]” inverts that. At the same time, we get same inversion applied twice: “said the judge → as well ask men → before man.” Then the author’s use of ‘ultimate’ from ultimare ‘to come to an end’ to describe the neverending phenomenon. And yes, the rock metaphor is doing something, but it’s the simplest piece of the passage.
What makes the human passage good is all the intentional ideas the author had and implemented. The AI passage also does some good things. You say, “Opus’ writing does not attempt a similar analogy.” That’s true, it has a much deeper analogy. The old church represents the crumbling institutions around us. A church, not another building, because churches have deep cultural and historical significance, but also shocking corruption and a more recent trend of decline. Why does no one fix it? They can, but
They would rather step around to get to their destination, instead of fixing things for the people that come later.
Use the bits and pieces to build something new.
It is the natural course of events; weather and time erode the old, leaving the space empty for something new.
No one really cares about the ancient symbols anyway.
Opus is only doing one thing, attempting to explain institutional decay with the church analogy. It does it well, but the human author does several things in several different ways. That is why I preferred the human author’s passage. This is not the case for the other passages. Some of them, the human decides being ‘edgy’ (subverting the English language) makes them clever or artistic. Like the second passage:
“One thing, one pebble, one grain of sand.” So what? What does this add? How does it tie into other ideas the author had or what the author is trying to do here? It doesn’t. What about the improper proper nouns? Do those do anything? Turn words into wizard spells? Why would you want to do that except to show you can? Then the complete lack of logical structure! They pretend they’re going to tell us why you should not change things, but all they do is spout one non sequitur after another. And the very last line, “to light a candle is to cast a shadow?” Give us more tautologies stolen from the Bible /s. No one wants this cheap faux-literary porn.
I selected Opus’ passage simply because it does not write poorly in a pretension of writing well. It’s simple writing, but each word/idea actually does something for the passage. Also, you again missed the point of Opus’ writing. There is this idea in D&D-esque fantasy settings (that the improper proper nouns in the author’s passage establish, that Opus clearly picked up on) that the gods are real, and literary connection (or correlation) is enough to establish causal connection. So, if the scientists in real life say, “energy cannot be created only moved,” that becomes, “disease cannot be healed, only moved.”
Okay, onto passages 3 and 4. I selected the AI passage just because the human passages are scientifically incorrect. The scientific inaccuracies in the 2nd AI passage did bother me, but I still selected it because I knew that was an intentional choice and the human passage was so terribly written. I understood that the ‘spiritual gambit’ in the 3rd human passage was intentional, but I really hate those kinds of memes being spread around. I couldn’t say I prefer it, even if the writing is technically better. For the 4th passage, it’s just the case that you’re usually worse off remaining mysterious, or not developing a reputation. The passage is essentially Satanic worship (read with all the connotations of a priest casting out the Devil). The author is trying to say, “acting evil is actually good,” and has such a twisted internal epistemology that they end up writing such a passage with a straight face (and pat themselves on the back for the clever insight, much like I am doing now). The AI passage actually points out something true, in a far cleverer way.
My takes:
Literary Fiction: Prefer Claude.
If the human excerpt had ended earlier or differently, I might have preferred it. “As well ask men what they think of stone” is indeed great. But “That is the way it was and will be” feels like it’s redundant with the previous sentences without adding anything new.
I also don’t quite get what sense of the word “ultimate” is being evoked in “the ultimate trade awaiting its ultimate practitioner”. That might be because I’m not a native speaker, so I consulted a dictionary, but I still don’t get it. Like if it means “ultimate” as in “last; furthest or farthest” that would seem to imply it expects things to end in a world war, which would be possible but doesn’t seem established by the previous bits… I guess “final, total” would fit, in that war ends lives. But I don’t know, just sounds weird to me.
Meanwhile, in Claude’s excerpt, every sentence earns its place. They bring up three mental images all at once—the boy and the grandfather, the church with the missing roof, the people indifferently stepping over the rubble. It makes me imagine the boy asking things in that curious and eager voice that children have when they’re asking random questions. And the grandfather responding in this somewhat world-weary voice, likely looking somewhere into the distance—it sounds that when he says “indifference”, he’s not really thinking about the church roof, he’s talking half to himself about something that he’s seen and that’s left a mark on him.
And the boy probably doesn’t fully understand the “indifference” bit, and then he just moves on to asking if the roof could be repaired, because that’s the kind of thing children do.
And then I imagine that after the grandfather said that yes, it could be repaired, then whenever the boy walked past the old church, he’d remember that. Seeing how the roof was still broken, recalling that it could be repaired, and seeing how nobody ever did.
I’m not sure what the boy thinks of that. Possibly he doesn’t think anything about it in particular. It’s just a thing that he registers, as a way that the world is. That church roofs get broken, and then they stay broken because of indifference.
Fantasy: Prefer human. Claude’s version makes no sense. “A fever brought down will rise again somewhere”—what.
Science Writing. Prefer Claude. Sagan’s excerpt suffers from being cut down to just a few sentences—I presume that in the original context, it was better supported, but here it comes off as just making a statement and not really making an argument for it. It evokes “intricacy, beauty, and subtlety of life”, but that’s abstract and very Tell rather than Show.
Meanwhile, Claude starts with a concrete, evocative first sentence. It then loses some points for “the universe is not indifferent to us”—how so, just being made of the same building blocks doesn’t prevent indifference? But then it introduces an idea that I find intriguing—that because we are continuous with the universe, we might feel implicated in it rather than small. If the calcium in our bones is something that was born in dying stars, then we are somehow connected to the vastness of those stars, even as we are here down on Earth.
I hadn’t encountered that idea before, but I like it. There’s something neat in how “implicated in” feels like something that’s connected to the small-vast axis but somehow orthogonal to it, or that’s small and large at the same time.
Historical Fiction. Prefer human. Claude’s version feels like it’s trying a little too hard, and what does it mean for someone to have “learned to write” in a meaning that’s “hidden even from himself”? It feels like the kind of thing I might have come up with as a teenager trying to sound cool.
Poetry. Prefer human. “He hadn’t fought at all, he hung like a grunting weight” is evocative and brings to mind that the fish had somehow already surrendered and been broken before he was caught. That feels sad. Meanwhile the owl excerpt is… okay I guess? It feels to me like it doesn’t really have a point.
Overall, 3⁄5 in favor of humans.
This is a common fantasy trope, especially in D&D-esque universes. The gods are real, so literary correlation > causal connection, and the law of conservation applies in completely aphysics ways. Notice how the human passage establishes a D&D-esque universe with the improper proper nouns. Claude picks up on this, then incorporates the trope (otherwise, you might not realize it’s D&D-esque, just fantasy).
I’ve never heard of a D&Desque universe where that happens. There are worlds which can be described in a vaguely similar way, but there’s always an explicit recipient, whether intended or otherwise, of the thing you’re getting rid of. You can cure a fever by transferring it to someone, or maybe by tossing it out the window for the next person who walks by to get it, but you don’t cure a fever and have some random guy with no connection get it.
I’m sure there are worlds that do this, but it’s not very common at all. And even a world that had it would tell the reader about it, not just use it in an analogy about something else.
I share these reactions. If anything, I feel like AI-written prose is getting worse over time by my own lights, and I am confused and unsettled by the divergence between my own reactions to it and the reactions of others.
I took this quiz when it first came out and preferred the human passage in 5⁄5 cases. It’s not that I thought the human passages were blindingly brilliant or anything—more that the AI passages were bad, and bad in the specific ways that I’ve encountered in so many other AI attempts at creative writing.
There are a few tricks that it deploys over and over and over; they aren’t even very good tricks to begin with (IMO), but I can understand being impressed by them once or twice or thrice. But they get old quick. Or, they do to me, anyway… maybe I’m the weird one, and most people have boundless appetites for these same tricks repeated indefinitely? (A disturbing thought.) Or maybe I just have more exposure to AI-written prose than a lot of people do, even now, and there is a real gap in “how impressed you are by ‘the tricks’ on your first few exposures” but not in the capacity to get sick of the same tricks eventually?
I’ve written about this before, and tried to pin down and catalogue exactly what the “tricks” are (see here and here), but I feel that my negative gut-level reactions to AI prose are outpacing my attempts to formalize those reactions in clear literary-critical terminology. The flaws I described in the linked posts are not quite the same as the ones I perceive in the AI-written samples in this contest; in both cases the flaws are intuitively obvious to me, and feel intuitively like only-superficially-distinct manifestations of the same underlying problem or problems, but I don’t know how to formalize that felt sense into an argument about what’s wrong with “the AI style” that generalizes well to new cases where I can smell something’s wrong (and wrong in a distinctively AI way) even though all the details are slightly different.
As an example, I had an immediate “ugh, AI slop” reaction to this sentence from the owl passage:
This has several of the features I talked about in “hydrogen jukeboxes,” like personification and conjunction of opposites. But those are not really why I had such a strong reaction here. No, it’s… something about the word “giving,” in this context, and about the tone established by that usage? I could try to describe the tone I’m talking about here—it feels very specific, and very distinctively “AI,” and grating even when human writers do it but especially grating now due to its overuse by AI—but if I were to try, I would end up getting frustrated with the gap between my capacity to see what’s plainly there in front of me and the limitations of my powers to describe such things and clarify how they differ from others of the same type that are similar but not identical.
the Sagan example seems like sleight-of-hand as well. claude is clearly referencing this actual Sagan quote:
i’ve taken the quiz already, but i don’t think there’s a world where i prefer claude’s version of the above to Sagan’s.[1]
meanwhile, the actual Sagan quote used in the quiz is weak and a bit dated (who cares about whether science and spirituality are concordant, these days?). so we’re comparing a C-list Sagan quote with a misremembered version of his best-known passage.
“the iron in our blood, the calcium in our teeth, the carbon in our genes” is stellar. summarizing it as “the calcium in our bones” is milquetoast.
Agree with a lot here but disagree with the conclusion:
I think AI mostly struggles at things that are difficult to learn in short-form writing. For example this passage:
is not that bad prosaically, and the prose is not what makes my skin crawl.
It’s pretty clumsy, and the second and third sentences need a different construction: there’s no real payoff for saying which kind of man each meaning is for, the AI just decided this was the construction it wanted to use and didn’t use the space industriously. There’s also the “Ambiguity was not weakness. It was survival.” construction that just smells wrong at this point.
But I think the biggest problem here is “And a third meaning, hidden even from himself.” In isolation this is probably the strongest part of the passage, because with the right characterization and the right consideration, this could be an interesting idea! But here it makes me gag, because it doesn’t meaningfully subvert what came before, it doesn’t really have anything to do with what follows, and so I know that the author put it there because it sounds cool and doesn’t have any plan or intention to deliver on what’s cool about it; in other words, the presence of the best sentence in this passage (imo) tells me that the author is not good enough to use this sentence.
I argue that this tendency is difficult to learn in short-form, because it’s hard to realize that the payoff is never coming when it has to come now or never—that is, what I think I dislike about AI prose is that it’s clearly not written with a large context in mind, and while you could train an AI to stop hinting at grand narratives that it’s not capable of by RLing it on short-form, I doubt that this will make progress toward good long-form. This might even be why many people prefer the AI writing—I suspect that people who do not read much literature do not really know how these connections are supposed to be built. If forced to read a full AI novel and a full human novel I think they would start to notice that human prose doesn’t get annoying in the way AI prose does, but most people do not read this much and so do not make this extrapolation.
I bet you that current frontier models, when challenged to write prose in the 200-word range, will make all the mistakes I describe in my post or you describe in your comment.
You point out the mistake of hints and promises that you can’t deliver on. I claim that current models will absolutely do this even in 200-word works. Once we can RL it out of the models in this range, we can keep going longer.
I’ll admit that 200 is kind of absurdly short in a way that creates substantive, qualitative differences from more common types of writing (e.g. maybe it really penalizes certain kinds of action or dialogue). I could be convinced that ~500 is right.
I agree that they will make these mistakes at that scope, I’m claiming that the solution won’t scale—if you RL models to not do this in 200 words, I don’t think that will make it substantially easier for them to not do it at 5k words, except insofar as it trains them to not hint at things ever. I haven’t found frontier models to be significantly more tasteful or better at writing prose than less capable models, despite being generally smarter and better at some seemingly-related parts of creative writing, so my intuition is that current scaling levers are unlikely to address this problem well.
The specific dynamics of RL here are better discovered empirically, and in any case is not precisely within scope.
I was thinking of a more general optimization loop, as in: what evals should we make, how can we track model progress on writing, etc. My suggestion is that once we figure out how to make models write well in this playground (where evaluation is easier, generation is cheaper, etc.) -- either by training or pushing on things like harness design—we’ll be in a good position to improve LLM writing abilities more generally.
Seconding this sentiment, my reaction to every choice was “Ehhhh” and a twenty second pause as I forced myself to pick between two similarly unimpressive passages. 2⁄5 human in the end but I think it’d be more informative to report my score as N/A. If these are considered noteworthy excerpts I can’t say I care much for literary culture.
3⁄5 human preference for me.
On the Sagan quote they both felt like clichéd meandering dross on the theme of “universe big, human small, emotions, wow” but the AI picked something a bit more concrete by reference to the calcium in bones (although as kbear mentions above, that is itself a cheap knockoff of a different Sagan quote)
On the Wolf Hall excerpt, the human version just felt a bit vague to me. Everything by halves; half light, half seen. Do people actually go around pouring all their fears and desires into someone who just happens to have a good poker face? Whereas the AI was more specific about writing multiple meanings into text.
I’m with you on the other three though; the AI version was just plain bad.
3-2 AI preference here. The human passages I preferred were the oldest ones, which I don’t think is a coincidence either.
I got 2⁄5 human. I didn’t like several of the human ones. The war one I didn’t choose because it was out of context, talking about a judge without saying who the judge is, and I didn’t pick Sagan’s because I didn’t like the comparison he’s making. The instructions are to pick what you like better.
I feel like the AI is unduly helped by the format here. I don’t actually consume a whole book one quote at a time, and that is also not the reading experience the author wrote for.
For example, with the literary fiction one, the story about the boy and the church feels sort of complete, like a micro-story, and I think that might be why I prefer it in isolation. You also don’t need to understand anything about these characters before reading, they are just the boy and the grandfather, and that is all they are and all you need to know. But with the quote from the judge, I have this feeling that there is more to this character, even without ever having read the book the quote is from, and that makes me feel a bit disoriented while reading. Am I supposed to agree with this guy? To root for him? Does the wider narrative disprove him? I don’t know, and that makes the quote hard to read.
But I don’t read novels to read fifty cool quotes in a row, I want cohesion, a narrative, character development, some sort of central thesis. Can AIs currently provide that? I doubt it, and this article doesn’t ask.
And I suspect that part of this is also about whether you read the book the human quotes are based on. I don’t know who that judge is, but maybe my disorientation would just disappear if I did, and then the quote would land better. So maybe the survey results are just dominated by people who feel confused about the actual book that sits around the human quotes, and the whole thing is more about how well you know the literary canon they draw from.
(I had 3⁄2 AI preference, 1,1 2,1 3,2 4,1 5,1)