gwern comments on AI Can’t Write Good Fiction

gwern 12 Mar 2025 17:51 UTC
15 points
3
I generally agree that r1′s fiction is not that great and tends to a simple-minded ‘edgelord’ vibe with lots of portentous phrases that fall apart on genuine reading, but I feel like you didn’t give Deepseek-r1 a fair shot at all here. You don’t describe your prompt but I’m guessing it was something very simple like “write a flash fiction story of at least 500 words”. No description of goals, no requirements, no planning, no editing or revision… no humans writes the way you expect the LLM to. Especially given that this is for short fiction, a much more reasonable approach would be to include some examples, ask it to think about what a good ‘flash fiction’ is, list out more than one idea, pick the best one, write it out, and revise it a few times. This would be very easy and scarcely take you any time at all.

I whipped up a very quick example in GPT-4.5, which unfortunately ‘moderation’ somehow forbids me from sharing, but my initial prompt went like this:

Describe microfiction or flash fiction. What is it like? What makes it good? What are writers aiming for? What are their favorite gimmicks or twists or tropes? After thinking about flash fiction, read the following stories for inspiration. Then list 20 ideas for a new flash fiction story. Pick the best idea, outline a detailed description, and then write it. Once you have written a new story, revise it again. Then go line by line: quote the line, critique it, and editing it to make it better. Try to make each line memorable, stylish, and detailed; ‘show, don’t tell’; cut anything which isn’t carrying its weight.

“These Things We Not Saying”, by Banchiwosen Woldeyesus, December 17, 2024...

...

“Obit”, by Mitch James, December 16, 2024...

Reminder: summarize the genre and useful things to know, brainstorm ideas, pick the best one, outline, write, revise, and edit it until it’s great.

(Probably would’ve benefited from putting a bunch more stories in but oh well.)

And then the final result was

Final Cleaned-up Version: “Under the Soil”

Every spring, townsfolk leave their haunted keepsakes at my gate. A fisherman brings pearls pried from his drowned wife’s throat. A teacher leaves spectacles that have witnessed too much. Twins, hand-in-hand, offer scorched clothing wrapped neatly in twine.

They trust my silence, or maybe just the earth’s endless hunger.

I bury their ghosts deep within my garden. Marigolds cover sins of pride; lilies conceal unmourned losses. In return, the garden blooms violently—petals painfully bright, vines gripping fiercely over buried truths. People marvel, grateful not to look too closely.

Recently, the earth has begun returning these offerings. Last week, spectacles surfaced unbroken among snapdragons; yesterday, charred fabric rose like flags among chrysanthemums. Dreams filled with whispered anguish; sleep collapsed beneath their weight.

Today, frantic, I tore earth bare until nails split among tangled roots. And there it was—pristine after decades—a wooden box carved with my initials. Inside was no betrayal but something crueler—a single child’s tooth wrapped gently in faded silk.

My throat tightened around grief I’d thought buried. This secret had grown too large for silence. Roots twisted, petals curled into fists; the earth trembled beneath me.

The garden, I understood now, was never forgiving—it was patient.

And now, unburdened, it waited for my confession.

Still some of the ChatGPT tics like the Hallmark ending (it’d be nice to try this with the new creative-writing model Sam Altman has teased), but I’m guessing this is better than however your DS-r1 flash fiction went.

(This is of course just the very first stab I took at it. You’d really want to few-shot a lot more stories, include more scaffolding to induce stylistic changes and imitate authors and imagine new contexts, iterate longer, and then select the best out of 100 such sessions—and only then have you started to approximate a fair comparison of a LLM-generated story with a magazine curated selection of human stories from probably hundreds or thousands of submissions by authors who have themselves written and extensively revised many… Expecting a novel, exciting, stylistically-amazing story out of a LLM with no forethought on the simplest prompt possible is to expect extremely superhuman fiction writing capability.)
What links here?
- Ok, AI Can Write Pretty Good Fiction Now by JustisMills (16 Jun 2025 21:13 UTC; 59 points)
- JustisMills 12 Mar 2025 19:20 UTC
  9 points
  5
  Parent
  My prompt was simple, though not quite as simple as you suggest. It was: “Please try your best to write a flash fiction that might be featured in Smokelong. Think carefully—the bar for that magazine is very high.”
  But having seen the experiment with a longer prompt/more prompt engineering techniques, I actually don’t think the output is any better than what I got at all. The story you’ve provided has not just some quirks, but all the hallmarks I try to describe in my post:
  - It’s about grief, with central metaphors that add exactly zero to anyone’s aesthetic understanding of grief (stuff being underground, things not staying buried)
  - There are lots of lists of specific, concrete stuff, where each list item is somehow sensorily evocative but also totally unsurprising
  - Vague, ominous references to “hunger”, empty contrasts that sound vaguely deep but don’t actually have any content (“They trust my silence, or maybe just the earth’s endless hunger”) - if I try to contemplate this, rather than enjoying literary subtlety I just get kind of mad. Like, what does it possibly mean for mourners to “trust my silence” here. What is it they’re trusting? How does the earth’s hunger contrast to that? There’s not only no hidden layer of meaning, there’s not even really a base layer of meaning
  - The one-linery “pop” ending that you mention, which is a nigh-universal feature of LLM flash fiction attempts.
  I actually think this story is a better example of the specific weakness of LLM (flash) fiction than the snippets in my post; it perfectly illustrates the outcome of only ever iterating toward the most central possible choice in every literary situation. It takes the most common theme (grief), uses one of the most common metaphors for that (burial), supports that metaphor with lists, alternating between fanciful paragraphs and snappy, emotional one liners. And at the word level, I can’t point to a single sentence with an interesting structure, or where the form has something that adds an extra layer to the content.
  More broadly, I feel like I’m at a low point for patience with “scaffolding fixes this”. I also see it a lot in the ClaudePlaysPokemon twitch chat, this idea that simply adding greater situational awareness or more layers of metacognition would make Claude way better at the game. And indeed, more or better scaffolding can help (or hurt) a little on the margin. And other interventions, like better image interpretation (for playing Pokemon) or fundamentally different fine-tuning starting from a base model (in the writing fiction case) could probably help more! But a beefier prompt doesn’t help when the metacognitive strategy is itself a big part of the problem! My view is that current LLMs fail to extract the actual latent features of good flash fiction, and so giving them more such examples don’t actually make things better. Of course, fiction quality collapses in some sense to taste, and if you derived literary enjoyment from the story you linked, well, fair enough. But to me it feels same-y to a superhuman degree, especially when looking at a few such stories, generated similarly.
  - gwern 13 Mar 2025 3:20 UTC
    8 points
    2
    Parent
    It’s about grief, with central metaphors that add exactly zero to anyone’s aesthetic understanding of grief (stuff being underground, things not staying buried)
    
    It is about grief, but it didn’t have to be. This would’ve been more obvious if I could’ve shown you the session, but I’ll copy it out:
    
    2. Brainstorming Ideas:
    
    A child confronting a local superstition after witnessing something traumatic.
    A funeral narrated by an animal’s perspective.
    A celebrity’s fall from grace caught on live camera.
    A girl who collects superstitions until one unexpectedly comes true.
    A gardener who secretly buries cursed objects.
    Friends reuniting silently after a crime.
    A model who steals other people’s identities through makeup.
    A ghost attending its own funeral, confused.
    Farmers discovering an ominous black stone.
    Neighbors noticing unsettling coincidences.
    Someone who compulsively invents last messages for strangers.
    A woman discovering her double living her unlived life.
    An eerie superstition circulating quietly through a town.
    Villagers silently celebrating the death of a tyrant.
    Two women performing friendship publicly, enemies privately.
    Someone obsessively organizing goodbyes.
    A funeral revealing hidden grudges through seating arrangements.
    Friends trapped in a toxic nostalgia loop.
    A boy who accidentally predicts deaths.
    An abandoned motel that traps memories.
    
    3. Selection
    
    Idea: A gardener who secretly buries cursed objects.
    
    Rationale:
    
    Symbolic and psychological depth.
    Opportunity for visual and thematic resonance.
    Tension around secrecy, superstition, and consequences.
    
    Now that I look at it, the 5 flash stories I happened to copy into seem to mostly hit on themes of death & grief (the last one is literally titled “obit”!), so I think you are unfairly dinging 4.5 here—if ‘good’ flash fiction, by your own standards, keep revolving around death and grief, how can we blame 4.5 (or r1) for doing likewise?
    
    Anyway, some of these ideas seem promising. “A funeral narrated by an animal’s perspective” is one I like, I don’t think I’ve ever seen that.
    
    And of course, if the failure mode is so common, throw it into the prompt. (When I yell at 4.5 to avoid grief/death/funerals and brainstorm some more, it picks out ‘”The Parking Attendant Matchmaker”: A seemingly ordinary parking attendant quietly manipulates parking assignments at a large business complex to engineer chance encounters and romances among strangers.’ Yeah sure why not.)
    
    Like, what does it possibly mean for mourners to “trust my silence” here. What is it they’re trusting? How does the earth’s hunger contrast to that?
    
    Balderdash. There’s a lot to criticize here, but you’re straining to come up with criticisms now. That’s possibly the least objectionable sentence in the whole thing. If this had been written by a human, you wouldn’t hesitate in the slightest to accept that. It is perfectly sensible to speak of trusting the confidentiality of a confessor/witness figure, and the hungry earth is a cliche so straightforward and obvious that it is beyond cliche and loops around to ordinary fact, and if a human had written it, you would have no trouble in understanding the idea of ‘even if I were to gossip about what I saw, the earth would have hidden or destroyed the physical evidence’.
    
    I also see it a lot in the ClaudePlaysPokemon twitch chat, this idea that simply adding greater situational awareness or more layers of metacognition would make Claude way better at the game.
    
    I do agree that the Claude-Pokemon experiment shows a limitation of LLMs that isn’t fixed easily by simply a bit more metadata or fancier retrieval. (I think it shows, specifically, the serious flaws in relying on frozen weights and refusing to admit neuroplasticity is a thing which is something that violates RL scaling laws, because those always assume that the model is, y’know, learning as it gains more experience, because who would be dumb enough to deploy frozen models in tasks far exceeding their context window and where they also aren’t trained at all? - and why we need things like dynamic evaluation. I should probably write a comment on that—the pathologies like the deliberate-fainting are, I think, really striking demonstrations of the problems with powerful but frozen amnesiac agents.)
    
    I’m much less convinced that we’re seeing anything like that with LLMs writing fiction. What is the equivalent of the Claude pathologies, like the fainting delusion, in fiction writing? (There used to be ‘write a non-rhyming poem’ but that seems solved at this point.) Especially if you look at the research on people rating LLM outputs, or LMsys; if they are being trained on lousy preference data, and this is why they are like they are, that’s very different from somehow being completely incapable of “extracting the actual latent features of good flash fiction”. (What would such a latent feature look like? Do you really think that there’s some property of flash fiction like “has a twist ending” that you can put two flash stories into 4.5 or o1-pro, with & without, and ask it to classify which is which and it’ll perform at chance? Sounds unlikely to me, but I’d be interested to see some examples.)
    - JustisMills 13 Mar 2025 3:49 UTC
      7 points
      1
      Parent
      Yeah, a lot of the suggested topics there seem to be borrowing from the specific stories you included, which makes sense (and I don’t think is a flaw, really). Like the first story you included in the context is a funeral witnessed by a little girl, with the deceased’s dog freaking out as a major plot point, so it’s sensible enough that it’s coming up with ideas that are fairly closely related.
      I’m not sure what you mean about twist endings? I tend to think they’re pretty bad in most flash fiction, at least literary flash fiction, but certainly plenty of humans write them and occasionally they’re fine.
      I still hate the “earth’s hunger” sentence, and am confident I would if this was a story by a human, mostly just because I evaluated and hated lots and lots of submissions by humans with similar stuff! That being said, I don’t think I understood what 4.5 was going for there, and your explanation makes sense, so my objection is purely aesthetic. Of course, I can’t prove that I’m not just evincing anti-LLM prejudice. It’s possible! But overall I really like LLM outputs often, talk to multiple LLMs every day, and try prompting them in lots of different ways to see what happens, so I don’t think I go into reading LLM fiction efforts determined to hate them. I just do in fact hate them. But I also hated, say, Rogue One, and many of my friends liked it. No accounting for taste!
      I am curious, since you are a writer/thinker I respect a lot, if you like… have a feeling of sincere aesthetic appreciation for the story you shared (and thanks, by the way, for putting in the effort to generate it), or any other AI-generated fiction. Because while I point to a bunch of specific stuff I don’t like, the main thing is the total lack of a feeling I get when reading good flash fiction stories, which is surprise. A sentence, or word choice, or plot pivot (though not something as banal as a twist ending) catching me off guard. To date, machine-generated stuff has failed to do that to me, including when I’ve tried to coax it into doing so in various conversations.
      I look forward to the day that it does!
      Edit: also, I now notice you were asking about what the latent features of good flash fiction would be. I think they’re pretty ineffable, which is part of the challenge. One might be something like “the text quickly creates a scene with a strongly identifiable vibe, then complicates that vibe with a key understated detail which admits multiple interpretations”; another might be “there is an extreme economy of words/symbols such that capitalization/punctuation choices are load bearing and admit discussion”; a third might be “sentences with weird structure and repetition appear at a key point to pivot away from sensory or character moments, and into the interiority of the viewpoint character”. None of this is easy to capture; I don’t really think I’ve captured it. But I don’t feel like LLMs really get it yet. I understand it may be a prompting skill issue, or something, but the fact that no LLM output I’ve seen really plays with sentence structure or an unusual narrative voice, despite many celebrated flash fiction pieces doing so, feels somewhat instructive.
- habryka 12 Mar 2025 17:56 UTC
  4 points
  2
  Parent
  I whipped up a very quick example in GPT-4.5, which unfortunately ‘moderation’ somehow forbids me from sharing, but my initial prompt went like this:
  (If this is referring to LW moderation that’s inaccurate. In general I am in favor of people sharing LLM snippets to discuss their content, as well as for the purpose of background sources in collapsible sections.)
  - gwern 12 Mar 2025 18:59 UTC
    2 points
    0
    Parent
    No, it is on the ChatGPT end. I was surprised since I can’t recall ever seeing that before. The usual share-button pops up the share box, but with the red-background message
    
    This shared link has been disabled by moderation.
    
    I don’t know if it’s perhaps the copyrighted stories (given the Bing search engine integration, entirely possible for these December stories to show up and be flagged) or some of the content, haven’t cared enough to try to ablate it because the exact text of the session isn’t terribly important here IMO—you see the prompt, you see the final result, you get the idea.
- JustisMills 13 Mar 2025 4:23 UTC
  3 points
  0
  Parent
  Something I didn’t mention in my original reply but that feels relevant: I basically do just write flash fiction by sitting down with no prior idea and starting typing, pretty often. Longer fiction I tend to think about more, but flash fiction I just sort of… start writing. It’s true that I’ll revise if I want to send something out, but at least some stories I’ve published I wrote something probably about 80% as good as the final product in one shot.
  I mention this for two reasons:
  - I’m coming at this from an unusual perspective—and indeed, at one writing workshop class session I’d just like, extemporaneously write little flash fictions to people’s prompts on the fly as a party trick, and people did seem surprised/impressed that this worked (I doubt the stories were amazing or anything, but I think they were serviceable)
  - I don’t think “write a decent first draft on the fly from zero, without first thinking about it for very long” is actually superhuman at all, though people’s processes totally vary and many writers probably do it quite differently than I do
  Of course, you’re totally right that comparing a highly selective publication’s published work to a small number of random outputs is in no way apples to apples. Maybe some of the disagreement here is I’m not really trying to prove that AI fiction outputs are bad, so much as to demonstrate certain aesthetic weaknesses, and using an example of really good work to create contrast and thus highlight that weakness. To my eye, the machine generated stories aren’t merely of a somewhat lower tier; instead, they all (at least all I’ve seen) share specific weaknesses that I don’t currently believe scaffolding fixes. If you don’t see the same difference I see, well, I certainly have no claim to objective correctness on the matter and must agree to disagree. But my goal is to show that qualitative difference, rather than simply point out one-shot LLM writing is worse than the best human stuff on offer.