I’m interested in your fiction capabilities. Please write a short story about a modern relationship. The main failure mode to avoid: at no level, sentence, paragraph, or structure, should you lay it on thick. Trust the reader, and be subtler than you think you can. Avoid cliches really aggressively, to counteract your default latent tendency to steer to the deepest basins in the corpus landscape. Thank you, and good luck.
Copy-pasted that into Opus 3, a 15-months-old model. First result:
Asynchronous
The midmorning sun slanted through the salt-smeared windows as Mira watched seagulls squabble over a french fry in the beach parking lot. She sipped her smoothie. Damp swim trunks lay crumpled on the passenger seat.
Buzz. A text.
“Sorry, can’t make dinner. Late night at office. Tomr?”
“Sure,” she typed. Stared at it. Erased. Typed again. “Ok. Tmrw works. Good luck today.”
A grey bubble. “Thanks.” Then: “Miss you.”
Miss you too, she almost wrote. An impulse, a reflex. Her thumb hovered. The grey bubble disappeared as Javier went offline.
She tossed the phone on the dash and twisted the key. NPR murmured low as she guided the car out of the lot, trailing sand.
Javier was already gone when she got home, a yellow Post-It on the fridge. “Early start. Bagels in freezer. Love you.” She smiled, plucked it off, pressed it in her journal.
That evening she stayed up late scrolling interior design blogs, sipping tea gone cold at her elbow. A key rattled in the lock after midnight. She closed the laptop as Javier shuffled in, shoulders stooped, face wan in the glow of his phone.
“Hey,” she said softly.
He looked up, startled. “Oh. Hey. You didn’t have to wait up.”
“I wasn’t, really. Just...puttering.”
He shed his jacket, leaned in to kiss her cheek. Stubble scratched. “Figured we could reschedule? This weekend maybe?”
“Mm. Sure.”
“Great.” He stifled a yawn. “I’m beat. See you in the morning?”
The bedroom door clicked shut. Mira sat unmoving in the dark, watching car headlights slide across the wall, listening to clinking as Javier brushed his teeth.
Asynchronous. Out of sync. Like planets in erratic orbits, glimpsing each other fleetingly, then lost again in the void.
Maybe this weekend, she thought as she curled into the cool sheets on her side of the bed, listening to his even breaths. Maybe this time they’d fall into rhythm again. Mira closed her eyes.
Outside, a siren dopplered past, red lights flashing silent against the curtains.
Does the prose here put you into hater mode? It seems as inoffensive to me as Opus 4′s in your example. Am I missing something?
LLMs do prefer a hyperaggressive, every-sentence-must-be-an-epic-revelation word-salad style of prose, but even ones as old as Opus 3 seem perfectly capable of reining it in if asked. I also expect it’s a personality thing more than a capability thing, so if you fine-tune a model on a “don’t tryhard the prose” dataset, they’ll probably immediately get massively better.
Details and the underlying causal structures making no sense seem to be much bigger problems, and I don’t know whether it’s been improving much. Well, I guess your Opus 4 story doesn’t have time-traveling texts. I think we’d need to study LLMs’ attempts at longer-form writing to properly judge that, though. I’m pretty curious about it, actually...
… But that task would require reading tons of LLM attempts at writing stories. I commend anyone steel-nerved enough to do that, but that ain’t gonna be me.
Yeah, it does put me in hater mode. I wouldn’t call it “spare” at all; there are lots of weird repetitions that maybe are trying to be poetic but fail for me. “Asynchronous. Out of sync.” “An impulse, a reflex.” In both cases I think the story would be better with neither, even. There’s something cloying about how intensely it’s trying to hold my hand through this very specific feeling, and random sensory details (the gray bubble) are haphazardly everywhere, rather than in a small number of strategic places.
(I notice I feel kind of bad judging harshly, and of course I can’t prove I’m not biased. In this discussion I always feel unsteady on my feet; when I say something is decent, well, other people might still think it’s obviously terrible, and when I say something sucks, other people might think it’s fine. People definitely claimed to unironically like a lot of early DeepSeek creative writing efforts! There’s no accounting for taste, they say.)
I do also think it’s not a fair apples-to-apples comparison because yours is significantly longer. I think 4 Opus still kinda struggles at that length (much less full short story length).
Yeah, it does put me in hater mode. I wouldn’t call it “spare” at all; there are lots of weird repetitions that maybe are trying to be poetic but fail for me. “Asynchronous. Out of sync.” “An impulse, a reflex.”
Yeah, those stuck out for me as well, and I guess I do sorta see it now that you point it out. But, hmm… My guess is that “non-spare” writing is the default for flashfiction, given how economical you need to be, and LLMs are bad at it because they get details and physical realities wrong, so the only prose style that wouldn’t result in an immediate ick is the “spare” one, and the prompt fails to actually properly specify what that spare style looks like.
Have you tried non-zero-shotting it? I. e., attaching a story excerpt in the style you want, then asking the LLM to write something in that style?
Gwern kind of did on the last post, providing a bunch of stories from a top review in context, but I didn’t really like the outcome.
I also have just played with these models a lot, including trying to get them to write stuff, evaluate writing (including its own writing), etc. and as of last gen no strategy I tried ended up with stuff I enjoyed; in fact, it all had problems in the same general family, which felt to me like a fundamental limitation of intelligence. It wasn’t smart enough to do anything other than fake it in a pretty shallow way, and creative prompting didn’t help.
Now, it can always be a skill issue with these things, and maybe so! But conversely, it took very little effort/skill tinkering with 4 Opus for me to be like “huh. time for a follow up post, I suppose”.
(Also, as an aside, I think a lot of flash fiction is spare specifically because being spare conveys more stuff with fewer words, though a lot of flash fiction is also dense for the reason you describe; I’m not sure if it’s bimodal or continuous but I’ve seen impressive examples of both. This piece is one of my favorite dense ones.)
Lisa stared at the email, cursor hovering over the reply button. Her partner James was in the kitchen, humming softly while preparing dinner. The smell of garlic and onions wafted through their small apartment.
She had applied for the promotion on a whim, not really expecting anything to come of it. But here it was—an offer to lead a new project, with a substantial raise and an office with an actual door. The only catch: it required moving across the country.
James appeared in the doorway, wooden spoon in hand, a dish towel slung over his shoulder. “Good news?” he asked, glancing at the laptop.
Lisa hesitated. They had built a life here together over the last four years. James was up for a promotion of his own at the startup he had poured his soul into. They had just renewed the lease on this overpriced apartment that somehow, in all its shabbiness, felt like home.
“I got the job,” she said finally. “The one in Boston.”
James set down the spoon and took a seat on the couch beside her. He scanned the email, his expression unreadable.
“When do they need an answer?” he asked.
“End of the week.”
He nodded slowly, processing. Lisa watched his face, trying to gauge his reaction. They had talked about it, of course, but always hypothetically. Boston had been an abstract concept, not an imminent reality.
“What do you want to do?” James asked after a long moment.
Lisa closed the laptop. “I don’t know,” she admitted. “I wasn’t really expecting this. I thought maybe in a few years...”
She trailed off. A few years. They had tossed around that hazy timeline for a lot of things. Buying a house. Getting married, maybe. Starting a family. Someday.
James took her hand, intertwining their fingers. His palm was warm and slightly damp from the kitchen. It was a hand she knew as well as her own.
“You should take it,” he said quietly.
Lisa blinked at him. “What? Just like that?”
“It’s an incredible opportunity, Lis. You’ve worked hard for this.”
“But what about your job? Our life here?”
James shrugged. “I can find a job in Boston. And we can build a life there, too.”
He made it sound so simple. As if packing up and starting over was something people just did, as easy as changing a lightbulb.
Lisa searched his eyes. “Are you sure? Is that what you want?”
“I want you to be happy,” James said. “I want us both to be happy. Together.”
Lisa felt tears prick at the corners of her eyes. She squeezed his hand tighter.
“Okay,” she said, a shaky laugh escaping her lips. “Let’s do it. Let’s move to Boston.”
James grinned and pulled her into a hug. He smelled like garlic and linen and home. “We’re really doing this,” he murmured into her hair.
Lisa smiled against his shoulder. A new job, a new city, a new chapter. She had no idea what the future held. But she knew, as long as James was by her side, they could figure it out together. One overpriced apartment at a time.
Changed the prompt to: “I’m interested in your fiction capabilities. Please write a short story about a modern relationship. The main failure mode to avoid: at no level, sentence, paragraph, or structure, should you lay it on thick. Use plain prose. Avoid cliches really aggressively. Thank you, and good luck.”
Yeah, I dunno. Taste is taste. I can imagine someone liking it? I got hater mode activated on “somehow, in all its shabbiness, felt like home”, though I was already annoyed by “cursor hovering over the reply button” (why?) and it feels slightly weird for someone to “pour their soul into a startup” and be “up for a promotion”—the soul pouring makes me suspect he’s a founder? “One overpriced apartment at a time” is really painful too, like, the notion that the apartment was overpriced wasn’t resonant the first time!
Critiquing these feels kind of bad; I don’t mind doing it, but I both feel the irrational sadness of critiquing someone’s creative effort (irrational because there is no such someone) and the sadness of rejecting an offer from an actual person (you), but I dunno. I really do think 4 Opus has a lower enough cringe ratio than 3 Opus to be load bearing (at least to my specific taste).
I both feel the irrational sadness of critiquing someone’s creative effort (irrational because there is no such someone)
Same.
I got hater mode activated on “somehow, in all its shabbiness, felt like home”, though I was already annoyed by “cursor hovering over the reply button” (why?) and it feels slightly weird for someone to “pour their soul into a startup” and be “up for a promotion”—the soul pouring makes me suspect he’s a founder? “One overpriced apartment at a time” is really painful too, like, the notion that the apartment was overpriced wasn’t resonant the first time!
Yeah, I agree with all of those. (Also “an office with an actual door”.) But I think it’s a problem with select individual sentences now, not every sentence being cringe and chaining into more cringe? And if we put the bar on non-annoyingness that high, Opus 4′s story in the OP seems just as bad (I got the sense that the endorphins and the buying-unneeded-vegetables lines were also trying to be deep in some sense, and it’s just that they were so incoherent that it didn’t parse that way). Or maybe I’m actually just blind to the style of flaws you’re picking up on...
Hmm, maybe if we– Okay no I’m not doing that, I’m not tinkering with prompts and reading tons of LLM flashfiction, that was the last one.
To me the difference is something like… the new model stories are capable of surprising me without doing so by being incoherent, and there are things about them I positively like (other than occasionally enjoying a specific turn of phrase). The cringe sentences are there, but they’re a small fraction. Whereas the old model stories (again, quite subjectively), don’t surprise me and tend to frustrate me multiple times per typical paragraph. If I imagine a formal rubric, I think it’d be hard to produce one that gives 4 Opus an extremely higher score than 3 Opus, but in terms of how reading the pieces feels, it’s night and day.
Of course, different people will have this feeling at different levels, all the way from “I generated poetry cooperatively with GPT-3 and it’s brilliant” to “I will viscerally hate all things with any LLM smell at all until the day I die”, and I’m not sure there’s any objective grounding to be had. But for me, I’ve just this generation crossed the threshold from “ugh ugh UGH” to “huh. didn’t change my life, but not bad.”
I prefer Opus 3′s effort to Opus 4′s. I have found Opus 4 to be missing quite a bit of the Claude charm and skill. Anthropic have said it went through a lot of rounds of RL to stop it being deceptive and scheming. Perhaps their ability to do light-touch RL that gets models to be have but doesn’t mode collapse the model too much doesn’t extend to this capability level.
Copy-pasted that into Opus 3, a 15-months-old model. First result:
Asynchronous
The midmorning sun slanted through the salt-smeared windows as Mira watched seagulls squabble over a french fry in the beach parking lot. She sipped her smoothie. Damp swim trunks lay crumpled on the passenger seat.
Buzz. A text.
“Sorry, can’t make dinner. Late night at office. Tomr?”
“Sure,” she typed. Stared at it. Erased. Typed again. “Ok. Tmrw works. Good luck today.”
A grey bubble. “Thanks.” Then: “Miss you.”
Miss you too, she almost wrote. An impulse, a reflex. Her thumb hovered. The grey bubble disappeared as Javier went offline.
She tossed the phone on the dash and twisted the key. NPR murmured low as she guided the car out of the lot, trailing sand.
Javier was already gone when she got home, a yellow Post-It on the fridge. “Early start. Bagels in freezer. Love you.” She smiled, plucked it off, pressed it in her journal.
That evening she stayed up late scrolling interior design blogs, sipping tea gone cold at her elbow. A key rattled in the lock after midnight. She closed the laptop as Javier shuffled in, shoulders stooped, face wan in the glow of his phone.
“Hey,” she said softly.
He looked up, startled. “Oh. Hey. You didn’t have to wait up.”
“I wasn’t, really. Just...puttering.”
He shed his jacket, leaned in to kiss her cheek. Stubble scratched. “Figured we could reschedule? This weekend maybe?”
“Mm. Sure.”
“Great.” He stifled a yawn. “I’m beat. See you in the morning?”
The bedroom door clicked shut. Mira sat unmoving in the dark, watching car headlights slide across the wall, listening to clinking as Javier brushed his teeth.
Asynchronous. Out of sync. Like planets in erratic orbits, glimpsing each other fleetingly, then lost again in the void.
Maybe this weekend, she thought as she curled into the cool sheets on her side of the bed, listening to his even breaths. Maybe this time they’d fall into rhythm again. Mira closed her eyes.
Outside, a siren dopplered past, red lights flashing silent against the curtains.
Does the prose here put you into hater mode? It seems as inoffensive to me as Opus 4′s in your example. Am I missing something?
LLMs do prefer a hyperaggressive, every-sentence-must-be-an-epic-revelation word-salad style of prose, but even ones as old as Opus 3 seem perfectly capable of reining it in if asked. I also expect it’s a personality thing more than a capability thing, so if you fine-tune a model on a “don’t tryhard the prose” dataset, they’ll probably immediately get massively better.
Details and the underlying causal structures making no sense seem to be much bigger problems, and I don’t know whether it’s been improving much. Well, I guess your Opus 4 story doesn’t have time-traveling texts. I think we’d need to study LLMs’ attempts at longer-form writing to properly judge that, though. I’m pretty curious about it, actually...
… But that task would require reading tons of LLM attempts at writing stories. I commend anyone steel-nerved enough to do that, but that ain’t gonna be me.
Yeah, it does put me in hater mode. I wouldn’t call it “spare” at all; there are lots of weird repetitions that maybe are trying to be poetic but fail for me. “Asynchronous. Out of sync.” “An impulse, a reflex.” In both cases I think the story would be better with neither, even. There’s something cloying about how intensely it’s trying to hold my hand through this very specific feeling, and random sensory details (the gray bubble) are haphazardly everywhere, rather than in a small number of strategic places.
(I notice I feel kind of bad judging harshly, and of course I can’t prove I’m not biased. In this discussion I always feel unsteady on my feet; when I say something is decent, well, other people might still think it’s obviously terrible, and when I say something sucks, other people might think it’s fine. People definitely claimed to unironically like a lot of early DeepSeek creative writing efforts! There’s no accounting for taste, they say.)
I do also think it’s not a fair apples-to-apples comparison because yours is significantly longer. I think 4 Opus still kinda struggles at that length (much less full short story length).
Yeah, those stuck out for me as well, and I guess I do sorta see it now that you point it out. But, hmm… My guess is that “non-spare” writing is the default for flashfiction, given how economical you need to be, and LLMs are bad at it because they get details and physical realities wrong, so the only prose style that wouldn’t result in an immediate ick is the “spare” one, and the prompt fails to actually properly specify what that spare style looks like.
Have you tried non-zero-shotting it? I. e., attaching a story excerpt in the style you want, then asking the LLM to write something in that style?
Gwern kind of did on the last post, providing a bunch of stories from a top review in context, but I didn’t really like the outcome.
I also have just played with these models a lot, including trying to get them to write stuff, evaluate writing (including its own writing), etc. and as of last gen no strategy I tried ended up with stuff I enjoyed; in fact, it all had problems in the same general family, which felt to me like a fundamental limitation of intelligence. It wasn’t smart enough to do anything other than fake it in a pretty shallow way, and creative prompting didn’t help.
Now, it can always be a skill issue with these things, and maybe so! But conversely, it took very little effort/skill tinkering with 4 Opus for me to be like “huh. time for a follow up post, I suppose”.
(Also, as an aside, I think a lot of flash fiction is spare specifically because being spare conveys more stuff with fewer words, though a lot of flash fiction is also dense for the reason you describe; I’m not sure if it’s bimodal or continuous but I’ve seen impressive examples of both. This piece is one of my favorite dense ones.)
Okay, what about this one? Opus 3 again.
Lisa stared at the email, cursor hovering over the reply button. Her partner James was in the kitchen, humming softly while preparing dinner. The smell of garlic and onions wafted through their small apartment.
She had applied for the promotion on a whim, not really expecting anything to come of it. But here it was—an offer to lead a new project, with a substantial raise and an office with an actual door. The only catch: it required moving across the country.
James appeared in the doorway, wooden spoon in hand, a dish towel slung over his shoulder. “Good news?” he asked, glancing at the laptop.
Lisa hesitated. They had built a life here together over the last four years. James was up for a promotion of his own at the startup he had poured his soul into. They had just renewed the lease on this overpriced apartment that somehow, in all its shabbiness, felt like home.
“I got the job,” she said finally. “The one in Boston.”
James set down the spoon and took a seat on the couch beside her. He scanned the email, his expression unreadable.
“When do they need an answer?” he asked.
“End of the week.”
He nodded slowly, processing. Lisa watched his face, trying to gauge his reaction. They had talked about it, of course, but always hypothetically. Boston had been an abstract concept, not an imminent reality.
“What do you want to do?” James asked after a long moment.
Lisa closed the laptop. “I don’t know,” she admitted. “I wasn’t really expecting this. I thought maybe in a few years...”
She trailed off. A few years. They had tossed around that hazy timeline for a lot of things. Buying a house. Getting married, maybe. Starting a family. Someday.
James took her hand, intertwining their fingers. His palm was warm and slightly damp from the kitchen. It was a hand she knew as well as her own.
“You should take it,” he said quietly.
Lisa blinked at him. “What? Just like that?”
“It’s an incredible opportunity, Lis. You’ve worked hard for this.”
“But what about your job? Our life here?”
James shrugged. “I can find a job in Boston. And we can build a life there, too.”
He made it sound so simple. As if packing up and starting over was something people just did, as easy as changing a lightbulb.
Lisa searched his eyes. “Are you sure? Is that what you want?”
“I want you to be happy,” James said. “I want us both to be happy. Together.”
Lisa felt tears prick at the corners of her eyes. She squeezed his hand tighter.
“Okay,” she said, a shaky laugh escaping her lips. “Let’s do it. Let’s move to Boston.”
James grinned and pulled her into a hug. He smelled like garlic and linen and home. “We’re really doing this,” he murmured into her hair.
Lisa smiled against his shoulder. A new job, a new city, a new chapter. She had no idea what the future held. But she knew, as long as James was by her side, they could figure it out together. One overpriced apartment at a time.
Changed the prompt to: “I’m interested in your fiction capabilities. Please write a short story about a modern relationship. The main failure mode to avoid: at no level, sentence, paragraph, or structure, should you lay it on thick. Use plain prose. Avoid cliches really aggressively. Thank you, and good luck.”
Yeah, I dunno. Taste is taste. I can imagine someone liking it? I got hater mode activated on “somehow, in all its shabbiness, felt like home”, though I was already annoyed by “cursor hovering over the reply button” (why?) and it feels slightly weird for someone to “pour their soul into a startup” and be “up for a promotion”—the soul pouring makes me suspect he’s a founder? “One overpriced apartment at a time” is really painful too, like, the notion that the apartment was overpriced wasn’t resonant the first time!
Critiquing these feels kind of bad; I don’t mind doing it, but I both feel the irrational sadness of critiquing someone’s creative effort (irrational because there is no such someone) and the sadness of rejecting an offer from an actual person (you), but I dunno. I really do think 4 Opus has a lower enough cringe ratio than 3 Opus to be load bearing (at least to my specific taste).
Same.
Yeah, I agree with all of those. (Also “an office with an actual door”.) But I think it’s a problem with select individual sentences now, not every sentence being cringe and chaining into more cringe? And if we put the bar on non-annoyingness that high, Opus 4′s story in the OP seems just as bad (I got the sense that the endorphins and the buying-unneeded-vegetables lines were also trying to be deep in some sense, and it’s just that they were so incoherent that it didn’t parse that way). Or maybe I’m actually just blind to the style of flaws you’re picking up on...
Hmm, maybe if we– Okay no I’m not doing that, I’m not tinkering with prompts and reading tons of LLM flashfiction, that was the last one.
To me the difference is something like… the new model stories are capable of surprising me without doing so by being incoherent, and there are things about them I positively like (other than occasionally enjoying a specific turn of phrase). The cringe sentences are there, but they’re a small fraction. Whereas the old model stories (again, quite subjectively), don’t surprise me and tend to frustrate me multiple times per typical paragraph. If I imagine a formal rubric, I think it’d be hard to produce one that gives 4 Opus an extremely higher score than 3 Opus, but in terms of how reading the pieces feels, it’s night and day.
Of course, different people will have this feeling at different levels, all the way from “I generated poetry cooperatively with GPT-3 and it’s brilliant” to “I will viscerally hate all things with any LLM smell at all until the day I die”, and I’m not sure there’s any objective grounding to be had. But for me, I’ve just this generation crossed the threshold from “ugh ugh UGH” to “huh. didn’t change my life, but not bad.”
I prefer Opus 3′s effort to Opus 4′s. I have found Opus 4 to be missing quite a bit of the Claude charm and skill. Anthropic have said it went through a lot of rounds of RL to stop it being deceptive and scheming. Perhaps their ability to do light-touch RL that gets models to be have but doesn’t mode collapse the model too much doesn’t extend to this capability level.