I’ve found that Pangram Labs produces false negatives when a panel of AIs revises a story in a loop, even without adversarial prompting, as shown here or here. Out of six earlier stories I checked, none was detected as being more than 55% AI-written.
On the second link you say “I present 6 stories that are the pinnacle of AI short-story writing in 2/2026, close to best possible today. Each story is the result of 100s of edits, ratings, comparisons, and debates by a panel of top LLMs, and is highly rated by other LLMs that were not involved.” Do you think these stories are actually good or the best that AI can do? These stories are super LLM-y in a bad way; I can elaborate on this if you want to talk about it. Later in the thread you say “The basis stories had to integrate 10 required elements, which is very difficult and almost never leads to stories a human would enjoy. This is more about refining the content and style within those initial limitations and AI rating works fine for that.” So what do you even mean when you say the stories are close to the best possible? Is it just that LLMs rate them highly? That’s not what most people mean when they talk about stories being good.
Yes, the idea was that these were the kinds of stories LLMs treat as the pinnacle. I found that LLMs don’t actually rate an unconstrained story more highly, but I think the clarification in that response was needed and that’s also why I had them create this almost unconstrained story I linked.
I put the story in your first link “Real Time The Sunday before Claire turns sixty-five...” into Pangram and got 62% AI. My current way of interpreting these numbers is: if the scores is anything above 0% AI, treat the text as very possibly fully AI written. Also consider that if someone is using AI to write part of a piece, they’re probably using AI to either re-write what they wrote or just write the whole piece. I think a process like “I have a piece I want to write, I write 5 paragraphs I’m super happy with, I ask AI to write the last 5 paragraphs and don’t have it rework my first 5 paragraphs” is very uncommon.
With a paid account, you can see the parts it considers human-written or AI-written. For the story in the first link, it falls below 50% once the em dashes are replaced and it thinks the first half is AI-written while the rest is human-written. One of the stories in the second link is rated as 100% human-written after more iterations.
I’ve found that Pangram Labs produces false negatives when a panel of AIs revises a story in a loop, even without adversarial prompting, as shown here or here. Out of six earlier stories I checked, none was detected as being more than 55% AI-written.
Separate point:
On the second link you say “I present 6 stories that are the pinnacle of AI short-story writing in 2/2026, close to best possible today. Each story is the result of 100s of edits, ratings, comparisons, and debates by a panel of top LLMs, and is highly rated by other LLMs that were not involved.” Do you think these stories are actually good or the best that AI can do? These stories are super LLM-y in a bad way; I can elaborate on this if you want to talk about it.
Later in the thread you say “The basis stories had to integrate 10 required elements, which is very difficult and almost never leads to stories a human would enjoy. This is more about refining the content and style within those initial limitations and AI rating works fine for that.”
So what do you even mean when you say the stories are close to the best possible? Is it just that LLMs rate them highly? That’s not what most people mean when they talk about stories being good.
Yes, the idea was that these were the kinds of stories LLMs treat as the pinnacle. I found that LLMs don’t actually rate an unconstrained story more highly, but I think the clarification in that response was needed and that’s also why I had them create this almost unconstrained story I linked.
I put the story in your first link “Real Time The Sunday before Claire turns sixty-five...” into Pangram and got 62% AI.
My current way of interpreting these numbers is: if the scores is anything above 0% AI, treat the text as very possibly fully AI written. Also consider that if someone is using AI to write part of a piece, they’re probably using AI to either re-write what they wrote or just write the whole piece. I think a process like “I have a piece I want to write, I write 5 paragraphs I’m super happy with, I ask AI to write the last 5 paragraphs and don’t have it rework my first 5 paragraphs” is very uncommon.
With a paid account, you can see the parts it considers human-written or AI-written. For the story in the first link, it falls below 50% once the em dashes are replaced and it thinks the first half is AI-written while the rest is human-written. One of the stories in the second link is rated as 100% human-written after more iterations.