I infer you mean “the claim that LLM writing is a slog to get through.”
I mean centrally the claim ‘to most human beings, AI prose is something sus. If you use AI to write something, people will know. Not everyone, but the people paying attention, who aren’t newcomers or distracted or intoxicated.’ I should have been clearer about that.
It’s specifically that factual claim that I think tends to be strongly self-reinforcing. If you don’t have a way to identify false positives and false negatives, there’s no way to get evidence that updates you against that claim.
I’m intending to make a broad epistemic claim here that’s not specific to LLM detection. One case that really drove this point home for me personally was the great ‘do mp3s sound worse?’ debate a number of years ago. Many people were utterly confident that they could hear the difference between mp3s and uncompressed files—and of course they could in some cases, when the mp3s were poor quality! But in the many cases where they couldn’t, they assumed they were hearing uncompressed files, and so it didn’t change their minds at all.
But I see so much LLM writing in the wild and professionally that just sucks
I absolutely agree that there’s a ton of sucky writing out there, including lots of sucky writing that’s identifiable as being from LLMs (and probably some sucky writing that isn’t identifiable as being from LLMs). Ninety percent of everything is crap.
In situations like this, where I see a lot of people doing something that seems like a pretty big mistake, I want to just say “don’t do that.”
Fair point! I’m not necessarily disagreeing with the advice, only with the epistemic validity of the factual claim.
That said, separately from the epistemic point: after posting the comment, I got curious about the current state of the evidence on LLM writing detection by humans. I asked a couple of frontier models[1] to do a review of 2025 / 2026 papers using blind testing[2]. As I read it, the upshot is:
Most readers can’t tell the difference better than chance.
Expert LLM users and those with training specifically in LLM detection do better, although evidence is somewhat mixed. Notably, this was in significant part because they had learned specific tells rather than judging the output as low-quality.
Subject matter experts do better for some types of writing, less so for others.
There are simple interventions (eg get the model to play a particular persona) that make the writing much harder to identify as LLM.
Note that most of the data is on 2023/2024-era models, eg ChatGPT-3.5, GPT-4o, Claude 2. I would expect that humans would do worse at detection with current frontier models.
Yeah, it’s an interesting question how good human detection is. My guess is that people who are paying attention are getting better at sniffing out AI faster than AI is getting less distinctively scented, but “people who are paying attention” is a heck of a sleight of hand.
Overall, I suppose my main feeling is that I see AI generated stuff all the time in lots of different arenas, and I see other people judging it, and it sort of feels like an Eternal September where some people are freshly excited by some AI use case in their thinking, and don’t realize how it comes off (or do, but it hasn’t occurred to them that it comes off badly for good reasons as well as straightforward prejudice). There may also be lots of people using AI so skillfully that they don’t fall into any of these traps. It’s even possible they far outnumber the people who are (I think) bumbling. Perhaps my doubt for this last proposition is a stuck prior. But if so, it is well and truly stuck.
Absolutely agreed re: people naively using LLMs in the simplest possible way for writing and that coming off very badly. And I think getting such people to consider the issue is well worthwhile!
Thanks for the thoughtful reply!
I mean centrally the claim ‘to most human beings, AI prose is something sus. If you use AI to write something, people will know. Not everyone, but the people paying attention, who aren’t newcomers or distracted or intoxicated.’ I should have been clearer about that.
It’s specifically that factual claim that I think tends to be strongly self-reinforcing. If you don’t have a way to identify false positives and false negatives, there’s no way to get evidence that updates you against that claim.
I’m intending to make a broad epistemic claim here that’s not specific to LLM detection. One case that really drove this point home for me personally was the great ‘do mp3s sound worse?’ debate a number of years ago. Many people were utterly confident that they could hear the difference between mp3s and uncompressed files—and of course they could in some cases, when the mp3s were poor quality! But in the many cases where they couldn’t, they assumed they were hearing uncompressed files, and so it didn’t change their minds at all.
I absolutely agree that there’s a ton of sucky writing out there, including lots of sucky writing that’s identifiable as being from LLMs (and probably some sucky writing that isn’t identifiable as being from LLMs). Ninety percent of everything is crap.
Fair point! I’m not necessarily disagreeing with the advice, only with the epistemic validity of the factual claim.
That said, separately from the epistemic point: after posting the comment, I got curious about the current state of the evidence on LLM writing detection by humans. I asked a couple of frontier models[1] to do a review of 2025 / 2026 papers using blind testing[2]. As I read it, the upshot is:
Most readers can’t tell the difference better than chance.
Expert LLM users and those with training specifically in LLM detection do better, although evidence is somewhat mixed. Notably, this was in significant part because they had learned specific tells rather than judging the output as low-quality.
Subject matter experts do better for some types of writing, less so for others.
There are simple interventions (eg get the model to play a particular persona) that make the writing much harder to identify as LLM.
Note that most of the data is on 2023/2024-era models, eg ChatGPT-3.5, GPT-4o, Claude 2. I would expect that humans would do worse at detection with current frontier models.
A debatable choice in this context, I realize, but that was what I had time for ;)
Opus-4.6, ChatGPT-5.4-Thinking
Yeah, it’s an interesting question how good human detection is. My guess is that people who are paying attention are getting better at sniffing out AI faster than AI is getting less distinctively scented, but “people who are paying attention” is a heck of a sleight of hand.
Overall, I suppose my main feeling is that I see AI generated stuff all the time in lots of different arenas, and I see other people judging it, and it sort of feels like an Eternal September where some people are freshly excited by some AI use case in their thinking, and don’t realize how it comes off (or do, but it hasn’t occurred to them that it comes off badly for good reasons as well as straightforward prejudice). There may also be lots of people using AI so skillfully that they don’t fall into any of these traps. It’s even possible they far outnumber the people who are (I think) bumbling. Perhaps my doubt for this last proposition is a stuck prior. But if so, it is well and truly stuck.
Absolutely agreed re: people naively using LLMs in the simplest possible way for writing and that coming off very badly. And I think getting such people to consider the issue is well worthwhile!