I think there’s a weak moral panic brewing here in terms of LLM usage, leading people to jump to conclusions they otherwise wouldn’t, and assume “xyz person’s brain is malfunctioning due to LLM use” before considering other likely options. As an example, someone on my recent post implied that the reason I didn’t suggest using spellcheck for typo fixes was because my personal usage of LLMs was unhealthy, rather than (the actual reason) that using the browser’s inbuilt spellcheck as a first pass seemed so obvious to me that it didn’t bear mentioning.
Even if it’s true that LLM usage is notably bad for human cognition, it’s probably bad to frame specific critique as “ah, another person mind-poisoned” without pretty good evidence for that.
(This is distinct from critiquing text for being probably AI-generated, which I think is a necessary immune reaction around here.)
Perhaps one effect is that the help of LLMs in writing brings some people from “too incoherent or bad at words to post anything at all, or if they do post it’ll get no attention” to “able to put something together that’s good enough to gain attention”, and the LLMs are just increasing the verbal fluency without improving the underlying logic. Result: we see more illogical stuff being posted, technically as a result of people using LLMs.
The base rate for acute psychosis is so high that I am deeply skeptical that LLMs are making a difference without much more evidence. And in a lot of cases, the persons at risk were going to be triggered by something, and we’re not really doing a good job of assessing if it was just bad luck that an LLM appears to be the proximate cause. I’m not saying it can’t be true, but I have strong reservations.
I am a psychiatry resident, so my vibes are slightly better than the norm! In my career so far, I haven’t actually met or heard of a single psychotic person being admitted to either hospital in my British city who even my colleagues claimed were triggered by a chafbot.
It can vary widely from country to country, and then there’s a lot of heterogeneity within each.
To sum up:
About 1⁄3800 to 1⁄5000 people develop new psychosis each year. And about 1 in 250 people have ongoing psychosis at any point in time.
I feel quite happy calling that a high base rate. As the first link alludes, episodes of psychosis may be detected by statements along the lines of:
>For example, “Flying mutant alien chimpanzees have harvested my kidneys to feed my goldfish.” Non-bizarre delusions are potentially possible, although extraordinarily unlikely. For example: “The CIA is watching me 24 hours a day by satellite surveillance.” The delusional disorder consists of non-bizarre delusions.
If a patient of mine were to say such a thing, I think it would be rather unfair of me to pin the blame for their condition on chimpanzees, the practice of organ transplants, Big Aquarium, American intelligence agencies, or Maxar.
(While the CIA certainly didn’t help my case with the whole MK ULTRA thing, that’s sixty years back. I don’t think local zoos or pet shops are implicated.)
Other reasons for doubt:
Case reports ≠ incidence. The handful of papers describing “ChatGPT-induced psychosis” are case studies and at risk of ecological fallacies.
People already at ultra-high risk for psychosis are over-represented among heavy chatbot users (loneliness, sleep disruption, etc.). Establishing causality would require a cohort design that controls for prior clinical risk, none exist yet.
The current thinking is that although around 1.5 to 3.5% of people will meet diagnostic criteria for a psychotic disorder, a significantly larger, variable number will experience at least one psychotic symptom in their lifetime.
The cited study is a survey of 7076 people in the Netherlands, which mentions:
1.5% sample prevalence of “any DSM-III-R diagnosis of psychotic disorder”
4.2% sample prevalence of “any rating of hallucinations and/or delusions”
17.5% sample prevalence of “any rating of psychotic or psychosislike symptoms”
How many people have mental health issues that cause them to develop religious delusions of grandeur? We don’t have much to go on here, so let’s do a very very rough guess with very flimsy data. This study says “approximately 25%-39% of patients with schizophrenia and 15%-22% of those with mania / bipolar have religious delusions.” 40 million people have bipolar disorder and 24 million have schizophrenia, so anywhere from 12-18 million people are especially susceptible to religious delusions. There are probably other disorders that cause religious delusions I’m missing, so I’ll stick to 18 million people. 8 billion people divided by 18 million equals 444, so 1 in every 444 people are highly prone to religious delusions. [...]
If one billion people are using chatbots weekly, and 1 in every 444 of them are prone to religious delusions, 2.25 million people prone to religious delusions are also using chatbots weekly. That’s about the same population as Paris.
I’ll assume 10,000 people believe chatbots are God based on the first article I shared. Obviously no one actually has good numbers on this, but this is what’s been reported on as a problem. [...]
Of the people who use chatbots weekly, 1 in every 100,000 develops the belief that the chatbot is God. 1 in every 444 weekly users were already especially prone to religious delusions. These numbers just don’t seem surprising or worth writing articles about. When a technology is used weekly by 1 in 8 people on Earth, millions of its users will have bad mental health, and for thousands that will manifest in the ways they use it.
The relevant class here isn’t people with acute psychosis. The relevant class is “their reasoning in general or on a specific issue has degenerated”, which a substantial portion of the population has due to various issues (politics being the classical example). Where they are strongly participating in motivated reasoning or other biases to a stronger degree than typical.
(So, while it may infact be rare for acute psychosis to occur, more mild versions that are still causing them to make far stronger statements than reasonable is more central to what I believe most people are referring to)
I think the blank moral panic is a bit unfounded my ratio of LLM : Search engine is 4:1 , I think LLMs just explain things faster a lot of times, I am likely to land on the same thing after searching 3-4 sites, because I kind of know what type of info I am looking for, I think if LLM is damaging to use the way I am using, it would be similarly damaging to use google on regular basis, which I don’t think it is.
I think the unhealthy usecases of LLMs don’t quite scale as uniformly as people might expect at least not currently, there was a recent research from METR on how experienced programmers actually lose time using LLMs.
I haven’t noticed anyone else come out and say it here, and I may express this more rigorously later, but, like, GPT-5 is a pretty big bear signal, right? Not just in terms of benchmarks suggesting a non-superexponential trend but also, to name a few angles/anecdata:
It did slightly worse than o3 at the first thing I tried it on (with thinking mode on)
It failed to one-shot a practical coding problem that was entirely in javascript, which is generally among its strengths (and the problem was certainly in principle solvable)
It’s hallucinating pretty obviously when I give it a 100 page or so document to read and comment on (it references lots from the document, but gets many details wrong and overfixates on others)
It’s described as an automatic family of models that naturally picks the best for the situation, which seems like exactly what you’d do if nothing else was really giving you the sauce
The main reddit reaction seems to be that the demo graphs were atrocious, which is not exactly glowing praise
All the above paired with the fact that this is what they chose to call GPT-5, and with the fact that Claude’s latest release was a well-named and justified 0.1 increment
I’m largely with Zvi that even integrating this stuff as it already exists into the economy does some interesting stuff, and that we live in interesting times already. But other than what’s already priced in by integrations and efficiency optimizations, progress feels s-curvier to me today than it did a week ago.
Broadly agreed. My sense is that our default prediction from here should be to extrapolate out the METR horizon length trend (until compute becomes more scarce at least) with somewhere between the more recent faster doubling time (122 days) and the slower doubling time we’ve seen longer term (213 days) rather than expecting substantially faster than this trend in the short term.
So, maybe I expect a ~160 day doubling trend over the next 3 years which implies a 50% reliability horizon length of ~1 week in 2 years and ~1.5 months in 3 years. By the end of this trend, I expect small speed ups due to substantial automation of engineering in AI R&D and slow downs due to reducing availability of compute, while simultaneously, these AIs will be producing a bunch of value in the economy and AI revenue will continue to grow pretty quickly. But, 50% reliability at 1.5 month long easy-to-measure SWE tasks (or 80% reliability at week long tasks) doesn’t yield crazy automation, though such systems may well be superhuman in many ways that let them add a bunch of value.
(I think you’ll be seeing some superexponentiallity in the trend due to a mix of achieving generality and AI R&D automation, but I currently don’t expect this to make much of a difference by 50% reliability at 1.5 month long horizon lengths with easy to measure tasks.)
But, this isn’t that s-curve-y in terms of interpretation? It’s just that progress will proceed at a steady rate rather than yielding super powerful AIs within 3 years.
Also, I think ways in which GPT-5 is especially bad relative to o3 might be more evidence about this being a bad product launch in particular rather than evidence that progress is that much slower overall.
I plan on writing more about this topic in the future.
same, I was surprised. I think my timelines … ooh man I don’t know what my mental timeline predictor did but I don’t think “get longer” is a full description. winter is a scary time because summer might come any day
I think there’s a weak moral panic brewing here in terms of LLM usage, leading people to jump to conclusions they otherwise wouldn’t, and assume “xyz person’s brain is malfunctioning due to LLM use” before considering other likely options. As an example, someone on my recent post implied that the reason I didn’t suggest using spellcheck for typo fixes was because my personal usage of LLMs was unhealthy, rather than (the actual reason) that using the browser’s inbuilt spellcheck as a first pass seemed so obvious to me that it didn’t bear mentioning.
Even if it’s true that LLM usage is notably bad for human cognition, it’s probably bad to frame specific critique as “ah, another person mind-poisoned” without pretty good evidence for that.
(This is distinct from critiquing text for being probably AI-generated, which I think is a necessary immune reaction around here.)
Perhaps one effect is that the help of LLMs in writing brings some people from “too incoherent or bad at words to post anything at all, or if they do post it’ll get no attention” to “able to put something together that’s good enough to gain attention”, and the LLMs are just increasing the verbal fluency without improving the underlying logic. Result: we see more illogical stuff being posted, technically as a result of people using LLMs.
The base rate for acute psychosis is so high that I am deeply skeptical that LLMs are making a difference without much more evidence. And in a lot of cases, the persons at risk were going to be triggered by something, and we’re not really doing a good job of assessing if it was just bad luck that an LLM appears to be the proximate cause. I’m not saying it can’t be true, but I have strong reservations.
Do you happen to know the number? Or is this a vibe claim?
I am a psychiatry resident, so my vibes are slightly better than the norm! In my career so far, I haven’t actually met or heard of a single psychotic person being admitted to either hospital in my British city who even my colleagues claimed were triggered by a chafbot.
But the actual figures have very wide error bars, one source claims around 50/100k for new episodes of first-time psychosis. https://www.ncbi.nlm.nih.gov/books/NBK546579/
But others claim around:
>The pooled incidence of all psychotic disorders was 26·6 per 100 000 person-years (95% CI 22·0-31·7).
That’s from a large meta-analysis in the Lancet
https://pubmed.ncbi.nlm.nih.gov/31054641/
It can vary widely from country to country, and then there’s a lot of heterogeneity within each.
To sum up:
About 1⁄3800 to 1⁄5000 people develop new psychosis each year. And about 1 in 250 people have ongoing psychosis at any point in time.
I feel quite happy calling that a high base rate. As the first link alludes, episodes of psychosis may be detected by statements along the lines of:
>For example, “Flying mutant alien chimpanzees have harvested my kidneys to feed my goldfish.” Non-bizarre delusions are potentially possible, although extraordinarily unlikely. For example: “The CIA is watching me 24 hours a day by satellite surveillance.” The delusional disorder consists of non-bizarre delusions.
If a patient of mine were to say such a thing, I think it would be rather unfair of me to pin the blame for their condition on chimpanzees, the practice of organ transplants, Big Aquarium, American intelligence agencies, or Maxar.
(While the CIA certainly didn’t help my case with the whole MK ULTRA thing, that’s sixty years back. I don’t think local zoos or pet shops are implicated.)
Other reasons for doubt:
Case reports ≠ incidence. The handful of papers describing “ChatGPT-induced psychosis” are case studies and at risk of ecological fallacies.
People already at ultra-high risk for psychosis are over-represented among heavy chatbot users (loneliness, sleep disruption, etc.). Establishing causality would require a cohort design that controls for prior clinical risk, none exist yet.
It is quite high:
The cited study is a survey of 7076 people in the Netherlands, which mentions:
1.5% sample prevalence of “any DSM-III-R diagnosis of psychotic disorder”
4.2% sample prevalence of “any rating of hallucinations and/or delusions”
17.5% sample prevalence of “any rating of psychotic or psychosislike symptoms”
This post tried making some quick estimates:
The relevant class here isn’t people with acute psychosis. The relevant class is “their reasoning in general or on a specific issue has degenerated”, which a substantial portion of the population has due to various issues (politics being the classical example). Where they are strongly participating in motivated reasoning or other biases to a stronger degree than typical. (So, while it may infact be rare for acute psychosis to occur, more mild versions that are still causing them to make far stronger statements than reasonable is more central to what I believe most people are referring to)
I think the blank moral panic is a bit unfounded my ratio of LLM : Search engine is 4:1 , I think LLMs just explain things faster a lot of times, I am likely to land on the same thing after searching 3-4 sites, because I kind of know what type of info I am looking for, I think if LLM is damaging to use the way I am using, it would be similarly damaging to use google on regular basis, which I don’t think it is.
I think the unhealthy usecases of LLMs don’t quite scale as uniformly as people might expect at least not currently, there was a recent research from METR on how experienced programmers actually lose time using LLMs.
I haven’t noticed anyone else come out and say it here, and I may express this more rigorously later, but, like, GPT-5 is a pretty big bear signal, right? Not just in terms of benchmarks suggesting a non-superexponential trend but also, to name a few angles/anecdata:
It did slightly worse than o3 at the first thing I tried it on (with thinking mode on)
It failed to one-shot a practical coding problem that was entirely in javascript, which is generally among its strengths (and the problem was certainly in principle solvable)
It’s hallucinating pretty obviously when I give it a 100 page or so document to read and comment on (it references lots from the document, but gets many details wrong and overfixates on others)
It’s described as an automatic family of models that naturally picks the best for the situation, which seems like exactly what you’d do if nothing else was really giving you the sauce
The main reddit reaction seems to be that the demo graphs were atrocious, which is not exactly glowing praise
All the above paired with the fact that this is what they chose to call GPT-5, and with the fact that Claude’s latest release was a well-named and justified 0.1 increment
I’m largely with Zvi that even integrating this stuff as it already exists into the economy does some interesting stuff, and that we live in interesting times already. But other than what’s already priced in by integrations and efficiency optimizations, progress feels s-curvier to me today than it did a week ago.
Broadly agreed. My sense is that our default prediction from here should be to extrapolate out the METR horizon length trend (until compute becomes more scarce at least) with somewhere between the more recent faster doubling time (122 days) and the slower doubling time we’ve seen longer term (213 days) rather than expecting substantially faster than this trend in the short term.
So, maybe I expect a ~160 day doubling trend over the next 3 years which implies a 50% reliability horizon length of ~1 week in 2 years and ~1.5 months in 3 years. By the end of this trend, I expect small speed ups due to substantial automation of engineering in AI R&D and slow downs due to reducing availability of compute, while simultaneously, these AIs will be producing a bunch of value in the economy and AI revenue will continue to grow pretty quickly. But, 50% reliability at 1.5 month long easy-to-measure SWE tasks (or 80% reliability at week long tasks) doesn’t yield crazy automation, though such systems may well be superhuman in many ways that let them add a bunch of value.
(I think you’ll be seeing some superexponentiallity in the trend due to a mix of achieving generality and AI R&D automation, but I currently don’t expect this to make much of a difference by 50% reliability at 1.5 month long horizon lengths with easy to measure tasks.)
But, this isn’t that s-curve-y in terms of interpretation? It’s just that progress will proceed at a steady rate rather than yielding super powerful AIs within 3 years.
Also, I think ways in which GPT-5 is especially bad relative to o3 might be more evidence about this being a bad product launch in particular rather than evidence that progress is that much slower overall.
I plan on writing more about this topic in the future.
same, I was surprised. I think my timelines … ooh man I don’t know what my mental timeline predictor did but I don’t think “get longer” is a full description. winter is a scary time because summer might come any day
Mostly agree but also the slower (pre-reasoning model) exponential on task length is certainly not falsified and even looks slightly more plausible.