It would be IMO pretty dumb to write a post that is e.g. trying to learn from past social movement failures and not have an LLM look over it to see whether it’s saying anything historically inaccurate.
Do you then look over what the LLM has said and see whether it’s saying anything historically inaccurate, without using an LLM?
Not most of the time! Like, I sometimes ask multiple LLMs, but I don’t verify every fact that an LLM tells me, unless it’s a domain where I predict LLMs are particularly likely to hallucinate. I keep in mind that stuff is sometimes hallucinated, but most of the time it’s fine to know that something is quite probably true.
There’s no such thing as “a domain where LLMs are particularly likely to hallucinate”. In every domain there’s some obscure jagged boundary, not very far from normal standard questions to ask, where LLMs will hallucinate, usually plausibly to a non-expert.
To me, this sounds like you’re simply pushing the problem a little bit downstream without actually addressing it. You’re still not verifying the facts; you’re just getting another system with similar flaws to the first (you). You aren’t actually fact checking at any point.
That is not how bayesian evidence works. I am treating LLM output as somewhat less trustworthy than I would trust what a colleague of mine says, but not fundamentally different. I am skeptical that you spend your days double checking every conversation you have with another human. I also don’t think you should spend your days double checking every single thing an LLM tells you.
This feels kind of like the early conversations about Wikipedia where people kept trying to insist Wikipedia is “not a real source”.
I am treating LLM output as somewhat less trustworthy than I would trust what a colleague of mine says, but not fundamentally different.
If you’re asking a human about some even mildly specialized topic, like history of Spain in the 17th century or different crop rotation methods or ordinary differential equations, and there’s no special reason that they really want to appear like they know what they’re talking about, they’ll generally just say “IDK”. LLMs are much less like that IME. I think this is actually a big difference in practice, at least in the domains I’ve tried (reproductive biology). LLMs routinely give misleading / false / out-of-date / vague-but-deceptively-satiating summaries.
I agree the LLMs are somewhat worse, especially compared to rationalist-adjacent experts in specialized fields, but they really aren’t that bad for most things. Like I researched the state of the art of datacenter security practices yesterday, and I am not like 99% confident that the AI got everything right, but I am pretty sure it helped me understand the rough shape of things a lot better.
This seems fine and good—for laying some foundations, which you can use for your own further theorizing, which will make you ready to learn from more reliable + rich expert sources over time. Then you can report that stuff. If instead you’re directly reporting your immediately-post-LLM models, I currently don’t think I want to read that stuff, or would want a warning. (I’m not necessarily pushing for some big policy, that seems hard. I would push for personal standards though.)
Fwiw, in my experience LLMs lie far more than early Wikipedia or any human I know, and in subtler and harder to detect ways. My spot checks for accuracy have been so dismal/alarming that at this point I basically only use them as search engines to find things humans have said.
I am wondering whether your experiences were formed via the first generation of reasoning models, and my guess is you are also thinking of asking different kinds of questions.
The thing that LLMs are really great at is to speak and think in the ontology and structure that is prevalent among experts in any field. This is usually where the vast majority of evidence comes from. LLMs aren’t going to make up whole ontologies about how bankruptcy law works, or how datacenter security works. It might totally make up details, but it won’t make up the high-level picture.
Second, this has just gotten a lot better over the last 6 months. GPT-5 still lies a good amount, but vastly less than o1 or o3. I found o1 almost unusable on this dimension.
Datapoint: I’m currently setting up a recording studio at Lighthaven, and I am using them all the time to get guides for things like “how to change a setting on this camera” or “how to use this microphone” or “how to use this recording software”.
Yes, they confabulate menus and things a lot, but as long as I keep uploading photos of what I actually see, they know the basics much better than me (e.g. what bit rate to set the video vs the audio, where to look to kill the random white noise input I’m getting, etc).
I’d say they confabulate like 50% of the time but that they’re still a much more effective search engine for me than google, and can read the manual much faster than me. My guess is I simply couldn’t do some of the projects I’m doing without them.
Do you then look over what the LLM has said and see whether it’s saying anything historically inaccurate, without using an LLM?
Not most of the time! Like, I sometimes ask multiple LLMs, but I don’t verify every fact that an LLM tells me, unless it’s a domain where I predict LLMs are particularly likely to hallucinate. I keep in mind that stuff is sometimes hallucinated, but most of the time it’s fine to know that something is quite probably true.
There’s no such thing as “a domain where LLMs are particularly likely to hallucinate”. In every domain there’s some obscure jagged boundary, not very far from normal standard questions to ask, where LLMs will hallucinate, usually plausibly to a non-expert.
To me, this sounds like you’re simply pushing the problem a little bit downstream without actually addressing it. You’re still not verifying the facts; you’re just getting another system with similar flaws to the first (you). You aren’t actually fact checking at any point.
That is not how bayesian evidence works. I am treating LLM output as somewhat less trustworthy than I would trust what a colleague of mine says, but not fundamentally different. I am skeptical that you spend your days double checking every conversation you have with another human. I also don’t think you should spend your days double checking every single thing an LLM tells you.
This feels kind of like the early conversations about Wikipedia where people kept trying to insist Wikipedia is “not a real source”.
If you’re asking a human about some even mildly specialized topic, like history of Spain in the 17th century or different crop rotation methods or ordinary differential equations, and there’s no special reason that they really want to appear like they know what they’re talking about, they’ll generally just say “IDK”. LLMs are much less like that IME. I think this is actually a big difference in practice, at least in the domains I’ve tried (reproductive biology). LLMs routinely give misleading / false / out-of-date / vague-but-deceptively-satiating summaries.
I agree the LLMs are somewhat worse, especially compared to rationalist-adjacent experts in specialized fields, but they really aren’t that bad for most things. Like I researched the state of the art of datacenter security practices yesterday, and I am not like 99% confident that the AI got everything right, but I am pretty sure it helped me understand the rough shape of things a lot better.
This seems fine and good—for laying some foundations, which you can use for your own further theorizing, which will make you ready to learn from more reliable + rich expert sources over time. Then you can report that stuff. If instead you’re directly reporting your immediately-post-LLM models, I currently don’t think I want to read that stuff, or would want a warning. (I’m not necessarily pushing for some big policy, that seems hard. I would push for personal standards though.)
Fwiw, in my experience LLMs lie far more than early Wikipedia or any human I know, and in subtler and harder to detect ways. My spot checks for accuracy have been so dismal/alarming that at this point I basically only use them as search engines to find things humans have said.
I am wondering whether your experiences were formed via the first generation of reasoning models, and my guess is you are also thinking of asking different kinds of questions.
The thing that LLMs are really great at is to speak and think in the ontology and structure that is prevalent among experts in any field. This is usually where the vast majority of evidence comes from. LLMs aren’t going to make up whole ontologies about how bankruptcy law works, or how datacenter security works. It might totally make up details, but it won’t make up the high-level picture.
Second, this has just gotten a lot better over the last 6 months. GPT-5 still lies a good amount, but vastly less than o1 or o3. I found o1 almost unusable on this dimension.
Datapoint: I’m currently setting up a recording studio at Lighthaven, and I am using them all the time to get guides for things like “how to change a setting on this camera” or “how to use this microphone” or “how to use this recording software”.
Yes, they confabulate menus and things a lot, but as long as I keep uploading photos of what I actually see, they know the basics much better than me (e.g. what bit rate to set the video vs the audio, where to look to kill the random white noise input I’m getting, etc).
I’d say they confabulate like 50% of the time but that they’re still a much more effective search engine for me than google, and can read the manual much faster than me. My guess is I simply couldn’t do some of the projects I’m doing without them.