I’ve had a similar experience in trying to have research discussions with LLMs. Every time I poke at my own conceptual confusion on a topic they just seem to kind of break down: saying inconsistent stuff in loops, retreating back to what has already been said on the topic. They’re even worse than this, since they do also often get really basic stuff wrong. E.g., just the other day Claude told me that the k-complexity of a random string was the same as that of a crystal. This was in the context of a probably confusing conversation for it where I was trying to more deeply grok and so really push on the confusions around complexity measures, still, it’s pretty revealing (imo) that this still happens. Overall, LLMs seem pretty incoherent to me, and incapable of having “real,” “novel,” or “scientific” thoughts; I don’t feel like I can trust them with anything important.
But I’m always wondering if it is me who is crazy here, as my social environment seems to believe that LLMs are formidable forces of intellect, getting better by the year. My own sense-making of this situation is similar to Jeremy’s: it does seem like something is getting better, just something more along the lines of ~”filling in between the lines of what is already known” and less “raw intelligence,” whatever that is. But it’s of course impossible to talk about any of these things or to even really know what the difference is and so on, and in hearing more and more hype about LLMs getting better at coding, and not being much of a coder myself, I have been worrying that own experience isn’t very representative. Maybe you can just get excellence out if you train super hard on a given domain, I don’t know. But also, maybe people are pointing at the same sort of thing when they say LLMs are “good at coding” as they are when they say they are getting smarter. So it’s an interesting data point for me, to see Jeremy describe it as such here.
I think there is a lot more reason to trust the facts cited in an NYT article. For one, the New York Times, along with most major news publications, has standards for fact checking. They try hard to get primary source validation, or at least secondary source validation (some of those guidelines are stated here); falsifying information is a fireable offense. They also have a reputation to uphold, a major part of which rests on their ability to convey the news truthfully. These kinds of checks don’t really exist for LLMs.
Nor do we have much insight into how LLM information is generated. With news publications, we can at least understand the sorts of biases which might be introduced via the mechanisms under which stories are produced: people interviewing a bunch of people, maybe in misleading ways, leaving out some facts, etc. With LLMs, we have much less of an idea of what kind of errors might emerge, and hence what to mentally correct for, since we don’t understand the process that generates their outputs.
Perhaps this is just a personal difference, but I would much rather take “technically true but misleading” over “totally wrong but subtle enough and authoritative enough and seems-kind-of-right enough that you can barely notice unless you really dig into the claims or already have extensive background knowledge.”
My response upon reading that LLMs did substantial research or writing for a post is generally to not make any update. That doesn’t mean parts of it aren’t right, they likely are, it just means that it takes a ton of work for me to sus out what’s true (much more than for a human post, for reasons that Gwern outlined above), and it’s usually not worth it.