I’ve had a similar experience in trying to have research discussions with LLMs. Every time I poke at my own conceptual confusion on a topic they just seem to kind of break down: saying inconsistent stuff in loops, retreating back to what has already been said on the topic. They’re even worse than this, since they do also often get really basic stuff wrong. E.g., just the other day Claude told me that the k-complexity of a random string was the same as that of a crystal. This was in the context of a probably confusing conversation for it where I was trying to more deeply grok and so really push on the confusions around complexity measures, still, it’s pretty revealing (imo) that this still happens. Overall, LLMs seem pretty incoherent to me, and incapable of having “real,” “novel,” or “scientific” thoughts; I don’t feel like I can trust them with anything important.
But I’m always wondering if it is me who is crazy here, as my social environment seems to believe that LLMs are formidable forces of intellect, getting better by the year. My own sense-making of this situation is similar to Jeremy’s: it does seem like something is getting better, just something more along the lines of ~”filling in between the lines of what is already known” and less “raw intelligence,” whatever that is. But it’s of course impossible to talk about any of these things or to even really know what the difference is and so on, and in hearing more and more hype about LLMs getting better at coding, and not being much of a coder myself, I have been worrying that own experience isn’t very representative. Maybe you can just get excellence out if you train super hard on a given domain, I don’t know. But also, maybe people are pointing at the same sort of thing when they say LLMs are “good at coding” as they are when they say they are getting smarter. So it’s an interesting data point for me, to see Jeremy describe it as such here.
This is not that far off from my own experience, including the part about wondering whether I’m crazy / whether it’s just me / whether there’s something I’m missing.
(Except for the “not being much of a coder” thing—I do a lot of coding, and have for many years, and the hype is very confusing to me. I’ve been using coding assistants on and off for around a year now, starting with Sonnet 3.7; at the time, they offered me nothing more than an incremental speed-up in certain atypically well-scoped tasks that I myself was relatively bad at, whereas now, with the latest models and harnesses, they… still offer me an incremental speed-up in certain atypically well-scoped tasks that I myself am relatively bad at. I’ve actually stopped using them entirely, recently, because I got fed up with having to maintain the cruft they wrote.)
That said, I do find LLMs very useful in a variety of ways, and have even found it helpful to discuss research with them at times.
I’m not sure I agree with the notion that they’re only good “inside the training distribution,” because it’s not clear what counts as “being inside the training distribution” if we’re conceding that LLMs can synthesize attributes which each appeared somewhere in training but never co-occurred in a single training example. Once the concept of “inside the training distribution” has been widened to include “everything anyone has ever written and anything that can be formed by ‘recombining’ those texts in arbitrarily abstract ways,” what doesn’t count as inside the training distribution? Of course one can always point at a failure post-hoc and say “oh, I must have strayed outside the distribution,” but this explains nothing unless the failures can be predicted in advance, and it’s not clear to me that they can.
The most useful heuristic I have for when today’s LLMs seem brilliant vs. dumb is, instead, that their failures are very often the result of them not having sufficient context about what’s going on and what exactly you need from them[1].
The lack of context is so often the bottleneck that, these days, I basically think of Claude as being strictly smarter than me in every way provided that Claude knows all the relevant information I know… which, of course, Claude never really does. But the more I supply that info to Claude—the closer I get to that asymptotic limit of Claude really, truly knowing exactly what my current research problem is and exactly what dead ends I’ve tried and why I think they failed etc. etc. -- the closer Claude gets to matching, or exceeding, my own level of competence.
And so I’m always thinking to myself, these days, about how to “get context into Claude.” If I’m doing something on a computer instead of asking Claude to do it, it’s because I have decided that getting the requisite context into Claude would be more onerous than just doing the whole thing myself. (Which is usually the case in practice.)
And this observation makes me feel like I’m going crazy when I see all the hype about LLM agents, about long-running autonomy and “one-shotting” and so on.
Because: “getting context into Claude” is not a task that Claude is very good at doing!
The reason for this is intuitive: it’s a cold start problem, a bootstrap paradox, whatever you want to call it. Claude is weak and unreliable until it has enough context—which means that if you let Claude handle the task of giving itself context, it will do so weakly and unreliably. And since its performance at everything else is so strongly gated on this one foundational step, executing that step weakly and unreliably will have catastrophic consequences.
Instead, I tend to keep things on a pretty tight leash—with a workflow that looks more like a carefully pre-designed and inflexible sequence of stages than an “agent” doing whatever it feels like—and I do a lot of verbose and detailed writing work to spell out everything that matters, in detail.
This is true both when I’m chatting directly with Claude (or another LLM), and when I’m writing code that uses LLMs to process data or make decisions. In both cases, I write very long and detailed messages (or message templates), in which I put a lot of effort into things like “clarifying potential confusions before they arise” and “including arguably extraneous information because it helps ‘set the scene’ in which the work is taking place” and “giving the LLM tips on how to think about the problem and how to check its work to verify it isn’t making a mistake I’ve seen it make before.”
When this works, it really works; I have seen Claude perform some pretty remarkable feats while inside this kind of “information-rich on-rails experience,” ones that impressed me much more than any of the high-autonomy agentic one-shotting stuff that the hype is focused on[2]. But it is a very different approach from the sort of thing that is getting hyped, and it requires a lot of upfront manual effort that often isn’t worth it.
Arguably this is sort of like the “training distribution” story, except adjusted to take in-context learning into account.
You can teach the LLM new tricks without re-training it—but you need to give it enough information to precisely specify the tricks in question and distinguish them from the many other slightly different tricks which you, or someone else, might hypothetically have wanted it to learn instead. After all, it has been trained so that it could work with many different users who all want different things, and it has no way of “determining which one of the users you are” except by the use of distinguishing factors that you provide to it.
More speculatively, I think even some of the cases when the LLM says something really dumb or vacuous—as opposing to “correctly solving a task, but the wrong one”—might really be further instances of “correctly solving the wrong task,” only at a different level of abstraction.
For all it knows at the outset, you might be the sort of person who would be most satisfied by some hand-wavey low-effort bluster that’s phrased in an eye-catching manner but which seems obviously stupid to a reader with sufficient expertise if they’re paying enough attention. So there’s an element of “proving yourself to the LLM,” of demonstrating in-context that you are that expert-who’s-paying-attention as opposed to, like, the median LM Arena rater or something.
To clarify, I mean that the stuff I’ve seen is more impressive specifically because I can somewhat-reliably elicit it once I’ve done all the laborious setup work, whereas when I try to skip that work and just let Claude Code make all the decisions, it usually makes those decisions wrong and ends up lost in some irrelevant cul-de-sac, beating its head against an irrelevant wall.
If the hype were actually representative—if the agent actually could do the equivalent of my “laborious setup work” and bootstrap itself into a state where it knows enough to meaningfully contribute—that would of course be a very different story.
When this works, it really works; I have seen Claude perform some pretty remarkable feats while inside this kind of “information-rich on-rails experience,” ones that impressed me much more than any of the high-autonomy agentic one-shotting stuff that the hype is focused on
Could you give an example?
Your explanation has left me wondering how much of the work done in achieving these feats is you providing the right context. Certainly, when I’m solving problems, a lot of the work is finding the right context.
Fwiw I similarly still experience them to be bad at coming up with useful novel math research ideas, even as they’ve gotten much more competent at coding. Though they aren’t great at coding yet either.
However, I don’t think this ‘filling in the blanks’ is something fundamentally different in kind from ‘raw intelligence’. I don’t think there’s a hard boundary here. Anything that isn’t a literal lookup table is applying algorithms to extrapolate what it knows to new situations. Even something as minor as changing the tense of a memorised sentence is novel invention of a sort, just a tiny little bit. I think current llms can’t extrapolate as far as some humans yet, but the average distance they can extrapolate over seems to me to have increased over time. They’re still bad at coming up with novel math research ideas now, but three years ago they were much worse.
Separately from this, llms just know a lot of things most humans don’t, which can make them a value add to some intellectual tasks even if they can’t extrapolate the things they know very far.
I agree that the AIs are pretty bad at handling conceptually confusing stuff. I basically think of them as being incredibly knowledgeable, not that smart, and having huge amounts of intuition on how to program (mostly due to their knowledge and their having read huge amounts of code).
My guess is that for any reasonable operationalization of “raw intelligence”, they’re getting smarter?
Overall, LLMs seem pretty incoherent to me, and incapable of having “real,” “novel,” or “scientific” thoughts; I don’t feel like I can trust them with anything important.
These feel like very different unrelated statements to me (not sure if you meant to imply they are connected). I think you can do real chunks of novel/scientific thought while being too incoherent to see it all the way through.
I’m not sure how you’re defining “real”/”novel”/”scientific” thoughts. I’m pretty sure they can and do, the thing they don’t do is persistently and strategically follow through on them and string them together in a useful way.
I’ve had a similar experience in trying to have research discussions with LLMs. Every time I poke at my own conceptual confusion on a topic they just seem to kind of break down: saying inconsistent stuff in loops, retreating back to what has already been said on the topic. They’re even worse than this, since they do also often get really basic stuff wrong. E.g., just the other day Claude told me that the k-complexity of a random string was the same as that of a crystal. This was in the context of a probably confusing conversation for it where I was trying to more deeply grok and so really push on the confusions around complexity measures, still, it’s pretty revealing (imo) that this still happens. Overall, LLMs seem pretty incoherent to me, and incapable of having “real,” “novel,” or “scientific” thoughts; I don’t feel like I can trust them with anything important.
But I’m always wondering if it is me who is crazy here, as my social environment seems to believe that LLMs are formidable forces of intellect, getting better by the year. My own sense-making of this situation is similar to Jeremy’s: it does seem like something is getting better, just something more along the lines of ~”filling in between the lines of what is already known” and less “raw intelligence,” whatever that is. But it’s of course impossible to talk about any of these things or to even really know what the difference is and so on, and in hearing more and more hype about LLMs getting better at coding, and not being much of a coder myself, I have been worrying that own experience isn’t very representative. Maybe you can just get excellence out if you train super hard on a given domain, I don’t know. But also, maybe people are pointing at the same sort of thing when they say LLMs are “good at coding” as they are when they say they are getting smarter. So it’s an interesting data point for me, to see Jeremy describe it as such here.
This is not that far off from my own experience, including the part about wondering whether I’m crazy / whether it’s just me / whether there’s something I’m missing.
(Except for the “not being much of a coder” thing—I do a lot of coding, and have for many years, and the hype is very confusing to me. I’ve been using coding assistants on and off for around a year now, starting with Sonnet 3.7; at the time, they offered me nothing more than an incremental speed-up in certain atypically well-scoped tasks that I myself was relatively bad at, whereas now, with the latest models and harnesses, they… still offer me an incremental speed-up in certain atypically well-scoped tasks that I myself am relatively bad at. I’ve actually stopped using them entirely, recently, because I got fed up with having to maintain the cruft they wrote.)
That said, I do find LLMs very useful in a variety of ways, and have even found it helpful to discuss research with them at times.
I’m not sure I agree with the notion that they’re only good “inside the training distribution,” because it’s not clear what counts as “being inside the training distribution” if we’re conceding that LLMs can synthesize attributes which each appeared somewhere in training but never co-occurred in a single training example. Once the concept of “inside the training distribution” has been widened to include “everything anyone has ever written and anything that can be formed by ‘recombining’ those texts in arbitrarily abstract ways,” what doesn’t count as inside the training distribution? Of course one can always point at a failure post-hoc and say “oh, I must have strayed outside the distribution,” but this explains nothing unless the failures can be predicted in advance, and it’s not clear to me that they can.
The most useful heuristic I have for when today’s LLMs seem brilliant vs. dumb is, instead, that their failures are very often the result of them not having sufficient context about what’s going on and what exactly you need from them[1].
The lack of context is so often the bottleneck that, these days, I basically think of Claude as being strictly smarter than me in every way provided that Claude knows all the relevant information I know… which, of course, Claude never really does. But the more I supply that info to Claude—the closer I get to that asymptotic limit of Claude really, truly knowing exactly what my current research problem is and exactly what dead ends I’ve tried and why I think they failed etc. etc. -- the closer Claude gets to matching, or exceeding, my own level of competence.
And so I’m always thinking to myself, these days, about how to “get context into Claude.” If I’m doing something on a computer instead of asking Claude to do it, it’s because I have decided that getting the requisite context into Claude would be more onerous than just doing the whole thing myself. (Which is usually the case in practice.)
And this observation makes me feel like I’m going crazy when I see all the hype about LLM agents, about long-running autonomy and “one-shotting” and so on.
Because: “getting context into Claude” is not a task that Claude is very good at doing!
The reason for this is intuitive: it’s a cold start problem, a bootstrap paradox, whatever you want to call it. Claude is weak and unreliable until it has enough context—which means that if you let Claude handle the task of giving itself context, it will do so weakly and unreliably. And since its performance at everything else is so strongly gated on this one foundational step, executing that step weakly and unreliably will have catastrophic consequences.
Instead, I tend to keep things on a pretty tight leash—with a workflow that looks more like a carefully pre-designed and inflexible sequence of stages than an “agent” doing whatever it feels like—and I do a lot of verbose and detailed writing work to spell out everything that matters, in detail.
This is true both when I’m chatting directly with Claude (or another LLM), and when I’m writing code that uses LLMs to process data or make decisions. In both cases, I write very long and detailed messages (or message templates), in which I put a lot of effort into things like “clarifying potential confusions before they arise” and “including arguably extraneous information because it helps ‘set the scene’ in which the work is taking place” and “giving the LLM tips on how to think about the problem and how to check its work to verify it isn’t making a mistake I’ve seen it make before.”
When this works, it really works; I have seen Claude perform some pretty remarkable feats while inside this kind of “information-rich on-rails experience,” ones that impressed me much more than any of the high-autonomy agentic one-shotting stuff that the hype is focused on[2]. But it is a very different approach from the sort of thing that is getting hyped, and it requires a lot of upfront manual effort that often isn’t worth it.
Arguably this is sort of like the “training distribution” story, except adjusted to take in-context learning into account.
You can teach the LLM new tricks without re-training it—but you need to give it enough information to precisely specify the tricks in question and distinguish them from the many other slightly different tricks which you, or someone else, might hypothetically have wanted it to learn instead. After all, it has been trained so that it could work with many different users who all want different things, and it has no way of “determining which one of the users you are” except by the use of distinguishing factors that you provide to it.
More speculatively, I think even some of the cases when the LLM says something really dumb or vacuous—as opposing to “correctly solving a task, but the wrong one”—might really be further instances of “correctly solving the wrong task,” only at a different level of abstraction.
For all it knows at the outset, you might be the sort of person who would be most satisfied by some hand-wavey low-effort bluster that’s phrased in an eye-catching manner but which seems obviously stupid to a reader with sufficient expertise if they’re paying enough attention. So there’s an element of “proving yourself to the LLM,” of demonstrating in-context that you are that expert-who’s-paying-attention as opposed to, like, the median LM Arena rater or something.
To clarify, I mean that the stuff I’ve seen is more impressive specifically because I can somewhat-reliably elicit it once I’ve done all the laborious setup work, whereas when I try to skip that work and just let Claude Code make all the decisions, it usually makes those decisions wrong and ends up lost in some irrelevant cul-de-sac, beating its head against an irrelevant wall.
If the hype were actually representative—if the agent actually could do the equivalent of my “laborious setup work” and bootstrap itself into a state where it knows enough to meaningfully contribute—that would of course be a very different story.
Could you give an example?
Your explanation has left me wondering how much of the work done in achieving these feats is you providing the right context. Certainly, when I’m solving problems, a lot of the work is finding the right context.
Fwiw I similarly still experience them to be bad at coming up with useful novel math research ideas, even as they’ve gotten much more competent at coding. Though they aren’t great at coding yet either.
However, I don’t think this ‘filling in the blanks’ is something fundamentally different in kind from ‘raw intelligence’. I don’t think there’s a hard boundary here. Anything that isn’t a literal lookup table is applying algorithms to extrapolate what it knows to new situations. Even something as minor as changing the tense of a memorised sentence is novel invention of a sort, just a tiny little bit. I think current llms can’t extrapolate as far as some humans yet, but the average distance they can extrapolate over seems to me to have increased over time. They’re still bad at coming up with novel math research ideas now, but three years ago they were much worse.
Separately from this, llms just know a lot of things most humans don’t, which can make them a value add to some intellectual tasks even if they can’t extrapolate the things they know very far.
I agree that the AIs are pretty bad at handling conceptually confusing stuff. I basically think of them as being incredibly knowledgeable, not that smart, and having huge amounts of intuition on how to program (mostly due to their knowledge and their having read huge amounts of code).
My guess is that for any reasonable operationalization of “raw intelligence”, they’re getting smarter?
These feel like very different unrelated statements to me (not sure if you meant to imply they are connected). I think you can do real chunks of novel/scientific thought while being too incoherent to see it all the way through.
I’m not sure how you’re defining “real”/”novel”/”scientific” thoughts. I’m pretty sure they can and do, the thing they don’t do is persistently and strategically follow through on them and string them together in a useful way.