This is not that far off from my own experience, including the part about wondering whether I’m crazy / whether it’s just me / whether there’s something I’m missing.
(Except for the “not being much of a coder” thing—I do a lot of coding, and have for many years, and the hype is very confusing to me. I’ve been using coding assistants on and off for around a year now, starting with Sonnet 3.7; at the time, they offered me nothing more than an incremental speed-up in certain atypically well-scoped tasks that I myself was relatively bad at, whereas now, with the latest models and harnesses, they… still offer me an incremental speed-up in certain atypically well-scoped tasks that I myself am relatively bad at. I’ve actually stopped using them entirely, recently, because I got fed up with having to maintain the cruft they wrote.)
That said, I do find LLMs very useful in a variety of ways, and have even found it helpful to discuss research with them at times.
I’m not sure I agree with the notion that they’re only good “inside the training distribution,” because it’s not clear what counts as “being inside the training distribution” if we’re conceding that LLMs can synthesize attributes which each appeared somewhere in training but never co-occurred in a single training example. Once the concept of “inside the training distribution” has been widened to include “everything anyone has ever written and anything that can be formed by ‘recombining’ those texts in arbitrarily abstract ways,” what doesn’t count as inside the training distribution? Of course one can always point at a failure post-hoc and say “oh, I must have strayed outside the distribution,” but this explains nothing unless the failures can be predicted in advance, and it’s not clear to me that they can.
The most useful heuristic I have for when today’s LLMs seem brilliant vs. dumb is, instead, that their failures are very often the result of them not having sufficient context about what’s going on and what exactly you need from them[1].
The lack of context is so often the bottleneck that, these days, I basically think of Claude as being strictly smarter than me in every way provided that Claude knows all the relevant information I know… which, of course, Claude never really does. But the more I supply that info to Claude—the closer I get to that asymptotic limit of Claude really, truly knowing exactly what my current research problem is and exactly what dead ends I’ve tried and why I think they failed etc. etc. -- the closer Claude gets to matching, or exceeding, my own level of competence.
And so I’m always thinking to myself, these days, about how to “get context into Claude.” If I’m doing something on a computer instead of asking Claude to do it, it’s because I have decided that getting the requisite context into Claude would be more onerous than just doing the whole thing myself. (Which is usually the case in practice.)
And this observation makes me feel like I’m going crazy when I see all the hype about LLM agents, about long-running autonomy and “one-shotting” and so on.
Because: “getting context into Claude” is not a task that Claude is very good at doing!
The reason for this is intuitive: it’s a cold start problem, a bootstrap paradox, whatever you want to call it. Claude is weak and unreliable until it has enough context—which means that if you let Claude handle the task of giving itself context, it will do so weakly and unreliably. And since its performance at everything else is so strongly gated on this one foundational step, executing that step weakly and unreliably will have catastrophic consequences.
Instead, I tend to keep things on a pretty tight leash—with a workflow that looks more like a carefully pre-designed and inflexible sequence of stages than an “agent” doing whatever it feels like—and I do a lot of verbose and detailed writing work to spell out everything that matters, in detail.
This is true both when I’m chatting directly with Claude (or another LLM), and when I’m writing code that uses LLMs to process data or make decisions. In both cases, I write very long and detailed messages (or message templates), in which I put a lot of effort into things like “clarifying potential confusions before they arise” and “including arguably extraneous information because it helps ‘set the scene’ in which the work is taking place” and “giving the LLM tips on how to think about the problem and how to check its work to verify it isn’t making a mistake I’ve seen it make before.”
When this works, it really works; I have seen Claude perform some pretty remarkable feats while inside this kind of “information-rich on-rails experience,” ones that impressed me much more than any of the high-autonomy agentic one-shotting stuff that the hype is focused on[2]. But it is a very different approach from the sort of thing that is getting hyped, and it requires a lot of upfront manual effort that often isn’t worth it.
Arguably this is sort of like the “training distribution” story, except adjusted to take in-context learning into account.
You can teach the LLM new tricks without re-training it—but you need to give it enough information to precisely specify the tricks in question and distinguish them from the many other slightly different tricks which you, or someone else, might hypothetically have wanted it to learn instead. After all, it has been trained so that it could work with many different users who all want different things, and it has no way of “determining which one of the users you are” except by the use of distinguishing factors that you provide to it.
More speculatively, I think even some of the cases when the LLM says something really dumb or vacuous—as opposing to “correctly solving a task, but the wrong one”—might really be further instances of “correctly solving the wrong task,” only at a different level of abstraction.
For all it knows at the outset, you might be the sort of person who would be most satisfied by some hand-wavey low-effort bluster that’s phrased in an eye-catching manner but which seems obviously stupid to a reader with sufficient expertise if they’re paying enough attention. So there’s an element of “proving yourself to the LLM,” of demonstrating in-context that you are that expert-who’s-paying-attention as opposed to, like, the median LM Arena rater or something.
To clarify, I mean that the stuff I’ve seen is more impressive specifically because I can somewhat-reliably elicit it once I’ve done all the laborious setup work, whereas when I try to skip that work and just let Claude Code make all the decisions, it usually makes those decisions wrong and ends up lost in some irrelevant cul-de-sac, beating its head against an irrelevant wall.
If the hype were actually representative—if the agent actually could do the equivalent of my “laborious setup work” and bootstrap itself into a state where it knows enough to meaningfully contribute—that would of course be a very different story.
This is not that far off from my own experience, including the part about wondering whether I’m crazy / whether it’s just me / whether there’s something I’m missing.
(Except for the “not being much of a coder” thing—I do a lot of coding, and have for many years, and the hype is very confusing to me. I’ve been using coding assistants on and off for around a year now, starting with Sonnet 3.7; at the time, they offered me nothing more than an incremental speed-up in certain atypically well-scoped tasks that I myself was relatively bad at, whereas now, with the latest models and harnesses, they… still offer me an incremental speed-up in certain atypically well-scoped tasks that I myself am relatively bad at. I’ve actually stopped using them entirely, recently, because I got fed up with having to maintain the cruft they wrote.)
That said, I do find LLMs very useful in a variety of ways, and have even found it helpful to discuss research with them at times.
I’m not sure I agree with the notion that they’re only good “inside the training distribution,” because it’s not clear what counts as “being inside the training distribution” if we’re conceding that LLMs can synthesize attributes which each appeared somewhere in training but never co-occurred in a single training example. Once the concept of “inside the training distribution” has been widened to include “everything anyone has ever written and anything that can be formed by ‘recombining’ those texts in arbitrarily abstract ways,” what doesn’t count as inside the training distribution? Of course one can always point at a failure post-hoc and say “oh, I must have strayed outside the distribution,” but this explains nothing unless the failures can be predicted in advance, and it’s not clear to me that they can.
The most useful heuristic I have for when today’s LLMs seem brilliant vs. dumb is, instead, that their failures are very often the result of them not having sufficient context about what’s going on and what exactly you need from them[1].
The lack of context is so often the bottleneck that, these days, I basically think of Claude as being strictly smarter than me in every way provided that Claude knows all the relevant information I know… which, of course, Claude never really does. But the more I supply that info to Claude—the closer I get to that asymptotic limit of Claude really, truly knowing exactly what my current research problem is and exactly what dead ends I’ve tried and why I think they failed etc. etc. -- the closer Claude gets to matching, or exceeding, my own level of competence.
And so I’m always thinking to myself, these days, about how to “get context into Claude.” If I’m doing something on a computer instead of asking Claude to do it, it’s because I have decided that getting the requisite context into Claude would be more onerous than just doing the whole thing myself. (Which is usually the case in practice.)
And this observation makes me feel like I’m going crazy when I see all the hype about LLM agents, about long-running autonomy and “one-shotting” and so on.
Because: “getting context into Claude” is not a task that Claude is very good at doing!
The reason for this is intuitive: it’s a cold start problem, a bootstrap paradox, whatever you want to call it. Claude is weak and unreliable until it has enough context—which means that if you let Claude handle the task of giving itself context, it will do so weakly and unreliably. And since its performance at everything else is so strongly gated on this one foundational step, executing that step weakly and unreliably will have catastrophic consequences.
Instead, I tend to keep things on a pretty tight leash—with a workflow that looks more like a carefully pre-designed and inflexible sequence of stages than an “agent” doing whatever it feels like—and I do a lot of verbose and detailed writing work to spell out everything that matters, in detail.
This is true both when I’m chatting directly with Claude (or another LLM), and when I’m writing code that uses LLMs to process data or make decisions. In both cases, I write very long and detailed messages (or message templates), in which I put a lot of effort into things like “clarifying potential confusions before they arise” and “including arguably extraneous information because it helps ‘set the scene’ in which the work is taking place” and “giving the LLM tips on how to think about the problem and how to check its work to verify it isn’t making a mistake I’ve seen it make before.”
When this works, it really works; I have seen Claude perform some pretty remarkable feats while inside this kind of “information-rich on-rails experience,” ones that impressed me much more than any of the high-autonomy agentic one-shotting stuff that the hype is focused on[2]. But it is a very different approach from the sort of thing that is getting hyped, and it requires a lot of upfront manual effort that often isn’t worth it.
Arguably this is sort of like the “training distribution” story, except adjusted to take in-context learning into account.
You can teach the LLM new tricks without re-training it—but you need to give it enough information to precisely specify the tricks in question and distinguish them from the many other slightly different tricks which you, or someone else, might hypothetically have wanted it to learn instead. After all, it has been trained so that it could work with many different users who all want different things, and it has no way of “determining which one of the users you are” except by the use of distinguishing factors that you provide to it.
More speculatively, I think even some of the cases when the LLM says something really dumb or vacuous—as opposing to “correctly solving a task, but the wrong one”—might really be further instances of “correctly solving the wrong task,” only at a different level of abstraction.
For all it knows at the outset, you might be the sort of person who would be most satisfied by some hand-wavey low-effort bluster that’s phrased in an eye-catching manner but which seems obviously stupid to a reader with sufficient expertise if they’re paying enough attention. So there’s an element of “proving yourself to the LLM,” of demonstrating in-context that you are that expert-who’s-paying-attention as opposed to, like, the median LM Arena rater or something.
To clarify, I mean that the stuff I’ve seen is more impressive specifically because I can somewhat-reliably elicit it once I’ve done all the laborious setup work, whereas when I try to skip that work and just let Claude Code make all the decisions, it usually makes those decisions wrong and ends up lost in some irrelevant cul-de-sac, beating its head against an irrelevant wall.
If the hype were actually representative—if the agent actually could do the equivalent of my “laborious setup work” and bootstrap itself into a state where it knows enough to meaningfully contribute—that would of course be a very different story.