This resonates a lot. I once wrote about what I’d say is a specific instantiation of this concept, namely that people often ask questions that appear to have a binary return type, when that is, in fact, just the wrong return type to look for. Many binary-sounding questions actually require a return type of “function that takes a potentially large number of input values and returns a boolean [or sometimes even something more complex than a boolean]”. When looking for and debating binary answers to such questions, one sweeps the most interesting parts of the answer under the rug.
silentbob
Two more points I’d add:
Diminishing returns in code produced. When code production is limited, one naturally works only on those pieces of code that are most useful. But if you can create 10x more code, you likely don’t reap 10x the benefits. I notice this a lot in private projects, I just quickly run out of ideas of what to even code because finding anything that has more value than the time it would even take to prompt it becomes non-trivial relatively quickly. Maybe somewhat of a skill issue, but I this effect will always be present to some degree.
Code doesn’t live in isolation. It often requires changes outside of the code to go along with it. So, often you can’t just speedrun the code to its final form, but many things around it are keeping it in place to some degree, or have to slowly advance alongside the code. Things like user expectations & knowledge, deployment, code in other projects that may depend on it and are outside of your control, general project planning / decisions, and so on. It reminds me of my observation about parameter updates during gradient descent here:
While most parameters have “settled down” after the first few hundred epochs, clearly some of them are still on a mission, though. I was also a bit surprised how smoothly many of them are moving. [...] many parameters are systematically drifting into one direction, albeit now in a more noisy fashion. It gives me “just update all the way bro!” vibes—but perhaps this is a bit of a coordination problem: multiple parameters have to move more or less in unison, and none of them can reasonably update further even though the longer term trajectory is clear.
I agree, and this seems like a special case of what I sometimes think of as “extrapolating too far”, which also occurs in reasoning of all kinds quite often and particularly when discussing the future.
An example would be the assumption that some scarce material resource eventually just “runs out” more or less suddenly, which people sometimes argue. In such cases, it’s almost always the case that scarcity is gradually increasing and plays into a feedback loop of a search for alternatives. But if one just extrapolates the “this resource will eventually run out” idea in isolation, without taking into account the changes this has on the rest of the world and the relevant process, one can get to the conclusion that it just eventually just “hits 0”.
Of course, extrapolating in more reasonable ways is often extremely difficult, as the systems involved are difficult/impossible to fully predict. But when one does isolated extrapolations, it’s at least helpful to keep in mind that this necessarily comes with simplifications that won’t hold, and that the world around the particular thing in question will likely react to the changes that the extrapolation entails. Or, as in your post, the extrapolated thing itself doesn’t follow a linear trajectory to begin with.
Some related observations I’ve made over the last months:
most new projects start out fast and easy with simple prompts leading to swift progress; but after a few iterations, I usually end up in situations where just explaining my ideas for the next marginal improvement becomes more and more time-consuming, and then many projects usually stall, as even with vibe-coding I reach that point where investing the effort to think about & explain the next steps becomes so complex it’s not worth it, given the expected payoffs (this, of course, depends greatly on the project; if it’s a product that thousands or millions of people use, the equation is different than if it’s just some gimmick for personal use)
back when I actually wrote code, there was kind of two-way communication between “product level” and “code level”—sometimes, certain abstractions in the code just made it very worthwhile to add certain features that otherwise maybe wouldn’t have occurred to me. This “code → product” communication channel almost entirely disappeared now. I’m not sure how important this is, but it certainly makes me a little sad, because these “oh I can write the code in this way and then this opens all kinds of doors and naturally allows us to do X!” moments used to be very nice.
Even a few years ago, in my software job I think I spent maybe ~40%, often less, on actual coding tasks, and the rest was happening somewhere outside of my IDE. While some of these other tasks can be done via Claude Code as well via MCPs, that hasn’t had a big impact yet on most of my activities. So, while the 40% of my prior work are now sped up by some considerable factor, much of the rest is still just as slow as before.
Other activities indeed have become somewhat more than before, because other people are using Claude Code as well and increase their output. If more happens in my organization, then more coordination is needed. To some degree that’s code reviews (although this is somewhat sped up by AI tools as well), but also just the process of having to refine & understand more tickets if we are working through them more quickly.
I suspect that some of that coordination is less efficient than before, because our developers have a less well-rounded theory of the code now, meaning they’re less quick to provide informed judgment on the fly. I don’t have much evidence of this besides personally having this impression of myself and having a reasonably strong prior that many others will have a similar experience.
I agree with many of your thoughts and considerations, but end up at the opposing prediction—I do think that coding agents will very likely improve fast enough that the problem of decaying vibe-coded code bases will be outpaced by their abilities in many cases. Naturally, I don’t think this is true across the board, but as a general trend, this seems likely to me. For the following reasons:
development since Opus 4.5 has been extremely fast, with many dimensions seeing improvements, from the models themselves, to their harnesses, to the UX & how people interact with them, to skills extending their capabilities in all kinds of ways; while it’s possible the progress of the recent ~5 months won’t continue at this rapid pace, there are so many axes along which improvement is happening that a severe slowdown would surprise me
I tend to think that one of the main advantages of good theory about a code base is that it allows you to make good predictions, which is e.g. useful when you’re in a meeting with other stakeholders and they need a quick assessment how much effort different features/changes would entail, or what risks are involved. Perhaps processes adapt in ways that make this advantage less important.
Coding agents most likely will eventually get better at building persistent & sharable representations of theory[1], if this is indeed as important as one might think (which I’m not so sure about)
I think people overestimate how much theory is really present in human-built code bases in big organizations; there is so much movement/churn involved, people switch teams, leave the company, work on new things and forget their old code. How common is it that things “built as a prototype” end up making it to production, that hacky solutions are shipped and corners are cut on all sides, compromising the theory on all ends? I think it’s highly common in many places. Of course there are always some experienced people you can point at who have good theory of their part of their code, and they’ll be principled about it and often say “no we don’t do it like that”, and I suppose that’s often indeed a good thing. But still, I’d assume that 90% of many production code bases are already pretty much theory-less even without vibe coding. Vibe coding will increase this share, but it also comes with better ways of dealing with theory-less code by, e.g., making it vastly more efficient to gain knowledge about how some unknown piece of code works and how it relates to other parts of the system.
I do find it very uncomfortable, in some ways, to rely on AI tools more and more in coding. It worries me to lose my grip on the theory. It feels like a dangerous route to take, and nobody can say with certainty if it’s worth the risk. I’m also worried about my skills atrophying with every instance of asking Claude to implement something that I could also do myself, if I had a little more patience—and what am I really contributing, when the skills I’ve built over decades are not something I’m using anymore in my daily work? And maybe I’m just telling myself that “learning AI tools now is important so I should use them all the time” because that’s a convenient excuse to do less mental work myself. But even then, after the development of the past few months, I can’t help but feel that insisting on humans having to think about code at all a year from now seems to vastly underestimate the trajectory that we’re seemingly on. (But then again, it wouldn’t be the first time I overestimated a recent trend and would then be surprised by it slowing down against my expectations—so I guess I’m leaning 60:40 towards my claims here being broadly in the right direction, and feel generally highly uncertain about where things are headed over the next 1-2 years)
- ^
Admittedly, part of this theory that people maintain in their heads (and that coding agents may potentially have while working on something and having everything in context (but not carry over across sessions)) may be somewhat abstract/conceptual/tacit and difficult to just put into words in a way that any reader could fully recover, as it’s a non-trivial kind of inverse problem to reconstruct a theory based on writing that was produced by that theory. Or, a lot of a theory may be very implicit and hard to fully extract, as it might for a large part consist of “unknown knowns” rather than explicit pieces of knowledge, and these unknown knowns may only be elicited when certain situations come up (such as, someone raising a particular question to test against your theory, and then it turns out to have a piece that relates to that question that you never before thought about but which then emerges out of your theory). But even if all this is true, I don’t think improving theory sharing of coding agents is a futile endeavor, and significant progress may yet be made.
While the comments provide a lot of counterexamples, I think the post still makes a very good point. I’ve done some self experimentation, see eg my melatonin self RCT, and I’m currently running a ~150 day experiment on several mood + productivity interventions in parallel, and I have to say, the power analysis beforehand is always disappointing. Even at 150 days, I’m basically biting the bullet of low statistical robustness of my findings, as I wasn’t willing to commit to doing this for a year or two. Additionally, this experiment can’t be blinded, so I can’t even be certain I’m measuring more than reporting bias (at least for some metrics). If I’m honest, I’m probably mostly doing it because I love data analysis and just look forward to that part. Ideally, I’ll get some insights out of it, but it’s unlikely they’ll be super surprising rather than just weak evidence roughly in the direction I already expect.
I once heard someone make the argument that self experimentation is worthwhile, but if you need statistical tools to evaluate it, then you’re doing it wrong, and you should rather look for effect sizes large enough that you easily and confidently notice them without calculating p-values. Seems like a valid claim to me. As long as there are high-variance things to try that may work amazingly well for you, it surely often makes sense to prioritize these rather than your average “this may improve my mood by 3%” intervention.
True, it’s possible larger context windows aren’t even needed and 1M is sufficient for the majority of our economy to get automated.
I also think it’s easy to underestimate how much context humans actually gather over the years though. E.g. in my job there’s a huge amount of information I picked up over time. And I never fully know in advance what subset of that information I might need on any given day. It would be futile to even try to write down everything that I know, because much of that knowledge is latent/fuzzy/hard to put in words/seems irrelevant but isn’t necessarily.
To list a few such things:
Company culture and structure
Teams and responsibilities
Many dozens of co-workers, their tenure, skills, personalities, common memories, what they look like, their voices
Dozens of tools, how to use and navigate them, when and why to use them, when and why they were introduced
A huge code base, or at least many many bits and pieces of it
The product(s) including future roadmap and past development, some known issues and limitations, their design
Context about how users interact with our software
Our competition and how we relate to them
I’d assume that my visual knowledge alone (what products, tools, people, logos etc look like) could fill a significant part of a 1M context window (given the current state of the tech).
I recently tried to compile one really thorough readme for LLMs about one project I had worked on. I think it ended up at around 50k tokens, but it was very far from complete, as I have so much latent knowledge about it that I can’t just easily export on demand—it just lives somewhere in my brain, stashed away until some situation arises where I actually need it. That said, it’s possible that “the essence” of that knowledge could be compressed to, say 10-20% the amount of tokens, which would indeed make your argument very plausible.
Often, qualitative differences turn out to be quantitative, especially in AI progress. As The Bitter Lesson pointed out in 2019, jumps in capabilities often don’t need some breakthrough or human ingenuity, but merely (much) more of the same, that is, scaling up the compute. And so we went from GPT2, which could produce English text with mostly flawless grammar but not much more, to the multilingual GPT3.5 that could write entire essays, to later models that are coming for most white collar jobs.
This naturally raises the question which other limitations exist in AI that seem qualitative, but end up being pretty much solved by the same thing but bigger. I wonder about three areas in particular:
Continual learning
Reliability & hallucinations
Multi-modality much closer to the human experience (something like audio-visual with depth- and time perception)
For all of these, it’s tempting to claim that they require some big breakthrough or entirely different approach than LLMs, and that the default would be that these current limitations will pose natural upper bounds to the impact of LLMs on our world. And I can well imagine that certain breakthroughs could greatly accelerate progress in these areas. But I also can’t help but suspect that even without major breakthroughs, we’ll inevitably see serious progress on these fronts anyway.
Continual learning: context window sizes didn’t see the rapid progress of some other areas & benchmarks, but even then, today’s frontier models have ~10x the context window of 2023. It’s not the primary thing labs are optimizing, but it seems overwhelmingly likely to me that algorithmic + hardware progress will lead to larger context windows of the years. And if we do reach 10M or 100M token context windows eventually, I wouldn’t be surprised if that (combined with other capability improvements) will be sufficient to make in-context learning capable enough to mostly alleviate the need for true continual learning for most economically valuable purposes. Sure, if somebody figures out true scalable & robust continual learning, then that’s an even bigger deal[1]. But I’d argue that even if this for whatever reason does not come to pass, merely scaling up context window sizes could eventually be sufficient to surpass the “context persistence advantages” of humans.[2]
Reliability & hallucinations: some people assume that LLMs will always hallucinate and it will take a fundamentally different approach to overcome this. Maybe they’re right, but at least in agentic coding we see that if you get the feedback loops right and “tether the model” to some verifiable part of reality, hallucinations mostly become a non-issue. It’s unclear to me how far this will actually work & scale in other areas, and Sam Altman’s prediction from 2023 that two years from then “we won’t still talk about” hallucinations certainly turned out to be incorrect. But I wouldn’t be surprised if some relatively marginal changes, such as forms of embodiment[3] or best-of-n style answers, or whatever other surprisingly simple strategy will be identified in the meantime, end up increasing reliability greatly.
Multi-Modality: in principle, a larger context window might allow just providing an LLM with 100s of images representing some form of livestream from a camera (or two), and appropriate training or reasoning might allow it to “perceive” movement. On the one hand, I’d think that it’s a huge disadvantage for the LLM if the “time modality” is not properly represented in the way its inputs are tokenized[4]. But on the other hand, it still seems conceivable that even such a suboptimal encoding of movement as “100 separate tokenized still images” could be handled by more advanced LLMs well enough to basically solve current limitations of LLM perception[5].
I’m not claiming that any of this is what is going to happen. Multi-modality in particular seems like something labs could expand a lot if it was a priority, but they just happen to focus on other areas that are more lucrative on the current margin. Either way, the point of this post is just to point out that I do think that these developments may be a bit of a lower bound of AI progress. Even if no major breakthroughs occur, I’d still assume we eventually end up
with in-context learning capable enough to surpass humans in many areas where we would currently assume continual learning to be required
with fewer and fewer hallucinations in many areas
and with AI models that can perceive the world in very similar ways to us, in so far as that’s helpful for the area they’re deployed in (and in many ways that may go way beyond the limits of human perception)
- ^
And to be fair, my best guess is that continual learning will see some breakthroughs in the next 1-3 years and will essentially get solved.
- ^
Somewhat relatedly to this, I also get the impression that much of what’s currently happening in the AI coding landscape (around skills, MCPs, agents/claude.md files, memory, context management...) is to some degree “overfitting” on the current margin of AI capability and will in future generations get obsolete once LLMs become better at building & persisting meaningful context themselves dynamically. We’re in this fun phase, where humans can still teach LLMs a lot to make them more useful, but I highly doubt this phase will last very long.
- ^
My thought here being that some form of embodiment “nails” the AI to reality and (directionally) prevents it from spiraling out of control in strange failure modes; of course, it might still turn psychotic for various reasons, but having a constant stream of “reality” almost certainly might have some grounding influence compared to its current reality that largely consists of its own thoughts, system prompts and the ramblings of its conversation partner.
- ^
E.g., CNNs seem conceptually nice in that they encode a certain prior about the modality of images, in that neighboring pixels tend to be more relevant for each other than more distant pixels. Similarly and reversed, providing frames of a video as entirely separate images just seems lacking, as the temporal connection isn’t really encoded, but just kind of “interpreted into it” after the fact.
- ^
To name one example of the limitations I mean here: if you’re working on a website and add some subtle animations to improve UX, this is something today’s coding agents have a very hard time testing. They can generally use browsers, click around, look at different screenshots, but this usually happens “one screenshot at a time” and does not include animations. They can still implement animations, and often do a good job at that, but they’re typically doing this blindly. Any human, on the other hand, who would use this website, would instantly and automatically perceive animations, and notice when they’re off in any considerable way.
I expect that 9⁄125 rate to climb quickly
DoxxBench here we come...
About the out-of-distribution game: perhaps they had an internal “competition” for employees to one-shots games and then chose the best ones? I’m not sure what would give you the confidence that such a process would be too cherry-picky for OpenAI. To me this seems like a typical tech company approach, but who knows.
I agree the general question about the best possible games different LLMs can create is very interesting and informative. My experiences so far have been mixed, but I also micromanaged the process a lot. Round-based generally seems to be easier than realtime, as they struggle to get the controls feel right, and can’t e2e test realtime creations properly. And prompting these things is also super finicky. My impression is also that their judgement about which things are important in a game are really unreliable—or maybe I’m just too opinionated, hard to say.
“was not associated” tells us more about the sample size than the effect, as far as I can tell, though, doesn’t it? The 0.82-2.42 CI does not seem very reassuring. Especially given this is just observational—it could well be that people who brush immediately after intake of something that’s bad for your teeth are generally conscientious about their dental health, so if they still end up with worse outcomes in this study (albeit not reaching statistical significance), then brushing quickly after acid intake could potentially be even worse than this CI (weakly) suggests.
That said, the measured odds ratios for fruit/acids between meals were so much larger that it might indeed make more sense to focus on these than on the exact timing of brushing.
After using Claude Code for a while, I can’t help but conclude that today’s frontier LLMs mostly meet the bar for what I’d consider AGI—with the exception of two things, that, I think, explain most of their shortcomings:
lack of real multimodality
context window limitations
Most frontier models are marketed as multimodal, but this is often limited to text + some way to encode images. And while LLM vision is OK for many practical purposes, it’s far from perfect, and even if they had perfect sight, being limited to singular images is still a huge limitation[1].
Imagine you, with your human general intelligence, were sitting in a dark room, and were conversing with someone who has a complex, difficult problem to solve, and you do your best to help them. But you can only communicate through a mostly text-based interface that allows this person to send you occasional screenshots or photos. Further imagine that every hour or so you lose your entire memory & mental model of the problem, and find yourself with nothing but a high-level and very lossy summary of what has been discussed before.
I think it’s very likely that under such restrictive circumstances, it’s just very hard to not run into all kinds of failure modes and limitations of capability, even for the undoubtedly general intelligence that is you.
So, in some sense, I’d think that there’s an “intelligence overhang”, where the raw intelligence that exists in these LLMs can’t fully unfold due to modality & context window limitations. These limitations mean that Claude Code et al. don’t yet show the effects on the economy and world as a whole that many would have expected from AGI. But I’d argue it makes sense to decouple the actual “intelligence” from the limiting way in which it’s currently bound to interact with the world—even if, as some might correctly argue, modality & context window are just an inherent property of LLMs. Because this is an important detail about the state of things that, I suppose, is neither part of most of the definitions people gave for AGI in the past, nor of the vague intuitions they had about what the term means.
- ^
as opposed to, say, understanding video, including sound, and including a sense of time. (This is not to say that vision is necessary for general intelligence, of course; but that’s kind of my whole point: the general intelligence is already there, it’s just that the modality + context restrictions mean AI is still much less effective at influencing the world in the way that a “naively” imagined AGI would)
It seems to me that narratives are skewed and highly simplified abstractions of (empirical) reality that then are subject to selection pressure, such that the most viral ones (within any subculture) dominate, where virality is often negatively correlated with accuracy. Yet, when hearing narratives from people we like and trust, we humans seem to have deeply ingrained urges to quickly believe them. This gets most apparent when you hear the narratives other subcultures are spreading that affect you or your beliefs negatively. Like, hearing the narratives of AI skeptics & ethicists (say about AI water usage, about AI not being “actually intelligent”, or about all AI doomers secretly trying to inflate stock prices) really drove home a Gellman-Amnesia-style realization for me, how deeply flawed narratives tend to be, and that this is very likely true for the narratives I’m affected by (without even realizing these are narratives!).
Narratives are usually a combination of an overly simplistic conclusion about some part of the world paired with radically filtered evidence. (And I guess this claim in itself is a bit of a narrative about narratives)
I agree with you though that narratives may be required to actually do things in the world and pure empiricism will be insufficient.
I read your title and thought “exactly!”. I then read your post and it was pretty much exactly what I expected after reading the title. So, ironically, it seems like you perfectly compressed the state of your mind into a few words. :) But to be fair, that’s probably mostly because we’ve made very similar experiences and doesn’t translate to human<->LLM communication.
When vibe-coding, many things work really fast, but I often end up in these cases where the thing I want changed is very nuanced and I can see that just blurting it out would cause the LLM to do something different from what I have in mind. So I sometimes have to write like 5 paragraphs to describe one relatively small change. Then the LLM comes up with a plan which I have to read, which again takes time, and sometimes there are 1-2 more details to clear up, so it’s a whole process, and all of this would kind of work naturally without me even noticing if I were to write the code.
A year ago I wrote a post in a somewhat similar direction, but the recent months of vibe coding with Opus 4.5 really gave me a new appreciation for all the different bottlenecks that remain. Once “writing code” is automated—which is basically now—it’s not like programmers are instantly replaced (evidently), we just hop on to the next bottleneck below. So, the average programmer will maybe be sped up by some percentage, with only extreme outliers getting a multiple-fold increase in output, and the rest merely shifts to focus on different things in their work. It’s still kind of mindblowing to me that that’s how it is. Perhaps it gets “solved” once the entire stack, from CEO to PM to testers to programmers, is AIs—but then I guess they would also have to communicate via not-flawlessly-efficient means with each other (and sometimes themselves, until continual learning is solved), and would still run into these coordination overhead issues? But I guess all that overhead is less notable when the systems themselves run at 100x our speed and work 24h/day.
Even with a car, there are cases where traffic and/or finding a parking spot can cause huge variance. It really depends on the type of meeting / circumstances of the other people whether it’s worth completely minimizing the risk of being late at the expense of potentially wasting a lot of your own time.
E.g., when I visit somebody at their home, then it will likely be bearable for them to welcome me 10 minutes later. Whereas if we meet at some public space, it may be very annoying for the person to stand around on their own (particularly if the person has social anxiety and gets serious disutility from the experience).
That all being said, probably the majority of minutes that people are late to things are self-inflicted, and I agree with OP it makes sense in general to reduce that part (and more generally striving to be a reliable person).
I can relate to a lot of this. But I think in my case the motivation for reinventing the wheel also comes down to fundamentally not enjoying activities like “reading documentation” or generally “understanding what another person has done”. But implementing my own library is usually fun. And I can often justify it to myself (and sometimes others) because it will then match the given use case perfectly and will be exactly as big/complex as needed, rather than being some huge highly general universal solution full of bells and whistles we won’t even need. Which can be a real advantage—but it’s also just one side of a trade-off, and I tend to weigh that side more highly than others, for probably rather self-serving reasons.
I once heard from a developer friend that he sometimes just reads things like the Docker documentation for fun in his spare time. It gave me great appreciation for how different people can be and how difficult it really is to overcome the typical mind fallacy… :) I never would have thought people can enjoy that. And now I’m interested in somehow finding that same enjoyment in myself, because I think it would make many things much easier if I could overcome that aversion that keeps pushing me in the direction of reinventing all the wheels.
I’m not sure what you’re hinting at, but in 99.9% of cases when I’m out of the house, I do carry a smartphone around. If you mean that it’s annoying when the display gets confused by water, then I agree that’s a real disadvantage (but I doubt people’s attitude towards being exposed to rain changed that much between 2006 and today, so there certainly is some severe general dislike of rain independent from smartphones). If this is not what you mean, then please elaborate. :)
Agreed, that’s one of the exceptions I was thinking of—if you’re getting soaked and have no way to get into dry clothes anytime soon, there’s little way around finding that rather unpleasant. But I’d say 95% of my rain encounters are way less severe than that, and in these cases, my (previous) attitude towards the rain really was the main issue about the whole situation.
People compare things that are close together in some way. You compare yourself to your neighbors or family, or to your colleagues at work, or to people that do similar work as you do in other companies.
Isn’t one pervasive problem today that many people compare themselves to those they see on social media, often including influencers with a very different lifestyle? So it seems to me that comparisons that are not so local are in fact often made, it primarily depends on what you’re exposed to—which to some degree is indeed the people around you, but nowadays more and more also includes the skewed images people on the internet, who often don’t even know you exist, broadcast to the world.
But maybe this is also partially your point. Maybe it would theoretically help to expose people a lot to “the reality of the 90s” or something, but I guess it’s a bit of an anti-meme and hence hard to do.
I agree that telling people how well off they are on certain scales is probably not super effective, but I’m still sometimes glad these perspectives exist and I can take them into consideration during tough times.
Firstly, I agree that this post covers many highly relevant considerations for how AI may develop over the next years, and why.
That said, it’s quite meandering and I feel like it could convey its key points in probably a third of the words. And the most likely reason for that is that it seems largely LLM-written, with LLM prose having a tendency to be very verbose, performative and full of “punchlines”, rather than just getting to the point. Imho future posts, if you decide to post more on lesswrong, would benefit a lot from having more of your own voice, which, I suspect, would automatically improve the length/focus part as well.