Like überall? Maybe jirgendwo vs nüberall are the better words. Like neverywhere or nalways.
silentbob
I agree it makes sense to raise the “does this prove too much” question. But I’d tend to think it doesn’t (as far as your example goes). Three thoughts:
I do think the company in your scenario does have a huge problem. Whether it’s an existential problem for them, or merely a “we’ll have some rough times ahead and might need to take two years to somehow regain a level of competence within our workforce” depends on their circumstances, but I’d say that most companies in most situations will struggle severely when 11 out of eleven experienced software developers spontaneously leave. Well, some companies may be able to just maintain their prior level and be in a comfortable enough spot that the severe slowdown in software development is not a big deal for them. But this is probably more of an exception.
Neither in this case, nor in the case of AI automation, would I call the task impossible. Just very hard. My main goal here was to put the “coding agents are magical and change everything” impression that one can very easily get into context, as I think these magical capabilities don’t easily transfer to larger-scale organizations.
As you already hint at with your last sentence, some of the challenges I mentioned affect LLMs in particular, and hence the situation for human developers to catch up would, imho, be much more realistic (at anything close to current capability levels of AI). I’m not sure if larger context window sizes would solve this (although I’ve sort of argued before that it might—I’ve somewhat updated in the opposite direction now, but am unsure). I do think that the fact that context windows are stored in text makes them less useful. And while this limitation exists, I think it will always lead to problems—although it’s conceivable that such problems (LLMs subtly misunderstanding things or missing nuances and hence creating worse code or making bad judgment calls) would just not matter all that much and would be outweighed by the advantages. I could imagine that an order of 10-100M token context windows would allow to capture the most important 95-99% of context, if it’s filled wisely and deliberately, but that’s really just spitballing. Such context window sizes are not impossible, but at recent trends, I’d be a bit surprised if we get there sooner than 2-3 years from now. And even when we do: this might still leave other bottlenecks in place, plus it would still require very targeted efforts to utilize these larger context windows properly.
I’m curious though, would you say you can also model “volumetric” 3D in your head, or more the typical shape/surface, e.g. seeing a 3D orange in your head, but only “from the outside”, without having a good detailed intuition about its internal structure?
I think I agree about your detail observation, but these details in my case are still mostly 2D surfaces within 3D space, rather than “true 3D” in the sense I was trying to get at in the post.
On the other side, having an LLM delete your production database or cause something catastrophic seems (I don’t have data on this) to happen way more often than catastrophically bad chatbot conversations.
I also don’t have any reliable data, but I would be very surprised if this were the case. I remember maybe ~3 publicly discussed cases of “deleted a production database”-grade LLM failures, but my impression is that there are probably at least 10s of thousands of cases of LLM psychosis or similarly bad/extreme outcomes, and could well imagine that number to be much higher.
For AGI, none of these constraints may be relevant. Minds can fork and merge. Training can be instant through weight sharing. Coordination happens at silicon speed without contracts. When one AI masters a new domain—say, protein folding or contract law—it won’t need to teach others through language or demonstration. It will simply share the relevant weights, like copying a file. The receiving AI instantly acquires years of “experience” in milliseconds.
I wonder if this actually holds up once continual learning is solved. Currently, I see ~three general ways in which that might potentially happen:
Some form of online weight updating. But that would mean different instances of the same original AI may not be “compatible” anymore in the sense that they could easily share something they learned with each other. The only viable way then would be to create identical clones of an AI that has learned something important (which is still highly useful, of course).
Context windows become so enormously large that AIs can just put an entire career worth of context in there, and in-context learning is strong enough for this alone to surpass the level of humans in most domains. In this case, they could in principle just share the relevant parts from their context window, describing in sufficient detail how to perform some skill perfectly, with another AI, and that might work. But it’s also possible that the way they stored that skill in their context relates in numerous ways to other things they personally have learned, and isolating a particular skill to share it with another AI may not work well, as, e.g., it tends to use words in different ways and thereby generalizes differently from what the provided context contains. (It would still likely work much better/faster than whatever humans do to share knowledge/skills with each other, though)
Perhaps some in-between thing that’s neither on weight level nor in plain language, like some form of persistent memory of embeddings or so. No idea what that might look like in practice, but the blurry image I have of it still looks like it might make it difficult to extract some isolated thing out of it without corrupting it beyond usefulness.
I think it still seems very likely that AIs will be much better than humans at all of this, in many relevant ways, so I agree with your point directionally. I don’t want to rule out that “sharing years of experience in milliseconds” does turn out true. Just wanted to point out that to me, it’s not at all obvious that this will happen, and solving certain problems on the way to AGI may come at the expense of the feasibility of instant skill sharing between AI instances.
For the videos I mentioned, my p(at least some phrases for this came out of an LLM) range from maybe 80% (SeaGate) to 97% (Mo Bitar). So I definitely see a chance I may be wrong about one of these samples. But I’d be very surprised if I’m wrong about the general trend and if actually several of the cases I showed are fully human-written, after all.
What makes me confident is the density of these patterns, that most of them occur together in most cases, and that this seems, as far as I can tell, to be a pretty recent development. I’m interested in quantifying that pattern density, I’ll see when I find the time to do so.
Of course, one can find any single one of these patterns in writing from before 2022. But I’d assume that it’s very difficult to find text with such a density of all the specific patterns that LLMs show. It could of course be the case that this writing style is just a sort of “persona selection” that occurred during post-training and there really were people speaking like this online in the past, rather than LLMs having been the first entities to truly own that style. But even then: the ubiquity of this style nowadays seems way too high to me to be explained without LLMs being heavily involved in the process.
Interesting! If you come across any such examples, I’d be very curious to see them.
I guess it depends on the alternative. The writing of many inexperienced writers will surely get better in all kinds of ways. But the writing I would want to read almost certainly gets worse in ways that I care about. LLM writing to me almost always feels very thin and “style over substance” (and then even in a style I grew to dislike). Naturally, writing is a very high-dimensional thing, and “things different people value in writing” equally so. So there will be different answers for different writers and readers. To me, the negatives are:
It’s a less accurate representation of the author’s thinking (assuming thinking on side of the author took place)
It tends to be full of “empty sentences”, hedging, shallow examples
Writing all over the world gets heavily correlated
I subjectively find the style annoying
It potentially robs us of a useful signal of who expands actual effort in their work
The “Not X—Y” pattern in particular often seems quite useless. The “not”-part could usually just be omitted without making things worse. It’s rarely something that people would have thought that needs to be corrected. And in the rare cases that it is: why not write in a way to avoid such misconceptions to begin with instead of repeatedly creating and then correcting them? Occasionally, it can be useful to get people to a certain notion and then correct it, as a rhetorical or pedagogical move, but certainly not 5x within any given text.
Although it was capabilities gapped a few years ago, at this point it’s trivially easy to apply a bit of prompting creativity to bypass 99% of people’s slop detectors on various social media platforms.
Can you elaborate? My impression is that at least Claude models struggle immensely to avoid their typical way of speaking (which I find annoying as hell), and I never managed to find a prompt that works to avoid that.
Firstly, I agree that this post covers many highly relevant considerations for how AI may develop over the next years, and why.
That said, it’s quite meandering and I feel like it could convey its key points in probably a third of the words. And the most likely reason for that is that it seems largely LLM-written, with LLM prose having a tendency to be very verbose, performative and full of “punchlines”, rather than just getting to the point. Imho future posts, if you decide to post more on lesswrong, would benefit a lot from having more of your own voice, which, I suspect, would automatically improve the length/focus part as well.
This resonates a lot. I once wrote about what I’d say is a specific instantiation of this concept, namely that people often ask questions that appear to have a binary return type, when that is, in fact, just the wrong return type to look for. Many binary-sounding questions actually require a return type of “function that takes a potentially large number of input values and returns a boolean [or sometimes even something more complex than a boolean]”. When looking for and debating binary answers to such questions, one sweeps the most interesting parts of the answer under the rug.
Two more points I’d add:
Diminishing returns in code produced. When code production is limited, one naturally works only on those pieces of code that are most useful. But if you can create 10x more code, you likely don’t reap 10x the benefits. I notice this a lot in private projects, I just quickly run out of ideas of what to even code because finding anything that has more value than the time it would even take to prompt it becomes non-trivial relatively quickly. Maybe somewhat of a skill issue, but I this effect will always be present to some degree.
Code doesn’t live in isolation. It often requires changes outside of the code to go along with it. So, often you can’t just speedrun the code to its final form, but many things around it are keeping it in place to some degree, or have to slowly advance alongside the code. Things like user expectations & knowledge, deployment, code in other projects that may depend on it and are outside of your control, general project planning / decisions, and so on. It reminds me of my observation about parameter updates during gradient descent here:
While most parameters have “settled down” after the first few hundred epochs, clearly some of them are still on a mission, though. I was also a bit surprised how smoothly many of them are moving. [...] many parameters are systematically drifting into one direction, albeit now in a more noisy fashion. It gives me “just update all the way bro!” vibes—but perhaps this is a bit of a coordination problem: multiple parameters have to move more or less in unison, and none of them can reasonably update further even though the longer term trajectory is clear.
I agree, and this seems like a special case of what I sometimes think of as “extrapolating too far”, which also occurs in reasoning of all kinds quite often and particularly when discussing the future.
An example would be the assumption that some scarce material resource eventually just “runs out” more or less suddenly, which people sometimes argue. In such cases, it’s almost always the case that scarcity is gradually increasing and plays into a feedback loop of a search for alternatives. But if one just extrapolates the “this resource will eventually run out” idea in isolation, without taking into account the changes this has on the rest of the world and the relevant process, one can get to the conclusion that it just eventually just “hits 0”.
Of course, extrapolating in more reasonable ways is often extremely difficult, as the systems involved are difficult/impossible to fully predict. But when one does isolated extrapolations, it’s at least helpful to keep in mind that this necessarily comes with simplifications that won’t hold, and that the world around the particular thing in question will likely react to the changes that the extrapolation entails. Or, as in your post, the extrapolated thing itself doesn’t follow a linear trajectory to begin with.
Some related observations I’ve made over the last months:
most new projects start out fast and easy with simple prompts leading to swift progress; but after a few iterations, I usually end up in situations where just explaining my ideas for the next marginal improvement becomes more and more time-consuming, and then many projects usually stall, as even with vibe-coding I reach that point where investing the effort to think about & explain the next steps becomes so complex it’s not worth it, given the expected payoffs (this, of course, depends greatly on the project; if it’s a product that thousands or millions of people use, the equation is different than if it’s just some gimmick for personal use)
back when I actually wrote code, there was kind of two-way communication between “product level” and “code level”—sometimes, certain abstractions in the code just made it very worthwhile to add certain features that otherwise maybe wouldn’t have occurred to me. This “code → product” communication channel almost entirely disappeared now. I’m not sure how important this is, but it certainly makes me a little sad, because these “oh I can write the code in this way and then this opens all kinds of doors and naturally allows us to do X!” moments used to be very nice.
Even a few years ago, in my software job I think I spent maybe ~40%, often less, on actual coding tasks, and the rest was happening somewhere outside of my IDE. While some of these other tasks can be done via Claude Code as well via MCPs, that hasn’t had a big impact yet on most of my activities. So, while the 40% of my prior work are now sped up by some considerable factor, much of the rest is still just as slow as before.
Other activities indeed have become somewhat more than before, because other people are using Claude Code as well and increase their output. If more happens in my organization, then more coordination is needed. To some degree that’s code reviews (although this is somewhat sped up by AI tools as well), but also just the process of having to refine & understand more tickets if we are working through them more quickly.
I suspect that some of that coordination is less efficient than before, because our developers have a less well-rounded theory of the code now, meaning they’re less quick to provide informed judgment on the fly. I don’t have much evidence of this besides personally having this impression of myself and having a reasonably strong prior that many others will have a similar experience.
I agree with many of your thoughts and considerations, but end up at the opposing prediction—I do think that coding agents will very likely improve fast enough that the problem of decaying vibe-coded code bases will be outpaced by their abilities in many cases. Naturally, I don’t think this is true across the board, but as a general trend, this seems likely to me. For the following reasons:
development since Opus 4.5 has been extremely fast, with many dimensions seeing improvements, from the models themselves, to their harnesses, to the UX & how people interact with them, to skills extending their capabilities in all kinds of ways; while it’s possible the progress of the recent ~5 months won’t continue at this rapid pace, there are so many axes along which improvement is happening that a severe slowdown would surprise me
I tend to think that one of the main advantages of good theory about a code base is that it allows you to make good predictions, which is e.g. useful when you’re in a meeting with other stakeholders and they need a quick assessment how much effort different features/changes would entail, or what risks are involved. Perhaps processes adapt in ways that make this advantage less important.
Coding agents most likely will eventually get better at building persistent & sharable representations of theory[1], if this is indeed as important as one might think (which I’m not so sure about)
I think people overestimate how much theory is really present in human-built code bases in big organizations; there is so much movement/churn involved, people switch teams, leave the company, work on new things and forget their old code. How common is it that things “built as a prototype” end up making it to production, that hacky solutions are shipped and corners are cut on all sides, compromising the theory on all ends? I think it’s highly common in many places. Of course there are always some experienced people you can point at who have good theory of their part of their code, and they’ll be principled about it and often say “no we don’t do it like that”, and I suppose that’s often indeed a good thing. But still, I’d assume that 90% of many production code bases are already pretty much theory-less even without vibe coding. Vibe coding will increase this share, but it also comes with better ways of dealing with theory-less code by, e.g., making it vastly more efficient to gain knowledge about how some unknown piece of code works and how it relates to other parts of the system.
I do find it very uncomfortable, in some ways, to rely on AI tools more and more in coding. It worries me to lose my grip on the theory. It feels like a dangerous route to take, and nobody can say with certainty if it’s worth the risk. I’m also worried about my skills atrophying with every instance of asking Claude to implement something that I could also do myself, if I had a little more patience—and what am I really contributing, when the skills I’ve built over decades are not something I’m using anymore in my daily work? And maybe I’m just telling myself that “learning AI tools now is important so I should use them all the time” because that’s a convenient excuse to do less mental work myself. But even then, after the development of the past few months, I can’t help but feel that insisting on humans having to think about code at all a year from now seems to vastly underestimate the trajectory that we’re seemingly on. (But then again, it wouldn’t be the first time I overestimated a recent trend and would then be surprised by it slowing down against my expectations—so I guess I’m leaning 60:40 towards my claims here being broadly in the right direction, and feel generally highly uncertain about where things are headed over the next 1-2 years)
- ^
Admittedly, part of this theory that people maintain in their heads (and that coding agents may potentially have while working on something and having everything in context (but not carry over across sessions)) may be somewhat abstract/conceptual/tacit and difficult to just put into words in a way that any reader could fully recover, as it’s a non-trivial kind of inverse problem to reconstruct a theory based on writing that was produced by that theory. Or, a lot of a theory may be very implicit and hard to fully extract, as it might for a large part consist of “unknown knowns” rather than explicit pieces of knowledge, and these unknown knowns may only be elicited when certain situations come up (such as, someone raising a particular question to test against your theory, and then it turns out to have a piece that relates to that question that you never before thought about but which then emerges out of your theory). But even if all this is true, I don’t think improving theory sharing of coding agents is a futile endeavor, and significant progress may yet be made.
While the comments provide a lot of counterexamples, I think the post still makes a very good point. I’ve done some self experimentation, see eg my melatonin self RCT, and I’m currently running a ~150 day experiment on several mood + productivity interventions in parallel, and I have to say, the power analysis beforehand is always disappointing. Even at 150 days, I’m basically biting the bullet of low statistical robustness of my findings, as I wasn’t willing to commit to doing this for a year or two. Additionally, this experiment can’t be blinded, so I can’t even be certain I’m measuring more than reporting bias (at least for some metrics). If I’m honest, I’m probably mostly doing it because I love data analysis and just look forward to that part. Ideally, I’ll get some insights out of it, but it’s unlikely they’ll be super surprising rather than just weak evidence roughly in the direction I already expect.
I once heard someone make the argument that self experimentation is worthwhile, but if you need statistical tools to evaluate it, then you’re doing it wrong, and you should rather look for effect sizes large enough that you easily and confidently notice them without calculating p-values. Seems like a valid claim to me. As long as there are high-variance things to try that may work amazingly well for you, it surely often makes sense to prioritize these rather than your average “this may improve my mood by 3%” intervention.
True, it’s possible larger context windows aren’t even needed and 1M is sufficient for the majority of our economy to get automated.
I also think it’s easy to underestimate how much context humans actually gather over the years though. E.g. in my job there’s a huge amount of information I picked up over time. And I never fully know in advance what subset of that information I might need on any given day. It would be futile to even try to write down everything that I know, because much of that knowledge is latent/fuzzy/hard to put in words/seems irrelevant but isn’t necessarily.
To list a few such things:
Company culture and structure
Teams and responsibilities
Many dozens of co-workers, their tenure, skills, personalities, common memories, what they look like, their voices
Dozens of tools, how to use and navigate them, when and why to use them, when and why they were introduced
A huge code base, or at least many many bits and pieces of it
The product(s) including future roadmap and past development, some known issues and limitations, their design
Context about how users interact with our software
Our competition and how we relate to them
I’d assume that my visual knowledge alone (what products, tools, people, logos etc look like) could fill a significant part of a 1M context window (given the current state of the tech).
I recently tried to compile one really thorough readme for LLMs about one project I had worked on. I think it ended up at around 50k tokens, but it was very far from complete, as I have so much latent knowledge about it that I can’t just easily export on demand—it just lives somewhere in my brain, stashed away until some situation arises where I actually need it. That said, it’s possible that “the essence” of that knowledge could be compressed to, say 10-20% the amount of tokens, which would indeed make your argument very plausible.
Often, qualitative differences turn out to be quantitative, especially in AI progress. As The Bitter Lesson pointed out in 2019, jumps in capabilities often don’t need some breakthrough or human ingenuity, but merely (much) more of the same, that is, scaling up the compute. And so we went from GPT2, which could produce English text with mostly flawless grammar but not much more, to the multilingual GPT3.5 that could write entire essays, to later models that are coming for most white collar jobs.
This naturally raises the question which other limitations exist in AI that seem qualitative, but end up being pretty much solved by the same thing but bigger. I wonder about three areas in particular:
Continual learning
Reliability & hallucinations
Multi-modality much closer to the human experience (something like audio-visual with depth- and time perception)
For all of these, it’s tempting to claim that they require some big breakthrough or entirely different approach than LLMs, and that the default would be that these current limitations will pose natural upper bounds to the impact of LLMs on our world. And I can well imagine that certain breakthroughs could greatly accelerate progress in these areas. But I also can’t help but suspect that even without major breakthroughs, we’ll inevitably see serious progress on these fronts anyway.
Continual learning: context window sizes didn’t see the rapid progress of some other areas & benchmarks, but even then, today’s frontier models have ~10x the context window of 2023. It’s not the primary thing labs are optimizing, but it seems overwhelmingly likely to me that algorithmic + hardware progress will lead to larger context windows of the years. And if we do reach 10M or 100M token context windows eventually, I wouldn’t be surprised if that (combined with other capability improvements) will be sufficient to make in-context learning capable enough to mostly alleviate the need for true continual learning for most economically valuable purposes. Sure, if somebody figures out true scalable & robust continual learning, then that’s an even bigger deal[1]. But I’d argue that even if this for whatever reason does not come to pass, merely scaling up context window sizes could eventually be sufficient to surpass the “context persistence advantages” of humans.[2]
Reliability & hallucinations: some people assume that LLMs will always hallucinate and it will take a fundamentally different approach to overcome this. Maybe they’re right, but at least in agentic coding we see that if you get the feedback loops right and “tether the model” to some verifiable part of reality, hallucinations mostly become a non-issue. It’s unclear to me how far this will actually work & scale in other areas, and Sam Altman’s prediction from 2023 that two years from then “we won’t still talk about” hallucinations certainly turned out to be incorrect. But I wouldn’t be surprised if some relatively marginal changes, such as forms of embodiment[3] or best-of-n style answers, or whatever other surprisingly simple strategy will be identified in the meantime, end up increasing reliability greatly.
Multi-Modality: in principle, a larger context window might allow just providing an LLM with 100s of images representing some form of livestream from a camera (or two), and appropriate training or reasoning might allow it to “perceive” movement. On the one hand, I’d think that it’s a huge disadvantage for the LLM if the “time modality” is not properly represented in the way its inputs are tokenized[4]. But on the other hand, it still seems conceivable that even such a suboptimal encoding of movement as “100 separate tokenized still images” could be handled by more advanced LLMs well enough to basically solve current limitations of LLM perception[5].
I’m not claiming that any of this is what is going to happen. Multi-modality in particular seems like something labs could expand a lot if it was a priority, but they just happen to focus on other areas that are more lucrative on the current margin. Either way, the point of this post is just to point out that I do think that these developments may be a bit of a lower bound of AI progress. Even if no major breakthroughs occur, I’d still assume we eventually end up
with in-context learning capable enough to surpass humans in many areas where we would currently assume continual learning to be required
with fewer and fewer hallucinations in many areas
and with AI models that can perceive the world in very similar ways to us, in so far as that’s helpful for the area they’re deployed in (and in many ways that may go way beyond the limits of human perception)
- ^
And to be fair, my best guess is that continual learning will see some breakthroughs in the next 1-3 years and will essentially get solved.
- ^
Somewhat relatedly to this, I also get the impression that much of what’s currently happening in the AI coding landscape (around skills, MCPs, agents/claude.md files, memory, context management...) is to some degree “overfitting” on the current margin of AI capability and will in future generations get obsolete once LLMs become better at building & persisting meaningful context themselves dynamically. We’re in this fun phase, where humans can still teach LLMs a lot to make them more useful, but I highly doubt this phase will last very long.
- ^
My thought here being that some form of embodiment “nails” the AI to reality and (directionally) prevents it from spiraling out of control in strange failure modes; of course, it might still turn psychotic for various reasons, but having a constant stream of “reality” almost certainly might have some grounding influence compared to its current reality that largely consists of its own thoughts, system prompts and the ramblings of its conversation partner.
- ^
E.g., CNNs seem conceptually nice in that they encode a certain prior about the modality of images, in that neighboring pixels tend to be more relevant for each other than more distant pixels. Similarly and reversed, providing frames of a video as entirely separate images just seems lacking, as the temporal connection isn’t really encoded, but just kind of “interpreted into it” after the fact.
- ^
To name one example of the limitations I mean here: if you’re working on a website and add some subtle animations to improve UX, this is something today’s coding agents have a very hard time testing. They can generally use browsers, click around, look at different screenshots, but this usually happens “one screenshot at a time” and does not include animations. They can still implement animations, and often do a good job at that, but they’re typically doing this blindly. Any human, on the other hand, who would use this website, would instantly and automatically perceive animations, and notice when they’re off in any considerable way.
I expect that 9⁄125 rate to climb quickly
DoxxBench here we come...
Are we entering the age of crime slop?