METR released a new paper with very interesting results on developer productivity effects from AI. I have copied the blogpost accompanying that paper here in full.
We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower. We view this result as a snapshot of early-2025 AI capabilities in one relevant setting; as these systems continue to rapidly evolve, we plan on continuing to use this methodology to help estimate AI acceleration from AI R&D automation [1].
See the full paper for more detail.
Motivation
While coding/agentic benchmarks [2] have proven useful for understanding AI capabilities, they typically sacrifice realism for scale and efficiency—the tasks are self-contained, don’t require prior context to understand, and use algorithmic evaluation that doesn’t capture many important capabilities. These properties may lead benchmarks to overestimate AI capabilities. In the other direction, because benchmarks are run without live human interaction, models may fail to complete tasks despite making substantial progress, because of small bottlenecks that a human would fix during real usage. This could cause us to underestimate model capabilities. Broadly, it can be difficult to directly translate benchmark scores to impact in the wild.
One reason we’re interested in evaluating AI’s impact in the wild is to better understand AI’s impact on AI R&D itself, which may pose significant risks. For example, extremely rapid AI progress could lead to breakdowns in oversight or safeguards. Measuring the impact of AI on software developer productivity gives complementary evidence to benchmarks that is informative of AI’s overall impact on AI R&D acceleration.
Methodology
To directly measure the real-world impact of AI tools on software development, we recruited 16 experienced developers from large open-source repositories (averaging 22k+ stars and 1M+ lines of code) that they’ve contributed to for multiple years. Developers provide lists of real issues (246 total) that would be valuable to the repository—bug fixes, features, and refactors that would normally be part of their regular work. Then, we randomly assign each issue to either allow or disallow use of AI while working on the issue. When AI is allowed, developers can use any tools they choose (primarily Cursor Pro with Claude 3.5/3.7 Sonnet—frontier models at the time of the study); when disallowed, they work without generative AI assistance. Developers complete these tasks (which average two hours each) while recording their screens, then self-report the total implementation time they needed. We pay developers $150/hr as compensation for their participation in the study.
Core Result
When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts. This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.
Below, we show the raw average developer forecasted times, and the observed implementation times—we can clearly see that developers take substantially longer when they are allowed to use AI tools.
Given both the importance of understanding AI capabilities/risks, and the diversity of perspectives on these topics, we feel it’s important to forestall potential misunderstandings or over-generalizations of our results. We list claims that we do not provide evidence for in Table 2.
We do not provide evidence that: | Clarification |
---|---|
AI systems do not currently speed up many or most software developers | We do not claim that our developers or repositories represent a majority or plurality of software development work |
AI systems do not speed up individuals or groups in domains other than software development | We only study software development |
AI systems in the near future will not speed up developers in our exact setting | Progress is difficult to predict, and there has been substantial AI progress over the past five years [3] |
There are not ways of using existing AI systems more effectively to achieve positive speedup in our exact setting | Cursor does not sample many tokens from LLMs, it may not use optimal prompting/scaffolding, and domain/repository-specific training/finetuning/few-shot learning could yield positive speedup |
Factor Analysis
We investigate 20 potential factors that might explain the slowdown, finding evidence that 5 likely contribute:
We rule out many experimental artifacts—developers used frontier models, complied with their treatment assignment, didn’t differentially drop issues (e.g. dropping hard AI-disallowed issues, reducing the average AI-disallowed difficulty), and submitted similar quality PRs with and without AI. The slowdown persists across different outcome measures, estimator methodologies, and many other subsets/analyses of our data. See the paper for further details and analysis.
Discussion
So how do we reconcile our results with impressive AI benchmark scores, and anecdotal reports of AI helpfulness and widespread adoption of AI tools? Taken together, evidence from these sources gives partially contradictory answers about the capabilities of AI agents to usefully accomplish tasks or accelerate humans. The following table breaks down these sources of evidence and summarizes the state of our evidence from these sources. Note that this is not intended to be comprehensive—we mean to very roughly gesture at some salient important differences.
Our RCT | Benchmarks like SWE-Bench Verified, RE-Bench | Anecdotes and widespread AI adoption | |
---|---|---|---|
Task type | PRs from large, high-quality open-source codebases | SWE-Bench Verified: open-source PRs with author-written tests, RE-Bench: manually crafted AI research problems with algorithmic scoring metrics | Diverse |
Task success definition | Human user is satisfied code will pass review—including style, testing, and documentation requirements | Algorithmic scoring (e.g. automated test cases) | Human user finds code useful (potentially as a throwaway prototype or ~single-use research code) |
AI type | Chat, Cursor agent mode, autocomplete | Typically fully autonomous agents, which may sample millions of tokens, use complicated agent scaffolds, etc. | Various models and tools |
Observations | Models slow down humans on 20min-4hr realistic coding tasks | Models often succeed at benchmark tasks that are very difficult for humans | Many people (although certainly not all) report finding AI very helpful for substantial software tasks taking them >1hr, across a wide range of applications. |
Reconciling these different sources of evidence is difficult but important, and in part it depends on what question we’re trying to answer. To some extent, the different sources represent legitimate subquestions about model capabilities—for example, we are interested in understanding model capabilities both given maximal elicitation (e.g. sampling millions of tokens or tens/hundreds of attempts/trajectories for every problem) and given standard/common usage. However, some properties can make the results invalid for most important questions about real-world usefulness—for example, self-reports may be inaccurate and overoptimistic.
Here are a few of the broad categories of hypotheses for how these observations could be reconciled that seem most plausible to us (this is intended to be a very simplified mental model):
Summary of observed results
AI slows down experienced open-source developers in our RCT, but demonstrates impressive benchmark scores and anecdotally is widely useful.
Hypothesis 1: Our RCT underestimates capabilities
Benchmark results and anecdotes are basically correct, and there’s some unknown methodological problem or properties of our setting that are different from other important settings.
Hypothesis 2: Benchmarks and anecdotes overestimate capabilities
Our RCT results are basically correct, and the benchmark scores and anecdotal reports are overestimates of model capability (possibly each for different reasons)
Hypothesis 3: Complementary evidence for different settings
All three methodologies are basically correct, but are measuring subsets of the “real” task distribution that are more or less challenging for models
In these sketches, red differences between a source of evidence and the “true” capability level of a model represent measurement error or biases that cause the evidence to be misleading, while blue differences (i.e. in the “Mix” scenario) represent valid differences in what different sources of evidence represent, e.g. if they are simply aiming at different subsets of the distribution of tasks.
Using this framework, we can consider evidence for and against various ways of reconciling these different sources of evidence. For example, our RCT results are less relevant in settings where you can sample hundreds or thousands of trajectories from models, which our developers typically do not try. It also may be the case that there are strong learning effects for AI tools like Cursor that only appear after several hundred hours of usage—our developers typically only use Cursor for a few dozen hours before and during the study. Our results also suggest that AI capabilities may be comparatively lower in settings with very high quality standards, or with many implicit requirements (e.g. relating to documentation, testing coverage, or linting/formatting) that take humans substantial time to learn.
On the other hand, benchmarks may overestimate model capabilities by only measuring performance on well-scoped, algorithmically scorable tasks. And we now have strong evidence that anecdotal reports/estimates of speed-up can be very inaccurate.
No measurement method is perfect—the tasks people want AI systems to complete are diverse, complex, and difficult to rigorously study. There are meaningful tradeoffs between methods, and it will continue to be important to develop and use diverse evaluation methodologies to form a more comprehensive picture of the current state of AI, and where we’re heading.
Going Forward
We’re excited to run similar versions of this study in the future to track trends in speedup (or slowdown) from AI, particularly as this evaluation methodology may be more difficult to game than benchmarks. If AI systems are able to substantially speed up developers in our setting, this could signal rapid acceleration of AI R&D progress generally, which may in turn lead to proliferation risks, breakdowns in safeguards and oversight, or excess centralization of power. This methodology gives complementary evidence to benchmarks, focused on realistic deployment scenarios, which helps us understand AI capabilities and impact more comprehensively compared to relying solely on benchmarks and anecdotal data.
Get in touch!
We’re exploring running experiments like this in other settings—if you’re an open-source developer or company interested in understanding the impact of AI on your work, reach out.
I was one of the developers in the @METR_Evals study. Some thoughts:
1. This is much less true of my participation in the study where I was more conscientious, but I feel like historically a lot of my AI speed-up gains were eaten by the fact that while a prompt was running, I’d look at something else (FB, X, etc) and continue to do so for much longer than it took the prompt to run.
I discovered two days ago that Cursor has (or now has) a feature you can enable to ring a bell when the prompt is done. I expect to reclaim a lot of the AI gains this way.
2. Historically I’ve lost some of my AI speed ups to cleaning up the same issues LLM code would introduce, often relatively simple violations of code conventions lik e using || instead of ??
A bunch of this is avoidable with stored system prompts which I was lazy about writing. Cursor has now made this easier and even attempts to learn repeatable rules “The user prefers X” that will get reused, saving time here.
3. Regarding me specifically, I work on the LessWrong codebase which is technically open-source. I feel like calling myself an “open-source developer” has the wrong connotations, and makes it more sound like I contribute to a highly-used Python library or something as an upper-tier developer which I’m not.
4. As a developer in the study, it’s striking to me how much more capable the models have gotten since February (when I was participating in the study).
I’m trying to recall if I was even using agents at the start. Certainly the later models (Opus 4, Gemini 2.5 Pro, o3 could just do vastly with less guidance) than 3.6, o1, etc.
For me, not going over my own data in the study, I could buy that maybe i was being slowed down a few months ago, but it is much much harder to believe now.
5. There was a selection effect in which tasks I submitted to the study. (a) I didn’t want to risk getting randomized to “no AI” on tasks that felt sufficiently important or daunting to do without AI assistence. (b) Neatly packaged and well-scoped tasks felt suitable for the study, large open-ended greenfield stuff felt harder to legibilize, so I didn’t submit those tasks to study even though AI speed up might have been larger.
6. I think if the result is valid at this point in time, that’s one thing, I think if people are citing in another 3 months time, they’ll be making a mistake (and I hope Metr has published a follow-up).
Apologies for the impoliteness, but… man, it sure sounds like you’re searching for reasons to dismiss the study results. Which sure is a red flag when the study results basically say “your remembered experience is that AI sped you up, and your remembered experience is unambiguously wrong about that”.
Like, look, when someone comes along with a nice clean study showing that your own brain is lying to you, that has got to be one of the worst possible times to go looking for reasons to dismiss the study.
I’m not pushing against the study results so much as what I think are misinterpretations people are going to make off of this study.
If the claim is “on a selected kind of tasks, developers in early 2025 predominantly using models like Claude 3.5 and 3.7 were slowed down when they though they were sped up”, then I’m not dismissing that. I don’t think the study is that clean or unambiguous giving methodological challenges, but I find it quite plausible.
In the above, I do only the following: (1) offer explanation for the result, (2) point out that I individually feel misrepresented by a particular descriptor use, (3) point out and affirm points the authors also make (a) about this being point-in-time, (b) there being selection effects at play.
You can say that if I feel now that I’m being speed up, I should be less confident given the results, and to that I say yeah, I am. And I’m surprised by the result too.
There’s a claim you’re making here that “I went looking for reasons” that feels weird. I don’t take it that whenever a result is “your remembered experience is wrong”, I’m being epistemically unvirtuous if I question it or discuss details. To repeat, I question the interpretation/generalization some might make rather than the raw result or even what the authors interpret it as, and I think as a participant I’m better positioned to notice the misgeneralization than just hearing the headline result (people reading the actual paper probably end up in the right place).
Fwiw, for me the calculated individual speed up was [-200%, 40%], which while it does weight predominantly in the negative (I got these numbers from the authors after writing my above comments). I’m not sure if that counts as unambiguously wrong about my remembered experience.
I think it doesn’t—our dev effects are so, so noisy!
The temptation to multitask strikes me as a very likely cause of the loss of productivity. It is why I virtually never use reasoning models except for deep research.
I (study author) responded to some of Ruby’s points on twitter. Delighted for devs including Ruby to discuss their experience publicly, I think it’s helpful for people to get a richer sense!
It sounds like both the study authors themselves and many of the comments are trying to spin this study in the narrowest possible way for some reason, so I’m gonna go ahead make the obvious claim: this result in fact generalizes pretty well. Beyond the most incompetent programmers working on the most standard cookie-cutter tasks with the least necessary context, AI is more likely to slow developers down than speed them up. When this happens, the developers themselves typically think they’ve been sped up, and their brains are lying to them.
And the obvious action-relevant takeaway is: if you think AI is speeding up your development, you should take a very close and very skeptical look at why you believe that.
I agree. I’ve been saying for awhile that LLMs are highly optimized to seem useful, and people should be very cautious about assessing their usefulness for that reason. This seems like strong and unambiguous positive evidence for that claim. And a lot of the reaction does seem like borderline cope—this is NOT what you would expect to see in AI 2027 like scenarios. It is worth updating explicitly!
I think that people are trying to say “look AI is progressing really fast, we shouldn’t make the mistake of thinking this is a fundamental limitation.” That may be, but the minimal thing I’m asking here is to actually track the evidence in favor of at least one alternative hypothesis: LLMs are not as useful as they seem.
gemini seemed useful for research and pushed me in the other direction. But lately there have been some bearish signs for LLMs (bullish for survival). Claude opus 4 is not solving longer time horizon tasks than o3. Agency on things like Pokémon, the experiment on running a vending machine, and net hack is still not good. And grok 3 is so toxic that I think this is best viewed as a capabilities problem which I personally would expect to be solved if AGI were very near. Also reasoning models seem to see INCREASED hallucinations.
My P(doom) has dropped back from 45% to 40% on these events.
I’m mostly cautious about overupdating here, because it’s too pleasant (and personally vindicating) result to see. But yeah, I would bet on this generalizing pretty broadly.
I use language models to help me design systems, not by asking them to solve problems, but by discussing my ideas with them. I have an idea of how to do something, usually vague, half-formed. I use automatic speech recognition to just ramble about it, describing the idea in messy, imprecise language. The language model listens and replies with a clearer, more structured version. I read or listen to that and immediately see what’s missing, or what’s wrong, or what’s useful. Then I refine the idea further. This loop continues until the design feels solid.
The model doesn’t invent the solution. It refines and reflects what I’m already trying to express. That’s the key. It doesn’t act as an agent; it’s not writing the code or proposing speculative alternatives. It helps me pin down what I’m already trying to do, but better, faster, and with much less friction than if I were doing it alone.
I mostly don’t use autocomplete. I don’t ask for “write this function.” (Though I think there is a correct way to use these.) Instead, I might say something like: “Right now I have this global state that stores which frame to draw for an animation. But that feels hacky. What if I want to run multiple of these at the same time? Maybe I can just make it a function of time. Like, if I have a function that, given time, tells me what to draw, then I don’t need to store any state. That would probably work. Is there any reason this wouldn’t work?” And the LM will restate the idea precisely: “You’re proposing to push side effects to the boundary and define animation as a pure function of time, like in a React-style architecture.” That clarity helps me immediately refine or correct the idea.
This changes the kind of work I can do. Without the model, I default to braindead hacking: solve local problems quickly, but end up with brittle, tangled code. Thinking structurally takes effort, and I often don’t do it. But in a conversational loop with the model, it’s fun. And because the feedback is immediate, it keeps momentum going.
This does offload cognition, but not by replacing my thinking. It’s integrated into it. The model isn’t doing the task. It’s helping me think more effectively about how to do the task. It names patterns I gestured at. It rephrases vague concepts sharply enough that I can critique them. It lets me externalize a confused internal state and get back something slightly clearer that I can then respond to. This creates an iterative improvement loop.
Maybe this works very well for me because I have ADHD. Maybe most people can just sit down and reflect in silence. For me, talking to the model lowers the activation energy and turns reflection into dialogue, which makes it very easy to do.
People say LMs slow you down. That’s true if you’re using them to write broken code from vague prompts and then patch the errors. But that’s not what I’m doing. I use them to think better, not to think less.
Similar here. For me, the greatest benefit is to have someone I can discuss the problem with. A rubber duck, Stack Exchange, peer programming—all in one. As a consequence, not only I implement something, but I also understand what I did and why. (Yeah, in theory, as a senior developer, I should always understand what I do and why… but there is a tradeoff between deep understanding and time spent.)
So, from my perspective, this is similar to saying that writing automated tests only slows you down.
More precisely, I do find it surprising that developers were slowed down by using AI. I just think that in longer term it is worth using it anyway.
Very interesting result; I was surprised to see an actual slowdown.
The extensive analysis of the factors potentially biasing the study’s results and the careful statements regarding what the study doesn’t show are appreciated. Seems like very solid work overall.
That said, one thing jumped out at me:
That seems like misaligned incentives, no? The participants got paid more the more time they spent on tasks. A flat reward for completing a task plus a speed bonus seems like a better way to structure it?
Edit: Ah, I see it’s addressed in an appendix:
Still seems like a design flaw to me, but I suppose it isn’t as trivial to fix as I’d thought.
I agree with the paper that paying here probably has minimal effects on devs, but also even if it does have an effect it doesn’t seem likely to change the results, unless somehow the AI group was more more incentivized to be slow than the non AI group.
I was one of the devs. Granted the money went to Lightcone and not me personally, but even if it had, I don’t see it motivating me in any particular direction. For one thing, Not taking longer – I’ve got too much to do to to drag my feet to make a little more money. Not pleasing METR – I didn’t believe they wanted any particular result.
FWIW: this is my qualitative sense for other devs too.
We made this design decision because we wanted to max out on external validity, following the task length work which had fewer internal validity/more external validity concerns.
I found this review from another participant useful. I particular resonate with the “generative AI slot machine effect.”
Same, I experienced something like this when I recently tried out LLMs for math research. I quickly got sick of it, because it never actually worked, but it did waste some of my time and I can easily imagine how one may fall into this trap hard.
I really enjoyed this study. I wish it weren’t so darn expensive, because I would love to see a dozen variations of this.
I still think I’m more productive with LLMs since Claude Code + Opus 4.0 (and have reasonably strong data points), but this does push me further in the direction of using LLMs only surgically rather than for everything, and towards recommending relatively restricted LLM use at my company.
My biggest question is “did the participants get to multitask?”
The paper suggests “yes”:
...but doesn’t really go into detail about how to navigate the issues I’d expect to run into there.
The way I had previously believed I got the biggest speedups from AI on more developed codebases make heavy use of the paradigm “send one agent to take a stab at a task that I think an agent can probably handle, then go focus on a different task in a different copy/branch of my repo” (In some cases when all the tasks are pretty doable-by-ai I’ll be more like in a “micromanager” role, rotating between 3 agents working on 3 different tasks. In cases where there’s a task that requires basically my full attention I’ll usually have one major task, but periodically notice small side-issues that seem like an LLM can handle them).
It seems like that sort of workflow is technically allowed by this paper’s process but not super encouraged. (Theoretically you record your screeen the whole time and “sort out afterwards” when you were working on various projects and when you were zoned out.)
I still wouldn’t be too surprised if I were slowed down on net because I don’t actually really stay proactively focused on the above workflow and instead zone out slightly, or spend a lot of time trying to get AIs to do some work that they aren’t actually good at yet, or take longer to review the result.
I’m not going to argue the study doesn’t show what it shows, but based on personal experience, I have a hard time believing the inferred claim that AI slows down programmers (vs. the more narrow claim that the study proves, which is that AI slows down programmers in situations that match the study).
I have a hard time believing this because I have seen the increased productivity on my own team.
Here’s my best theory for what’s going on:
AI makes programming slower when the programmer otherwise knows what they’re doing
AI makes more mistakes and regresses to the mean, requiring human fixes on top of waiting for the AI to run
AI helps most when a programmer doesn’t know exactly what to do, so it saves them time researching (reading and understanding code other people wrote, looking through docs for how to do things, reading through bug reports and questions to find answer to issues they encounter, etc.)
And then obviously if someone’s not a programmer and vibe coding AI helps them move infinitely faster than they would have otherwise because they couldn’t code at all
This would explain the results the METR team got, but also explain why it seems so obvious to everyone that we should be paying a lot of money for AI tools to help programmers write code.
(I’ll admit, there’s another reason for programmers to want to use AI even if it did make them worse at their jobs: it outsources some of the most unpleasant programming labor, so even if it’s slower, it’s worth it in the eyes of a programmer because their experience of programming feels better when they use AI because they didn’t spend a lot of time doing tasks they didn’t enjoy doing, like typing out the code changes they already figured out in their head.)
We definitely do not claim that AI broadly slows down programmers! See tweet thread about this.
I think all the points you raised are an important part of the story—we additionally go through some other factors that we think might explain the surprising result.
Yes? I’m not objecting directly to the results of the study, which are contained to what the study can show, but to the inference that many people seem to be drawing from the study.
I think “many people” is doing a lot of work here—I’ve generally found the public reception to be very nuanced, moreso than I was expecting. See e.g. Gary Marcus’ post
Basically that’s proposing to take the programmer job description and move it from hands-on write the code yourself to hands-off review and adjust the code the AI agents are writing.
Many people currently employed as programmers actually do enjoy the hands-on part and I suspect that even those doing it mostly for the money tend to like it more than code reviews. Code reviews are probably just below writing documentation and writing tests on the list of things most programmers don’t like doing.
Now, personally I don’t mind writing tests and don’t mind doing code reviews myself, probably more so than most people I’ve worked with, and yet if that’s what the job morphs into I’ll probably change careers, circumstances permitting.
What I could see is that, as the job description changes, the kind of people who get into the job also changes with it. And there’s certainly people who do think like you describe. Just not many of them in my bubble.
That’s fair. I know there are programmers who actually like writing code for it’s own sake rather than as a way to achieve a goal. I think you are right that the profession will change to be less about writing code and more about achieving goals (and it already is, so I just mean it will more be like this), since AI will be cheap enough to make humans writing code too expensive.
About the bit where developers thought they were more productive but were actually less so: I’ve heard people say things like “overall, using AI tools didn’t save me any time, but doing it this way cost me less mental energy than doing it all by myself”. I’ve also sometimes felt similarly. I wonder if people might be using something like “how good do I feel at the end of the day” as a proxy for “how productive was I today”.
How much AI do these developers use in their normal work? Is your hypothesis that these people are 20% less productive now in their real work because they think AI gives them a big productivity use, so they use it a lot, but it actually hinders them? Or were they relatively unfamiliar with AI use, tried to use them an unusual amount for the experiment, and it backfired? Or is there some other important difference between their normal work and this experiment?
When I was first told about this study, I was asked to make a prediction before they revealed the results. I predicted wrong. I don’t know how to update exactly, but it feels bad to try to explain away the results (which I feel myself want to do).
This is about what I expected.
Future work ought to try A with inexperienced programmers, or xperienced programmers working on unfamiliar codebases. A theory we might have at this point is that it’s harder for AI to help more experienced people.
The question is, though, does it really matter if AI can speed up inexperienced programmers? Inasmuch as what we care about is AI effects on AI progress or AI effects on the economy, evaluating how useful they are to professional, highly experienced developers is precisely the thing to do. The boosts to inexperienced people are mostly irrelevant, since (1) their contributions to research/economy would presumably be very small, (2) they’re a self-solving problem, in that they’d either not do much development, or quickly turn into experienced developers.
One thing it’d be interesting to test is whether, over time, LLMs become useful to ever-more-experienced people. Which can be tested either by re-running the study every 4-6 months, or by running a “postdictive” study where we re-test several times while restricting the available models to the models that were available at different past dates. It’d pretty expensive, though.
Do you have model to back the claim up of inexperienced software engineers not contributing? Like an economic modelling claim or similar?
I’m just curious as I’m not sure if this is the case or not? (Like if a really good manager who is shit at coding learns how to understand code this can for example give large speedups)
Kinda sorta? I was basing it off a vague intuition that productivity is probably distributed as a power law, so the top few percent of people would account for most of the useful (economic/research) output. Looking it up, if we use income as a proxy for productivity, that seems to be the case.
Now, granted, what this says is “the few most productive developers account for most of the value”, not “the few most competent developers account for most of the value”. But I think it’s reasonable to assume that the two are strongly correlated. Alternative models would imply a software industry in which the bulk of the gains is generated by superstar novices in their first years of programming, who then burn out and stop contributing much. Pretty sure that’s not how it works.
Another intuition I had is that the bulk of programmer-hours is probably experienced-programmer-hours, because, again, novices either quit or quickly become experienced programmers. So again, unless we assume that the value-generation is skewed towards a person’s first months/years of programming, we have to assume that most of the value is generated by experienced people.
But that’s all admittedly pure intuitive theorizing. I did say “presumably” in my initial statement.
I mean you’re obviously correct about the value distribution of experienced software engineers.
I should have made it more clear but I was more considering things like upskilling management or like more direct operational research on how it affects other parts of the firm itself.
Operational research can be quite complex but power laws are a thing and as a first approximation I would agree. It is just that I think it might be a bit more complex than that in reality since a manager without coding experience might be helped by it still.
Why would we use income as a proxy for productivity, given that
a) companies’ pay grades are only half matching each other,
b) there exists an open source community?
I don’t think that holds either. Say, existence of Windows is a large chunk of value (which enables other software and so on), but Windows is not written competently—e.g. from what we see when it, upon an update, crashes a bunch of computers.
What I’m saying is that ‘experienced’ is not precisely equal to ‘competent’; as long as your code works somehow, you are not under large pressure to make it maintainable or even valid for all cases.
Well, the top labs pretty much only higher really cracked coders, and it seems like the top labs are primarily responsible for pushing the frontier.
I do not know if Thane had a more rigorous argument, but mine seems pretty likely to work.
So, how does that relate to the general productivity of the firm? If I look at this from a perspective of someone like Stafford Beer or other types of operational research then I could say that the smoothness of the delegation between top level and bottom level defines how good the operations are.
For example, you can have lots of cracked engineers but that doesn’t matter if management doesn’t know what to do?
What does this have to do with inexperienced software engineers?
I don’t think I understand what you’re getting at anymore.
One can think of a manager as an inexperienced software engineer for example. Sorry if I didn’t make that clear before.
I know a bunch of people with more experience in other areas who now have a lot easier time understanding code and that literacy might then lead to increases in precision at management level.
You reminded me of this part of Rudolf’s story:
Really interesting paper. Granting the results, it seems plausible that AI still boosts productivity overall by easing the cognitive burden on developers and letting them work more hours per day.
Great study!
A strong motivating aspect of the study is measuring AI R&D accleration. I am somewhat wary of using this methodology to find negative evidence for this kind of acceleration happening at labs:
I must believe that using AI agents productively is a skill question, despite the graphs in the paper showing no learning effects. One kind of company filled with people knowing lots about how to prompt AIs, and their limitations, are AI labs. Even if most developers
The mean speedup/slowdown can be a difficult metric: the heavy tail of research impact + feedback loops around AI R&D make it so that just one subgroup with high positive speedup could have a big impact.
Reading a recent account of an employee who left OpenAI, the dev experience also sounds pretty dissimilar. Summarizing, OAI repos are large (matches the study setting), but people don’t have a great understanding of the full repo (since it’s a large monorepo+ a lot of new people joining) and there do not seem to be uniform code guidelines.