What is required for AI to provide net speedup to a software engineering project, when humans write higher quality code than AIs? It depends how it’s used.
Cursor regime
In this regime, similar to how I use Cursor agent mode, the human has to read every line of code the AI generates, so we can write:
tH is the time for the human to write the code, either from scratch or after rejecting an AI suggestion
tAIgen is the time for the AI to generate the code in tokens per second.
tcheck is the time for the human to check the code, accept or reject it, and make any minor revisions, in order to bring the code quality and probability of bugs equal with human-written code (Δpbug=0)
pfail is the fraction of AI suggestions that are rejected entirely.
Note this neglects other factors like code review time, code quality, bugs that aren’t caught by the human, or enabling things the human can’t do.
Autonomous regime
In this regime the AI is reliable enough that the human doesn’t check all the code for bugs, and instead eats the chance of costly bugs entering the codebase.
Δpbug is the added probability of a bug from the AI agent compared to a human
tbugcost is the expected cost of a bug, including revenue loss, compromising other projects, and time required to fix the bug. This can be higher than tH
Verifiable regime
If task success is cheaply verifiable e.g. if comprehensive tests are already written or other AIs can verify the code, then bugs are impossible except through reward hacking.
Here tprompt is lower than in the other regimes because the verifier can help the generator AI understand the task, and tAIgen can be faster too because you can parallelize.
Observations
Although the AI is always faster than writing the code than the human, overall speedup is subject to Amdahl’s law, i.e. limited by the slowest component. If AI generation is only 3x as fast as the human per line of code, speedup will never be faster than 3x. Even when AI mistake rate is low enough that we move to the autonomous regime, we still have tprompt and tAIgen, both of which can be significant fractions of tH currently, especially for expert humans who are fast at writing code and projects that are few lines of code. Therefore I expect >5x speedups to overall projects only when (a) AIs write higher-quality code than humans, or (b) tasks are cheaply verifiable and reward hacking is rare, or (c) both tprompt and tAIgen are much faster than in my current work.
This calculus changes when you can work on many things at once (similar to @faul_sname’s comment but this might be that you can work on many projects at once, even if they can’t each be parallelized well).
The uplift equation:
What is required for AI to provide net speedup to a software engineering project, when humans write higher quality code than AIs? It depends how it’s used.
Cursor regime
In this regime, similar to how I use Cursor agent mode, the human has to read every line of code the AI generates, so we can write:
Speedup factor=tHtAI+H=tHtprompt+tAIgen+tcheck+pfail⋅tH
Where
tH is the time for the human to write the code, either from scratch or after rejecting an AI suggestion
tAIgen is the time for the AI to generate the code in tokens per second.
tcheck is the time for the human to check the code, accept or reject it, and make any minor revisions, in order to bring the code quality and probability of bugs equal with human-written code (Δpbug=0)
pfail is the fraction of AI suggestions that are rejected entirely.
Note this neglects other factors like code review time, code quality, bugs that aren’t caught by the human, or enabling things the human can’t do.
Autonomous regime
In this regime the AI is reliable enough that the human doesn’t check all the code for bugs, and instead eats the chance of costly bugs entering the codebase.
Speedup factor=tHtAI=tHtprompt+tAIgen+Δpbug⋅tbugcost+pfail⋅tH
Δpbug is the added probability of a bug from the AI agent compared to a human
tbugcost is the expected cost of a bug, including revenue loss, compromising other projects, and time required to fix the bug. This can be higher than tH
Verifiable regime
If task success is cheaply verifiable e.g. if comprehensive tests are already written or other AIs can verify the code, then bugs are impossible except through reward hacking.
Speedup factor=tHtAI=tHtprompt+tAIgen+prewardhack⋅tbugcost+pfail⋅tH
Here tprompt is lower than in the other regimes because the verifier can help the generator AI understand the task, and tAIgen can be faster too because you can parallelize.
Observations
Although the AI is always faster than writing the code than the human, overall speedup is subject to Amdahl’s law, i.e. limited by the slowest component. If AI generation is only 3x as fast as the human per line of code, speedup will never be faster than 3x. Even when AI mistake rate is low enough that we move to the autonomous regime, we still have tprompt and tAIgen, both of which can be significant fractions of tH currently, especially for expert humans who are fast at writing code and projects that are few lines of code. Therefore I expect >5x speedups to overall projects only when (a) AIs write higher-quality code than humans, or (b) tasks are cheaply verifiable and reward hacking is rare, or (c) both tprompt and tAIgen are much faster than in my current work.
I made a simple Desmos model here.
Unless the work can be parallelized.
This calculus changes when you can work on many things at once (similar to @faul_sname’s comment but this might be that you can work on many projects at once, even if they can’t each be parallelized well).