I think there’s a hidden variable in this framework: the effective branching factor of the task. For well-specified tasks (prove this theorem, implement this API to spec, etc.), AI cost probably does scale roughly linearly with task size, matching your model. But most real-world engineering tasks aren’t like that. The bottleneck isn’t execution speed, it’s the iterative discovery of what the task actually is.
A senior engineer doing a 4-hour task isn’t typing for 4 hours. They’re making dozens of micro-decisions informed by tacit knowledge about the codebase, the product, the users, the org. They’re constantly pruning the search space. An AI working autonomously on the same task produces tokens faster but explores a much larger space of wrong approaches, because it lacks that contextual judgment. So the cost might scale superlinearly with task complexity not because of compute limits but because of specification limits.
This suggests the linear regime on your Pareto frontier might be shorter for real-world task distributions than for benchmark-style tasks, and the “fraction of human cost” metric could look very different depending on how well-specified the task is. The tasks where AI efficiency looks great (tight feedback loops, clear success criteria) are not the same tasks that fill most of an engineer’s day.
(Though note that obviously this could change if AI starts to have more context and better judgement, which would basically mean training a better model.)
I think there’s a hidden variable in this framework: the effective branching factor of the task. For well-specified tasks (prove this theorem, implement this API to spec, etc.), AI cost probably does scale roughly linearly with task size, matching your model. But most real-world engineering tasks aren’t like that. The bottleneck isn’t execution speed, it’s the iterative discovery of what the task actually is.
A senior engineer doing a 4-hour task isn’t typing for 4 hours. They’re making dozens of micro-decisions informed by tacit knowledge about the codebase, the product, the users, the org. They’re constantly pruning the search space. An AI working autonomously on the same task produces tokens faster but explores a much larger space of wrong approaches, because it lacks that contextual judgment. So the cost might scale superlinearly with task complexity not because of compute limits but because of specification limits.
This suggests the linear regime on your Pareto frontier might be shorter for real-world task distributions than for benchmark-style tasks, and the “fraction of human cost” metric could look very different depending on how well-specified the task is. The tasks where AI efficiency looks great (tight feedback loops, clear success criteria) are not the same tasks that fill most of an engineer’s day.
(Though note that obviously this could change if AI starts to have more context and better judgement, which would basically mean training a better model.)