I think this might be a case where, for each codebase, there is a particular model that goes from “not reliable enough to be useful” to “reliable enough to sometimes be useful”—at my workplace, this first happened with Sonnet 3.6 (then called Claude Sonnet 3.5 New) - there was what felt like a step change from 3.5 to 3.6 where previous progress felt less impactful because incremental improvements went from “unable to reliably handle the boilerplate” to “able to reliably handle the boilerplate”, and then later improvements felt less impactful because once you can write the boilerplate, there isn’t really a lot of alpha in doing it better, and none of the models are reliable enough that we trust them to write bits of core business logic where bugs or poor choices can cause subtle data integrity issues years down the line.
I suspect the same is true of e.g. trying to use LLMs to do major version upgrades of frameworks—a team may have a looming django 4 → django 5 migration, and try out every new model on that task. Once one of them is good enough, the upgrade will be done, and then further tasks will mostly be easier ones like minor version updates. So the most impressive task they’ve seen a model do will be that major version upgrade, and it will take some time for more difficult tasks that are still well-scoped, hard to do, and easy to verify to come up.
I think this might be a case where, for each codebase, there is a particular model that goes from “not reliable enough to be useful” to “reliable enough to sometimes be useful”—at my workplace, this first happened with Sonnet 3.6 (then called Claude Sonnet 3.5 New) - there was what felt like a step change from 3.5 to 3.6 where previous progress felt less impactful because incremental improvements went from “unable to reliably handle the boilerplate” to “able to reliably handle the boilerplate”, and then later improvements felt less impactful because once you can write the boilerplate, there isn’t really a lot of alpha in doing it better, and none of the models are reliable enough that we trust them to write bits of core business logic where bugs or poor choices can cause subtle data integrity issues years down the line.
I suspect the same is true of e.g. trying to use LLMs to do major version upgrades of frameworks—a team may have a looming django 4 → django 5 migration, and try out every new model on that task. Once one of them is good enough, the upgrade will be done, and then further tasks will mostly be easier ones like minor version updates. So the most impressive task they’ve seen a model do will be that major version upgrade, and it will take some time for more difficult tasks that are still well-scoped, hard to do, and easy to verify to come up.