if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month)
This doesn’t seem like a good example to me.
The sort of tasks we’re talking about are extrapolations of current benchmark tasks, so it’s more like: what a programming savant with almost no ability to interact with colleagues or search out new context might do in a month given a self-contained, thoroughly specced and vetted task.
I expect current systems will naively scale to that, but not to the abilities of an arbitrary intern because that requires skills that aren’t tested in the benchmarks.
This doesn’t seem like a good example to me.
The sort of tasks we’re talking about are extrapolations of current benchmark tasks, so it’s more like: what a programming savant with almost no ability to interact with colleagues or search out new context might do in a month given a self-contained, thoroughly specced and vetted task.
I expect current systems will naively scale to that, but not to the abilities of an arbitrary intern because that requires skills that aren’t tested in the benchmarks.