Inevitably, some of these activities will be harder to automate than others, delaying the overall timeline. It seems difficult to route around this problem. For instance, if it turns out to be difficult to evaluate the quality of model outputs for fuzzy / subjective tasks, it’s not clear how an R&D organization (regardless of how much or little automation it has incorporated) could rapidly improve model capabilities on those tasks, regardless of how much progress is being made in other areas.
One reason I expect less jaggeed progress than you is that my intuition is that even for tasks that are theoretically easy to verify/check, if they take a long time for humans and are very valuable, they will still often be hard to automate if there aren’t easily verifiable intermediate outputs. For example, perhaps it’s much easier to automate few hour coding tasks than few hour tasks in less verifiable domains. But for coding tasks that take humans months, it’s not clear that there’s a much better training signal for intermediate outputs than there is for tasks with a less verifiable end state. And if there aren’t easily verifiable intermediate outputs, it seems you face similar challenges to short horizon non-verifiable tasks in terms of getting a good training signal. Furthermore, the sorts of long horizon coding tasks humans do are often inherently vague and fuzzy as well, at a higher rather than shorter ones. It’s less clear how much of an issue this is for math, but for coding this consideration points me toward expecting automation of coding not that much before other fuzzier skills.
One reason I expect less jaggeed progress than you is that my intuition is that even for tasks that are theoretically easy to verify/check, if they take a long time for humans and are very valuable, they will still often be hard to automate if there aren’t easily verifiable intermediate outputs. For example, perhaps it’s much easier to automate few hour coding tasks than few hour tasks in less verifiable domains. But for coding tasks that take humans months, it’s not clear that there’s a much better training signal for intermediate outputs than there is for tasks with a less verifiable end state. And if there aren’t easily verifiable intermediate outputs, it seems you face similar challenges to short horizon non-verifiable tasks in terms of getting a good training signal. Furthermore, the sorts of long horizon coding tasks humans do are often inherently vague and fuzzy as well, at a higher rather than shorter ones. It’s less clear how much of an issue this is for math, but for coding this consideration points me toward expecting automation of coding not that much before other fuzzier skills.