I find myself in agreement with basically everything in this comment, and yet I observe that most of the benefits of AI so far have looked like moving rarer and rarer classes of tasks from “the long tail” to “approximately solved”. I suppose it’s worth clarifying exactly what we mean by “the long tail” though.
I observe now that for most common low-level tasks (e.g. write a patch and apply it to a file) scaffolded LLMs can ever do, they can, given multiple attempts but no human feedback, do the task quite reliably. Likewise for most tasks which are simple combinations of tasks the LLM already knows how to do, and how to recognize that it has successfully done.
Where they get into trouble, in my experience, is in cases where they have a task that is not too far from things they’re good at, but is just far enough out of distribution that they’re not sure what success even looks like, and where the output of that task is not natural language text.
Basically, when I consider the following two types of failures
A: The AI failed to complete the task because it failed to exhibit long-term strategic goal-directed behavior
B: The AI failed to complete the task because it ran into a subproblem where something went wrong but had failures in perception such that it was not able to figure out what was wrong and ended up spinning its wheels or otherwise failed to solve the subproblem
I expect that B >> A in terms of frequency. As long as that’s the case, there’s not that much benefit from solving the failures of type A, because they’re almost never the bottleneck. If you need access to human anyway because type B failures are prevalent, you might as well offload the rare type A problems where training data is scarce and reliability is more important than speed onto that same human (if you can get that human enough context that the human will be more reliable, which I expect you can).
The short answer is that they do help, but are not one weird tricks to remove the bottleneck of humans being the main cost
There is One Weird Trick to remove the bottleneck of humans being the main cost. That trick is to spend more and more money on inference until you reach the breakeven point where a human is cheaper. For example, in cases where you would need to spend 1M input / 100k output tokens on GPT-5.4-pro to solve, a human wins on cost if they can solve the problem for $50 (or improve token efficiency by a corresponding amount). One quite easy way to burn a ton of tokens is in trying to improve reliability by doing best-of-n on some task where the reliability isn’t quite there yet. If you can get the outputs into a human-evaluatable format there, that’s likely to be cheaper (and the human can likely notice trends in the failure cases).
On tasks where delegation to a human is possible at all, once you’re on the Pareto frontier of “increase n in best-of-n vs delegate a higher fraction to a human”, there no longer is a threshold / discontinuity in value that comes from higher reliability. I expect many of the tasks that only come up once every 100 hours or once every 1000 hours to be this kind of thing where the juice of aiming for full automation is just not worth the squeeze. I expect many of the things we think of as “long-horizon goal-directedness” come up about this rarely.
I find myself in agreement with basically everything in this comment, and yet I observe that most of the benefits of AI so far have looked like moving rarer and rarer classes of tasks from “the long tail” to “approximately solved”. I suppose it’s worth clarifying exactly what we mean by “the long tail” though.
I observe now that for most common low-level tasks (e.g. write a patch and apply it to a file) scaffolded LLMs can ever do, they can, given multiple attempts but no human feedback, do the task quite reliably. Likewise for most tasks which are simple combinations of tasks the LLM already knows how to do, and how to recognize that it has successfully done.
Where they get into trouble, in my experience, is in cases where they have a task that is not too far from things they’re good at, but is just far enough out of distribution that they’re not sure what success even looks like, and where the output of that task is not natural language text.
Basically, when I consider the following two types of failures
A: The AI failed to complete the task because it failed to exhibit long-term strategic goal-directed behavior B: The AI failed to complete the task because it ran into a subproblem where something went wrong but had failures in perception such that it was not able to figure out what was wrong and ended up spinning its wheels or otherwise failed to solve the subproblem
I expect that B >> A in terms of frequency. As long as that’s the case, there’s not that much benefit from solving the failures of type A, because they’re almost never the bottleneck. If you need access to human anyway because type B failures are prevalent, you might as well offload the rare type A problems where training data is scarce and reliability is more important than speed onto that same human (if you can get that human enough context that the human will be more reliable, which I expect you can).
There is One Weird Trick to remove the bottleneck of humans being the main cost. That trick is to spend more and more money on inference until you reach the breakeven point where a human is cheaper. For example, in cases where you would need to spend 1M input / 100k output tokens on GPT-5.4-pro to solve, a human wins on cost if they can solve the problem for $50 (or improve token efficiency by a corresponding amount). One quite easy way to burn a ton of tokens is in trying to improve reliability by doing best-of-n on some task where the reliability isn’t quite there yet. If you can get the outputs into a human-evaluatable format there, that’s likely to be cheaper (and the human can likely notice trends in the failure cases).
On tasks where delegation to a human is possible at all, once you’re on the Pareto frontier of “increase n in best-of-n vs delegate a higher fraction to a human”, there no longer is a threshold / discontinuity in value that comes from higher reliability. I expect many of the tasks that only come up once every 100 hours or once every 1000 hours to be this kind of thing where the juice of aiming for full automation is just not worth the squeeze. I expect many of the things we think of as “long-horizon goal-directedness” come up about this rarely.