In conversations about this that I’ve seen the crux is usually:
Do you expect greater capabilities increases per dollar from [continued scaling by ~the current* techniques] or by some [scaffolding/orchestration scheme/etc].
The latter just isn’t very dollar efficient, so I think we’d have to see the existing [ways to spend money to get better performance] get more expensive / hit a serious wall before sufficient resources are put into this kind of approach. It may be cheap to try, but verifying performance on relevant tasks and iterating on the design gets really expensive really quickly. On the scale-to-schlep spectrum, this is closer to schlep. I think you’re right that something like this could be important at some point in the future, conditional on much less efficient returns from other methods.
This is a bit of a side note, but I think your human analogy for time horizons doesn’t quite work, as Eli said. The question is ‘how much coherent and purposeful person-minute-equivalent-doing can an LLM execute before it fails n percent of the time?’ Many person-years of human labor can be coherently oriented toward a single outcome (whether it’s one or many people involved). That the humans get sleepy or distracted in-between is an efficiency concern, not a coherence concern; it affects the rate at which the work gets done, but doesn’t put an upper bound on the total amount of purposeful labor that can hypothetically be directed, since humans just pick up where they left off pursuing the same goals for years and years at a time, while LLMs seem to lose the plot once they’ve started to nod off.
In conversations about this that I’ve seen the crux is usually:
The latter just isn’t very dollar efficient, so I think we’d have to see the existing [ways to spend money to get better performance] get more expensive / hit a serious wall before sufficient resources are put into this kind of approach. It may be cheap to try, but verifying performance on relevant tasks and iterating on the design gets really expensive really quickly. On the scale-to-schlep spectrum, this is closer to schlep. I think you’re right that something like this could be important at some point in the future, conditional on much less efficient returns from other methods.
This is a bit of a side note, but I think your human analogy for time horizons doesn’t quite work, as Eli said. The question is ‘how much coherent and purposeful person-minute-equivalent-doing can an LLM execute before it fails n percent of the time?’ Many person-years of human labor can be coherently oriented toward a single outcome (whether it’s one or many people involved). That the humans get sleepy or distracted in-between is an efficiency concern, not a coherence concern; it affects the rate at which the work gets done, but doesn’t put an upper bound on the total amount of purposeful labor that can hypothetically be directed, since humans just pick up where they left off pursuing the same goals for years and years at a time, while LLMs seem to lose the plot once they’ve started to nod off.