If anyone is game for creating an agentic research scaffold like that Thane describes
Here’s a more detailed the basic structure as envisioned after 5 minutes’ thought:
You feed a research prompt to the “Outer Loop” of a model, maybe have a back-and-forth fleshing out the details.
The Outer Loop decomposes the research into several promising research directions/parallel subproblems.
Each research direction/subproblem is handed off to a “Subagent” instance of the model.
Each Subagent runs search queries on the web and analyses the results, up to the limits of its context window. After the analysis, it’s prompted to evaluate (1) which of the results/sources are most relevant and which should be thrown out, (2) whether this research direction is promising and what follow-up questions are worth asking.
If a Subagent is very eager to pursue a follow-up question, it can either run a subsequent search query (if there’s enough space in the context window), or it’s prompted to distill its current findings and replace itself with a next-iteration Subagent, in whose context it loads only the most important results + its analyses + the follow-up question.
This is allowed up to some iteration count.
Once all Subagents have completed their research, instantiate an Evaluator instance, into whose context window we dump the final results of each Subagent’s efforts (distilling if necessary). The Evaluator integrates the information from all parallel research directions and determines whether the research prompt has been satisfactorily addressed, and if not, what follow-up questions are worth pursuing.
The Evaluator’s final conclusions are dumped into the Outer Loop’s context (without the source documents, to not overload the context window).
If the Evaluator did not choose to terminate, the next generation of Subagents is spawned, each prompted with whatever contexts are recommended by the Evaluator.
Iterate, spawning further Evaluator instances and Subproblem instances as needed.
Once the Evaluator chooses to terminate, or some compute upper-bound is reached, the Outer Loop instantiates a Summarizer, into which it dumps all of the Evaluator’s analyses + all of the most important search results.
The Summarizer is prompted to generate a high-level report outline, then write out each subsection, then the subsections are patched together into a final report.
Here’s what this complicated setup is supposed to achieve:
Do a pass over an actually large, diverse amount of sources. Most of such web-search tools (Google’s, DeepSeek’s, or this shallow replication) are basically only allowed to make one search query, and then they have to contend with whatever got dumped into their context window. If the first search query turns out poorly targeted, in a way that becomes obvious after looking through the results, the AI’s out of luck.
Avoid falling into “rabbit holes”, i. e. some arcane-and-useless subproblems the model becomes obsessed with. Subagents would be allowed to fall into them, but presumably Evaluator steps, with a bird’s-eye view of the picture, would recognize that for a failure mode and not recommend the Outer Loop to follow it.
Attempt to patch together a “bird’s eye view” on the entirety of the results given the limitations of the context window. Subagents and Evaluators would use their limited scope to figure out what results are most important, then provide summaries + recommendations based on what’s visible from their limited vantage points. The Outer Loop and then the Summarizer instances, prompted with information-dense distillations of what’s visible from each lower-level vantage point, should effectively be looking at (a decent approximation of) the full scope of the problem.
Has anything like this been implemented in the open-source community or one of the countless LLM-wrapper startups? I would expect so, since it seems like an obvious thing to try + the old AutoGPT scaffolds worked in a somewhat similar manner… But it’s possible the market’s totally failed to deliver.
It should be relatively easy to set up using Flowise plus e. g. Firecrawl. If nothing like this has been implemented, I’m putting a $500 $250 bounty on it.
Edit: Alright, this seems like something very close to it.
Here’s a more detailed the basic structure as envisioned after 5 minutes’ thought:
You feed a research prompt to the “Outer Loop” of a model, maybe have a back-and-forth fleshing out the details.
The Outer Loop decomposes the research into several promising research directions/parallel subproblems.
Each research direction/subproblem is handed off to a “Subagent” instance of the model.
Each Subagent runs search queries on the web and analyses the results, up to the limits of its context window. After the analysis, it’s prompted to evaluate (1) which of the results/sources are most relevant and which should be thrown out, (2) whether this research direction is promising and what follow-up questions are worth asking.
If a Subagent is very eager to pursue a follow-up question, it can either run a subsequent search query (if there’s enough space in the context window), or it’s prompted to distill its current findings and replace itself with a next-iteration Subagent, in whose context it loads only the most important results + its analyses + the follow-up question.
This is allowed up to some iteration count.
Once all Subagents have completed their research, instantiate an Evaluator instance, into whose context window we dump the final results of each Subagent’s efforts (distilling if necessary). The Evaluator integrates the information from all parallel research directions and determines whether the research prompt has been satisfactorily addressed, and if not, what follow-up questions are worth pursuing.
The Evaluator’s final conclusions are dumped into the Outer Loop’s context (without the source documents, to not overload the context window).
If the Evaluator did not choose to terminate, the next generation of Subagents is spawned, each prompted with whatever contexts are recommended by the Evaluator.
Iterate, spawning further Evaluator instances and Subproblem instances as needed.
Once the Evaluator chooses to terminate, or some compute upper-bound is reached, the Outer Loop instantiates a Summarizer, into which it dumps all of the Evaluator’s analyses + all of the most important search results.
The Summarizer is prompted to generate a high-level report outline, then write out each subsection, then the subsections are patched together into a final report.
Here’s what this complicated setup is supposed to achieve:
Do a pass over an actually large, diverse amount of sources. Most of such web-search tools (Google’s, DeepSeek’s, or this shallow replication) are basically only allowed to make one search query, and then they have to contend with whatever got dumped into their context window. If the first search query turns out poorly targeted, in a way that becomes obvious after looking through the results, the AI’s out of luck.
Avoid falling into “rabbit holes”, i. e. some arcane-and-useless subproblems the model becomes obsessed with. Subagents would be allowed to fall into them, but presumably Evaluator steps, with a bird’s-eye view of the picture, would recognize that for a failure mode and not recommend the Outer Loop to follow it.
Attempt to patch together a “bird’s eye view” on the entirety of the results given the limitations of the context window. Subagents and Evaluators would use their limited scope to figure out what results are most important, then provide summaries + recommendations based on what’s visible from their limited vantage points. The Outer Loop and then the Summarizer instances, prompted with information-dense distillations of what’s visible from each lower-level vantage point, should effectively be looking at (a decent approximation of) the full scope of the problem.
Has anything like this been implemented in the open-source community or one of the countless LLM-wrapper startups? I would expect so, since it seems like an obvious thing to try + the old AutoGPT scaffolds worked in a somewhat similar manner… But it’s possible the market’s totally failed to deliver.
It should be relatively easy to set up using Flowise plus e. g. Firecrawl. If nothing like this has been implemented, I’m putting a
$500$250 bounty on it.Edit: Alright, this seems like something very close to it.