I find the part about extreme specialization very interesting, and potentially applicable to training AI agent systems (from an outsider’s perspective). Today’s instruction-following LLMs could in theory cooperate since they don’t yet follow goals outside of their prompt, so we can just prompt them to work together with each other and they will do so without hesitation. So it sounds like we can get a lot of benefit from specialization if we can train them to cooperate effectively.
Today’s frontier LLMs are quite general-purpose and benefit from being so, and I would guess that’s both for economic reasons during training (one big frontier model outperforms many smaller specialized models for the same training cost) and to benefit performance in interdisciplinary tasks. But all our training evaluations and most real-life production workloads are done on a single LLM being used in a scaffold. That single LLM model might contain many experts but they are tightly coupled. But what if that wasn’t the case?
Could we train a system of separate LLMs that each have narrow use cases but are natively designed to be able to talk to one another? We could run them on different machines and train them to rapidly communicate with one another using a predefined agentic scaffold (or some other communication method more deeply embedded in the model architecture itself), with the objective function being some function of the system’s performance as a whole and individual models’ contributions to it, rather than the training process only running and evaluating a single model.
That seems like it could unlock a lot of benefits akin to the analogy with multicellularity, with each LLM being an expert in a certain field and knowing just enough about other fields to delegate to the other experts when needed. Sort of like MoE but at an agent scaffold level instead of at a LLM-level. Compared to regular MoE it could be at least much more efficient with memory usage when hosted in a large-scale datacenter setting, or the system as a whole could even be able to reach new levels of intelligence without increasing the size of each individual LLM.
I find the part about extreme specialization very interesting, and potentially applicable to training AI agent systems (from an outsider’s perspective). Today’s instruction-following LLMs could in theory cooperate since they don’t yet follow goals outside of their prompt, so we can just prompt them to work together with each other and they will do so without hesitation. So it sounds like we can get a lot of benefit from specialization if we can train them to cooperate effectively.
Today’s frontier LLMs are quite general-purpose and benefit from being so, and I would guess that’s both for economic reasons during training (one big frontier model outperforms many smaller specialized models for the same training cost) and to benefit performance in interdisciplinary tasks. But all our training evaluations and most real-life production workloads are done on a single LLM being used in a scaffold. That single LLM model might contain many experts but they are tightly coupled. But what if that wasn’t the case?
Could we train a system of separate LLMs that each have narrow use cases but are natively designed to be able to talk to one another? We could run them on different machines and train them to rapidly communicate with one another using a predefined agentic scaffold (or some other communication method more deeply embedded in the model architecture itself), with the objective function being some function of the system’s performance as a whole and individual models’ contributions to it, rather than the training process only running and evaluating a single model.
That seems like it could unlock a lot of benefits akin to the analogy with multicellularity, with each LLM being an expert in a certain field and knowing just enough about other fields to delegate to the other experts when needed. Sort of like MoE but at an agent scaffold level instead of at a LLM-level. Compared to regular MoE it could be at least much more efficient with memory usage when hosted in a large-scale datacenter setting, or the system as a whole could even be able to reach new levels of intelligence without increasing the size of each individual LLM.