One of my takeaways from EA Global this year was that most alignment people aren’t focused on LLM-based agents (LMAs)[1] as a route to takeover-capable AGI.
I was at EA Global, and this statement is surprising to me. My impression is that most people do think that LMAs are the primary route to takeover-capable AGI.
What would a non-LLM-based takeover-capable agent even look like, concretely?
Would it be something like SIMA, trained primarily on real-time video data rather than text? Even SIMA has to be trained to understand natural-language instructions, and it seems like natural-language understanding will continue to be important for many tasks we’d want to do in the future.
Or would it be something like AlphaProof, which operates entirely in a formal environment? In this case, it seems unlikely to have any desire to take over, since everything it cares about is localized within the “box” of the formal problem it’s solving. That is, unless you start mixing formal problems with real-world problems in its training data, but if so you’d probably be using natural-language text for the latter. In any case, AlphaProof was trained with an LLM (Gemini) as a starting point, and it seems like future formal AI systems will also benefit from a “warm start” where they are initialized with LLM weights, allowing a basic understanding of formal syntax and logic.
My question isn’t just whether people think LMAs are the primary route to dangerous AI; it’s also why they’re not addressing the agentic part in their alignment work if they do think that.
I think the most common likely answer is “aligning LLMs should help a lot with aligning agents driven by those LLMs”. That’s a reasonable position. I’m just surprised and a little confused that so little work explicitly addresses the new alignment challenges that arise if an LLM is part of a more autonomous agentic system.
The alternative I was thinking of is some new approach that doesn’t really rely on training on a language corpus. Or there are other schemes for AI and AGI that aren’t based on networks at all.
The other route is LLMs/foudnation models that are not really agentic, but relatively passive and working only step-by-step at human direction, like current systems. I hear people talk about dangers of “transformative AI” in deliberately broad terms that don’t include us designing them to be agentic.
I was at EA Global, and this statement is surprising to me. My impression is that most people do think that LMAs are the primary route to takeover-capable AGI.
What would a non-LLM-based takeover-capable agent even look like, concretely?
Would it be something like SIMA, trained primarily on real-time video data rather than text? Even SIMA has to be trained to understand natural-language instructions, and it seems like natural-language understanding will continue to be important for many tasks we’d want to do in the future.
Or would it be something like AlphaProof, which operates entirely in a formal environment? In this case, it seems unlikely to have any desire to take over, since everything it cares about is localized within the “box” of the formal problem it’s solving. That is, unless you start mixing formal problems with real-world problems in its training data, but if so you’d probably be using natural-language text for the latter. In any case, AlphaProof was trained with an LLM (Gemini) as a starting point, and it seems like future formal AI systems will also benefit from a “warm start” where they are initialized with LLM weights, allowing a basic understanding of formal syntax and logic.
My question isn’t just whether people think LMAs are the primary route to dangerous AI; it’s also why they’re not addressing the agentic part in their alignment work if they do think that.
I think the most common likely answer is “aligning LLMs should help a lot with aligning agents driven by those LLMs”. That’s a reasonable position. I’m just surprised and a little confused that so little work explicitly addresses the new alignment challenges that arise if an LLM is part of a more autonomous agentic system.
The alternative I was thinking of is some new approach that doesn’t really rely on training on a language corpus. Or there are other schemes for AI and AGI that aren’t based on networks at all.
The other route is LLMs/foudnation models that are not really agentic, but relatively passive and working only step-by-step at human direction, like current systems. I hear people talk about dangers of “transformative AI” in deliberately broad terms that don’t include us designing them to be agentic.