Other points aside, the proposition “LM agents are an unusually safe way to build powerful AI systems” seems really important; it would be great to see more research/intuitions on this + clarification on various flavors of “LM agents.”
I guess one crux for sharing research on LM agents is whether there are viable alternative paths to powerful AI systems. If LM-agents is clearly the easiest path, there’s less reason to share research on them; if a less-safe path looks similarly easy, we should differentially advance LM-agents.
I’m not aware of alternative paths that look anywhere near as easy as LM-agents. Or: I don’t know what viable alternative paths LM-agents are supposed to be safer than. (Edit: some alignment researcher friends mention old-fashioned RL agents as a possible path to powerful AI that’s less safe than LM-agents but say that path looks substantially harder than LM-agents, such that we don’t need to boost LM-agents more.)
Maybe rather than ‘different paths’ Paul just means that capabilities can come from more-powerful-LMs or more-sophisticated-agent-scaffolding. He says:
at a fixed level of capability, I think the more we are relying on LM agents (rather than larger LMs) the safer we are.
I buy something like this, at least. But (I weakly intuit) we’ll almost exclusively be relying on LM agents rather than mere next-token-predictors by default; there’s no need to boost LM agents. And even if that’s good, that doesn’t mean that marginal improvements in LM agents’ sophistication/complexity are safer than marginal improvements in underlying-LM-capability. (I don’t have a take on this—just flagging it as a crux.)
My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.
Good post.
Other points aside, the proposition “LM agents are an unusually safe way to build powerful AI systems” seems really important; it would be great to see more research/intuitions on this + clarification on various flavors of “LM agents.”
I guess one crux for sharing research on LM agents is whether there are viable alternative paths to powerful AI systems. If LM-agents is clearly the easiest path, there’s less reason to share research on them; if a less-safe path looks similarly easy, we should differentially advance LM-agents.
I’m not aware of alternative paths that look anywhere near as easy as LM-agents. Or: I don’t know what viable alternative paths LM-agents are supposed to be safer than. (Edit: some alignment researcher friends mention old-fashioned RL agents as a possible path to powerful AI that’s less safe than LM-agents but say that path looks substantially harder than LM-agents, such that we don’t need to boost LM-agents more.)
Maybe rather than ‘different paths’ Paul just means that capabilities can come from more-powerful-LMs or more-sophisticated-agent-scaffolding. He says:
I buy something like this, at least. But (I weakly intuit) we’ll almost exclusively be relying on LM agents rather than mere next-token-predictors by default; there’s no need to boost LM agents. And even if that’s good, that doesn’t mean that marginal improvements in LM agents’ sophistication/complexity are safer than marginal improvements in underlying-LM-capability. (I don’t have a take on this—just flagging it as a crux.)
My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.