Agentized LLMs will change the alignment landscape
Epistemic status: head spinning, suddenly unsure of everything in alignment. And unsure of these predictions.
I’m following the suggestions in 10 reasons why lists of 10 reasons might be a winning strategy in order to get this out quickly (reason 10 will blow your mind!). I’m hoping to prompt some discussion, rather than try to do the definitive writeup on this topic when this technique was introduced so recently.
Ten reasons why agentized LLMs will change the alignment landscape:
Agentized LLMs like Auto-GPT and Baby AGI may fan the sparks of AGI in GPT-4 into a fire. These techniques use an LLM as a central cognitive engine, within a recursive loop of breaking a task goal into subtasks, working on those subtasks (including calling other software), and using the LLM to prioritize subtasks and decide when they’re adequately well done. They recursively check whether they’re making progress on their top-level goal.
While it remains to be seen what these systems can actually accomplish, I think it’s very likely that they will dramatically enhance the effective intelligence of the core LLM. I think this type of recursivity and breaking problems into separate cognitive tasks is central to human intelligence. This technique adds several key aspects of human cognition; executive function; reflective, recursive thought; and episodic memory for tasks, despite using non-brainlike implementations. To be fair, the existing implementations seem pretty limited and error-prone. But they were implemented in days. So this is a prediction of near-future progress, not a report on amazing new capabilities.
This approach appears to be easier than I’d thought. I’ve been expecting this type of self-prompting to imitate the advantages of human thought, but I didn’t expect the cognitive capacities of GPT-4 to make it so easy to do useful multi-step thinking and planning. The ease of initial implementation (something like 3 days, with all of the code also written by GPT-4 for baby AGI) implies that improvements may also be easier than we would have guessed.
Integration with HuggingGPT and similar approaches can provide these cognitive loops with more cognitive capacities. This integration was also easier than I’d have guessed, with GPT-4 learning from a handful (e.g., 40) of examples how to use other software tools. Those tools will include both sensory capacities, with vision models and other sensory models of various types, and the equivalent of a variety of output capabilities.
Integration of recursive LLM self-improvement like “Reflexion” can utilize these cognitive loops to make the core model better at a variety of tasks.
Easily agentized LLMs is terrible news for capabilities. I think we’ll have an internet full of LLM-bots “thinking” up and doing stuff within a year.
This is absolutely bone-chilling for the urgency of the alignment and coordination problems. Some clever chucklehead already created ChaosGPT, an instance of Auto-GPT given the goal to destroy humanity and create chaos. You are literally reading the thoughts of something thinking about how to kill you. It’s too stupid to get very far, but it will get smarter with every LLM improvement, and every improvement to the recursive self-prompting wrapper programs. This gave me my very first visceral fear of AGI destroying us. I recommend it, unless you’re already plenty viscerally freaked out.
Watching agents think is going to shift public opinion. We should be ready for more AI scares and changing public beliefs. I have no idea how this is going to play out in the political sphere, but we need to figure this out to have a shot at successful alignment, because
We will be in a multilateral AGI world. Anyone can spawn a dumb AGI and have it either manage their social media, or try to destroy humanity. And over the years, those commercially available AGIs will get smarter. Because defense is harder than offense, it is going to be untenable to indefinitely defend the world against out-of-control AGIs. But
Important parts of alignment and interpretability might be a lot easier than most of us have been thinking. These agents take goals as input, in English. They reason about those goals much as humans do, and this will likely improve with model improvements. This does not solve the outer alignment problem; one existing suggestion is to include a top-level goal of “reducing suffering.” No! No! No!. This also does not solve the alignment stability problem. Starting goals can be misinterpreted or lost to recursive subgoals, and if any type of continued learning is included, behavior will shift over time. It doesn’t even solve the inner alignment problem if recursive training methods create mesa-optimizers in the LLMs. But it also provides incredibly easy interpretability, because these systems think in English.
If I’m right about any reasonable subset of this stuff, this lands us in a terrifying, promising new landscape of alignment issues. We will see good bots and bad bots, and the balance of power will shift. Ultimately I think this leads to the necessity of very strong global monitoring, including breaking all encryption, to prevent hostile AGI behavior. The array of issues is dizzying (I am personally dizzied, and a bit short on sleep from fear and excitement). I would love to hear others’ thoughts.
I’m using a neologism, and a loose definition of agency as things that flexibly pursue goals. That’s similar to this more rigorous definition.