The subagent problem is really hard

A putative new idea for AI control; index here.

The first step to solving a problem is to define it. The first first step is to realise how tricky it is to define. This is a stub on a difficult problem.

Subagents and turning AIs off

One of the hardest things with designing safe AIs, is turning them off. Installing an OFF button is trivial; motivating them to let you push it can be done. No, the real problem is the subagent problem. The AI can create one or many subagents to carry out their task; once that’s done, turning off the original AI is of little value, we have to control or extinguish the subagents.

A general problem

Subagents are a problem for many other control ideas. For example, if you want to program ‘achieve X using only this many computer cycles’, then ‘create a subagent that wants to achieve X without the computer cycle restrictions’ is one way of doing it (as long as that can be done within the restrictions). The same goes for energy use and other restrictions. Satisficers can satisfice by creating a maximiser. A boxed Oracle can remain in the box while cheerfully plotting to create an unboxed copy of itself. In fact, whatever the AI’s motivation, “create subagents with the same motivation” is almost always a good idea. And, once created, there is no motivation to want to shut them down. Even if the AI is motivated to shut itself down, there is no reason for it to extend this to subagents.

Pre-corriged agents would at least be motivated to keep control of their subagents (unlike corrigible agents, who have no current desire to create subagents that would obey their future preferences).

What is a subagent?

It’s also almost impossible to define a subagent. Is a corporation a subagent? Is a thousand pieces of code that reside on different computers and occasionally communicate a subagent? Is an employee or a seduced human a subagent? A pile of rocks with water running through could be a piece of conceptual art—or a slow computer. How about a virulent meme among the human population, making us all into its subagents?

The best idea I’ve found for defining subagents is that they are reactive intelligences with planning abilities, so random natural events that should have large influence in the future (say, a meteor heading towards Earth) would instead be re-diverted to have little impact. However, if this is used as a definition, then the subagents would simply be programmed to not behave “reactively” in whatever way defined. A slightly more subtle problem is that, since we don’t have a definition of agent or subagent, we can’t easily protect against holes such as “there is an alien agent on that asteroid, which will optimise the solar system iff it detects no subagent on Earth”. Such holes could completely reverse the definition.

No subagent versus no influential subagents

Without the ability to properly define subagents, we can’t restrict them directly. Even if you had an AI that was properly motivated to, say, do some action then turn itself off, there is no reason for it to want to do the same with its subagents.

Most of the methods of reduced impact can help with problematic subagents. They do not preclude the creation of subagents, nor mandate their destruction, but they do prevent them from having much influence. This is because reduced impact methods are aimed at preventing the AI itself from having much influence, either directly or via subagent creation. Even the attempt in this post to detect subagents directly, more properly detects influential decisions (of which the creation of a subagent could be one of them).

This is certainly a positive, but it may not be robust. If the definition of “not having much influence” turns out to have holes, it would be less than ideal to have any remaining subagents hanging about.

The “make sure that you cannot be detected in the future” approach is probably the most promising for controlling subagents. The AI there is motivated to erase all traces of its existence, which subagents would presumably be.

In all, it’s a very tricky problem, and the core failure of many ideas for AI control.