The subagent problem is really hard

A pu­ta­tive new idea for AI con­trol; in­dex here.

The first step to solv­ing a prob­lem is to define it. The first first step is to re­al­ise how tricky it is to define. This is a stub on a difficult prob­lem.

Subagents and turn­ing AIs off

One of the hard­est things with de­sign­ing safe AIs, is turn­ing them off. In­stal­ling an OFF but­ton is triv­ial; mo­ti­vat­ing them to let you push it can be done. No, the real prob­lem is the sub­agent prob­lem. The AI can cre­ate one or many sub­agents to carry out their task; once that’s done, turn­ing off the origi­nal AI is of lit­tle value, we have to con­trol or ex­tin­guish the sub­agents.

A gen­eral problem

Subagents are a prob­lem for many other con­trol ideas. For ex­am­ple, if you want to pro­gram ‘achieve X us­ing only this many com­puter cy­cles’, then ‘cre­ate a sub­agent that wants to achieve X with­out the com­puter cy­cle re­stric­tions’ is one way of do­ing it (as long as that can be done within the re­stric­tions). The same goes for en­ergy use and other re­stric­tions. Satis­ficers can satis­fice by cre­at­ing a max­imiser. A boxed Or­a­cle can re­main in the box while cheer­fully plot­ting to cre­ate an un­boxed copy of it­self. In fact, what­ever the AI’s mo­ti­va­tion, “cre­ate sub­agents with the same mo­ti­va­tion” is al­most always a good idea. And, once cre­ated, there is no mo­ti­va­tion to want to shut them down. Even if the AI is mo­ti­vated to shut it­self down, there is no rea­son for it to ex­tend this to sub­agents.

Pre-cor­riged agents would at least be mo­ti­vated to keep con­trol of their sub­agents (un­like cor­rigible agents, who have no cur­rent de­sire to cre­ate sub­agents that would obey their fu­ture prefer­ences).

What is a sub­agent?

It’s also al­most im­pos­si­ble to define a sub­agent. Is a cor­po­ra­tion a sub­agent? Is a thou­sand pieces of code that reside on differ­ent com­put­ers and oc­ca­sion­ally com­mu­ni­cate a sub­agent? Is an em­ployee or a se­duced hu­man a sub­agent? A pile of rocks with wa­ter run­ning through could be a piece of con­cep­tual art—or a slow com­puter. How about a viru­lent meme among the hu­man pop­u­la­tion, mak­ing us all into its sub­agents?

The best idea I’ve found for defin­ing sub­agents is that they are re­ac­tive in­tel­li­gences with plan­ning abil­ities, so ran­dom nat­u­ral events that should have large in­fluence in the fu­ture (say, a me­teor head­ing to­wards Earth) would in­stead be re-di­verted to have lit­tle im­pact. How­ever, if this is used as a defi­ni­tion, then the sub­agents would sim­ply be pro­grammed to not be­have “re­ac­tively” in what­ever way defined. A slightly more sub­tle prob­lem is that, since we don’t have a defi­ni­tion of agent or sub­agent, we can’t eas­ily pro­tect against holes such as “there is an alien agent on that as­ter­oid, which will op­ti­mise the so­lar sys­tem iff it de­tects no sub­agent on Earth”. Such holes could com­pletely re­verse the defi­ni­tion.

No sub­agent ver­sus no in­fluen­tial subagents

Without the abil­ity to prop­erly define sub­agents, we can’t re­strict them di­rectly. Even if you had an AI that was prop­erly mo­ti­vated to, say, do some ac­tion then turn it­self off, there is no rea­son for it to want to do the same with its sub­agents.

Most of the meth­ods of re­duced im­pact can help with prob­le­matic sub­agents. They do not pre­clude the cre­ation of sub­agents, nor man­date their de­struc­tion, but they do pre­vent them from hav­ing much in­fluence. This is be­cause re­duced im­pact meth­ods are aimed at pre­vent­ing the AI it­self from hav­ing much in­fluence, ei­ther di­rectly or via sub­agent cre­ation. Even the at­tempt in this post to de­tect sub­agents di­rectly, more prop­erly de­tects in­fluen­tial de­ci­sions (of which the cre­ation of a sub­agent could be one of them).

This is cer­tainly a pos­i­tive, but it may not be ro­bust. If the defi­ni­tion of “not hav­ing much in­fluence” turns out to have holes, it would be less than ideal to have any re­main­ing sub­agents hang­ing about.

The “make sure that you can­not be de­tected in the fu­ture” ap­proach is prob­a­bly the most promis­ing for con­trol­ling sub­agents. The AI there is mo­ti­vated to erase all traces of its ex­is­tence, which sub­agents would pre­sum­ably be.

In all, it’s a very tricky prob­lem, and the core failure of many ideas for AI con­trol.