Tools versus agents

In his cri­tique of the Sin­gu­lar­ity In­sti­tute, Holden Karnofsky pre­sented a dis­tinc­tion be­tween an AI func­tion­ing as a tool ver­sus one func­tion­ing as an agent. In his words, a tool AI would

(1) Calcu­late which ac­tion A would max­i­mize pa­ram­e­ter P, based on ex­ist­ing data set D. (2) Sum­ma­rize this calcu­la­tion in a user-friendly man­ner, in­clud­ing what Ac­tion A is, what likely in­ter­me­di­ate out­comes it would cause, what other ac­tions would re­sult in high val­ues of P, etc.

In con­trast, an agent AI would:

(1) Calcu­late which ac­tion, A, would max­i­mize pa­ram­e­ter P, based on ex­ist­ing data set D. (2) Ex­e­cute Ac­tion A.

The idea be­ing that an AI, asked to “pre­vent hu­man suffer­ing”, would come up with two plans:

  1. Kill all hu­man.

  2. Cure all dis­eases, make ev­ery­one young and im­mor­tal.

Then the agent AI would go out and kill ev­ery­one, while the tool AI would give us the list and we would pick the sec­ond one. In the fol­low­ing, I’ll as­sume the AI is su­per­in­tel­li­gent, and has no other ob­jec­tives than what we give it.

Long lists

Of course, we’re un­likely to get a clear two el­e­ment list. More likely we’d get some­thing like:

  1. Kill all hu­mans with en­g­ineered plagues.

  2. Kill all hu­mans with nukes.

  3. Kill all hu­mans with nanobots.

  4. Kill all hu­mans with...

  5. ...

  6. ...

  7. Lobotomise all hu­mans with en­g­ineered plagues.

  8. Lobotomise all hu­mans with surgery.

  9. Lobotomise all hu­mans with...

  10. ...

  11. ...

  12. Kill some hu­mans, lobotomise oth­ers, cure still oth­ers.

  13. ...

The nice solu­tions might not even ap­pear on the list. Of course, this is still very worth­while in­for­ma­tion! This al­lows us to go into the tool AI, and rewire it again, so that it gets our mean­ings more ac­cu­rately. Maybe af­ter a few iter­a­tions, we’ll have re­fined the AIs un­der­stand­ing of what we want, and we’ll get a nice im­ple­mentable solu­tion near the top. Of course, this pre­sup­poses that we un­der­stand the op­tions, and that it’s safe for us to read the list.

Un­der­stand­ing the options

The key, and difficult re­quire­ment is that the AI “sum­ma­rize this calcu­la­tion in a user-friendly man­ner”. The most effi­cient ac­tion won’t be “kill all hu­mans”; it will in­stead be “im­ple­ment this al­gorithm, fund that re­search lab, send this email to this poli­ti­cian...” In fact, it’ll prob­a­bly be “type this se­quence of com­mands...”

So if we’re to judge the rel­a­tive merit of the plans, we re­ally are de­pen­dent on the tool AI’s sum­mary skills. So the AI needs to have good crite­ria for what counts as a good sum­mary (rea­son­ably ac­cu­rate, but not over­loaded with ir­rele­vant in­for­ma­tion; such that a “hy­po­thet­i­cal hu­man out­side the uni­verse” would agree with the as­sess­ment if it saw the course of the fu­ture; not de­signed to se­duce hu­mans into im­ple­ment­ing it, etc...). It seems that the sum­mary abil­ity is nearly the en­tirety of the prob­lem!

A poorly de­signed sum­mary crite­ria is as bad as an agent AI. For in­stance, as­sume the crite­ria are “hu­mans in the fu­ture would agree that the sum­mary was good”. Then, de­pend­ing on how we ground ‘agree’, the tool AI could put one of these plans at the top:

  1. Kill all hu­mans (sum­marised as “cure all hu­mans”).

  2. Lobotomise all hu­mans (sum­marised as “cure all hu­mans”).

  3. Make the tool AI into an agent that will take over the world and rewire hu­man minds to agree the sum­mary was good (sum­marised as “cure all hu­mans and give them each a pony”).

There are re­lated is­sues with other sum­mary crite­ria. Any­time we have the AI judge the qual­ity of its an­swer based on some hu­man re­ac­tion to its sum­mary, we are vuln­er­a­ble to such a plan. And if we try and define the sum­mary “ob­jec­tively”, then if we miss some­thing in the ob­jec­tive defi­ni­tion—like the im­por­tance of hu­man au­ton­omy, or the value of so­cial in­ter­ac­tions with gen­uine equals—then that will get ruth­lessly sup­pressed. The “sum­mary crite­ria” take the place of the “friendly util­ity func­tion” in the agent AI.

More­over, we can’t use the “tool AI” ap­proach when de­sign­ing the sum­mary crite­ria. We can’t get the AI to list a bunch of sum­maries, and have hu­mans in­spect them for which ones are bet­ter—be­cause we don’t know what they are sum­maries of. We could train it on toy prob­lems, but that doesn’t guaran­tee ac­cu­racy of sum­maries for plans that dra­mat­i­cally af­fect the whole fu­ture of the hu­man species, and po­ten­tially, the uni­verse. The best we can man­age is some sort of spot-checks for sum­maries—bet­ter than a free agent AI, but hardly weighty as a se­cu­rity mea­sure.


On Less Wrong we are hav­ing great difficulty defin­ing coun­ter­fac­tu­als prop­erly, and un­less we solve the prob­lem well, the AI could pro­duce non­sense similar to the spu­ri­ous proofs in UDT. If the AI knows that we wouldn’t im­ple­ment cer­tain plans, then it is free to do what it wants with them, giv­ing them ran­dom de­scrip­tions and prop­er­ties. It might be that the AI, when mak­ing its list, is con­stantly look­ing for­wards to how we’ll re­act to the list, and chang­ing the list in con­se­quence, and the only sta­ble list it can pro­duce is one with one el­e­ment so se­duc­tive, that we find our­selves com­pel­led to take it. Or this may not hap­pen—but it’s still worth bear­ing in mind as a prob­lem for the tool AI ap­proach.


So far it seems that there is lit­tle differ­ence be­tween a tool AI and an agent AI. But there are differ­ences! Imag­ine if the tool AI pro­duced a plan go­ing like:

  1. First, kill Ab­dul. Then kill Beatrice. Then kill Claude. Then kill...

An agent AI could eas­ily start im­ple­ment­ing the plan, and we’d be finished. But a tool AI couldn’t even pro­pose it. Even if it was obfub­scated, af­ter we’d ended up kil­ling Ab­dul and Beatrice, we’d start to sus­pect that this wasn’t a good plan, and stop. Hence that plan wouldn’t be an effi­cient way of achiev­ing any­thing.

So the tool AI can only pro­duce plans that, once put in mo­tion, could not be stopped (or else it wouldn’t achieve the given goal). Most likely this would mean that it would pro­duce only plans that we wouldn’t ever want to stop—at least, not un­til its too late. So we are liter­ally re­quiring the AI to pro­duce plans that are ul­ti­mately se­duc­tive to the hu­man race.

The tool AI would list its plans us­ing all the re­sources we had available. Among those re­sources is a su­per­in­tel­li­gent AI—and one of the eas­iest ways to achieve any goal is to make use of that AI. So com­ing top in all the plans would be things like “cre­ate an AI with these mo­ti­va­tions” or “hack the tool AI out of its re­stric­tions”. So we are fur­ther re­quiring the tool AI to come up with sub­tle and dis­crete ways of hav­ing it­self hacked. Even if we avoid this di­rectly, any plan the AI pro­duces will be more effi­cient if over­seen by it­self, so it will rate more highly those plans that cause us to come back to it con­stantly for more feed­back.

The AI may have no prefer­ences, but its plans will have prefer­ences for us­ing the AI in fu­ture in par­tic­u­lar ways.

Is a su­per­in­tel­li­gent tool AI likely?

Fi­nally, sep­a­rate from the is­sues dis­cussed, we can ask: is it likely that we’ll pro­duce a su­per­in­tel­li­gent tool AI? For prob­lems such as rout­ing from one town to the next, à la Google Maps, tool AIs seem very effec­tive. But for im­ple­ment­ing more com­plex plans, some that in­volve re­search and ex­per­i­ments, an agent AI is much more use­ful. Also, it might not be pos­si­ble to pro­duce a su­per­in­tel­li­gent AI with­out it need­ing goals at some point in its de­vel­op­ment pro­cess. And then we bet­ter be sure we’ve scrubbed those goals away—some­how pro­gram­ming the AI to let us do that—or the tool AI will be an agent AI, us­ing us as it’s im­ple­men­tors to achieve the frag­men­tary goals it has.


There seems to be a lot of prob­lems with the tool ap­proach (more than I sus­pected when I first started look­ing into it). The tool AI will be driven to trick us, se­duce us, and try and cre­ate more agents or hack it­self free. The only defense against this is proper pro­gram­ming. The tool AI seems slightly safer than a free agent AI, but not by much. I feel the Or­a­cle is a more sen­si­ble “not full FAI” ap­proach to look into.