The Steering Problem

Most AI re­search fo­cuses on re­pro­duc­ing hu­man abil­ities: to learn, in­fer, and rea­son; to per­ceive, plan, and pre­dict. There is a com­ple­men­tary prob­lem which (un­der­stand­ably) re­ceives much less at­ten­tion: if you had these abil­ities, what would you do with them?

The steer­ing prob­lem: Us­ing black-box ac­cess to hu­man-level cog­ni­tive abil­ities, can we write a pro­gram that is as use­ful as a well-mo­ti­vated hu­man with those abil­ities?

This post ex­plains what the steer­ing prob­lem is and why I think it’s worth spend­ing time on.


A ca­pa­ble, well-mo­ti­vated hu­man can be ex­tremely use­ful: they can work with­out over­sight, pro­duce re­sults that need not be dou­ble-checked, and work to­wards goals that aren’t pre­cisely defined. Th­ese ca­pa­bil­ities are crit­i­cal in do­mains where de­ci­sions can­not be eas­ily su­per­vised, whether be­cause they are too fast, too com­plex, or too nu­mer­ous.

In some sense “be as use­ful as pos­si­ble” is just an­other task at which a ma­chine might reach hu­man-level perfor­mance. But it is differ­ent from the con­crete ca­pa­bil­ities nor­mally con­sid­ered in AI re­search.

We can say clearly what it means to “pre­dict well,” “plan well,” or “rea­son well.” If we ig­nored com­pu­ta­tional limits, ma­chines could achieve any of these goals to­day. And be­fore the ex­ist­ing vi­sion of AI is re­al­ized, we must nec­es­sar­ily achieve each of these goals.

For now, “be as use­ful as pos­si­ble” is in a differ­ent cat­e­gory. We can’t say ex­actly what it means. We could not do it no mat­ter how fast our com­put­ers could com­pute. And even if we re­solved the most salient challenges in AI, we could re­main in the dark about this one.

Con­sider a ca­pa­ble AI tasked with run­ning an aca­demic con­fer­ence. How should it use its ca­pa­bil­ities to make de­ci­sions?

  • We could try to spec­ify ex­actly what makes a con­fer­ence good or bad. But our re­quire­ments are com­plex and varied, and so spec­i­fy­ing them ex­actly seems time-con­sum­ing or im­pos­si­ble.

  • We could build an AI that imi­tates suc­cess­ful con­fer­ence or­ga­niz­ers. But this ap­proach can never do any bet­ter than the hu­mans we are imi­tat­ing. Real­is­ti­cally, it won’t even match hu­man perfor­mance un­less we some­how com­mu­ni­cate what char­ac­ter­is­tics are im­por­tant and why.

  • We could ask an AI to max­i­mize our satis­fac­tion with the con­fer­ence. But we’ll get what we mea­sure. An ex­ten­sive eval­u­a­tion would greatly in­crease the cost of the con­fer­ence, while a su­perfi­cial eval­u­a­tion would leave us with a con­fer­ence op­ti­mized for su­perfi­cial met­rics.

Every­day ex­pe­rience with hu­mans shows how hard del­e­ga­tion can be, and how much eas­ier it is to as­sign a task to some­one who ac­tu­ally cares about the out­come.

Of course there is already pres­sure to write use­ful pro­grams in ad­di­tion to smart pro­grams, and some AI re­search stud­ies how to effi­ciently and ro­bustly com­mu­ni­cate de­sired be­hav­iors. For now, available solu­tions ap­ply only in limited do­mains or to weak agents. The steer­ing prob­lem is to close this gap.


A sys­tem which “merely” pre­dicted well would be ex­traor­di­nar­ily use­ful. Why does it mat­ter whether we know how to make a sys­tem which is “as use­ful as pos­si­ble”?

Our ma­chines will prob­a­bly do some things very effec­tively. We know what it means to “act well” in the ser­vice of a given goal. For ex­am­ple, us­ing hu­man cog­ni­tive abil­ities as a black box, we could prob­a­bly de­sign au­tonomous cor­po­ra­tions which very effec­tively max­i­mized growth. If the black box was cheaper than the real thing, such au­tonomous cor­po­ra­tions could dis­place their con­ven­tional com­peti­tors.

If ma­chines can do ev­ery­thing equally well, then this would be great news. If not, so­ciety’s di­rec­tion may be profoundly in­fluenced by what can and can­not be done eas­ily. For ex­am­ple, if we can only max­i­mize what we can pre­cisely define, we may in­ad­ver­tently end up with a world filled with ma­chines try­ing their hard­est to build big­ger fac­to­ries and bet­ter wid­gets, un­in­ter­ested in any­thing we con­sider in­trin­si­cally valuable.

All tech­nolo­gies are more use­ful for some tasks than oth­ers, but ma­chine in­tel­li­gence might be par­tic­u­larly prob­le­matic be­cause it can en­trench it­self. For ex­am­ple, a ra­tio­nal profit-max­i­miz­ing cor­po­ra­tion might dis­tribute it­self through­out the world, pay peo­ple to help pro­tect it, make well-crafted moral ap­peals for equal treat­ment, or cam­paign to change policy. Although such cor­po­ra­tions could bring large benefits in the short term, in the long run they may be difficult or im­pos­si­ble to up­root, even once they serve no one’s in­ter­ests.

Why now?

Re­pro­duc­ing hu­man abil­ities gets a lot of de­served at­ten­tion. Figur­ing out ex­actly what you’d do once you suc­ceed feels like plan­ning the cel­e­bra­tion be­fore the vic­tory: it might be in­ter­est­ing, but why can’t it wait?

  1. Maybe it’s hard. Prob­a­bly the steer­ing prob­lem is much eas­ier than the AI prob­lem, but it might turn out to be sur­pris­ingly difficult. If it is difficult, then learn­ing that ear­lier will help us think more clearly about AI, and give us a head start on ad­dress­ing it.

  2. It may help us un­der­stand AI. The difficulty of say­ing ex­actly what you want is a ba­sic challenge, and the steer­ing prob­lem is a nat­u­ral per­spec­tive on this challenge. A lit­tle bit of re­search on nat­u­ral the­o­ret­i­cal prob­lems is of­ten worth­while, even when the di­rect ap­pli­ca­tions are limited or un­clear. In sec­tion 4 we dis­cuss pos­si­ble ap­proaches to the steer­ing prob­lem, many of which are new per­spec­tives on im­por­tant prob­lems.

  3. It should be de­vel­oped alongside AI. The steer­ing prob­lem is a long-term goal in the same way that un­der­stand­ing hu­man-level pre­dic­tion is a long-term goal. Just as we do the­o­ret­i­cal re­search on pre­dic­tion be­fore that re­search is com­mer­cially rele­vant, it may be sen­si­ble to do the­o­ret­i­cal re­search on steer­ing be­fore it is com­mer­cially rele­vant. Ideally, our abil­ity to build use­ful sys­tems will grow in par­allel with our abil­ity to build ca­pa­ble sys­tems.

  4. Nine women can’t make a baby in one month. We could try to save re­sources by post­pon­ing work on the steer­ing prob­lem un­til it seems im­por­tant. At this point it will be eas­ier to work on the steer­ing prob­lem, and if the steer­ing prob­lem turns out to be unim­por­tant then we can avoid think­ing about it al­to­gether.
    But at large scales it be­comes hard to speed up progress by in­creas­ing the num­ber of re­searchers. Fewer peo­ple work­ing for longer may ul­ti­mately be more effi­cient (even if ear­lier re­searchers are at a dis­ad­van­tage). This is par­tic­u­larly press­ing if we may even­tu­ally want to in­vest much more effort in the steer­ing prob­lem.

  5. AI progress may be sur­pris­ing. We prob­a­bly won’t re­pro­duce hu­man abil­ities in the next few decades, and we prob­a­bly won’t do it with­out am­ple ad­vance no­tice. That said, AI is too young, and our un­der­stand­ing too shaky, to make con­fi­dent pre­dic­tions. A mere 15 years is 20% of the his­tory of mod­ern com­put­ing. If im­por­tant hu­man-level ca­pa­bil­ities are de­vel­oped sur­pris­ingly early or rapidly, then it would be worth­while to bet­ter un­der­stand the im­pli­ca­tions in ad­vance.

  6. The field is sparse. Be­cause the steer­ing prob­lem and similar ques­tions have re­ceived so lit­tle at­ten­tion, in­di­vi­d­ual re­searchers are likely to make rapid head­way. There are per­haps three to four or­ders of mag­ni­tude be­tween ba­sic re­search on AI and re­search di­rectly rele­vant to the steer­ing prob­lem, low­er­ing the bar for ar­gu­ments 1-5.

In sec­tion 3 we dis­cuss some other rea­sons not to work on the steer­ing prob­lem: Is work done now likely to be rele­vant? Is there any con­crete work to do now? Should we wait un­til we can do ex­per­i­ments? Are there ad­e­quate in­cen­tives to re­solve this prob­lem already?

Defin­ing the prob­lem precisely

Re­call our prob­lem state­ment:

The steer­ing prob­lem: Us­ing black-box ac­cess to hu­man-level cog­ni­tive abil­ities, can we write a pro­gram that is as use­ful as a well-mo­ti­vated hu­man with those abil­ities?

We’ll adopt a par­tic­u­lar hu­man, Hugh, as our “well-mo­ti­vated hu­man:” we’ll as­sume that we have black-box ac­cess to Hugh-level cog­ni­tive abil­ities, and we’ll try to write a pro­gram which is as use­ful as Hugh.


In re­al­ity, AI re­search yields com­pli­cated sets of re­lated abil­ities, with rich in­ter­nal struc­ture and no sim­ple perfor­mance guaran­tees. But in or­der to do con­crete work in ad­vance, we will model abil­ities as black boxes with well-defined con­tracts.

We’re par­tic­u­larly in­ter­ested in tasks which are “AI com­plete” in the sense that hu­man-level perfor­mance on that task could be used as a black box to achieve hu­man-level perfor­mance on a very wide range of tasks. For now, we’ll fur­ther fo­cus on do­mains where perfor­mance can be un­am­bigu­ously defined.

Some ex­am­ples:

  • Boolean ques­tion-an­swer­ing. A ques­tion-an­swerer is given a state­ment and out­puts a prob­a­bil­ity. A ques­tion-an­swerer is Hugh-level if it never makes judg­ments pre­dictably worse than Hugh’s. We can con­sider ques­tion-an­swer­ers in a va­ri­ety of lan­guages, rang­ing from nat­u­ral lan­guage (“Will a third party win the US pres­i­dency in 2016?”) to pre­cise al­gorith­mic speci­fi­ca­tions (“Will this pro­gram out­put 1?”).

  • On­line learn­ing. A func­tion-learner is given a se­quence of la­bel­led ex­am­ples (x, y) and pre­dicts the la­bel of a new data point, x’. A func­tion-learner is Hugh-level if, af­ter train­ing on any se­quence of data (xi, yi), the learner’s guess for the la­bel of the next point xi+1 is—on av­er­age—at least as good as Hugh’s.

  • Em­bod­ied re­in­force­ment learn­ing. A re­in­force­ment learner in­ter­acts with an en­vi­ron­ment and re­ceives pe­ri­odic re­wards, with the goal of max­i­miz­ing the dis­counted sum of its re­wards. A re­in­force­ment learner is Hugh-level if, fol­low­ing any se­quence of ob­ser­va­tions, it achieves an ex­pected perfor­mance as good as Hugh’s in the sub­se­quent rounds. The ex­pec­ta­tion is taken us­ing our sub­jec­tive dis­tri­bu­tion over the phys­i­cal situ­a­tion of an agent who has made those ob­ser­va­tions.

When talk­ing about Hugh’s pre­dic­tions, judg­ments, or de­ci­sions, we imag­ine that Hugh has ac­cess to a rea­son­ably pow­er­ful com­puter, which he can use to pro­cess or dis­play data. For ex­am­ple, if Hugh is given the bi­nary data from a cam­era, he can ren­der it on a screen in or­der to make pre­dic­tions about it.

We can also con­sider a par­tic­u­larly de­gen­er­ate abil­ity:

  • Un­limited com­pu­ta­tion. A box that can run any al­gorithm in a sin­gle time step is—in some sense—Hugh level at ev­ery pre­cisely stated task.

Although un­limited com­pu­ta­tion seems ex­cep­tion­ally pow­er­ful, it’s not im­me­di­ately clear how to solve the steer­ing prob­lem even us­ing such an ex­treme abil­ity.

Mea­sur­ing usefulness

What does it mean for a pro­gram to be “as use­ful” as Hugh?

We’ll start by defin­ing “as use­ful for X as Hugh,” and then we will in­for­mally say that a pro­gram is “as use­ful” as Hugh if it’s as use­ful for the tasks we care most about.

Con­sider H, a black box that simu­lates Hugh or per­haps con­sults a ver­sion of Hugh who is work­ing re­motely. We’ll sup­pose that run­ning H takes the same amount of time as con­sult­ing our Hugh-level black boxes. A pro­ject to ac­com­plish X could po­ten­tially use as many copies of H as it can af­ford to run.

A pro­gram P is more use­ful than Hugh for X if, for ev­ery pro­ject us­ing H to ac­com­plish X, we can effi­ciently trans­form it into a new pro­ject which uses P to ac­com­plish X. The new pro­ject shouldn’t be much more ex­pen­sive—it shouldn’t take much longer, use much more com­pu­ta­tion or many ad­di­tional re­sources, in­volve much more hu­man la­bor, or have sig­nifi­cant ad­di­tional side-effects.


What it does it mean for Hugh to be well-mo­ti­vated?

The eas­iest ap­proach is uni­ver­sal quan­tifi­ca­tion: for any hu­man Hugh, if we run our pro­gram us­ing Hugh-level black boxes, it should be as use­ful as Hugh.

Alter­na­tively, we can lev­er­age our in­tu­itive sense of what it means for some­one to be well-mo­ti­vated to do X, and define “well-mo­ti­vated” to mean “mo­ti­vated to help the user’s pro­ject suc­ceed.”

Scal­ing up

If we are given bet­ter black boxes, we should make a bet­ter pro­gram. This is cap­tured by the re­quire­ment that our pro­gram should be as use­ful as Hugh, no mat­ter how ca­pa­ble Hugh is (as long as the black boxes are equally ca­pa­ble).

Ideally, our solu­tions should scale far past hu­man-level abil­ities. This is not a the­o­ret­i­cal con­cern—in many do­mains com­put­ers already have sig­nifi­cantly su­per­hu­man abil­ities. This re­quire­ment is harder to make pre­cise, be­cause we can no longer talk about the “hu­man bench­mark.” But in gen­eral, we would like to build sys­tems which are (1) work­ing to­wards their owner’s in­ter­ests, and (2) nearly as effec­tive as the best goal-di­rected sys­tems that can be built us­ing the available abil­ities. The ideal solu­tion to the steer­ing prob­lem will have these char­ac­ter­is­tics in gen­eral, even when the black-box abil­ities are rad­i­cally su­per­hu­man.

This is an abridged ver­sion of this doc­u­ment from 2014; most of the doc­u­ment is now su­per­seded by later posts in this se­quence.

To­mor­row’s AI Align­ment Fo­rum se­quences post will be ‘Embed­ded Agency (text)’ in the se­quence Embed­ded Agency, by Scott Garrabrant and Abram Dem­ski.

The next post in this se­quence will come out on Thurs­day 15th Novem­ber, and will be ‘Clar­ify­ing “AI Align­ment”’ by Paul Chris­ti­ano.