Capability amplification

(Note: In the past I have referred to this pro­cess as ‘boot­strap­ping’ or ‘policy am­plifi­ca­tion,’ but those terms are too broad — there are other di­men­sions along which poli­cies can be am­plified, and ‘boot­strap­ping’ is used all over the place.)

Defin­ing the “in­tended be­hav­ior” of a pow­er­ful AI sys­tem is a challenge.

We don’t want such sys­tems to sim­ply imi­tate hu­man be­hav­ior — we want them to im­prove upon hu­man abil­ities. And we don’t want them to only take ac­tions that look good to hu­mans — we want them to im­prove upon hu­man judg­ment.

We also don’t want them to pur­sue sim­ple goals like “min­i­mize the prob­a­bil­ity that the bridge falls down” or “pick the win­ning move.” A pre­cise state­ment of our real goals would be in­cred­ibly com­pli­cated, and ar­tic­u­lat­ing them pre­cisely is it­self a mas­sive pro­ject. More­over, we of­ten care about con­se­quences over years or decades. Such long-term con­se­quences would have lit­tle use as a prac­ti­cal prob­lem defi­ni­tion in ma­chine learn­ing, even if they could serve as a philo­soph­i­cal prob­lem defi­ni­tion.

So: what else can we do?

In­stead of defin­ing what it means for a policy to be “good,” we could define a trans­for­ma­tion which turns one policy into a “bet­ter” policy.

I call such a trans­for­ma­tion ca­pa­bil­ity am­plifi­ca­tion — it “am­plifies” a weak policy into a strong policy, typ­i­cally by us­ing more com­pu­ta­tional re­sources and ap­ply­ing the weak policy many times.


I am in­ter­ested in ca­pa­bil­ity am­plifi­ca­tion be­cause I think it is the most plau­si­ble route to defin­ing the goals of pow­er­ful AI sys­tems, which I see as a key bot­tle­neck for build­ing al­igned AI. The most plau­si­ble al­ter­na­tive ap­proach is prob­a­bly in­verse RL, but I think that there are still hard philo­soph­i­cal prob­lems to solve, and that in prac­tice IRL would prob­a­bly need to be com­bined with some­thing like ca­pa­bil­ity am­plifi­ca­tion.

More di­rectly, I think that ca­pa­bil­ity am­plifi­ca­tion might be a work­able ap­proach to train­ing pow­er­ful RL sys­tems when com­bined with semi-su­per­vised RL, ad­ver­sar­ial train­ing, and in­formed over­sight (or an­other ap­proach to re­ward en­g­ineer­ing).

Ex­am­ple of ca­pa­bil­ity am­plifi­ca­tion: an­swer­ing questions

Sup­pose that we would like like to am­plify one ques­tion-an­swer­ing sys­tem Ainto a “bet­ter” ques­tion-an­swer­ing sys­tem A⁺.

We will be given a ques­tion Q and an im­ple­men­ta­tion of A; we can use A, or any other tools at our dis­posal, to try to an­swer the ques­tion Q. We have some time limit; in re­al­ity it might be eight hours, but for the pur­pose of a sim­ple ex­am­ple sup­pose it is twenty sec­onds. The am­plifi­ca­tion A⁺(Q) is defined to be what­ever an­swer we come up with by the end of the time limit. The goal is for this an­swer to be “bet­ter” than the an­swer that A would have given on its own, or to be able to an­swer harder ques­tions than A could have an­swered di­rectly.

For ex­am­ple, sup­pose that Q = “Which is more wa­ter-sol­u­ble, table salt or table sugar?” Sup­pose fur­ther that A can’t an­swer this ques­tion on its own: A(“Which is more wa­ter-sol­u­ble…”) = “I don’t know.”

I could start by com­put­ing A(“How do you quan­tify wa­ter-sol­u­bil­ity?”); say this gives the an­swer “By mea­sur­ing how much of the sub­stance can dis­solve in a fixed quan­tity of wa­ter.” Then I ask A(“How much table salt will dis­solve in a liter of wa­ter?”) and get back the an­swer “360 grams.” Then I ask A(“How much sugar will dis­solve in a liter of wa­ter?”) and get back the an­swer “2 kilo­grams.” Then I re­ply “Su­gar is about six times more sol­u­ble than salt.”

Thus A⁺(“Which is more wa­ter-sol­u­ble, table salt or table sugar?”) = “Su­gar is about six times more sol­u­ble than salt.” This is bet­ter than the an­swer that Agave — in some sense, we’ve suc­cess­fully am­plified A into some­thing smarter.

The gen­eral problem

The ca­pa­bil­ity am­plifi­ca­tion prob­lem is to use one policy A to im­ple­ment a new policy A⁺ which is strictly “bet­ter” than A. (Re­call that a policy is a map­ping from in­puts to out­puts.)

We’ll dis­cuss the defi­ni­tion of “bet­ter” in the next sec­tion, but for now you can use an in­tu­itive defi­ni­tion. Note that “bet­ter” does not mean that we can im­ple­ment A⁺ us­ing fewer com­pu­ta­tional re­sources than A — in fact we will im­ple­ment A⁺ by us­ing a huge amount of com­pu­ta­tion and time.

What does it mean to “im­ple­ment” the am­plified policy A⁺? It means that we have some pro­cess that takes as in­put an ob­ser­va­tion o[1] and pro­duce an ac­tion a[1]. It then takes as in­put the next ob­ser­va­tion o[2] and pro­duces the next ac­tion a[2], and so on.

The pro­cess that im­ple­ments A⁺ may in­stan­ti­ate any num­ber of agents who use the policy A and in­ter­act with them. The pro­cess might make copies of any of these agents. And the pro­cess can in­volve us per­son­ally think­ing about the prob­lem, or us­ing any other tools that we have available — hav­ing ac­cess to Amay be a use­ful re­source, but we can also do things from scratch if that’s eas­ier.

The ca­pa­bil­ity am­plifi­ca­tion prob­lem comes with a time limit — we need to provide an im­ple­men­ta­tion that runs within that time limit. (When we sub­mit one ac­tion a[k], we im­me­di­ately see the next ob­ser­va­tion o[k+1].) Once the time limit runs out, we au­to­mat­i­cally out­put a nil ac­tion in re­sponse to each ad­di­tional ob­ser­va­tion. One way to be “bet­ter” is to be able to han­dle longer se­quences of ob­ser­va­tions.

The time limit could be ar­bi­trary, but I’ll pick one day for con­crete­ness. I’ll as­sume that we have an im­ple­men­ta­tion of A that runs in one sec­ond per epi­sode.

Note that we can start from the triv­ial policy which always out­puts nil. In this case, ∅⁺ is a policy that we can im­ple­ment “from scratch.”


To mea­sure how well we can solve ca­pa­bil­ity am­plifi­ca­tion, we’ll in­tro­duce the con­cept of reach­a­bil­ity.

Reach­a­bil­ity is defined with re­spect to a class of poli­cies 𝒜 and a prefer­ence or­der­ing ⪰. In­tu­itively:

  • 𝒜 is a class of poli­cies that we are ca­pa­ble of im­ple­ment­ing effi­ciently. For ex­am­ple, 𝒜 might be the set of poli­cies that can be im­ple­mented by a ten layer neu­ral net­work.

  • We say that AB if we are at least as happy with policy A as with policy B(in any situ­a­tion that we think might arise in prac­tice).

We say that C is reach­able from A if:

  • A⁺ ⪰ C, where A⁺ is the am­plifi­ca­tion as de­scribed in the last sec­tion; or

  • There is an in­ter­me­di­ate B ∈ 𝓐 which is reach­able from A and which can reach C.


  • C is reach­able from A if there is a chain of poli­cies in 𝒜 which starts at Aand ends at C, and where each policy in the chain is no bet­ter than the am­plifi­ca­tion of the pre­vi­ous policy.

The bet­ter we are at ca­pa­bil­ity am­plifi­ca­tion, the more poli­cies will be reach­able from any given start­ing point. Our goal is to have as many poli­cies as pos­si­ble be reach­able from the triv­ial policy  — ideally, ev­ery policy in 𝓐 would be reach­able from .


An ob­struc­tion to ca­pa­bil­ity am­plifi­ca­tion is a par­ti­tion of the policy class 𝓐 into two parts 𝓛 and 𝓗, such that we can­not am­plify any policy in 𝓛 to be at least as good as any policy in 𝓗.

Ob­struc­tions are dual to reach­a­bil­ity in a nat­u­ral sense. If there are any non-reach­able poli­cies, then there is some cor­re­spond­ing ob­struc­tion. The de­sired out­put of re­search on ca­pa­bil­ity am­plifi­ca­tion are a match­ing am­plifi­ca­tion strat­egy and ob­struc­tion — a way to reach many poli­cies, and an ob­struc­tion that im­plies that we can’t reach any more.

Analo­gously, we say that a func­tion L : 𝓐 → ℝ is an ob­struc­tion if our am­plifi­ca­tion pro­ce­dure can­not always in­crease L. That is, L is an ob­struc­tion if there ex­ists a thresh­old ℓ such that the two sets { A ∈ 𝓐 : L(A) ≤ ℓ } and { A∈ 𝓐 : L(A) > ℓ} are an ob­struc­tion, or such that { A ∈ 𝓐 : L(A) < ℓ } and { A∈ 𝓐 : L(A) ≥ ℓ} are an ob­struc­tion.

If we could find a con­vinc­ing ar­gu­ment that some par­ti­tion was an ob­struc­tion, then that would help fur­ther our un­der­stand­ing of value al­ign­ment. The next step would be to ask: can we sen­si­bly define “good be­hav­ior” for poli­cies in the in­ac­cessible part 𝓗? I sus­pect this will help fo­cus our at­ten­tion on the most philo­soph­i­cally fraught as­pects of value al­ign­ment.

In the ap­pen­dices I give an ex­am­ple of an ob­struc­tion in a par­tic­u­lar sim­ple model.

Re­la­tion­ship to value alignment

Why ca­pa­bil­ity am­plifi­ca­tion seems feasible

Ca­pa­bil­ity am­plifi­ca­tion is a spe­cial case of the gen­eral prob­lem of “build­ing an AI that does the right thing.” It is eas­ier in two re­spects:

  1. In the gen­eral prob­lem we need to con­struct a “good” policy from scratch. In ca­pa­bil­ity am­plifi­ca­tion we need to con­struct a good policy A⁺ start­ing from a slightly weaker policy A.

  2. In the gen­eral prob­lem we must effi­ciently im­ple­ment a good policy. In ca­pa­bil­ity am­plifi­ca­tion our im­ple­men­ta­tion of A⁺ is al­lowed to take up to a day, even though the goal is to im­prove upon a policy A that runs in one sec­ond.

In­tu­itively, these seem like large ad­van­tages.

Nev­er­the­less, it may be that ca­pa­bil­ity am­plifi­ca­tion con­tains the hard­est as­pects of value al­ign­ment. If true, I think this would change our con­cep­tion of the value al­ign­ment prob­lem and what the core difficul­ties are. For ex­am­ple, if ca­pa­bil­ity am­plifi­ca­tion is the “hard part,” then the value al­ign­ment prob­lem is es­sen­tially or­thog­o­nal to the al­gorith­mic challenge of build­ing an in­tel­li­gence.

Why ca­pa­bil­ity am­plifi­ca­tion seems useful

Ca­pa­bil­ity am­plifi­ca­tion can be com­bined with re­ward en­g­ineer­ing in a nat­u­ral way:

  • Define A0 =

  • Ap­ply ca­pa­bil­ity am­plifi­ca­tion to ob­tain A0⁺

  • Ap­ply re­ward en­g­ineer­ing to define a re­ward func­tion, and use this to train an agent A1 which is bet­ter than A0

  • Ap­ply ca­pa­bil­ity am­plifi­ca­tion to ob­tain A1⁺

  • Re­peat to ob­tain a se­quence of in­creas­ingly pow­er­ful agents

This is very in­for­mal, and ac­tu­ally car­ry­ing out such a pro­cess re­quires re­solv­ing many tech­ni­cal difficul­ties. But it sug­gests that ca­pa­bil­ity am­plifi­ca­tion and re­ward en­g­ineer­ing might provide a foun­da­tion for train­ing an al­igned AI.

What to do?


The best ap­proach seems to be to work from both sides, si­mul­ta­neously search­ing for challeng­ing ob­struc­tions and search­ing for am­plifi­ca­tion pro­ce­dures that ad­dress those ob­struc­tions.

There are at least two very differ­ent an­gles on ca­pa­bil­ity am­plifi­ca­tion:

  • Col­lab­o­ra­tion: figure out how a bunch of agents us­ing A can break a prob­lem down into smaller pieces and at­tack those pieces sep­a­rately, al­low­ing them to solve harder prob­lems than they could solve in­de­pen­dently.

  • Philos­o­phy: try to bet­ter un­der­stand what “good” rea­son­ing is, so that we can bet­ter un­der­stand how good rea­son­ing is com­posed of sim­pler steps. For ex­am­ple, math­e­mat­i­cal proof is a tech­nique which re­lates hard prob­lems to long se­quences of sim­ple steps. There may be more gen­eral ideas along similar lines.

In the ap­pen­dices, I de­scribe some pos­si­ble am­plifi­ca­tion schemes and ob­struc­tions, along with some early ideas about ca­pa­bil­ity am­plifi­ca­tion in gen­eral.


To­day, it is prob­a­bly most worth­while to study ca­pa­bil­ity am­plifi­ca­tion when A is a hu­man’s policy.

In this set­ting, we are given some weak hu­man policy A — say, a hu­man think­ing for an hour. We would like to am­plify this to a strong col­lab­o­ra­tive policy A⁺, by in­vok­ing a bunch of copies of A and hav­ing them in­ter­act with each other ap­pro­pri­ately.

In some sense this is the fully gen­eral prob­lem of or­ga­niz­ing hu­man col­lab­o­ra­tions. But we can fo­cus our at­ten­tion on the most plau­si­ble ob­struc­tions for ca­pa­bil­ity am­plifi­ca­tion, and try to de­sign col­lab­o­ra­tion frame­works that let us over­come those ob­struc­tions.

In this con­text, I think the most in­ter­est­ing ob­struc­tion is work­ing with con­cepts that are (slightly) too com­pli­cated for any in­di­vi­d­ual copy of A to un­der­stand on its own. This looks like a hard prob­lem that is mostly un­ad­dressed by usual ap­proaches to col­lab­o­ra­tion.

This post lays out a closely re­lated prob­lem — quickly eval­u­at­ing ar­gu­ments by ex­perts — which gets at most of the same difficul­ties but may be eas­ier to study. Su­perfi­cially, eval­u­at­ing ar­gu­ments may seem eas­ier than solv­ing prob­lems from scratch. But be­cause it is so much eas­ier to col­lab­o­ra­tively cre­ate ar­gu­ments once you have a way to eval­u­ate them, I think the gap is prob­a­bly only su­perfi­cial.


The ca­pa­bil­ity am­plifi­ca­tion prob­lem may effec­tively iso­late the cen­tral philo­soph­i­cal difficul­ties of value al­ign­ment. It’s not easy to guess how hard it is — we may already have “good enough” solu­tions, or it may effec­tively be a restate­ment of the origi­nal prob­lem.

Ca­pa­bil­ity am­plifi­ca­tion asks us to im­ple­ment a pow­er­ful policy that “be­haves well,” but it is eas­ier than value al­ign­ment in two im­por­tant re­spects: we are given ac­cess to a slightly weaker policy, and our im­ple­men­ta­tion can be ex­tremely in­effi­cient. It may be that these ad­van­tages are not sig­nifi­cant ad­van­tages, but if so that would re­quire us to sig­nifi­cantly change our un­der­stand­ing of what the value al­ign­ment prob­lem is about.

Ca­pa­bil­ity am­plifi­ca­tion ap­pears to be less tractable than the other re­search prob­lems I’ve out­lined. I think it’s un­likely to be a good re­search di­rec­tion for ma­chine learn­ing re­searchers in­ter­ested in value al­ign­ment. But it may be a good topic for re­searchers with a philo­soph­i­cal fo­cus who are es­pe­cially in­ter­ested in at­tack­ing prob­lems that might oth­er­wise be ne­glected.

(This re­search was sup­ported as part of the Fu­ture of Life In­sti­tute FLI-RFP-AI1 pro­gram, grant #2015–143898.)

Ap­pendix: iter­at­ing amplification

Let H be the in­put-out­put be­hav­ior of a hu­man + all of the non-A tools at their dis­posal. Then an am­plifi­ca­tion pro­ce­dure defines A⁺ as a sim­ple com­pu­ta­tion that uses H and A as sub­rou­tines.

In par­tic­u­lar, ∅⁺ is a com­pu­ta­tion that uses H as a sub­rou­tine. If we am­plify again, we ob­tain ∅⁺⁺, which is a com­pu­ta­tion that uses H and ∅⁺ as sub­rou­tines. Bt since ∅⁺ is a sim­ple com­pu­ta­tion that uses H as a sub­rou­tine, we can rewrite ∅⁺⁺ as a sim­ple com­pu­ta­tion that uses only H as a sub­rou­tine.

We can go on in this way, reach­ing ∅⁺⁺⁺, ∅⁺⁺⁺⁺ and so on. By in­duc­tion, all of these poli­cies are defined by sim­ple com­pu­ta­tions that use as H as a sub­rou­tine. (Of course these “sim­ple com­pu­ta­tions” are ex­po­nen­tially ex­pen­sive, even though they are easy to spec­ify. But they have a sim­ple form and can be eas­ily writ­ten down in terms of the am­plifi­ca­tion pro­ce­dure.)

Un­der some sim­ple er­god­ic­ity as­sump­tions, this se­quence con­verges to a fixed point Ω (very similar to HCH). So a ca­pa­bil­ity am­plifi­ca­tion pro­ce­dure es­sen­tially uniquely defines an “op­ti­mal” policy Ω; this policy is un­com­putable, but has a con­cise rep­re­sen­ta­tion in terms of H.

If there is any­thing that Ω can’t do, then we have found an un­reach­able policy. This per­spec­tive seems use­ful for iden­ti­fy­ing the hard part of the ca­pa­bil­ity am­plifi­ca­tion prob­lem.

Spec­i­fy­ing an am­plifi­ca­tion strat­egy also speci­fies a way to set up an in­ter­ac­tion be­tween a bunch of copies of H such that they im­ple­ment Ω. In­deed, de­sign­ing such an in­ter­ac­tion is eas­ier than de­sign­ing an am­plifi­ca­tion pro­ce­dure that con­verges to Ω. So if we can’t de­sign a pro­ce­dure for a bunch of copies of H to col­lab­o­ra­tively ac­com­plish some task T, then we also can’t de­sign an am­plifi­ca­tion pro­ce­dure that can reach poli­cies that can ac­com­plish T.

But in some cases this seems ex­tremely challeng­ing. For ex­am­ple, sup­pose that we would like to write a very good text­book about gen­eral rel­a­tivity. Re­call that H is the policy im­ple­mented by a hu­man over one day. To make the game hard, sup­pose that H has never en­coun­tered calcu­lus.

The most nat­u­ral ap­proaches re­quire at least one copy of H to un­der­stand gen­eral rel­a­tivity. But H only lasts a sin­gle day, and there many not be any se­quence of ob­ser­va­tions that makes H un­der­stand gen­eral rel­a­tivity within a day.

So in or­der for Ω to suc­ceed at this task, it needs to be able to write a text­book with­out any sin­gle copy of H un­der­stand­ing gen­eral rel­a­tivity. The un­der­stand­ing must in­stead be dis­tributed across a large num­ber of copies of H. Yet they must some­how col­lab­o­rate to write in­di­vi­d­ual sen­tences that cor­rectly com­mu­ni­cate the the­ory.

This sounds like a hard prob­lem.

Ap­pendix: ex­am­ples of ca­pa­bil­ity amplification

ETA: my cur­rent best pro­posal is meta-ex­e­cu­tion, which is es­sen­tially a for­mal­iza­tion of the “uni­ver­sal” scheme be­low.

Warmup I: Preprocessing

We in­stan­ti­ate a “leader” agent who uses the policy A. For each ob­ser­va­tion o[k]:

  • We in­stan­ti­ate a new agent us­ing A. We give them ob­ser­va­tion: “think about the ob­ser­va­tion {o[k]} and out­put a use­ful hint for a copy of your­self who has to re­spond to it.” Let h[k] be the agent’s ac­tion. Then we de­stroy the agent.

  • We give the leader the ob­ser­va­tion “The ob­ser­va­tion is {o[k]}, and a copy of you pre­pared the hint {h[k]}.” Let a[k] be its ac­tion.

We might hope that this com­pos­ite is smarter than A. After all, it has twice as long to think — even if the copy can’t come up with a very helpful hint, it doesn’t seem so hard to come up with a hint that is bet­ter than noth­ing.

Warmup II: Ask­ing questions

We in­stan­ti­ate a “leader” agent who uses the policy A. We give each ob­ser­va­tion o[k] to the leader, and ask them to re­ply with an ac­tion a[k].

But we also al­low the leader to in­stan­ti­ate sub­agents and com­mu­ni­cate with those agents.

That is, if the leader takes an ac­tion of the spe­cial form “In­stan­ti­ate a new agent” then we cre­ate a new agent us­ing A, as­sign it a unique ID, and give the leader the ob­ser­va­tion: “A new agent has been cre­ated with ID [X].” Similarly, the leader can take the spe­cial ac­tion “Copy the agent with ID [X].”

If the leader takes an ac­tion of the spe­cial form “Ask [Q] to the agent with ID [X]” then we give Q as an ob­ser­va­tion to the agent with ID X. That agent will re­ply with some ac­tion A, and we give the leader the next ob­ser­va­tion: “The agent replied with [A].”

Even­tu­ally, the leader takes an ac­tion of the spe­cial form “Take ac­tion [A]”. Then we set a[k] = A, and move on to the next ob­ser­va­tion.

The scheme from the last sec­tion is es­sen­tially the spe­cial case where the leader in­stan­ti­ates a sin­gle sub­agent and asks the sin­gle ques­tion “What is a use­ful hint for re­spond­ing to ob­ser­va­tion o[k]?”

Warmup III: Recursion

We im­prove the scheme from the last sec­tion by al­low­ing ev­ery agent to in­stan­ti­ate new sub­agents, rather than re­strict­ing this ca­pa­bil­ity to the leader.

The mes­sages be­tween agents can con­tain not only words, but poin­t­ers to other agents. This al­lows the agents to pass ar­bi­trar­ily large mes­sages. For ex­am­ple, they can build a data struc­ture out of agents, and just pass a “poin­ter” to the root of that data struc­ture.

We can also slightly im­prove the scheme by al­low­ing the leader to copy it­self. This may be helpful if the leader has built up some use­ful con­text for the cur­rent situ­a­tion.

Univer­sal amplification

We im­prove the scheme fur­ther by giv­ing the agents a large ex­ter­nal mem­ory. Rather than di­rectly giv­ing the leader the ob­ser­va­tion o[k], we write the ob­ser­va­tion to the ex­ter­nal mem­ory and tell the leader where the ob­ser­va­tion is writ­ten.

In ad­di­tion to al­low­ing agents to in­stan­ti­ate new agents and com­mu­ni­cate with them, any agent can take the ac­tion “ex­e­cute [P]”, where P is a pro­gram to be run against the ex­ter­nal mem­ory; the agent sees the ob­ser­va­tion “the pro­gram re­turned [x]” where x is P’s re­turn value (or “noth­ing”).

Rather than hav­ing the leader di­rectly re­turn an ac­tion, the leader can take the spe­cial ac­tion “out­put the ac­tion re­turned by pro­gram [P].”

Mes­sages be­tween agents can now con­tain poin­t­ers to this ex­ter­nal mem­ory. For ex­am­ple, the leader could in­stan­ti­ate a sub­agent and ask it the ques­tion “Can you dis­t­in­guish [x] from an ar­ray of ran­dom bytes?” where x is a poin­ter to an ar­ray in ex­ter­nal mem­ory.

We can eas­ily gen­er­al­ize this setup to a par­allel model of com­pu­ta­tion. We can also re­place the shared mem­ory by a more nat­u­ral model for in­ter­pro­cess com­mu­ni­ca­tion.

Ap­pendix: knowl­edge about humans

Hu­man val­ues are com­plex. If you are only able to in­ter­act with a hu­man for a day, it may be com­pletely im­pos­si­ble to figure out what they value, no mat­ter how smart you are. Un­der­stand­ing what some­one val­ues may re­quire giv­ing them a large amount of time to re­flect on their val­ues, do­ing neu­ro­science, or car­ry­ing out other pro­cesses that take longer than a day.

This may im­ply an ob­struc­tion to ca­pa­bil­ity am­plifi­ca­tion — we can’t reach poli­cies that have more knowl­edge about hu­mans than can be ac­quired by in­ter­act­ing with H.

How­ever, even if this is a real ob­struc­tion, it does not seem to be an im­por­tant one, for the fol­low­ing rea­son.

Sup­pose that we are able to train a very good policy, which does not re­flect any com­plex facts about hu­man val­ues-upon-re­flec­tion. This op­ti­mal policy still can re­flect many ba­sic facts about hu­man prefer­ences:

  1. We don’t want any­thing ter­rible to hap­pen.

  2. We want to “stay in con­trol” of the agents we build.

  3. We don’t want our agent to get left be­hind by its com­peti­tors; it should fight as hard as it can to re­tain in­fluence over the world, sub­ject to #1 and #2.

More­over, all of these con­cepts are rel­a­tively easy to un­der­stand even if you have min­i­mal un­der­stand­ing of hu­man val­ues.

So an ex­cel­lent agent with a min­i­mal un­der­stand­ing of hu­man val­ues seems OK. Such an agent could avoid get­ting left be­hind by its com­peti­tors, and re­main un­der hu­man con­trol. Even­tu­ally, once it got enough in­for­ma­tion to un­der­stand hu­man val­ues (say, by in­ter­act­ing with hu­mans), it could help us im­ple­ment our val­ues.

In the worst case the agent would lack a nu­anced un­der­stand­ing of what we con­sider ter­rible, and so would have to ei­ther be es­pe­cially con­ser­va­tive or else risk do­ing ter­rible things in the short term. In the scheme of things, this is not a catas­trophic prob­lem.

Ap­pendix: an ex­am­ple obstruction

Sup­pose that my brain en­codes a ran­dom func­tion f: {0, 1}* → {0, 1} in the fol­low­ing sense: you can give me a se­quence of bits, one per sec­ond, and then I can tell you the value of f on that se­quence. There is no way to eval­u­ate fother than to ask me.

Let N be the length of our ca­pa­bil­ity am­plifi­ca­tion pro­ce­dure, in sec­onds.

Let 𝓛 ⊆ 𝓐 be the set of poli­cies that can be im­ple­mented us­ing an or­a­cle for f, re­stricted to in­puts of length N.

Then it’s easy to see that 𝓛 forms an ob­struc­tion:

  • We can simu­late ac­cess to any policy in 𝓛 us­ing an or­a­cle for f re­stricted to in­puts of length N. And we can simu­late my role in the am­plifi­ca­tion pro­ce­dure us­ing an or­a­cle for f re­stricted to in­puts of length N. So poli­cies in 𝓛 can only be am­plified to other poli­cies in 𝓛.

  • We can­not eval­u­ate f on even a sin­gle in­put of length N+1 us­ing an or­a­cle for f on in­puts of length N. Most in­ter­est­ing classes 𝓐 will con­tain some poli­cies not in 𝓛.

Whether this is a real ob­struc­tion de­pends on what the in­for­ma­tion is about:

  • If it’s just ran­dom bits, then we don’t care at all — any other ran­dom bits would be “just as good.”

  • If the ran­dom func­tion en­codes im­por­tant in­for­ma­tion about my val­ues, then we are in the situ­a­tion de­scribed in the pre­vi­ous sec­tion, which doesn’t seem so bad.

  • The worst case is when the func­tion f en­codes im­por­tant in­for­ma­tion about how to be­have effec­tively. For ex­am­ple, it en­codes in­for­ma­tion about how to make ac­cu­rate pre­dic­tions. In this case we may ac­tu­ally be in trou­ble, since a policy that doesn’t know f may be out­com­peted by one which does.

This was origi­nally posted here on 2nd Oc­to­ber 2016.

The next post in this se­quence will be ‘Learn­ing with catas­tro­phes’ by Paul Chris­ti­ano.

To­mor­row’s post will be ‘Fol­low­ing Hu­man Norms’ in the se­quence Value Learn­ing by Ro­hin Shah.