AlphaGo Zero and capability amplification

AlphaGo Zero is an im­pres­sive demon­stra­tion of AI ca­pa­bil­ities. It also hap­pens to be a nice proof-of-con­cept of a promis­ing al­ign­ment strat­egy.

How AlphaGo Zero works

AlphaGo Zero learns two func­tions (which take as in­put the cur­rent board):

  • A prior over moves p is trained to pre­dict what AlphaGo will even­tu­ally de­cide to do.

  • A value func­tion v is trained to pre­dict which player will win (if AlphaGo plays both sides)

Both are trained with su­per­vised learn­ing. Once we have these two func­tions, AlphaGo ac­tu­ally picks it moves by us­ing 1600 steps of Monte Carlo tree search (MCTS), us­ing p and v to guide the search. It trains p to by­pass this ex­pen­sive search pro­cess and di­rectly pick good moves. As p im­proves, the ex­pen­sive search be­comes more pow­er­ful, and p chases this mov­ing tar­get.

Iter­ated ca­pa­bil­ity amplification

In the sim­plest form of iter­ated ca­pa­bil­ity am­plifi­ca­tion, we train one func­tion:

  • A “weak” policy A, which is trained to pre­dict what the agent will even­tu­ally de­cide to do in a given situ­a­tion.

Just like AlphaGo doesn’t use the prior p di­rectly to pick moves, we don’t use the weak policy A di­rectly to pick ac­tions. In­stead, we use a ca­pa­bil­ity am­plifi­ca­tion scheme: we call A many times in or­der to pro­duce more in­tel­li­gent judg­ments. We train A to by­pass this ex­pen­sive am­plifi­ca­tion pro­cess and di­rectly make in­tel­li­gent de­ci­sions. As A im­proves, the am­plified policy be­comes more pow­er­ful, and A chases this mov­ing tar­get.

In the case of AlphaGo Zero, A is the prior over moves, and the am­plifi­ca­tion scheme is MCTS. (More pre­cisely: A is the pair (p, v), and the am­plifi­ca­tion scheme is MCTS + us­ing a rol­lout to see who wins.)

Out­side of Go, A might be a ques­tion-an­swer­ing sys­tem, which can be ap­plied sev­eral times in or­der to first break a ques­tion down into pieces and then sep­a­rately an­swer each com­po­nent. Or it might be a policy that up­dates a cog­ni­tive workspace, which can be ap­plied many times in or­der to “think longer” about an is­sue.

The significance

Re­in­force­ment learn­ers take a re­ward func­tion and op­ti­mize it; un­for­tu­nately, it’s not clear where to get a re­ward func­tion that faith­fully tracks what we care about. That’s a key source of safety con­cerns.

By con­trast, AlphaGo Zero takes a policy-im­prove­ment-op­er­a­tor (like MCTS) and con­verges to­wards a fixed point of that op­er­a­tor. If we can find a way to im­prove a policy while pre­serv­ing its al­ign­ment, then we can ap­ply the same al­gorithm in or­der to get very pow­er­ful but al­igned strate­gies.

Us­ing MCTS to achieve a sim­ple goal in the real world wouldn’t pre­serve al­ign­ment, so it doesn’t fit the bill. But “think longer” might. As long as we start with a policy that is close enough to be­ing al­igned — a policy that “wants” to be al­igned, in some sense — al­low­ing it to think longer may make it both smarter and more al­igned.

I think de­sign­ing al­ign­ment-pre­serv­ing policy am­plifi­ca­tion is a tractable prob­lem to­day, which can be stud­ied ei­ther in the con­text of ex­ist­ing ML or hu­man co­or­di­na­tion. So I think it’s an ex­cit­ing di­rec­tion in AI al­ign­ment. A can­di­date solu­tion could be in­cor­po­rated di­rectly into the AlphaGo Zero ar­chi­tec­ture, so we can already get em­piri­cal feed­back on what works. If by good for­tune pow­er­ful AI sys­tems look like AlphaGo Zero, then that might get us much of the way to an al­igned AI.


This was origi­nally posted here on 19th Oc­to­ber 2017.

To­mor­row’s AI Align­ment Fo­rum se­quences will con­tinue with a pair of posts, ‘What is nar­row value learn­ing’ by Ro­hin Shah and ‘Am­bi­tious vs. nar­row value learn­ing’ by Paul Chris­ti­ano, from the se­quence on Value Learn­ing.

The next post in this se­quence will be ‘Direc­tions for AI Align­ment’ by Paul Chris­ti­ano on Thurs­day.