A com­ment on the IDA-AlphaGoZero meta­phor; cap­ab­il­it­ies versus alignment

Iter­ated Distil­la­tion and Am­pli­fic­a­tion has of­ten been com­pared to AlphaGo Zero. But in the case of a Go-play­ing AI, people of­ten don’t see a dis­tinc­tion between align­ment and cap­ab­il­ity, in­stead meas­ur­ing suc­cess on a single axis: abil­ity to win games of Go. But I think a use­ful ana­logy can still be drawn that sep­ar­ates out ways in which a Go-play­ing AI is aligned from ways in which it is cap­able.

A well-aligned but not very cap­able Go-play­ing agent would be an agent that is best modeled as try­ing to win at Go, as op­posed to try­ing to op­tim­ize some other fea­ture of se­quence of board po­s­i­tions, but still does not win against mod­er­ately skilled op­pon­ents. A very cap­able but poorly aligned Go-play­ing agent would be an agent that is very good at caus­ing the Go games that it plays to have cer­tain prop­er­ties other than the agent win­ning. One way to cre­ate such an agent would be to take a value net­work that rates board po­s­i­tions some­what ar­bit­rar­ily, and then use Monte Carlo tree search to find policies that cause the game to pro­gress through board states that are rated highly by the value net­work.

To a cer­tain ex­tent, this could ac­tu­ally hap­pen to some­thing like AlphaGo Zero. If some bad board po­s­i­tions are mis­takenly as­signed high val­ues, AlphaGo Zero will use Monte Carlo tree search to search for ways of get­ting to those board po­s­i­tions (and sim­il­arly for ways to avoid good board po­s­i­tions that are mis­takenly as­signed low val­ues). But it will sim­ul­tan­eously be cor­rect­ing the mis­taken val­ues of these board po­s­i­tions as it no­tices that its pre­dic­tions of the val­ues of fu­ture board po­s­i­tions that fol­low from them are biased. Thus the Monte Carlo tree search stage in­creases both cap­ab­il­ity and align­ment. Mean­while, the policy net­work re­train­ing stage should be ex­pec­ted to de­crease both cap­ab­il­ity and align­ment, since it is merely ap­prox­im­at­ing the res­ults of the Monte Carlo tree search. The fact that this works shows that the ex­tent to which Monte Carlo tree search in­creases align­ment and cap­ab­il­ity ex­ceeds the ex­tent to which policy net­work re­train­ing de­creases them.

It seems to me that this is the model that Iter­ated Distil­la­tion and Am­pli­fic­a­tion should be try­ing to emu­late. The amp­li­fic­a­tion stage will both in­crease cap­ab­il­ity by in­creas­ing the avail­able com­put­ing power, and in­crease align­ment be­cause of the hu­man over­seer. The dis­til­la­tion stage will de­crease both cap­ab­il­ity and align­ment by im­per­fectly ap­prox­im­at­ing the amp­li­fied agent. The im­port­ant thing is not to drive the ex­tent to which dis­til­la­tion de­creases align­ment down to zero (which may be im­possible), but to en­sure that the dis­til­la­tion stage does not cause align­ment prob­lems that will not be cor­rec­ted in the sub­sequent amp­li­fic­a­tion stage.