A comment on the IDA-AlphaGoZero metaphor; capabilities versus alignment

Iter­ated Distil­la­tion and Am­plifi­ca­tion has of­ten been com­pared to AlphaGo Zero. But in the case of a Go-play­ing AI, peo­ple of­ten don’t see a dis­tinc­tion be­tween al­ign­ment and ca­pa­bil­ity, in­stead mea­sur­ing suc­cess on a sin­gle axis: abil­ity to win games of Go. But I think a use­ful anal­ogy can still be drawn that sep­a­rates out ways in which a Go-play­ing AI is al­igned from ways in which it is ca­pa­ble.

A well-al­igned but not very ca­pa­ble Go-play­ing agent would be an agent that is best mod­eled as try­ing to win at Go, as op­posed to try­ing to op­ti­mize some other fea­ture of se­quence of board po­si­tions, but still does not win against mod­er­ately skil­led op­po­nents. A very ca­pa­ble but poorly al­igned Go-play­ing agent would be an agent that is very good at caus­ing the Go games that it plays to have cer­tain prop­er­ties other than the agent win­ning. One way to cre­ate such an agent would be to take a value net­work that rates board po­si­tions some­what ar­bi­trar­ily, and then use Monte Carlo tree search to find poli­cies that cause the game to progress through board states that are rated highly by the value net­work.

To a cer­tain ex­tent, this could ac­tu­ally hap­pen to some­thing like AlphaGo Zero. If some bad board po­si­tions are mis­tak­enly as­signed high val­ues, AlphaGo Zero will use Monte Carlo tree search to search for ways of get­ting to those board po­si­tions (and similarly for ways to avoid good board po­si­tions that are mis­tak­enly as­signed low val­ues). But it will si­mul­ta­neously be cor­rect­ing the mis­taken val­ues of these board po­si­tions as it no­tices that its pre­dic­tions of the val­ues of fu­ture board po­si­tions that fol­low from them are bi­ased. Thus the Monte Carlo tree search stage in­creases both ca­pa­bil­ity and al­ign­ment. Mean­while, the policy net­work re­train­ing stage should be ex­pected to de­crease both ca­pa­bil­ity and al­ign­ment, since it is merely ap­prox­i­mat­ing the re­sults of the Monte Carlo tree search. The fact that this works shows that the ex­tent to which Monte Carlo tree search in­creases al­ign­ment and ca­pa­bil­ity ex­ceeds the ex­tent to which policy net­work re­train­ing de­creases them.

It seems to me that this is the model that Iter­ated Distil­la­tion and Am­plifi­ca­tion should be try­ing to em­u­late. The am­plifi­ca­tion stage will both in­crease ca­pa­bil­ity by in­creas­ing the available com­put­ing power, and in­crease al­ign­ment be­cause of the hu­man over­seer. The dis­til­la­tion stage will de­crease both ca­pa­bil­ity and al­ign­ment by im­perfectly ap­prox­i­mat­ing the am­plified agent. The im­por­tant thing is not to drive the ex­tent to which dis­til­la­tion de­creases al­ign­ment down to zero (which may be im­pos­si­ble), but to en­sure that the dis­til­la­tion stage does not cause al­ign­ment prob­lems that will not be cor­rected in the sub­se­quent am­plifi­ca­tion stage.