Disagree­ment with Paul: align­ment induction

I had a dis­cus­sion with Paul Chris­ti­ano, about his Iter­ated Am­pli­fic­a­tion and Distil­la­tion scheme. We had a dis­agree­ment, a dis­agree­ment that I be­lieve points to some­thing in­ter­est­ing, so I’m post­ing this here.

It’s a dis­agree­ment about the value of the concept of “pre­serving align­ment”. To vastly over­sim­plify Paul’s idea, the AI A[n] will check that A[n+1] is still aligned with hu­man pref­er­ences; mean­while, A[n-1] will be check­ing that A[n] is still aligned with hu­man pref­er­ences, all the way down to A[0] and an ini­tial hu­man H that checks on it.

In­tu­it­ively, this seems doable—A[n] is “nice”, so it seems that it can reas­on­ably check that A[n+1] is also nice, and so on.

But, as I poin­ted out in this post, it’s very pos­sible that A[n] is “nice” only be­cause it lacks power/​can’t do cer­tain things/​hasn’t thought of cer­tain policies. So nice­ness—in the sense of be­hav­ing sens­ibly as an autonom­ous agent—does not go through the in­duct­ive step in this ar­gu­ment.

In­stead, Paul con­firmed that “align­ment” means “won’t take un­aligned ac­tions, and will as­sess the de­cisions of a higher agent in a way that pre­serves align­ment (and pre­serves the pre­ser­va­tion of align­ment, and so on)”.

This concept does in­duct prop­erly, but seems far less in­tu­it­ive to me. It re­lies on hu­mans, for ex­ample, be­ing able to en­sure that A[0] will be aligned, that any more power­ful cop­ies it as­sesses will be aligned, that any more power­ful cop­ies those cop­ies as­sess are also aligned, and so on.

In­tu­it­ively, for any concept C of align­ment for H and A[0], I ex­pect one of four things will hap­pen, with the first three be­ing more likely:

  • The C does not in­duct.

  • The C already con­tains all of the friendly util­ity func­tion; in­duc­tion works, but does noth­ing.

  • The C does in­duct non-trivi­ally, but is in­com­plete: it’s very nar­row, and doesn’t define a good can­did­ate for a friendly util­ity func­tion.

  • The C does in­duct in a non-trivial way, the res­ult is friendly, but only one or two steps of the in­duc­tion are ac­tu­ally needed.

Hope­fully, fur­ther re­search should cla­rify if my in­tu­itions are cor­rect.