Disagreement with Paul: alignment induction

I had a dis­cus­sion with Paul Chris­ti­ano, about his Iter­ated Am­plifi­ca­tion and Distil­la­tion scheme. We had a dis­agree­ment, a dis­agree­ment that I be­lieve points to some­thing in­ter­est­ing, so I’m post­ing this here.

It’s a dis­agree­ment about the value of the con­cept of “pre­serv­ing al­ign­ment”. To vastly over­sim­plify Paul’s idea, the AI A[n] will check that A[n+1] is still al­igned with hu­man prefer­ences; mean­while, A[n-1] will be check­ing that A[n] is still al­igned with hu­man prefer­ences, all the way down to A[0] and an ini­tial hu­man H that checks on it.

In­tu­itively, this seems doable—A[n] is “nice”, so it seems that it can rea­son­ably check that A[n+1] is also nice, and so on.

But, as I pointed out in this post, it’s very pos­si­ble that A[n] is “nice” only be­cause it lacks power/​can’t do cer­tain things/​hasn’t thought of cer­tain poli­cies. So nice­ness—in the sense of be­hav­ing sen­si­bly as an au­tonomous agent—does not go through the in­duc­tive step in this ar­gu­ment.

In­stead, Paul con­firmed that “al­ign­ment” means “won’t take un­al­igned ac­tions, and will as­sess the de­ci­sions of a higher agent in a way that pre­serves al­ign­ment (and pre­serves the preser­va­tion of al­ign­ment, and so on)”.

This con­cept does in­duct prop­erly, but seems far less in­tu­itive to me. It re­lies on hu­mans, for ex­am­ple, be­ing able to en­sure that A[0] will be al­igned, that any more pow­er­ful copies it as­sesses will be al­igned, that any more pow­er­ful copies those copies as­sess are also al­igned, and so on.

In­tu­itively, for any con­cept C of al­ign­ment for H and A[0], I ex­pect one of four things will hap­pen, with the first three be­ing more likely:

  • The C does not in­duct.

  • The C already con­tains all of the friendly util­ity func­tion; in­duc­tion works, but does noth­ing.

  • The C does in­duct non-triv­ially, but is in­com­plete: it’s very nar­row, and doesn’t define a good can­di­date for a friendly util­ity func­tion.

  • The C does in­duct in a non-triv­ial way, the re­sult is friendly, but only one or two steps of the in­duc­tion are ac­tu­ally needed.

Hope­fully, fur­ther re­search should clar­ify if my in­tu­itions are cor­rect.