[Question] What are the differences between all the iterative/​recursive approaches to AI alignment?

I have been try­ing to un­der­stand all the iter­a­tive/​re­cur­sive ap­proaches to AI al­ign­ment. The ap­proaches I am aware of are:

  • ALBA (I get the vague im­pres­sion that this has been su­per­seded by iter­ated am­plifi­ca­tion, so I haven’t looked into it)

  • HCH (weak and strong)

  • Iter­ated am­plifi­ca­tion (IDA)

  • Debate

  • Meta-execution

  • (Re­cur­sive) re­ward modeling

  • Fac­tored cognition

  • Fac­tored evaluation

(I think that some of these, like HCH, aren’t strictly speak­ing an ap­proach to AI al­ign­ment, but they are still iter­a­tive/​re­cur­sive things dis­cussed in the con­text of AI al­ign­ment, so I want to bet­ter un­der­stand them.)

One way of phras­ing what I am try­ing to do is to come up with a “min­i­mal set” of pa­ram­e­ters/​di­men­sions along which to com­pare these differ­ent ap­proaches, so that I can take a ba­sic tem­plate, then set the pa­ram­e­ters to ob­tain each of the above ap­proaches as an in­stance.

Here are the pa­ram­e­ters/​di­men­sions that I have come up with so far:

  • ca­pa­bil­ity of agents: I think in HCH, the agents are hu­man-level. In the other ap­proaches, my un­der­stand­ing is that the ca­pa­bil­ity of the agents in­creases as more and more rounds of am­plifi­ca­tion/​dis­til­la­tion take place.

  • al­lowed com­mu­ni­ca­tion: It seems like weak and strong HCH differ in the kind of com­mu­ni­ca­tion that is al­lowed be­tween the as­sis­tants (with strong HCH al­low­ing more flex­ible com­mu­ni­ca­tion). Within IDA, there is low band­width vs high band­width over­sight, which seems like a similar pa­ram­e­ter. I’m not sure what the other ap­proaches al­low.

  • train­ing method dur­ing dis­til­la­tion step: I think IDA leaves the train­ing method flex­ible. Ac­cord­ing to this post, fac­tored cog­ni­tion seems to use imi­ta­tion learn­ing and fac­tored eval­u­a­tion seems to use re­in­force­ment learn­ing. I think re­cur­sive re­ward mod­el­ing also uses re­in­force­ment learn­ing. HCH seems to be just about the am­plifi­ca­tion step (?), so no train­ing method is used. I’m not sure about the oth­ers.

  • en­tity who “splits the ques­tions”, co­or­di­nates ev­ery­thing dur­ing am­plifi­ca­tion, or se­lects the branches: In fac­tored cog­ni­tion, fac­tored eval­u­a­tion, IDA, and HCH, it seems like the hu­man splits the ques­tions. In De­bate, the branches are cho­sen by the two AIs in the de­bate (who are in an ad­ver­sar­ial re­la­tion­ship).

  • en­tity who does the eval­u­a­tion/​gives feed­back (“the over­seer”): It seems like in fac­tored eval­u­a­tion, the hu­man gives feed­back. In De­bate, the fi­nal judg­ment is pro­vided by the hu­man. My un­der­stand­ing is that in IDA, the na­ture of the over­seer is flex­ible (“For ex­am­ple, Arthur could ad­vise Hugh on how to define a bet­ter over­seer; Arthur could offer ad­vice in real-time to help Hugh be a bet­ter over­seer; or Arthur could di­rectly act as an over­seer for his more pow­er­ful suc­ces­sor”).

  • what the over­seer does (i.e. what kind of feed­back is pro­vided): I think the over­seer can be pas­sive/​ac­tive de­pend­ing on the dis­til­la­tion method (see my com­ment here), so maybe this pa­ram­e­ter isn’t re­quired in a “min­i­mal set”.

  • re­quired num­ber of hu­man feed­back per round: In De­bate, there is one feed­back at the end of a de­bate round. In fac­tored eval­u­a­tion, it seems like the hu­man must provide feed­back at each node in the ques­tion tree (or a sep­a­rate hu­man at each node in the ques­tion tree).

  • depth of re­cur­sion: It seems like IDA limits the depth of the re­cur­sion to one step, whereas the other ap­proaches seem to al­low ar­bi­trary depth (see my com­ment here).

  • sep­a­ra­tion of task perfor­mance vs eval­u­a­tion/​over­sight: It seems like in fac­tored eval­u­a­tion, there is an en­tity who does the task it­self (the ex­perts at the bot­tom of this di­a­gram), and a sep­a­rate en­tity who eval­u­ates the work of the ex­perts (the “fac­tored eval­u­a­tion” box in the same di­a­gram), but in fac­tored cog­ni­tion, there is just the en­tity do­ing the task.

I would ap­pre­ci­ate hear­ing about more pa­ram­e­ters/​di­men­sions that I have missed, and also any help filling in some of the val­ues for the pa­ram­e­ters (in­clud­ing cor­rec­tions to any of my spec­u­la­tions above).

Ideally, there would be a table with the pa­ram­e­ters as columns and each of the ap­proaches as rows (pos­si­bly trans­posed), like the table in this post. I would be will­ing to pro­duce such a write-up as­sum­ing I am able to fill in enough of the val­ues that it be­comes use­ful.

If any­one thinks this kind of com­par­i­son is use­less/​framed in the wrong way, please let me know. (I would also want to know what the cor­rect fram­ing is!)