HCH is not just Mechanical Turk

HCH, in­tro­duced in Hu­mans con­sult­ing HCH, is a com­pu­ta­tional model in which a hu­man an­swers ques­tions us­ing ques­tions an­swered by an­other hu­man, which can call other hu­mans, which can call other hu­mans, and so on. Each step in the pro­cess con­sists of a hu­man tak­ing in a ques­tion, op­tion­ally ask­ing one or more sub­ques­tions to other hu­mans, and re­turn­ing an an­swer based on those sub­ques­tions. HCH can be used as a model for what Iter­ated Am­plifi­ca­tion would be able to do in the limit of in­finite com­pute. HCH can also be used to de­com­pose the ques­tion of “is Iter­ated Am­plifi­ca­tion safe” into “is HCH safe” and “If HCH is safe, will Iter­ated Am­plifi­ca­tion ap­prox­i­mate the be­havi­our of HCH in a way that is also safe”.

I think there’s a way to in­ter­pret HCH in a way that leads to in­cor­rect in­tu­itions about why we would ex­pect it to be safe. Here, I de­scribe three mod­els of how one could think HCH would work, and why we might ex­pect them to be safe.

Me­chan­i­cal Turk: The hu­man Bob, is hired on Me­chan­i­cal Turk to act as a com­po­nent of HCH. Bob takes in some rea­son­able length nat­u­ral lan­guage ques­tion, for­mu­lates sub­ques­tions to ask other Turk­ers, and turns the re­sponses from those Turk­ers into an an­swer to the origi­nal ques­tion. Bob only sees the ques­tion he is asked and thinks for a short pe­riod of time be­fore ask­ing sub­ques­tions or re­turn­ing an an­swer. The ques­tion of “is HCH cor­rigible” is about “how does the cor­rigi­bil­ity of Bob trans­late into cor­rigi­bil­ity of the over­all sys­tem”? To claim that HCH is safe in this sce­nario, we could point to Bob be­ing well-in­ten­tioned, hav­ing hu­man-like con­cepts and rea­son­ing in a hu­man-like way. Also, since Bob has to com­mu­ni­cate in nat­u­ral lan­guage to other hu­mans, those com­mu­ni­ca­tions could be mon­i­tored or re­flected upon. We could claim that this leads the rea­son­ing that pro­duces the an­swer to stay within the space of rea­son­ing that hu­mans use, and so more likely to re­flect our val­ues and less likely to yield un­ex­pected out­comes that mis­in­ter­pret our val­ues.

Lookup Table: An AI safety re­search team lead by Alice writes down a set of 100 mil­lion pos­si­ble queries that they claim cap­ture all hu­man rea­son­ing. For each of these queries, they then write out the sub­ques­tions that would need to be writ­ten, along with sim­ple com­puter code that com­bines the an­swers to the sub­ques­tions into an an­swer to the origi­nal ques­tion. This pro­duces a large lookup table, and the “hu­man” in HCH is just a call to this lookup table. The ques­tion of “is HCH cor­rigible” is about “has Alice’s team suc­cess­fully de­signed a set of rules that perform cor­rigible rea­son­ing”? To jus­tify this, we point to Alice’s team hav­ing a large body of AI safety knowl­edge, proofs of prop­er­ties of the sys­tem, demon­stra­tions of the sys­tem work­ing in prac­tice, etc.

Overseer’s Man­ual: An AI safety re­search team lead by Alice has writ­ten a man­ual on how to cor­rigibly an­swer ques­tions by de­com­pos­ing them into sub­ques­tions. This man­ual is handed to Bob, who was hired to de­com­pose tasks. Bob care­fully stud­ies the man­ual and ap­plies the rules in it when he is perform­ing his task (and the qual­ity of his work is mon­i­tored by the team). Alice’s team has care­fully thought about how to de­com­posed tasks, and performed many ex­per­i­ments with peo­ple like Bob try­ing to de­com­pose tasks. So they un­der­stand the space of strate­gies and out­puts that Bob will pro­duce given the man­ual. The “hu­man” in HCH is ac­tu­ally a hu­man (Bob), but in effect Bob is act­ing as a com­pressed lookup table, and is only nec­es­sary be­cause the lookup table is too large to write down. An anal­ogy is that it would take too much space and time to write down a list of trans­la­tions of all pos­si­ble 10 word sen­tences from English to Ger­man, but it is pos­si­ble to train hu­mans who, given any 10 word English sen­tence can pro­duce the Ger­man trans­la­tion. The safety prop­er­ties are caused by Alice’s team’s prepa­ra­tions, which in­clude Alice’s team mod­el­ling how Bob would pro­duce an­swers af­ter read­ing the man­ual. To jus­tify the safety of the sys­tem, we again point to Alice’s team hav­ing a large body of AI safety knowl­edge, proofs of prop­er­ties of the sys­tem, demon­stra­tions of the sys­tem work­ing in prac­tice etc.

I claim that the Me­chan­i­cal Turk sce­nario is in­com­plete about why we might hope for an HCH sys­tem to be safe. Though it might be safer than a com­pu­ta­tion with­out hu­man in­volve­ment, I would find it hard to trust that this sys­tem would con­tinue to scale with­out run­ning into prob­lems, like hand­ing over con­trol de­liber­ately or ac­ci­den­tally to some un­safe com­pu­ta­tional pro­cess. The Me­chan­i­cal Turk sce­nario leaves out the pro­cess of de­sign that Alice’s team takes part in the Lookup Table and Overseer’s Man­ual sce­nar­ios, which can in­clude at least some con­sid­er­a­tion of AI safety is­sues (though how much of this is nec­es­sary is an open ques­tion). I think this de­sign pro­cess, if done right, is the thing that could give the sys­tem the abil­ity to avoid these prob­lems as it scales. I think that we should keep these stronger Lookup Table and Overseer’s Man­ual sce­nar­ios in mind when con­sid­er­ing whether HCH might be safe.

(Thanks to An­dreas Stuh­lmüller and Owain Evans for feed­back on a draft of this post)