Deliberation as a method to find the “actual preferences” of humans

Some re­cent dis­cus­sion about what Paul Chris­ti­ano means by “short-term prefer­ences” got me think­ing more gen­er­ally about de­liber­a­tion as a method of figur­ing out the hu­man user’s or users’ “ac­tual prefer­ences”. (I can’t give a defi­ni­tion of “ac­tual prefer­ences” be­cause we have such a poor un­der­stand­ing of meta-ethics that we don’t even know what the term should mean or if they even ex­ist.)

To set the fram­ing of this post: We want good out­comes from AI. To get this, we prob­a­bly want to figure out the hu­man user’s or users’ “ac­tual prefer­ences” at some point. There are sev­eral op­tions for this:

  • Directly solve meta-ethics. We figure out whether there are nor­ma­tive facts about what we should value, and use this solu­tion to clar­ify what “ac­tual prefer­ences” means and to find the hu­man’s or hu­mans’ “ac­tual prefer­ences”.

  • Solve meta-philos­o­phy. This is like solv­ing meta-ethics, but there is an ex­tra level of meta: we figure out what philos­o­phy is or what hu­man brains are do­ing when they make philo­soph­i­cal progress, then use this un­der­stand­ing to solve meta-ethics. Then we pro­ceed as in the “di­rectly solve meta-ethics” ap­proach.

  • De­liber­ate for a long time. We ac­tu­ally get a hu­man or group of hu­mans to think for a long time un­der ideal­ized con­di­tions, or ap­prox­i­mate the out­put of this pro­cess some­how. This might re­duce to one of the above ap­proaches (if the hu­mans come to be­lieve that solv­ing meta-ethics/​meta-philos­o­phy is the best way to find their “ac­tual prefer­ences”) or it might not.

The third op­tion is the fo­cus of this post. The first two are also very wor­thy of con­sid­er­a­tion—they just aren’t the fo­cus here. Also the list isn’t meant to be com­pre­hen­sive; I would be in­ter­ested to hear any other ap­proaches.

In terms of Paul’s re­cent AI al­ign­ment land­scape tree, I think this dis­cus­sion fits un­der the “Learn from teacher” node, but I’m not sure.

Ter­minolog­i­cal note: In this post, I use “de­liber­a­tion” and “re­flec­tion” in­ter­change­ably. I think this is stan­dard, but I’m not sure. If any­one uses these terms differ­ently, I would like to know how they dis­t­in­guish be­tween them.

Ap­proaches to de­liber­a­tion that have been sug­gested so far

In this sec­tion, I list some con­crete-ish ap­proaches to de­liber­a­tion that have been con­sid­ered so far. I say “con­crete-ish” rather than “con­crete” be­cause each of these ap­proaches seems un­der­de­ter­mined in many ways, e.g. for “hu­mans sit­ting down”, it’s not clear if we split the hu­mans up in some way, which hu­mans we use, how much time we al­low, what kind of “vot­ing”/​par­li­a­men­tary sys­tem we use, and so on. Later on in this post I will talk about prop­er­ties for de­liber­a­tion, so the “con­crete-ish” ap­proaches here are con­crete in two senses: (a) they have some of these prop­er­ties filled in (e.g. “hu­mans sit­ting down” says the com­pu­ta­tion hap­pens pri­mar­ily in­side hu­man brains); and (b) within a sin­gle prop­erty, they might spec­ify a spe­cific mechanism (e.g. say­ing “use coun­ter­fac­tual or­a­cles some­how” is more con­crete than say­ing “use an ap­proach where the com­pu­ta­tion doesn’t hap­pen in­side hu­man brains”).

  • Hu­mans sit­ting down. A hu­man or group of hu­mans sit­ting down and think­ing for a long time (a.k.a. Long Reflec­tion/​Great De­liber­a­tion).

  • Uploads sit­ting down. The above, but with whole brain em­u­la­tions (up­loads) in­stead. This would speed up the re­flec­tion in cal­en­dar time. There are prob­a­bly other benefits and draw­backs as well.

  • Coun­ter­fac­tual or­a­cles. Cer­tain uses of coun­ter­fac­tual or­a­cles al­low hu­mans to speed up re­flec­tion. For ex­am­ple, if we ask a coun­ter­fac­tual or­a­cle to pre­dict what we will say in a week, then we can get the an­swer now in­stead of wait­ing a week to find out what we would have said. See these two com­ments for a more de­tailed pro­posal.

  • Imi­ta­tion-based IDA (iter­ated dis­til­la­tion and am­plifi­ca­tion). The hu­man can break apart the ques­tion of “What are my ac­tual prefer­ences?” into sub-queries and use AI as­sis­tants to help an­swer the ques­tion. Alter­na­tively, the hu­man can ask more “con­crete” ques­tions like “How do I solve this math prob­lem?” or spec­ify more con­crete tasks like “Help me sched­ule an ap­point­ment”, where the out­put of de­liber­a­tion is im­plicit in how the AI sys­tem be­haves.

  • RL-based IDA. This is like imi­ta­tion-based IDA, but in­stead of dis­till­ing the over­seer via imi­ta­tion, we use re­in­force­ment learn­ing.

  • De­bate. This is prob­a­bly a dumb idea, but we can imag­ine get­ting the two AIs in De­bate to ar­gue for what the hu­man should think about their val­ues. In­stead of ex­plor­ing the whole tree of ar­gu­ments and counter-ar­gu­ments, the hu­man can just pro­cess a sin­gle path through the tree, which will speed up the re­flec­tion.

  • CEV (co­her­ent ex­trap­o­lated vo­li­tion), or more speci­fi­cally what Eliezer Yud­kowsky calls “ini­tial dy­namic” in the CEV pa­per. An AI tries to figure out what a group of hu­mans would think about their val­ues if they knew more, thought faster, etc.

  • Am­bi­tious value learn­ing. Some­how use lots of data and com­pute to learn the hu­man util­ity func­tion.

Prop­er­ties of deliberation

With the ex­am­ples above in hand, I want to step back and ab­stract out some prop­er­ties/​axes/​di­men­sions they have.

  • Hu­man vs non-hu­man com­pu­ta­tion. Does the com­pu­ta­tion hap­pen pri­mar­ily in­side hu­man brains?

  • Hu­man-like vs non-hu­man-like cog­ni­tion. Does the cog­ni­tion re­sem­ble hu­man thought? If the com­pu­ta­tion hap­pens out­side hu­man brains, we can try to mimic the low-level steps in the rea­son­ing of the de­liber­a­tion (i.e. simu­lat­ing hu­man thought), or we can just try to pre­dict the out­ward re­sult with­out go­ing through the same in­ter­nal me­chan­ics (non-simu­lated). There are in­ter­me­di­ate cases where we pre­dict-with­out-simu­lat­ing over short time pe­ri­ods (like one hour) but then simu­late as we glue to­gether these short-term re­flec­tions. One can also think of hu­man-like vs non-hu­man-like cog­ni­tion as whether the pro­cess (rather than out­put; see be­low) of de­liber­a­tion is ex­plicit (hu­man-like) vs im­plicit (non-hu­man-like).

    • In the non-simu­lated case, there is the fur­ther ques­tion of whether any con­se­quen­tial­ist rea­son­ing takes place (it might be im­pos­si­ble to pre­dict a long-term re­flec­tion with­out any con­se­quen­tial­ist rea­son­ing, so this might only ap­ply to short-term re­flec­tions). Fur­ther dis­cus­sion: Q5 in the CEV pa­per, Vingean re­flec­tion, this com­ment by Eliezer and Paul’s re­ply to it, this post and the re­sult­ing dis­cus­sion.

    • When I was origi­nally think­ing about this, I con­flated “Does the com­pu­ta­tion hap­pen pri­mar­ily in­side hu­man brains?” and “Does the cog­ni­tion re­sem­ble hu­man thought?” but these two can vary in­de­pen­dently. For in­stance, whole brain em­u­la­tion does the com­pu­ta­tion out­side of hu­man brains even though the cog­ni­tion re­sem­bles hu­man thought, and an im­ple­men­ta­tion of HCH could hy­po­thet­i­cally be done in­volv­ing just hu­mans but its cog­ni­tion would not re­sem­ble hu­man thought.

  • Im­plicit vs ex­plicit out­put. Is the out­put of the de­liber­a­tion ex­plic­itly rep­re­sented, or is it just im­plicit in how the sys­tem be­haves? From the IDA pa­per: “The hu­man must be in­volved in this pro­cess be­cause there is no ex­ter­nal ob­jec­tive to guide learn­ing—the ob­jec­tive is im­plicit in the way that the hu­man co­or­di­nates the copies of . For ex­am­ple, we have no ex­ter­nal mea­sure of what con­sti­tutes a ‘good’ an­swer to a ques­tion, this no­tion is only im­plicit in how a hu­man de­cides to com­bine the an­swers to sub­ques­tions (which usu­ally in­volves both facts and value judg­ments).”

    • I also ini­tially con­flated im­plicit vs ex­plicit pro­cess and im­plicit vs ex­plicit out­put. Again, these can vary in­de­pen­dently: RL-based IDA would have an ex­plicit rep­re­sen­ta­tion of the re­ward func­tion but the de­liber­a­tion would not re­sem­ble hu­man thought (ex­plicit out­put, im­plicit pro­cess), and we can imag­ine some hu­mans who end up re­fus­ing to state what they value even af­ter re­flec­tion, say­ing some­thing like “I’ll just do what­ever I feel like do­ing in the mo­ment” (im­plicit out­put, ex­plicit pro­cess).

  • Hu­man in­ter­me­di­ate in­te­gra­tion. Some kinds of ap­proaches (like coun­ter­fac­tual or­a­cles and De­bate) seem to speed up the de­liber­a­tion by “offload­ing” parts of the work to AIs and hav­ing the hu­mans in­te­grate the in­ter­me­di­ate re­sults.

  • Hu­man un­der­stand­abil­ity of out­put. The out­put of de­liber­a­tion could be sim­ple enough that a hu­man could un­der­stand it and in­te­grate it into their wor­ld­views, or it could be so com­pli­cated that this is not pos­si­ble. It seems like there is a choice as to whether to al­low non-un­der­stand­able out­puts. This was called “un­der­stand­abil­ity” in this com­ment. Whereas hu­man vs non-hu­man com­pu­ta­tion is about whether the pro­cess of de­liber­a­tion takes place in the hu­man brain, and hu­man-like vs non-hu­man-like cog­ni­tion is about whether the pro­cess is hu­manly un­der­stand­able, hu­man un­der­stand­abil­ity is about whether the out­put even­tu­ally makes its way into the hu­man brain. See the table be­low for a sum­mary.

  • Speed. In AI take­off sce­nar­ios where a bunch of differ­ent AIs are com­pet­ing with each other, the de­liber­a­tion pro­cess must pro­duce some an­swer quickly or pro­duce suc­ces­sive an­swers as time goes on (in or­der to figure out which re­sources are worth ac­quiring). On the other hand, in take­off sce­nar­ios where the first suc­cess­ful pro­ject achieves a de­ci­sive strate­gic ad­van­tage, the de­liber­a­tion can take its time.

  • Num­ber of rounds (satis­fic­ing vs max­i­miz­ing). The CEV pa­per talks about CEV as a way to “ap­ply emer­gency first aid to hu­man civ­i­liza­tion, but not do hu­man­ity’s work on our be­half, or de­cide our fu­tures for us” (p. 36). This seems to im­ply that in the “CEV story”, hu­man­ity it­self (or at least some sub­set of hu­mans) will do even more re­flec­tion af­ter CEV. To use the ter­minol­ogy from the CEV pa­per, the first round is to satis­fice, and the sec­ond round is to max­i­mize.[1] Paul also seems to en­vi­sion the AI learn­ing hu­man val­ues in two rounds: the first round to gain a min­i­mal un­der­stand­ing for the pur­pose of strat­egy-steal­ing, and the sec­ond round to gain a more nu­anced un­der­stand­ing to im­ple­ment our “ac­tual prefer­ences”.[2]

  • In­di­vi­d­ual vs col­lec­tive re­flec­tion. The CEV pa­per ar­gues for ex­trap­o­lat­ing the col­lec­tive vo­li­tion of all cur­rently-ex­ist­ing hu­mans, and says things like “You can go from a col­lec­tive dy­namic to an in­di­vi­d­ual dy­namic, but not the other way around; it’s a one-way hatch” (p. 23). As far as I know, other peo­ple haven’t re­ally ar­gued one way or the other (in some places I’ve seen peo­ple re­strict­ing dis­cus­sion to a sin­gle hu­man for sake of sim­plic­ity).

  • Peek­ing at the out­put. In the CEV pa­per and Ar­bital page, Eliezer talks about giv­ing a sin­gle hu­man or group of hu­mans the abil­ity to peek at the out­put of the re­flec­tion and al­low them to “veto” the out­put of the re­flec­tion. See “Mo­ral haz­ard vs. de­bug­ging” on the Ar­bital CEV page and also dis­cus­sions of Last Judge in the CEV pa­per.

I’m not sure that these di­men­sions cleanly sep­a­rate or how im­por­tant they are. There are also prob­a­bly many other di­men­sions that I’m miss­ing.

Since I had trou­ble dis­t­in­guish­ing be­tween some of the above prop­er­ties, I made the fol­low­ing table:

Out­put Pro­cess
Im­plicit vs ex­plicit Im­plicit vs ex­plicit out­put Hu­man-like vs non-hu­man-like cog­ni­tion
Un­der­stand­able vs not un­der­stand­able Hu­man un­der­stand­abil­ity of out­put (hu­man in­ter­me­di­ate in­te­gra­tion also im­plies un­der­stand­abil­ity of in­ter­me­di­ate re­sults and thus also of the out­put) Hu­man-like vs non-hu­man-like cog­ni­tion (there might also be non-hu­man-like ap­proaches that are un­der­stand­able)
In­side vs out­side hu­man brain (Re­duces to un­der­stand­able vs not un­der­stand­able) Hu­man vs non-hu­man com­pu­ta­tion

Com­par­i­son table

The fol­low­ing table sum­ma­rizes my un­der­stand­ing of where each of the con­crete-ish ap­proaches stands on a sub­set of the above prop­er­ties. I’ve re­stricted the com­par­i­son to a sub­set of the prop­er­ties be­cause many ap­proaches leave cer­tain ques­tions unan­swered and also be­cause if I add too many columns the table will be­come difficult to read.

In ad­di­tion to the ap­proaches listed above, I’ve in­cluded HCH since I think it’s an in­ter­est­ing the­o­ret­i­cal case to look at.

In­side hu­man brain? Hu­man-like cog­ni­tion? Im­plicit vs ex­plicit out­put In­ter­me­di­ate in­te­gra­tion Un­der­stand­able out­put?
Hu­man sit­ting down yes yes ex­plicit (hope­fully) yes yes
Uploads sit­ting down no yes ex­plicit maybe yes
Coun­ter­fac­tual or­a­cle no no ex­plicit yes yes
Imi­ta­tion-based IDA no no im­plicit/​de­pends on ques­tion* no no
RL-based IDA no no† ex­plicit† no no†
HCH yes no im­plicit/​de­pends on ques­tion* no n.a.
De­bate no no ex­plicit yes yes
CEV no ?‡ ex­plicit no yes
Am­bi­tious value learn­ing no no ex­plicit no maybe

* We could imag­ine ask­ing a ques­tion like “What are my ac­tual prefer­ences?” to get an ex­plicit an­swer, or just ask AI as­sis­tants to do some­thing (in which case the out­put of de­liber­a­tion is not ex­plicit).

† Paul says “Rather than learn­ing a re­ward func­tion from hu­man data, we also train it by am­plifi­ca­tion (act­ing on the same rep­re­sen­ta­tions used by the gen­er­a­tive model). Again, we can dis­till the re­ward func­tion into a neu­ral net­work that acts on se­quences of ob­ser­va­tions, but now in­stead of learn­ing to pre­dict hu­man judg­ments it’s pre­dict­ing a very large im­plicit de­liber­a­tion.” The “im­plicit” in this quote seems to re­fer to the pro­cess (rather than out­put) of de­liber­a­tion. See also the para­graph start­ing with “To sum­ma­rize my own un­der­stand­ing” in this com­ment (which I think is talk­ing about RL-based IDA), which sug­gests that maybe we should dis­t­in­guish be­tween “un­der­stand­able in the­ory if we had the time” vs “un­der­stand­able within the time con­straints we have” (in the table I went with the lat­ter). There is also the ques­tion of whether a re­ward func­tion is “ex­plicit enough” as a rep­re­sen­ta­tion of val­ues.

‡ Q5 (p. 32) in the CEV pa­per clar­ifies that the com­pu­ta­tion to find CEV wouldn’t be sen­tient, but I’m not sure if the pa­per says whether the cog­ni­tion will re­sem­ble hu­man thought.


  • We can imag­ine a graph where the hori­zon­tal axis is “qual­ity of de­liber­a­tion” and the ver­ti­cal axis is “qual­ity of out­come (over­all value of the fu­ture)”. If your in­tu­ition says that the over­all value of the fu­ture is sen­si­tive to the qual­ity of de­liber­a­tion, it seems good to pay at­ten­tion to how differ­ent “suc­cess sto­ries” in­cor­po­rate de­liber­a­tion, and to un­der­stand the qual­ity of de­liber­a­tion for each ap­proach. It might turn out that there is a cer­tain thresh­old above which out­comes are “good enough” and that all the con­crete ap­proaches pass this thresh­old (the thresh­old could ex­ist on ei­ther axis—we might stop car­ing about how good the out­comes are above a cer­tain point, or all ap­proaches to de­liber­a­tion above a cer­tain point pro­duce ba­si­cally the same out­come); in that case, un­der­stand­ing de­liber­a­tion might not be so in­ter­est­ing. How­ever, if there are no such thresh­olds (so that bet­ter de­liber­a­tion con­tinu­ally leads to bet­ter out­comes), or if some of the ap­proaches do not pass the thresh­old, then it seems worth be­ing picky about how de­liber­a­tion is im­ple­mented (po­ten­tially re­ject­ing cer­tain suc­cess sto­ries for lack of satis­fac­tory de­liber­a­tion).

  • Think­ing about de­liber­a­tion is tricky be­cause it re­quires men­tally keep­ing track of the strate­gic back­ground/​as­sump­tions for each “suc­cess story”, e.g. talk­ing about speed of de­liber­a­tion only makes sense in a slow take­off sce­nario, and peek­ing at the out­put only makes sense un­der types of de­liber­a­tion where the hu­mans aren’t do­ing the work. See my re­lated com­ment about a similar is­sue with suc­cess sto­ries. It’s also tricky be­cause there turns out to be a bunch of sub­tle dis­tinc­tions that I didn’t re­al­ize ex­isted.

  • One of my origi­nal mo­ti­va­tions for think­ing about de­liber­a­tion was to try to un­der­stand what kind of de­liber­a­tion Paul has in mind for IDA. Hav­ing gone through the above anal­y­sis, I feel like I un­der­stand each ap­proach (e.g. RL-based IDA, coun­ter­fac­tual or­a­cles) bet­ter but I’m not sure I un­der­stand Paul’s over­all vi­sion any bet­ter. I think my main con­fu­sion is that Paul talks about many differ­ent ways de­liber­a­tion could work (e.g. RL-based IDA and hu­man-in-the-coun­ter­fac­tual-loop seem pretty differ­ent), and it’s not clear what ap­proach he thinks is most plau­si­ble.


Thanks to Wei Dai for sug­gest­ing the point about solv­ing meta-ethics. (How­ever, I may have mis­rep­re­sented his point, and this ac­knowl­edg­ment should not be seen as an en­dorse­ment by him.)

  1. From the CEV pa­per: “Do we want our co­her­ent ex­trap­o­lated vo­li­tion to satis­fice, or max­i­mize? My guess is that we want our co­her­ent ex­trap­o­lated vo­li­tion to satis­fice […]. If so, rather than try­ing to guess the op­ti­mal de­ci­sion of a spe­cific in­di­vi­d­ual, the CEV would pick a solu­tion that satis­ficed the spread of pos­si­bil­ities for the ex­trap­o­lated statis­ti­cal ag­gre­gate of hu­mankind.” (p. 36)

    And: “This is an­other rea­son not to stand in awe of the judg­ments of a CEV—a solu­tion that satis­fices an ex­trap­o­lated spread of pos­si­bil­ities for the statis­ti­cal ag­gre­gate of hu­mankind may not cor­re­spond to the best de­ci­sion of any in­di­vi­d­ual, or even the best vote of any real, ac­tual adult hu­mankind.” (p. 37) ↩︎

  2. Paul says “So an ex­cel­lent agent with a min­i­mal un­der­stand­ing of hu­man val­ues seems OK. Such an agent could avoid get­ting left be­hind by its com­peti­tors, and re­main un­der hu­man con­trol. Even­tu­ally, once it got enough in­for­ma­tion to un­der­stand hu­man val­ues (say, by in­ter­act­ing with hu­mans), it could help us im­ple­ment our val­ues.” ↩︎