[AN #86]: Improving debate and factored cognition through human experiments

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back; you can send it to me by re­ply­ing to this email.

Au­dio ver­sion here (may not be up yet).


Wri­teup: Progress on AI Safety via De­bate (Beth Barnes et al) (sum­ma­rized by Ro­hin): This post re­ports on work done on cre­at­ing a de­bate (AN #5) setup that works well with hu­man play­ers. In the game, one player is hon­est (i.e. ar­gu­ing for the cor­rect an­swer) and one is mal­i­cious (i.e. ar­gu­ing for some worse an­swer), and they play a de­bate in some for­mat, af­ter which a judge must de­cide which player won the de­bate. They are us­ing Think­ing Physics ques­tions for these de­bates, be­cause they in­volve ques­tions with clear an­swers that are con­fus­ing to most peo­ple (the judges) but easy for some ex­perts (the play­ers).

Early freeform text de­bates did not work very well, even with smart, mo­ti­vated judges. The mal­i­cious player could deflect on ques­tions they didn’t want to an­swer, e.g. by claiming that the ques­tion was am­bigu­ous and redi­rect­ing at­ten­tion by ask­ing new ques­tions. In ad­di­tion, when the mal­i­cious player got to go first and give an in­cor­rect “frame­work” for find­ing the an­swer, and then made in­di­vi­d­u­ally true claims to “fill in” the frame­work, it was hard for the hon­est player to re­but it. So, they moved to a frame­work with­out such asym­me­tries: both play­ers gave a claim (si­mul­ta­neously), both gave con­struc­tive ar­gu­ments, and both re­but­ted the other’s ar­gu­ments. In ad­di­tion, part of the ap­peal of de­bate is that the agents can “zoom in” on the par­tic­u­lar dis­agree­ment they have, and the judge need only an­a­lyze the small­est dis­agree­ment in or­der to de­clare an over­all win­ner.

This sug­gests the fol­low­ing setup: play­ers si­mul­ta­neously provide an an­swer sup­ported with sub­claims. Then, af­ter look­ing at the other player’s an­swer and sub­claims, they can provide ob­jec­tions (per­haps on par­tic­u­lar sub­claims, or the over­all struc­ture of the ar­gu­ment). Play­ers can con­cede or challenge ob­jec­tions left by the other player; the de­bate then re­curses on one of the challenged ob­jec­tions (cur­rently, the mal­i­cious player chooses which claim to re­curse on, in or­der to be con­ser­va­tive). There is a depth limit to the re­cur­sion, once it is reached (or if there are no challenged ob­jec­tions) the judge is given a tran­script of the de­bate and must de­clare a win­ner.

A new prob­lem arises: the mal­i­cious player can ex­ploit am­bi­guity to their ad­van­tage. For ex­am­ple, sup­pose the mal­i­cious player is ar­gu­ing that there are 5 primes be­tween 0 and 10, and makes the sub­claims “there are 3 primes be­tween 0 and 5” and “there are 2 primes be­tween 5 and 10“. If challenged on the first, they will say “the three primes are 2, 3, and 5”, if challenged on the sec­ond, they will say “the two primes are 5 and 7”. While in this sim­ple case the hon­est player could reg­ister a top-level ob­jec­tion claiming that the word “be­tween” is am­bigu­ous, and ask for clar­ifi­ca­tion, in gen­eral it would be too hard to elimi­nate all am­bi­guity.

To miti­gate this, the team added a method of cross ex­am­i­na­tion, in which each player is al­lowed to ask an ear­lier ver­sion of the other agent a ques­tion, and quote its an­swer as part of the de­bate tran­script. (The post talks about how this can be ap­prox­i­mated with hu­mans, but I won’t go into the de­tails.) In the com­plex­ity the­ory anal­ogy, this boosts de­bate from PSPACE to NEXP. In­tu­itively, cross ex­am­i­na­tion al­lows play­ers to force each other to pre­com­mit to spe­cific claims and ar­gu­ments, be­cause if they don’t stay con­sis­tent in their claims and ar­gu­ments, then their op­po­nent can demon­strate the in­con­sis­tency. Now, if the mal­i­cious player tries to make an ar­gu­ment that de­pends on in­ter­pret­ing an am­bigu­ous con­cept in two differ­ent ways, then the hon­est player can cross ex­am­ine and ask which of the two in­ter­pre­ta­tions they mean. If they are in­con­sis­tent, that can be demon­strated to the judge; if they con­sis­tently an­swer one way, then the hon­est player can challenge the part of the ar­gu­ment that de­pends on the other in­ter­pre­ta­tion.

They then iden­tify sev­eral open con­cerns with de­bate, of which they high­light the long com­pu­ta­tion prob­lem. This is a prob­lem when you no longer as­sume that the de­baters have op­ti­mal play: in this case, the mal­i­cious player could cre­ate a com­pli­cated ar­gu­ment that nei­ther de­bater un­der­stands well, that sup­ports the mal­i­cious case but that the hon­est player doesn’t know how to re­fute.

Ro­hin’s opinion: I en­joyed this a lot: the prob­lems found were crisp and the solu­tions had good ar­gu­ments that they ac­tu­ally solved the iden­ti­fied prob­lem. Read­ing through the ac­tual ex­am­ples and ar­gu­ments made me more op­ti­mistic about de­bate in gen­eral, mostly from a felt sense that the ac­tual con­crete re­sults were get­ting closer to match­ing the the­o­ret­i­cal ideal, and that there ac­tu­ally could be rea­son­able solu­tions to “messy” prob­lems like am­bi­guity.

The full post has for­mal ex­pla­na­tions and ac­tual ex­am­ples, which I highly recom­mend.

Eval­u­at­ing Ar­gu­ments One Step at a Time (Ought) (sum­ma­rized by Ro­hin): This blog post re­ports on the re­sults of Ought’s ex­per­i­ments with fac­tored eval­u­a­tion (AN #62) by hu­mans on ques­tions about movie re­views. There are two hu­man play­ers, one hon­est and one mal­i­cious. Half the time, the hon­est player chooses a true top-level claim, de­com­poses it into sup­port­ing sub­claims up to some depth, and pro­vides sup­port­ing quotes for each claim and sub­claim. The other half of the time, the mal­i­cious player does the same, ex­cept they choose a false top-level claim. In ei­ther case, the op­po­site player then chooses quotes that aim to re­fute the claims and sub­claims. For each claim and sub­claim, a new judge eval­u­ates whether that claim is “valid” or “in­valid”. If any claim is marked in­valid, then the top-level claim is judged to be false, oth­er­wise it is judged true.

They found that depth 1 and depth 2 trees out­performed depth 0 trees (for which judg­ments were at about chance). How­ever, they did not perform amaz­ingly well—with posthoc hy­per­pa­ram­e­ter tun­ing, depth 1 and 2 trees got ac­cu­ra­cies of 65% and 69% re­spec­tively. False pos­i­tives (ac­cept­ing false claims) were pri­mar­ily caused due to mal­i­cious claims that couldn’t eas­ily be re­but­ted with the limited bud­get (but could be re­but­ted with a larger bud­get), and in­di­vi­d­ual steps be­ing com­plex enough that the judge made mis­takes. False nega­tives (re­ject­ing true claims) tend to hap­pen when play­ers choose hard-to-ar­gue true claims, es­pe­cially ones in which no spe­cific quote clearly sup­ports the claim, even though the full para­graph sup­ports the claim through its tone and style.

There are sev­eral ap­proaches that could the­o­ret­i­cally solve these prob­lems, such as in­creas­ing the size of claim trees, and im­prov­ing qual­ity con­trol for judg­ments (e.g. by ag­gre­gat­ing mul­ti­ple judg­ments to­gether).

Ro­hin’s opinion: I wouldn’t read too much into the low ac­cu­ra­cies of the depth 2 trees: it seems quite plau­si­ble that this is spe­cific to the movie re­view set­ting, and in set­tings with clearer an­swers you could do bet­ter. Like with the pre­vi­ous post, I found the ac­tual ex­am­ples quite illu­mi­nat­ing: it’s always in­ter­est­ing to see what hap­pens when the­ory col­lides with the real world.

Tech­ni­cal AI alignment

Tech­ni­cal agen­das and prioritization

Co­op­er­a­tion, Con­flict, and Trans­for­ma­tive Ar­tifi­cial In­tel­li­gence: A Re­search Agenda (Jesse Clif­ton) (sum­ma­rized by Flo): This agenda by the Effec­tive Altru­ism Foun­da­tion fo­cuses on risks of as­tro­nom­i­cal suffer­ing (s-risks) posed by Trans­for­ma­tive AI (AN #82) (TAI) and es­pe­cially those re­lated to con­flicts be­tween pow­er­ful AI agents. This is be­cause there is a very clear path from ex­tor­tion and ex­e­cuted threats against al­tru­is­tic val­ues to s-risks. While es­pe­cially im­por­tant in the con­text of s-risks, co­op­er­a­tion be­tween AI sys­tems is also rele­vant from a range of differ­ent view­points. The agenda cov­ers four clusters of top­ics: strat­egy, cred­i­bil­ity and bar­gain­ing, cur­rent AI frame­works, as well as de­ci­sion the­ory.

The ex­tent of co­op­er­a­tion failures is likely in­fluenced by how power is dis­tributed af­ter the tran­si­tion to TAI. At first glance, it seems like widely dis­tributed sce­nar­ios (as CAIS (AN #40)) are more prob­le­matic, but re­lated liter­a­ture from in­ter­na­tional re­la­tions paints a more com­pli­cated pic­ture. The agenda seeks a bet­ter un­der­stand­ing of how the dis­tri­bu­tion of power af­fects catas­trophic risk, as well as po­ten­tial lev­ers to in­fluence this dis­tri­bu­tion. Other top­ics in the strat­egy/​gov­er­nance cluster in­clude the iden­ti­fi­ca­tion and anal­y­sis of re­al­is­tic sce­nar­ios for mis­al­ign­ment, as well as case stud­ies on co­op­er­a­tion failures in hu­mans and how they can be af­fected by policy.

TAI might en­able un­prece­dented cred­i­bil­ity, for ex­am­ple by be­ing very trans­par­ent, which is cru­cial for both con­tracts and threats. The agenda aims at bet­ter mod­els of the effects of cred­i­bil­ity on co­op­er­a­tion failures. One ap­proach to this is open-source game the­ory, where agents can see other agents’ source codes. Promis­ing ap­proaches to pre­vent catas­trophic co­op­er­a­tion failures in­clude the iden­ti­fi­ca­tion of peace­ful bar­gain­ing mechanisms, as well as sur­ro­gate goals. The idea of sur­ro­gate goals is for an agent to com­mit to act as if it had a differ­ent goal, when­ever it is threat­ened, in or­der to pro­tect its ac­tual goal from threats.

As some as­pects of con­tem­po­rary AI ar­chi­tec­tures might still be pre­sent in TAI, it can be use­ful to study co­op­er­a­tion failure in cur­rent sys­tems. One con­crete ap­proach to en­abling co­op­er­a­tion in so­cial dilem­mas that could be tested with con­tem­po­rary sys­tems is based on bar­gain­ing over poli­cies com­bined with pun­ish­ments for de­vi­a­tions. Re­lat­edly, it is worth in­ves­ti­gat­ing whether or not multi-agent train­ing leads to hu­man-like bar­gain­ing by de­fault. This has im­pli­ca­tions on the suit­abil­ity of be­havi­oural vs clas­si­cal game the­ory to study TAI. The be­havi­oural game the­ory of hu­man-ma­chine in­ter­ac­tions might also be im­por­tant, es­pe­cially in hu­man-in-the-loop sce­nar­ios of TAI.

The last cluster dis­cusses the im­pli­ca­tions of bounded com­pu­ta­tion on de­ci­sion the­ory as well as the de­ci­sion the­o­ries (im­plic­itly) used by cur­rent agent ar­chi­tec­tures. Another fo­cus lies on acausal rea­son­ing and in par­tic­u­lar the pos­si­bil­ity of acausal trade, where differ­ent cor­re­lated AI sys­tems co­op­er­ate with­out any causal links be­tween them.

Flo’s opinion: I am broadly sym­pa­thetic to the fo­cus on pre­vent­ing the worst out­comes and it seems plau­si­ble that ex­tor­tion could play an im­por­tant role in these, even though I worry more about dis­tri­bu­tional shift plus in­cor­rigi­bil­ity. Still, I am ex­cited about the fo­cus on co­op­er­a­tion, as this seems ro­bustly use­ful for a wide range of sce­nar­ios and most value sys­tems.

Ro­hin’s opinion: Un­der a suffer­ing-fo­cused ethics un­der which s-risks far over­whelm x-risks, I think it makes sense to fo­cus on this agenda. There don’t seem to be many plau­si­ble paths to s-risks: by de­fault, we shouldn’t ex­pect them, be­cause it would be quite sur­pris­ing for an amoral AI sys­tem to think it was par­tic­u­larly use­ful or good for hu­mans to suffer, as op­posed to not ex­ist at all, and there doesn’t seem to be much rea­son to ex­pect an im­moral AI sys­tem. Con­flict and the pos­si­bil­ity of car­ry­ing out threats are the most plau­si­ble ways by which I could see this hap­pen­ing, and the agenda here fo­cuses on ne­glected prob­lems in this space.

How­ever, un­der other eth­i­cal sys­tems (un­der which s-risks are worse than x-risks, but do not com­pletely dwarf x-risks), I ex­pect other tech­ni­cal safety re­search to be more im­pact­ful, be­cause other ap­proaches can more di­rectly tar­get the failure mode of an amoral AI sys­tem that doesn’t care about you, which seems both more likely and more amenable to tech­ni­cal safety ap­proaches (to me at least). I could imag­ine work on this agenda be­ing quite im­por­tant for strat­egy re­search, though I am far from an ex­pert here.

Iter­ated amplification

Syn­the­siz­ing am­plifi­ca­tion and de­bate (Evan Hub­inger) (sum­ma­rized by Ro­hin): The dis­til­la­tion step in iter­ated am­plifi­ca­tion (AN #30) can be done us­ing imi­ta­tion learn­ing. How­ever, as ar­gued in Against Mimicry, if your model M is un­able to do perfect imi­ta­tion, there must be er­rors, and in this case the imi­ta­tion ob­jec­tive doesn’t nec­es­sar­ily in­cen­tivize a grace­ful failure, whereas a re­ward-based ob­jec­tive does. So, we might want to add an aux­iliary re­ward ob­jec­tive. This post pro­poses an al­gorithm in which the am­plified model an­swers a ques­tion via a de­bate (AN #5). The dis­til­led model can then be trained by a com­bi­na­tion of imi­ta­tion of the am­plified model, and re­in­force­ment learn­ing on the re­ward of +1 for win­ning the de­bate and −1 for los­ing.

Ro­hin’s opinion: This seems like a rea­son­able al­gorithm to study, though I sus­pect there is a sim­pler al­gorithm that doesn’t use de­bate that has the same ad­van­tages. Some other thoughts in this thread.

Learn­ing hu­man intent

Deep Bayesian Re­ward Learn­ing from Prefer­ences (Daniel S. Brown et al) (sum­ma­rized by Zach): Bayesian in­verse re­in­force­ment learn­ing (IRL) is ideal for safe imi­ta­tion learn­ing since it al­lows un­cer­tainty in the re­ward func­tion es­ti­ma­tor to be quan­tified. This ap­proach re­quires thou­sands of like­li­hood es­ti­mates for pro­posed re­ward func­tions. How­ever, each like­li­hood es­ti­mate re­quires train­ing an agent ac­cord­ing to the hy­poth­e­sized re­ward func­tion. Pre­dictably, such a method is com­pu­ta­tion­ally in­tractable for high di­men­sional prob­lems.

In this pa­per, the au­thors pro­pose Bayesian Re­ward Ex­trap­o­la­tion (B-REX), a scal­able prefer­ence-based Bayesian re­ward learn­ing al­gorithm. They note that in this set­ting, a like­li­hood es­ti­mate that re­quires a loop over all demon­stra­tions is much more fea­si­ble than an es­ti­mate that re­quires train­ing a new agent. So, they as­sume that they have a set of ranked tra­jec­to­ries, and eval­u­ate the like­li­hood of a re­ward func­tion by its abil­ity to re­pro­duce the prefer­ence or­der­ing in the demon­stra­tions. To get fur­ther speedups, they fix all but the last layer of the re­ward model us­ing a pre­train­ing step: the re­ward of a tra­jec­tory is then sim­ply the dot product of the last layer with the fea­tures of the tra­jec­tory as com­puted by all but the last layer of the net (which can be pre­com­puted and cached once).

The au­thors test B-REX on pixel-level Atari games and show com­pet­i­tive perfor­mance to T-REX (AN #54), a re­lated method that only com­putes the MAP es­ti­mate. Fur­ther­more, the au­thors can cre­ate con­fi­dence in­ter­vals for perfor­mance since they can sam­ple from the re­ward dis­tri­bu­tion.

Zach’s opinion: The idea of us­ing prefer­ence or­der­ings (Bradley-Terry) to speed up the pos­te­rior prob­a­bil­ity calcu­la­tion was in­ge­nious. While B-REX isn’t strictly bet­ter than T-REX in terms of re­wards achieved, the abil­ity to con­struct con­fi­dence in­ter­vals for perfor­mance is a ma­jor benefit. My take­away is that Bayesian IRL is get­ting more effi­cient and may have good po­ten­tial as a prac­ti­cal ap­proach to safe value learn­ing.

Prevent­ing bad behavior

At­tain­able util­ity has a sub­agent prob­lem (Stu­art Arm­strong) (sum­ma­rized by Flo): This post ar­gues that reg­u­lariz­ing an agent’s im­pact by at­tain­able util­ity (AN #25) can fail when the agent is able to con­struct sub­agents. At­tain­able util­ity reg­u­lariza­tion uses aux­iliary re­wards and pe­nal­izes the agent for chang­ing its abil­ity to get high ex­pected re­wards for these to re­strict the agent’s power-seek­ing. More speci­fi­cally, the penalty for an ac­tion is the ab­solute differ­ence in ex­pected cu­mu­la­tive aux­iliary re­ward be­tween the agent ei­ther do­ing the ac­tion or noth­ing for one time step and then op­ti­miz­ing for the aux­iliary re­ward.

This can be cir­cum­vented in some cases: If the aux­iliary re­ward does not benefit from two agents in­stead of one op­ti­miz­ing it, the agent can just build a copy of it­self that does not have the penalty, as do­ing this does not change the agent’s abil­ity to get a high aux­iliary re­ward. For more gen­eral aux­iliary re­wards, an agent could build an­other more pow­er­ful agent, as long as the pow­er­ful agent com­mits to bal­anc­ing out the en­su­ing changes in the origi­nal agent’s at­tain­able aux­iliary re­wards.

Flo’s opinion: I am con­fused about how much the com­mit­ment to bal­ance out the origi­nal agent’s at­tain­able util­ity would con­strain the pow­er­ful sub­agent. Also, in the pres­ence of sub­agents, it seems plau­si­ble that at­tain­able util­ity mostly de­pends on the agent’s abil­ity to pro­duce sub­agents of differ­ent gen­er­al­ity with differ­ent goals: If a sub­agent that op­ti­mizes for a sin­gle aux­iliary re­ward was eas­ier to build than a more gen­eral one, build­ing a gen­eral pow­er­ful agent could con­sid­er­ably de­crease at­tain­able util­ity for all aux­iliary re­wards, such that the high penalty rules out this ac­tion.


TAISU—Tech­ni­cal AI Safety Un­con­fer­ence (Linda Linse­fors) (sum­ma­rized by Ro­hin): This un­con­fer­ence on tech­ni­cal AI safety will be held May 14th-17th; ap­pli­ca­tion dead­line is Fe­bru­ary 23.

AI Align­ment Visit­ing Fel­low­ship (sum­ma­rized by Ro­hin): This fel­low­ship would sup­port 2-3 ap­pli­cants to visit FHI for three or more months to work on hu­man-al­igned AI. The ap­pli­ca­tion dead­line is Feb 28.

No comments.