AI Alignment Writing Day Roundup #2

Here are some of the posts from the AI Align­ment Fo­rum writ­ing day. Due to the par­ti­ci­pants writ­ing 34 posts in less than 24 hours (!), I’m re-airing them to let peo­ple have a proper chance to read (and com­ment) on them, in roughly chronolog­i­cal or­der.

1) Com­pu­ta­tional Model: Causal Di­a­grams with Sym­me­try by Johns Wentworth

This post is about rep­re­sent­ing logic, math­e­mat­ics, and func­tions with causal mod­els.

For our pur­poses, the cen­tral idea of em­bed­ded agency is to take these black-box sys­tems which we call “agents”, and break open the black boxes to see what’s go­ing on in­side.
Causal DAGs with sym­me­try are how we do this for Tur­ing-com­putable func­tions in gen­eral. They show the ac­tual cause-and-effect pro­cess which com­putes the re­sult; con­cep­tu­ally they rep­re­sent the com­pu­ta­tion rather than a black-box func­tion.

I’m new to a lot of this, but to me this seemed like a weird and sur­pris­ing way to think about math (e.g. the no­tion that the in­put is some­thing that causes the next in­put ). Seems like a very in­ter­est­ing set of ideas to ex­plore.

2) Towards a mechanis­tic un­der­stand­ing of cor­rigi­bil­ity by Evan Hubinger

This post builds off of Paul Chris­ti­ano’s post on Worst Case Guaran­tees. That post claims:

Even if we are very care­ful about how we de­ploy ML, we may reach the point where a small num­ber of cor­re­lated failures could quickly be­come catas­trophic… I think the long-term safety of ML sys­tems re­quires be­ing able to rule out this kind of be­hav­ior, which I’ll call un­ac­cept­able, even for in­puts which are ex­tremely rare on the in­put dis­tri­bu­tion.

Paul then pro­poses a pro­ce­dure built around ad­ver­sar­ial search, where one part of the sys­tem searches for in­puts that pro­duce un­ac­cept­able out­puts in the trained agent, and talks more about how one might build such a sys­tem.

Evan’s post tries to make progress on find­ing a good no­tion of ac­cept­able be­havi­our from an ML sys­tem. Paul’s post offers two con­di­tions about the ease of choos­ing an ac­cept­able ac­tion (in par­tic­u­lar, that it should not stop the agent achiev­ing a high av­er­age re­ward and that is shouldn’t make hard prob­lems much harder), but Evan’s con­di­tions are about the ease of train­ing an ac­cept­able model. His two con­di­tions are:

  1. It must be not that hard for an am­plified over­seer to ver­ify that a model is ac­cept­able.

  2. It must be not that hard to find such an ac­cept­able model dur­ing train­ing.

If you want to be able to do some form of in­formed over­sight to pro­duce an ac­cept­able model, how­ever, these are some of the most im­por­tant con­di­tions to pay at­ten­tion to. Thus, I gen­er­ally think about choos­ing an ac­cept­abil­ity con­di­tion as try­ing to an­swer the ques­tion: what is the eas­iest-to-train-and-ver­ify prop­erty such that all mod­els that satisfy that prop­erty (and achieve high av­er­age re­ward) are safe?

The post then ex­plores two pos­si­ble ap­proaches, act-based cor­rigi­bil­ity and in­differ­ence cor­rigi­bil­ity.

3) Log­i­cal Op­ti­miz­ers by Don­ald Hobson

This ap­proach offers a solu­tion to a sim­pler ver­sion of the FAI prob­lem:

Sup­pose I was handed a hy­per­com­puter and al­lowed to run code on it with­out wor­ry­ing about mind­crime, then the hy­per­com­puter is re­moved, al­low­ing me to keep 1Gb of data from the com­pu­ta­tions. Then I am handed a magic hu­man util­ity func­tion, as code on a mem­ory stick. [The ap­proach be­low] would al­low me to use the situ­a­tion to make a FAI.

4) De­con­fuse Your­self about Agency by VojtaKovarik

This post offers some cute for­mal­i­sa­tions, for ex­am­ple, gen­er­al­is­ing the no­tion of an­thro­po­mor­phism to -mor­phiza­tion, about mor­ph­ing/​mod­el­ling any sys­tem by us­ing an al­ter­na­tive ar­chi­tec­ture .

This is an at­tempt to re­move the need to ex­plic­itly use the term ‘agency’ in con­ver­sa­tion, out of a sense that the use of the word is lack­ing in sub­stance. I’m not sure I agree with this, I think peo­ple are us­ing it to talk about a sub­stan­tive thing they don’t know how to for­mal­ise yet. Nonethe­less I liked all the var­i­ous tech­ni­cal ideas offered.

My favourite part per­son­ally was the open­ing list of con­crete ar­chi­tec­tures or­ganised by how ‘agenty’ they feel, which I will quote in full:

  1. Ar­chi­tec­tures I would in­tu­itively call “agenty”:

    1. Monte Carlo tree search al­gorithm, parametrized by the num­ber of rol­louts made each move and util­ity func­tion (or heuris­tic) used to eval­u­ate po­si­tions.

    2. (semi-vague) “Clas­si­cal AI-agent” with sev­eral in­ter­con­nected mod­ules (util­ity func­tion and world model, ac­tions, plan­ning al­gorithm, and ob­ser­va­tions used for learn­ing and up­dat­ing the world model).

    3. (vague) Hu­man parametrized by their goals, knowl­edge, and skills (and, of course, many other de­tails).

  2. Ar­chi­tec­tures I would in­tu­itively call “non-agenty”:

    1. A hard-coded se­quence of ac­tions.

    2. Look-up table.

    3. Ran­dom gen­er­a­tor (out­putting x∼π on ev­ery in­put, for some prob­a­bil­ity dis­tri­bu­tion π).

  3. Multi-agent ar­chi­tec­tures:

    1. Ant colony.

    2. Com­pany (con­sist­ing of in­di­vi­d­ual em­ploy­ees, op­er­at­ing within an econ­omy).

    3. Com­pre­hen­sive AI ser­vices.

5) Thoughts from a Two-Boxer by jaek

I re­ally liked this post, even though the au­thor ends by say­ing the post might not have much of a pur­pose any more.

Hav­ing writ­ten that last para­graph I sud­denly un­der­stand why de­ci­sion the­ory in the AI com­mu­nity is the way it is. I guess I wasn’t prop­erly en­gag­ing with the premises of the thought ex­per­i­ment.

The post was (ac­cord­ing to me) some­one think­ing for them­selves about de­ci­sion the­ory and putting in the effort to clearly ex­plain their thoughts as they went along.

My un­der­stand­ing of the main dis­agree­ment be­tween academia’s CDT/​EDT and the AI Align­ment’s UDT/​FDT al­ter­na­tives is the same as Paul Chris­ti­ano’s un­der­stand­ing, which is that they are mo­ti­vated by ask­ing slightly differ­ent ques­tions (the former be­ing more hu­man-fo­cused and the lat­ter be­ing mo­ti­vated by how to en­ter code into an AI). This post shows some­one think­ing through that and com­ing to that same re­al­i­sa­tion for them­selves. I ex­pect to link to it in the fu­ture as an ex­am­ple of this.