Alignment Newsletter #40

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.

The Align­ment Fo­rum se­quences have started again! As a re­minder, treat them as though I had high­lighted them.


Refram­ing Su­per­in­tel­li­gence: Com­pre­hen­sive AI Ser­vices as Gen­eral In­tel­li­gence (Eric Drexler): This is a huge doc­u­ment; rather than sum­ma­rize it all in this newslet­ter, I wrote up my sum­mary in this post. For this newslet­ter, I’ve copied over the de­scrip­tion of the model, but left out all of the im­pli­ca­tions and cri­tiques.

The core idea is to look at the path­way by which we will de­velop gen­eral in­tel­li­gence, rather than as­sum­ing that at some point we will get a su­per­in­tel­li­gent AGI agent. To pre­dict how AI will progress in the fu­ture, we can look at how AI pro­gresses cur­rently—through re­search and de­vel­op­ment (R&D) pro­cesses. AI re­searchers con­sider a prob­lem, define a search space, for­mu­late an ob­jec­tive, and use an op­ti­miza­tion tech­nique in or­der to ob­tain an AI sys­tem, called a ser­vice, that performs the task.

A ser­vice is an AI sys­tem that de­liv­ers bounded re­sults for some task us­ing bounded re­sources in bounded time. Su­per­in­tel­li­gent lan­guage trans­la­tion would count as a ser­vice, even though it re­quires a very de­tailed un­der­stand­ing of the world, in­clud­ing en­g­ineer­ing, his­tory, sci­ence, etc. Epi­sodic RL agents also count as ser­vices.

While each of the AI R&D sub­tasks is cur­rently performed by a hu­man, as AI pro­gresses we should ex­pect that we will au­to­mate these tasks as well. At that point, we will have au­to­mated R&D, lead­ing to re­cur­sive tech­nolog­i­cal im­prove­ment. This is not re­cur­sive self-im­prove­ment, be­cause the im­prove­ment comes from R&D ser­vices cre­at­ing im­prove­ments in ba­sic AI build­ing blocks, and those im­prove­ments feed back into the R&D ser­vices. All of this should hap­pen be­fore we get any pow­er­ful AGI agents that can do ar­bi­trary gen­eral rea­son­ing.

Ro­hin’s opinion: I’m glad this has fi­nally been pub­lished—it’s been in­form­ing my views for a long time now. I broadly buy the gen­eral view put for­ward here, with a few nit­picks that you can see in the post. I re­ally do recom­mend you read at least the post—that’s just the sum­mary of the re­port, so it’s full of in­sights, and it should be in­ter­est­ing to tech­ni­cal safety and strat­egy re­searchers al­ike.

I’m still not sure how this should af­fect what re­search we do—tech­niques like prefer­ence learn­ing and re­cur­sive re­ward mod­el­ing seem ap­pli­ca­ble to CAIS as well, since they al­low us to more ac­cu­rately spec­ify what we want each in­di­vi­d­ual ser­vice to do.

Tech­ni­cal AI alignment

Iter­ated am­plifi­ca­tion sequence

Su­per­vis­ing strong learn­ers by am­plify­ing weak ex­perts (Paul Chris­ti­ano): This was pre­vi­ously cov­ered in AN #30, I’ve copied the sum­mary and opinion. This pa­per in­tro­duces iter­ated am­plifi­ca­tion, fo­cus­ing on how it can be used to define a train­ing sig­nal for tasks that hu­mans can­not perform or eval­u­ate, such as de­sign­ing a tran­sit sys­tem. The key in­sight is that hu­mans are ca­pa­ble of de­com­pos­ing even very difficult tasks into slightly sim­pler tasks. So, in the­ory, we could provide ground truth la­bels for an ar­bi­trar­ily difficult task by a huge tree of hu­mans, each de­com­pos­ing their own sub­ques­tion and hand­ing off new sub­ques­tions to other hu­mans, un­til ques­tions are easy enough that a hu­man can di­rectly an­swer them.

We can turn this into an effi­cient al­gorithm by hav­ing the hu­man de­com­pose the ques­tion only once, and us­ing the cur­rent AI sys­tem to an­swer the gen­er­ated sub­ques­tions. If the AI isn’t able to an­swer the sub­ques­tions, then the hu­man will get non­sense an­swers. How­ever, as long as there are ques­tions that the hu­man + AI sys­tem can an­swer but the AI alone can­not an­swer, the AI can learn from the an­swers to those ques­tions. To re­duce the re­li­ance on hu­man data, an­other model is trained to pre­dict the de­com­po­si­tion that the hu­man performs. In ad­di­tion, some tasks could re­fer to a large con­text (eg. eval­u­at­ing safety for a spe­cific rocket de­sign), so they model the hu­man as be­ing able to ac­cess small pieces of the con­text at a time.

They eval­u­ate on sim­ple al­gorith­mic tasks like dis­tance be­tween nodes in a graph, where they can pro­gram an au­to­mated hu­man de­com­po­si­tion for faster ex­per­i­ments, and there is a ground truth solu­tion. They com­pare against su­per­vised learn­ing, which trains a model on the ground truth an­swers to ques­tions (which iter­ated am­plifi­ca­tion does not have ac­cess to), and find that they can match the perfor­mance of su­per­vised learn­ing with only slightly more train­ing steps.

Ro­hin’s opinion: This is my new fa­vorite post/​pa­per for ex­plain­ing how iter­ated am­plifi­ca­tion works, since it very suc­cinctly and clearly makes the case for iter­ated am­plifi­ca­tion as a strat­egy for gen­er­at­ing a good train­ing sig­nal. I’d recom­mend read­ing the pa­per in full, as it makes other im­por­tant points that I haven’t in­cluded in the sum­mary.

Note that it does not ex­plain a lot of Paul’s think­ing. It ex­plains one par­tic­u­lar train­ing method that al­lows you to train an AI sys­tem with a more in­tel­li­gent and in­formed over­seer.

Value learn­ing sequence

Will hu­mans build goal-di­rected agents? (Ro­hin Shah): The pre­vi­ous post ar­gued that co­her­ence ar­gu­ments do not mean that a su­per­in­tel­li­gent AI must have goal-di­rected be­hav­ior. In this post, I con­sider other ar­gu­ments sug­gest­ing that we’ll build goal-di­rected AI sys­tems.

- Since hu­mans are goal-di­rected, they will build goal-di­rected AI to help them achieve their goals. Re­ac­tion: Some­what agree, but this only shows that the hu­man + AI sys­tem should be goal-di­rected, not the AI it­self.

- Goal-di­rected AI can ex­ceed hu­man perfor­mance. Re­ac­tion: Mostly agree, but there could be al­ter­na­tives that still ex­ceed hu­man perfor­mance.

- Cur­rent RL agents are goal-di­rected. Re­ac­tion: While the math says this, in prac­tice this doesn’t seem true, since RL agents learn from ex­pe­rience rather than plan­ning over the long term.

- Ex­ist­ing in­tel­li­gent agents are goal-di­rected. Re­ac­tion: Seems like a good rea­son to not build AI us­ing evolu­tion.

- Goal-di­rected agents are more in­ter­pretable and so more de­sir­able. Re­ac­tion: Disagree, it seems like we’re ar­gu­ing that we should build goal-di­rected AI so that we can more eas­ily pre­dict that it will cause catas­tro­phe.

AI safety with­out goal-di­rected be­hav­ior (Ro­hin Shah): The main thrust of the sec­ond chap­ter of the se­quence is that it is not re­quired for a su­per­in­tel­li­gent AI sys­tem to be goal-di­rected. While there are cer­tainly eco­nomic ar­gu­ments sug­gest­ing that we will build goal-di­rected AI, these do not have the force of a the­o­rem. Given the strong ar­gu­ments we’ve de­vel­oped that goal-di­rected AI would likely be dan­ger­ous, it seems worth ex­plor­ing other op­tions. Some pos­si­bil­ities are AI sys­tems that in­fer and fol­low norms, cor­rigible AI, and bounded and epi­sodic AI ser­vices.

Th­ese other pos­si­bil­ities can be cast in a util­ity-max­i­miza­tion frame­work. How­ever, if you do that then you are once again tempted to say that you are screwed if you get the util­ity func­tion slightly wrong. In­stead, I would want to build these sys­tems in such a way that the de­sir­able prop­er­ties are in­her­ent to the way that they rea­son, so that it isn’t even a co­her­ent ques­tion to ask “what if we get it slightly wrong”.


Imi­ta­tion learn­ing con­sid­ered un­safe? (capy­bar­alet): We might hope that us­ing imi­ta­tion learn­ing to mimic a cor­rigible hu­man would be safe. How­ever, this would in­volve mimick­ing the hu­man’s plan­ning pro­cess. It seems fairly likely that slight er­rors in the imi­ta­tion of this pro­cess could lead to the cre­ation of a goal-di­rected plan­ning pro­cess that does dan­ger­ous long-term op­ti­miza­tion.

Ro­hin’s opinion: This seems pretty similar to the prob­lem of in­ner op­ti­miz­ers, in which while search­ing for a good policy for some task T on train­ing dis­tri­bu­tion D, you end up find­ing a con­se­quen­tial­ist agent that is op­ti­miz­ing some util­ity func­tion that leads to good perfor­mance on D. That agent will have all the stan­dard dan­gers of goal-di­rected op­ti­miza­tion out of dis­tri­bu­tion.

Two More De­ci­sion The­ory Prob­lems for Hu­mans (Wei Dai): The first prob­lem is that any par­tic­u­lar hu­man’s val­ues only make sense for the cur­rent en­vi­ron­ment. When con­sid­er­ing differ­ent cir­cum­stances (eg. an as­tro­nom­i­cally large num­ber of very slightly nega­tive ex­pe­riences like get­ting a dust speck in your eye), many peo­ple will not know how to eval­u­ate the value of such a situ­a­tion.

The sec­ond prob­lem is that for most for­mal­iza­tions of val­ues or util­ity func­tions, the val­ues are defined rel­a­tive to some way of mak­ing de­ci­sions in the world, or some on­tol­ogy through which we un­der­stand the world. If this de­ci­sion the­ory or on­tol­ogy changes, it’s not clear how to “trans­fer” the val­ues to the new ver­sion.

Pre­dic­tors as Agents

Tech­ni­cal agen­das and prioritization

Refram­ing Su­per­in­tel­li­gence: Com­pre­hen­sive AI Ser­vices as Gen­eral In­tel­li­gence (Eric Drexler): Sum­ma­rized in the high­lights!

Agent foundations

Failures of UDT-AIXI, Part 1: Im­proper Ran­dom­iz­ing (Diffrac­tor)

Prevent­ing bad behavior

Op­ti­miza­tion Reg­u­lariza­tion through Time Penalty (Linda Linse­fors)