Alignment Newsletter

I publish the Alignment Newsletter, a weekly publication with recent content relevant to AI alignment. See here for more details. Quick links: email signup form, RSS feed, spreadsheet of all summaries.

The Align­ment Newslet­ter #1: 04/​09/​18

The Align­ment Newslet­ter #2: 04/​16/​18

The Align­ment Newslet­ter #3: 04/​23/​18

The Align­ment Newslet­ter #4: 04/​30/​18

The Align­ment Newslet­ter #5: 05/​07/​18

The Align­ment Newslet­ter #6: 05/​14/​18

The Align­ment Newslet­ter #7: 05/​21/​18

The Align­ment Newslet­ter #8: 05/​28/​18

The Align­ment Newslet­ter #9: 06/​04/​18

The Align­ment Newslet­ter #10: 06/​11/​18

The Align­ment Newslet­ter #11: 06/​18/​18

The Align­ment Newslet­ter #12: 06/​25/​18

Align­ment Newslet­ter #13: 07/​02/​18

Align­ment Newslet­ter #14

Align­ment Newslet­ter #15: 07/​16/​18

Align­ment Newslet­ter #16: 07/​23/​18

Align­ment Newslet­ter #17

Align­ment Newslet­ter #18

Align­ment Newslet­ter #19

Align­ment Newslet­ter #20

Align­ment Newslet­ter #21

Align­ment Newslet­ter #22

Align­ment Newslet­ter #23

Align­ment Newslet­ter #24

Align­ment Newslet­ter #25

Align­ment Newslet­ter #26

Align­ment Newslet­ter #27

Align­ment Newslet­ter #28

Align­ment Newslet­ter #29

Align­ment Newslet­ter #30

Align­ment Newslet­ter #31

Align­ment Newslet­ter #32

Align­ment Newslet­ter #33

Align­ment Newslet­ter #34

Align­ment Newslet­ter #35

Align­ment Newslet­ter #36

Align­ment Newslet­ter #37

Align­ment Newslet­ter #38

Align­ment Newslet­ter #39

Align­ment Newslet­ter #40

Align­ment Newslet­ter #41

Align­ment Newslet­ter #42

Align­ment Newslet­ter #43

Align­ment Newslet­ter #44

Align­ment Newslet­ter #45

Align­ment Newslet­ter #46

Align­ment Newslet­ter #47

Align­ment Newslet­ter #48

Align­ment Newslet­ter #49

Align­ment Newslet­ter #50

Align­ment Newslet­ter #51

Align­ment Newslet­ter #52

Align­ment Newslet­ter One Year Retrospective

Align­ment Newslet­ter #53

[AN #54] Box­ing a finite-hori­zon AI sys­tem to keep it unambitious

[AN #55] Reg­u­la­tory mar­kets and in­ter­na­tional stan­dards as a means of en­sur­ing benefi­cial AI

[AN #56] Should ML re­searchers stop run­ning ex­per­i­ments be­fore mak­ing hy­pothe­ses?

[AN #57] Why we should fo­cus on ro­bust­ness in AI safety, and the analo­gous prob­lems in programming

[AN #58] Mesa op­ti­miza­tion: what it is, and why we should care

[AN #59] How ar­gu­ments for AI risk have changed over time

[AN #60] A new AI challenge: Minecraft agents that as­sist hu­man play­ers in cre­ative mode

[AN #61] AI policy and gov­er­nance, from two peo­ple in the field

[AN #62] Are ad­ver­sar­ial ex­am­ples caused by real but im­per­cep­ti­ble fea­tures?

[AN #63] How ar­chi­tec­ture search, meta learn­ing, and en­vi­ron­ment de­sign could lead to gen­eral intelligence

[AN #64]: Us­ing Deep RL and Re­ward Uncer­tainty to In­cen­tivize Prefer­ence Learning

[AN #65]: Learn­ing use­ful skills by watch­ing hu­mans “play”

[AN #66]: De­com­pos­ing ro­bust­ness into ca­pa­bil­ity ro­bust­ness and al­ign­ment robustness

[AN #67]: Creat­ing en­vi­ron­ments in which to study in­ner al­ign­ment failures

[AN #68]: The at­tain­able util­ity the­ory of impact

[AN #69] Stu­art Rus­sell’s new book on why we need to re­place the stan­dard model of AI

[AN #70]: Agents that help hu­mans who are still learn­ing about their own preferences

[AN #71]: Avoid­ing re­ward tam­per­ing through cur­rent-RF optimization

[AN #72]: Align­ment, ro­bust­ness, method­ol­ogy, and sys­tem build­ing as re­search pri­ori­ties for AI safety

[AN #73]: De­tect­ing catas­trophic failures by learn­ing how agents tend to break

[AN #74]: Separat­ing benefi­cial AI into com­pe­tence, al­ign­ment, and cop­ing with impacts

[AN #75]: Solv­ing Atari and Go with learned game mod­els, and thoughts from a MIRI employee

[AN #76]: How dataset size af­fects ro­bust­ness, and bench­mark­ing safe ex­plo­ra­tion by mea­sur­ing con­straint violations

[AN #77]: Dou­ble de­scent: a unifi­ca­tion of statis­ti­cal the­ory and mod­ern ML practice

[AN #78] For­mal­iz­ing power and in­stru­men­tal con­ver­gence, and the end-of-year AI safety char­ity comparison

[AN #79]: Re­cur­sive re­ward mod­el­ing as an al­ign­ment tech­nique in­te­grated with deep RL

[AN #80]: Why AI risk might be solved with­out ad­di­tional in­ter­ven­tion from longtermists

[AN #81]: Univer­sal­ity as a po­ten­tial solu­tion to con­cep­tual difficul­ties in in­tent alignment

[AN #82]: How OpenAI Five dis­tributed their train­ing computation

[AN #83]: Sam­ple-effi­cient deep learn­ing with ReMixMatch

[AN #84] Re­view­ing AI al­ign­ment work in 2018-19

[AN #85]: The nor­ma­tive ques­tions we should be ask­ing for AI al­ign­ment, and a sur­pris­ingly good chatbot

[AN #86]: Im­prov­ing de­bate and fac­tored cog­ni­tion through hu­man experiments

[AN #87]: What might hap­pen as deep learn­ing scales even fur­ther?

[AN #88]: How the prin­ci­pal-agent liter­a­ture re­lates to AI risk

[AN #89]: A unify­ing for­mal­ism for prefer­ence learn­ing algorithms

[AN #90]: How search land­scapes can con­tain self-re­in­forc­ing feed­back loops

[AN #91]: Con­cepts, im­ple­men­ta­tions, prob­lems, and a bench­mark for im­pact measurement

[AN #92]: Learn­ing good rep­re­sen­ta­tions with con­trastive pre­dic­tive coding

[AN #93]: The Precipice we’re stand­ing at, and how we can back away from it

[AN #94]: AI al­ign­ment as trans­la­tion be­tween hu­mans and machines

[AN #95]: A frame­work for think­ing about how to make AI go well

[AN #96]: Buck and I dis­cuss/​ar­gue about AI Alignment

[AN #97]: Are there his­tor­i­cal ex­am­ples of large, ro­bust dis­con­ti­nu­ities?

[AN #98]: Un­der­stand­ing neu­ral net train­ing by see­ing which gra­di­ents were helpful

[AN #99]: Dou­bling times for the effi­ciency of AI algorithms

[AN #100]: What might go wrong if you learn a re­ward func­tion while acting

[AN #101]: Why we should rigor­ously mea­sure and fore­cast AI progress

[AN #102]: Meta learn­ing by GPT-3, and a list of full pro­pos­als for AI alignment

[AN #103]: ARCHES: an agenda for ex­is­ten­tial safety, and com­bin­ing nat­u­ral lan­guage with deep RL

[AN #104]: The per­ils of in­ac­cessible in­for­ma­tion, and what we can learn about AI al­ign­ment from COVID

[AN #105]: The eco­nomic tra­jec­tory of hu­man­ity, and what we might mean by optimization

[AN #106]: Eval­u­at­ing gen­er­al­iza­tion abil­ity of learned re­ward models

[AN #107]: The con­ver­gent in­stru­men­tal sub­goals of goal-di­rected agents

[AN #108]: Why we should scru­ti­nize ar­gu­ments for AI risk

[AN #109]: Teach­ing neu­ral nets to gen­er­al­ize the way hu­mans would

[AN #110]: Learn­ing fea­tures from hu­man feed­back to en­able re­ward learning

[AN #111]: The Cir­cuits hy­pothe­ses for deep learning

[AN #112]: Eng­ineer­ing a Safer World

[AN #113]: Check­ing the eth­i­cal in­tu­itions of large lan­guage models

[AN #114]: The­ory-in­spired safety solu­tions for pow­er­ful Bayesian RL agents

[AN #115]: AI safety re­search prob­lems in the AI-GA framework

[AN #116]: How to make ex­pla­na­tions of neu­rons compositional

[AN #117]: How neu­ral nets would fare un­der the TEVV framework

[AN #118]: Risks, solu­tions, and pri­ori­ti­za­tion in a world with many AI systems

[AN #119]: AI safety when agents are shaped by en­vi­ron­ments, not rewards

[AN #120]: Trac­ing the in­tel­lec­tual roots of AI and AI alignment

[AN #121]: Fore­cast­ing trans­for­ma­tive AI timelines us­ing biolog­i­cal anchors

[AN #122]: Ar­gu­ing for AGI-driven ex­is­ten­tial risk from first principles

[AN #123]: In­fer­ring what is valuable in or­der to al­ign recom­mender systems

[AN #124]: Prov­ably safe ex­plo­ra­tion through shielding

[AN #125]: Neu­ral net­work scal­ing laws across mul­ti­ple modalities

[AN #126]: Avoid­ing wire­head­ing by de­cou­pling ac­tion feed­back from ac­tion effects

[AN #127]: Re­think­ing agency: Carte­sian frames as a for­mal­iza­tion of ways to carve up the world into an agent and its environment

[AN #128]: Pri­ori­tiz­ing re­search on AI ex­is­ten­tial safety based on its ap­pli­ca­tion to gov­er­nance demands

[AN #129]: Ex­plain­ing dou­ble de­scent by mea­sur­ing bias and variance

[AN #130]: A new AI x-risk pod­cast, and re­views of the field

[AN #131]: For­mal­iz­ing the ar­gu­ment of ig­nored at­tributes in a util­ity function

[AN #132]: Com­plex and sub­tly in­cor­rect ar­gu­ments as an ob­sta­cle to debate

[AN #133]: Build­ing ma­chines that can co­op­er­ate (with hu­mans, in­sti­tu­tions, or other ma­chines)

[AN #134]: Un­der­speci­fi­ca­tion as a cause of frag­ility to dis­tri­bu­tion shift

[AN #135]: Five prop­er­ties of goal-di­rected systems

[AN #136]: How well will GPT-N perform on down­stream tasks?

[AN #137]: Quan­tify­ing the benefits of pre­train­ing on down­stream task performance

[AN #138]: Why AI gov­er­nance should find prob­lems rather than just solv­ing them

[AN #139]: How the sim­plic­ity of re­al­ity ex­plains the suc­cess of neu­ral nets

[AN #140]: The­o­ret­i­cal mod­els that pre­dict scal­ing laws

[AN #141]: The case for prac­tic­ing al­ign­ment work on GPT-3 and other large models

[AN #142]: The quest to un­der­stand a net­work well enough to reim­ple­ment it by hand

[AN #143]: How to make em­bed­ded agents that rea­son prob­a­bil­is­ti­cally about their environments

[AN #144]: How lan­guage mod­els can also be fine­tuned for non-lan­guage tasks

Align­ment Newslet­ter Three Year Retrospective

[AN #145]: Our three year an­niver­sary!

[AN #146]: Plau­si­ble sto­ries of how we might fail to avert an ex­is­ten­tial catastrophe

[AN #147]: An overview of the in­ter­pretabil­ity landscape

[AN #148]: An­a­lyz­ing gen­er­al­iza­tion across more axes than just ac­cu­racy or loss

[AN #149]: The newslet­ter’s ed­i­to­rial policy

[AN #150]: The sub­types of Co­op­er­a­tive AI research

[AN #151]: How spar­sity in the fi­nal layer makes a neu­ral net debuggable

[AN #152]: How we’ve over­es­ti­mated few-shot learn­ing capabilities

[AN #153]: Ex­per­i­ments that demon­strate failures of ob­jec­tive robustness

[AN #154]: What eco­nomic growth the­ory has to say about trans­for­ma­tive AI

[AN #155]: A Minecraft bench­mark for al­gorithms that learn with­out re­ward functions

[AN #156]: The scal­ing hy­poth­e­sis: a plan for build­ing AGI

[AN #157]: Mea­sur­ing mis­al­ign­ment in the tech­nol­ogy un­der­ly­ing Copilot

[AN #158]: Should we be op­ti­mistic about gen­er­al­iza­tion?

[AN #159]: Build­ing agents that know how to ex­per­i­ment, by train­ing on pro­ce­du­rally gen­er­ated games

[AN #160]: Build­ing AIs that learn and think like people

[AN #161]: Creat­ing gen­er­al­iz­able re­ward func­tions for mul­ti­ple tasks by learn­ing a model of func­tional similarity

[AN #162]: Foun­da­tion mod­els: a paradigm shift within AI

[AN #163]: Us­ing finite fac­tored sets for causal and tem­po­ral inference

[AN #164]: How well can lan­guage mod­els write code?

[AN #165]: When large mod­els are more likely to lie

[AN #166]: Is it crazy to claim we’re in the most im­por­tant cen­tury?

[AN #167]: Con­crete ML safety prob­lems and their rele­vance to x-risk

[AN #168]: Four tech­ni­cal top­ics for which Open Phil is so­lic­it­ing grant proposals

[AN #169]: Col­lab­o­rat­ing with hu­mans with­out hu­man data

[AN #170]: An­a­lyz­ing the ar­gu­ment for risk from power-seek­ing AI

[AN #171]: Disagree­ments be­tween al­ign­ment “op­ti­mists” and “pes­simists”

[AN #172] Sorry for the long hi­a­tus!

[AN #173] Re­cent lan­guage model re­sults from DeepMind