Posts either linking to, or summarizing, formal papers published elsewhere.

Some AI re­search ar­eas and their rele­vance to ex­is­ten­tial safety

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

Thirty-three ran­domly se­lected bioethics papers

My Reser­va­tions about Dis­cov­er­ing La­tent Knowl­edge (Burns, Ye, et al)

Notes/​blog posts on two re­cent MIRI papers

Ev­i­dence of Learned Look-Ahead in a Chess-Play­ing Neu­ral Network

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

New pa­per: Long-Term Tra­jec­to­ries of Hu­man Civilization

Study on what makes peo­ple ap­prove or con­demn mind up­load tech­nol­ogy; refer­ences LW

AGI Safety Liter­a­ture Re­view (Ever­itt, Lea & Hut­ter 2018)

Some con­cep­tual high­lights from “Disjunc­tive Sce­nar­ios of Catas­trophic AI Risk”

Papers for 2017

Paper: Su­per­in­tel­li­gence as a Cause or Cure for Risks of Astro­nom­i­cal Suffering

So­cial Choice Ethics in Ar­tifi­cial In­tel­li­gence (pa­per challeng­ing CEV-like ap­proaches to choos­ing an AI’s val­ues)

[link] Why Self-Con­trol Seems (but may not be) Limited

Kurzban et al. on op­por­tu­nity cost mod­els of men­tal fa­tigue and re­source-based mod­els of willpower

Fal­la­cies as weak Bayesian evidence

I Was Not Al­most Wrong But I Was Al­most Right: Close-Call Coun­ter­fac­tu­als and Bias

[Preprint for com­ment­ing] Digi­tal Im­mor­tal­ity: The­ory and Pro­to­col for Indi­rect Mind Uploading

IJMC Mind Upload­ing Spe­cial Is­sue published

Bad news for uploading

“Per­sonal Iden­tity and Upload­ing”, by Mark Walker

“Ray Kurzweil and Upload­ing: Just Say No!”, Nick Agar

Publi­ca­tion of “An­thropic De­ci­sion The­ory”

SSC Jour­nal Club: AI Timelines

Com­put­er­phile dis­cusses MIRI’s “Log­i­cal In­duc­tion” paper

New pa­per from MIRI: “Toward ideal­ized de­ci­sion the­ory”

[LINK] In­ter­na­tional vari­a­tion in IQ – the role of parasites

IQ Scores Fail to Pre­dict Aca­demic Perfor­mance in Chil­dren With Autism

[LINK] Neu­ro­scien­tists Find That Sta­tus within Groups Can Affect IQ

New re­port: In­tel­li­gence Ex­plo­sion Microeconomics

The Chro­matic Num­ber of the Plane is at Least 5 - Aubrey de Grey

[Question] Why is pseudo-al­ign­ment “worse” than other ways ML can fail to gen­er­al­ize?

Stan­ford En­cy­clo­pe­dia of Philos­o­phy on AI ethics and superintelligence

Mul­ti­verse-wide Co­op­er­a­tion via Cor­re­lated De­ci­sion Making

A tech­ni­cal note on bil­in­ear lay­ers for interpretability

Papers, Please #1: Var­i­ous Papers on Em­ploy­ment, Wages and Productivity

Au­mann Agree­ment by Combat

“A Defi­ni­tion of Sub­jec­tive Prob­a­bil­ity” by An­scombe and Aumann

Sny­der-Beat­tie, Sand­berg, Drexler & Bon­sall (2020): The Timing of Evolu­tion­ary Tran­si­tions Suggests In­tel­li­gent Life Is Rare

[Paper] The Global Catas­trophic Risks of the Pos­si­bil­ity of Find­ing Alien AI Dur­ing SETI

Com­ment on “En­doge­nous Epistemic Fac­tion­al­iza­tion”

Op­ti­mized Pro­pa­ganda with Bayesian Net­works: Com­ment on “Ar­tic­u­lat­ing Lay The­o­ries Through Graph­i­cal Models”

For­mal Solu­tion to the In­ner Align­ment Problem

Deep limi­ta­tions? Ex­am­in­ing ex­pert dis­agree­ment over deep learning

En­tropic bound­ary con­di­tions to­wards safe ar­tifi­cial superintelligence

Com­ment on “De­cep­tion as Co­op­er­a­tion”

2021 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Read­ing the ethi­cists: A re­view of ar­ti­cles on AI in the jour­nal Science and Eng­ineer­ing Ethics

Paper: Fore­cast­ing world events with neu­ral nets

Poster Ses­sion on AI Safety

How to Read Papers Effi­ciently: Fast-then-Slow Three pass method

Learn­ing prefer­ences by look­ing at the world

[Question] How Old is Smal­lpox?

Is Caviar a Risk Fac­tor For Be­ing a Million­aire?

[Link] Com­puter im­proves its Civ­i­liza­tion II game­play by read­ing the manual

Ar­ti­cle Re­view: Dis­cov­er­ing La­tent Knowl­edge (Burns, Ye, et al)

A Sum­mary Of An­thropic’s First Paper

Gen­er­al­iz­ing Ex­per­i­men­tal Re­sults by Lev­er­ag­ing Knowl­edge of Mechanisms

New pa­per: Cor­rigi­bil­ity with Utility Preservation

Me­mory, nu­tri­tion, mo­ti­va­tion, and genes

Hu­man-AI Collaboration

“Every­thing is Cor­re­lated”: An An­thol­ogy of the Psy­chol­ogy Debate

Skep­ti­cism About Deep­Mind’s “Grand­mas­ter-Level” Chess Without Search

A dis­cus­sion of the pa­per, “Large Lan­guage Models are Zero-Shot Rea­son­ers”

David Chalmers’ “The Sin­gu­lar­ity: A Philo­soph­i­cal Anal­y­sis”

Let’s Dis­cuss Func­tional De­ci­sion Theory

In­tro­duc­ing Cor­rigi­bil­ity (an FAI re­search sub­field)

Coun­ter­fac­tual out­come state tran­si­tion parameters

How to es­cape from your sand­box and from your hard­ware host

Or­a­cle paper

New pa­per: The In­cen­tives that Shape Behaviour

Dis­solv­ing the Fermi Para­dox, and what re­flec­tion it provides

Mas­ter­ing Chess and Shogi by Self-Play with a Gen­eral Re­in­force­ment Learn­ing Algorithm

Sum­mary: Sur­real Decisions

How Big a Deal are MatMul-Free Trans­form­ers?

To Learn Crit­i­cal Think­ing, Study Crit­i­cal Thinking

An Overview of Sparks of Ar­tifi­cial Gen­eral In­tel­li­gence: Early ex­per­i­ments with GPT-4

Paper di­ges­tion: “May We Have Your At­ten­tion Please? Hu­man-Rights NGOs and the Prob­lem of Global Com­mu­ni­ca­tion”

The Phys­iol­ogy of Willpower

Ex­perts vs. parents

The Mind Is Not De­signed For Thinking

[Link] Per­sis­tence of Long-Term Me­mory in Vitrified and Re­vived C. el­e­gans worms

[Question] Can this model grade a test with­out know­ing the an­swers?

Im­pli­ca­tions of Quan­tum Com­put­ing for Ar­tifi­cial In­tel­li­gence Align­ment Research

The the­ory of Prox­i­mal Policy Op­ti­mi­sa­tion implementations

Cita­bil­ity of Less­wrong and the Align­ment Forum

Link: Writ­ing ex­er­cise closes the gen­der gap in uni­ver­sity-level physics

Dono­hue, Le­vitt, Roe, and Wade: T-minus 20 years to a mas­sive crime wave?

FHI pa­per pub­lished in Science: in­ter­ven­tions against COVID-19

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

NeurIPS ML Safety Work­shop 2022

[Question] How can we se­cure more re­search po­si­tions at our uni­ver­si­ties for x-risk re­searchers?

That one apoc­a­lyp­tic nu­clear famine pa­per is bunk

Hope Function

Rawls’s Veil of Ig­no­rance Doesn’t Make Any Sense

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

How You Can Gain Self Con­trol Without “Self-Con­trol”

Func­tional Trade-offs

“Are Ex­per­i­ments Pos­si­ble?” Seeds of Science call for reviewers

Char­ac­ter­iz­ing In­trin­sic Com­po­si­tion­al­ity in Trans­form­ers with Tree Projections

How truth­ful is GPT-3? A bench­mark for lan­guage models

Walk­through of the Tiling Agents for Self-Mod­ify­ing AI paper

Do­ing your good deed for the day

[linkpost] Ac­qui­si­tion of Chess Knowl­edge in AlphaZero

De­mand­ing and De­sign­ing Aligned Cog­ni­tive Architectures

Even if you have a nail, not all ham­mers are the same

Less Com­pe­ti­tion, More Mer­i­toc­racy?

A New In­ter­pre­ta­tion of the Marsh­mal­low Test

Good News for Immunostimulants

Let’s Read: Su­per­hu­man AI for mul­ti­player poker

Tiling Agents for Self-Mod­ify­ing AI (OPFAI #2)

The Vuln­er­a­ble World Hy­poth­e­sis (by Bostrom)

Deep­Mind ar­ti­cle: AI Safety Gridworlds

Claims & As­sump­tions made in Eter­nity in Six Hours

[1911.08265] Mas­ter­ing Atari, Go, Chess and Shogi by Plan­ning with a Learned Model | Arxiv

Effect het­ero­gene­ity and ex­ter­nal val­idity in medicine

Learn­ing bi­ases and re­wards simultaneously

Rea­son­ing isn’t about logic (it’s about ar­gu­ing)

