# Progress Studies

TagLast edit: 18 Nov 2022 5:28 UTC by

Progress Studies is the study of the causes of civilizational progress, e.g., the combination of economic, technological, scientific, and cultural advancements that have transformed human life and raised standards of living over the past couple of centuries.

The bicycle, as we know it today, was not invented until the late 1800s. Yet it was a simple mechanical invention. It would seem to require no brilliant inventive insight, and certainly no scientific background.
Why, then, wasn’t it invented much earlier? – Why did we wait so long for the bicycle?

## Origin of the Name

Progress Studies was proposed as an academic field by Tyler Cowen and Patrick Collison [1] after they noticed that there’s no intellectual movement focused on understanding the dynamics of progress, or on trying to speed it up.

# Dis­con­tin­u­ous progress in his­tory: an update

14 Apr 2020 0:00 UTC
178 points
(aiimpacts.org)

# Epistemic stan­dards for “Why did it take so long to in­vent X?”

2 Mar 2020 18:58 UTC
90 points
(rootsofprogress.org)

# Tech­nol­ogy Changes Constraints

25 Jan 2020 23:13 UTC
106 points

# [An­swer] Why wasn’t sci­ence in­vented in China?

23 Apr 2019 21:47 UTC
92 points

# In­dus­trial literacy

30 Sep 2020 16:39 UTC
281 points
(rootsofprogress.org)

# An­nounc­ing the Progress Forum

17 Nov 2022 19:26 UTC
82 points

# Why did ev­ery­thing take so long?

29 Dec 2017 1:00 UTC
29 points
(meteuphoric.wordpress.com)

# Study­ing Early Stage Science: Re­search Pro­gram Introduction

17 Jan 2020 22:12 UTC
32 points
(medium.com)

# Some ex­am­ples of tech­nol­ogy timelines

27 Mar 2020 18:13 UTC
24 points

# Two Kinds of Tech­nol­ogy Change

11 Oct 2018 4:54 UTC
61 points

# Is Science Slow­ing Down?

27 Nov 2018 3:30 UTC
121 points
(slatestarcodex.com)

# The 300-year jour­ney to the covid vaccine

9 Nov 2020 23:06 UTC
121 points
(rootsofprogress.org)

# The Coper­ni­can Revolu­tion from the Inside

1 Nov 2017 10:51 UTC
116 points

# Why haven’t we cel­e­brated any ma­jor achieve­ments lately?

17 Aug 2020 20:34 UTC
193 points
(rootsofprogress.org)

# Why has nu­clear power been a flop?

16 Apr 2021 16:49 UTC
125 points
(rootsofprogress.org)

# Thiel on Progress and Stagnation

20 Jul 2020 20:27 UTC
175 points

# 19th-cen­tury progress studies

27 Aug 2020 2:28 UTC
30 points
(rootsofprogress.org)

# The Rise and Fall of Amer­i­can Growth: A summary

5 Oct 2020 21:41 UTC
58 points
(rootsofprogress.org)

# 1960: The Year The Sin­gu­lar­ity Was Cancelled

23 Apr 2019 1:30 UTC
96 points
(slatestarcodex.com)

# How to end stag­na­tion?

1 Mar 2021 19:42 UTC
36 points
(rootsofprogress.org)

# A dash­board for progress

21 Mar 2021 3:51 UTC
24 points
(rootsofprogress.org)

# Hu­man­ity is Win­ning the Fight Against In­fec­tious Disease

30 Aug 2021 1:31 UTC
60 points

# BBC Fu­ture cov­ers progress studies

16 Jun 2022 22:44 UTC
20 points
(rootsofprogress.org)

# Why Weren’t Hot Air Bal­loons In­vented Sooner?

18 Oct 2022 0:41 UTC
107 points
(lostfutures.substack.com)

# Why did we wait so long for the bi­cy­cle?

17 Jul 2019 18:45 UTC
47 points
(rootsofprogress.org)

# An­nounc­ing Progress Stud­ies for Young Schol­ars, an on­line sum­mer pro­gram in the his­tory of technology

20 May 2020 0:52 UTC
19 points
(rootsofprogress.org)

# [Question] Why didn’t Agoric Com­put­ing be­come pop­u­lar?

16 Feb 2019 6:19 UTC
51 points

21 Oct 2019 17:50 UTC
101 points
(rootsofprogress.org)

# On the Loss and Preser­va­tion of Knowledge

8 Mar 2018 18:40 UTC
55 points
(medium.com)

# Why Science is slow­ing down, Univer­si­ties and Maslow’s hi­er­ar­chy of needs

15 Feb 2020 20:39 UTC
16 points

# “Is Science Bro­ken?” is underspecified

12 Aug 2016 11:59 UTC
16 points

# Why any­thing that can be for-profit, should be

29 Apr 2020 20:00 UTC
50 points
(rootsofprogress.org)

# A let­ter on op­ti­mism about hu­man progress

4 Dec 2019 4:21 UTC
34 points
(letter.wiki)

# In­stant stone (just add wa­ter!)

13 Nov 2019 22:33 UTC
98 points
(rootsofprogress.org)

# Book re­view: The Sleep­walk­ers by Arthur Koestler

23 Apr 2019 0:10 UTC
68 points
(thinkingcomplete.blogspot.com)

# In­tel­lec­tual Progress In­side and Out­side Academia

2 Sep 2017 23:08 UTC
40 points

# Shut­tling be­tween sci­ence and invention

27 May 2020 21:25 UTC
59 points
(rootsofprogress.org)

5 Jun 2020 17:47 UTC
34 points
(rootsofprogress.org)

# Book re­view: The Tech­nol­ogy Trap

20 Jul 2019 12:40 UTC
28 points
(thinkingcomplete.blogspot.com)

# Why ev­ery­thing might have taken so long

1 Jan 2018 1:00 UTC
103 points
(meteuphoric.wordpress.com)

# The Fu­ture of Science

28 Jul 2020 2:43 UTC
21 points

# Tra­di­tion is Smarter Than You Are

19 Sep 2018 17:54 UTC
64 points
(scholars-stage.blogspot.com)

# The Era Of Un­limited Every­thing: Un­limited Ma­te­ri­als & Un­limited Money

1 Aug 2020 3:01 UTC
4 points

# The YouTube Revolu­tion in Knowl­edge Transfer

17 Sep 2019 20:10 UTC
61 points
(medium.com)

2 Sep 2019 18:40 UTC
34 points

8 Sep 2019 23:07 UTC
28 points

# S-Curves for Trend Forecasting

23 Jan 2019 18:17 UTC
99 points

# An­nounc­ing: Progress Stud­ies Read­ing Group.

30 Aug 2020 10:39 UTC
6 points

# [Question] What makes peo­ple in­tel­lec­tu­ally ac­tive?

29 Dec 2018 22:29 UTC
101 points

# Some el­e­ments of in­dus­trial literacy

8 Oct 2020 19:51 UTC
9 points
(rootsofprogress.org)

# Tech­nol­ogy and its side effects

13 Oct 2020 20:07 UTC
17 points
(rootsofprogress.org)

# [Question] What is our true life ex­pec­tancy?

23 Oct 2020 23:17 UTC
14 points

# A re­view of Where Is My Fly­ing Car? by J. Storrs Hall

6 Nov 2020 20:01 UTC
101 points
(rootsofprogress.org)

# Epistemic Progress

20 Nov 2020 19:58 UTC
54 points

# The Flynn Effect Clarified

12 Dec 2020 5:18 UTC
34 points
(www.bayesianinvestor.com)

# When life was liter­ally full of crap

21 Dec 2020 21:16 UTC
59 points
(rootsofprogress.org)

# Tech­nolog­i­cal stag­na­tion: Why I came around

23 Jan 2021 22:05 UTC
182 points
(rootsofprogress.org)

# Clar­ifi­ca­tions on tech stagnation

31 Jan 2021 0:59 UTC
20 points
(rootsofprogress.org)

# [Question] Math­e­mat­i­cal Models of Progress?

16 Feb 2021 0:21 UTC
28 points

# Ex­po­nen­tial growth is the baseline

22 Feb 2021 2:08 UTC
27 points
(rootsofprogress.org)

# Les­sons from “The Book of My Life”

6 Jan 2021 22:40 UTC
104 points
(benmgarfinkel.wordpress.com)

# High­lights from The Au­to­bi­og­ra­phy of An­drew Carnegie

8 Apr 2021 22:03 UTC
88 points
(rootsofprogress.org)

# Wanted: Re­search As­sis­tant for The Roots of Progress

19 Apr 2021 19:03 UTC
31 points
(rootsofprogress.org)

# We need a ca­reer path for invention

29 Apr 2021 18:11 UTC
51 points
(rootsofprogress.org)

# Did the In­dus­trial Revolu­tion de­crease costs or in­crease qual­ity?

20 May 2021 21:10 UTC
71 points
(rootsofprogress.org)

# Why did we wait so long for the thresh­ing ma­chine?

29 Jun 2021 19:55 UTC
100 points
(rootsofprogress.org)

# Win­ston Churchill, fu­tur­ist and EA

12 Jul 2021 2:07 UTC
66 points
(rootsofprogress.org)

# We need a new philos­o­phy of progress

23 Aug 2021 21:31 UTC
64 points
(rootsofprogress.org)

# The Roots of Progress is now a non­profit organization

23 Aug 2021 21:33 UTC
44 points
(rootsofprogress.org)

# Progress, Stag­na­tion, & Collapse

22 Jul 2021 16:51 UTC
38 points

# Wanted: Chief of Staff for The Roots of Progress

7 Sep 2021 17:26 UTC
15 points
(rootsofprogress.org)

# How fac­to­ries were made safe

12 Sep 2021 19:58 UTC
138 points
(rootsofprogress.org)

# Fore­cast­ing Trans­for­ma­tive AI, Part 1: What Kind of AI?

24 Sep 2021 0:46 UTC
17 points

# In the shadow of the Great War

18 Oct 2021 23:08 UTC
32 points
(rootsofprogress.org)

# Young Scientists

22 Oct 2021 18:01 UTC
28 points

# The Roots of Progress events in Austin, Novem­ber 4–6

25 Oct 2021 16:12 UTC
12 points
(rootsofprogress.org)

# They don’t make ’em like they used to

27 Oct 2021 19:44 UTC
39 points
(rootsofprogress.org)

# The po­etry of progress

15 Nov 2021 19:24 UTC
50 points
(rootsofprogress.org)

# Meetup for The Roots of Progress in San Diego, Dec 1

24 Nov 2021 22:50 UTC
7 points
(rootsofprogress.org)

# The bonds of fam­ily and com­mu­nity: Poverty and cru­elty among Rus­sian peas­ants in the late 19th century

28 Nov 2021 17:22 UTC
112 points
(rootsofprogress.org)

# Con­ver­sa­tion on tech­nol­ogy fore­cast­ing and gradualism

9 Dec 2021 21:23 UTC
107 points

# More power to you

15 Dec 2021 23:50 UTC
16 points
(rootsofprogress.org)

# Progress, hu­man­ism, agency: An in­tel­lec­tual core for the progress movement

12 Jan 2022 0:06 UTC
11 points
(rootsofprogress.org)

# Why do we need a NEW philos­o­phy of progress?

25 Jan 2022 17:28 UTC
75 points
(rootsofprogress.org)

# Event: Mo­ral Foun­da­tions of Progress Stud­ies, March 4–6 at UT Austin

26 Jan 2022 21:57 UTC
15 points
(rootsofprogress.org)

# Trans­porta­tion as a Constraint

6 Apr 2020 4:58 UTC
173 points

# Ob­served pat­terns around ma­jor tech­nolog­i­cal advancements

3 Feb 2022 0:30 UTC
45 points
(aiimpacts.org)

# What is a “philos­o­phy of progress?”

4 Feb 2022 22:02 UTC
10 points
(rootsofprogress.org)

# What would a thriv­ing progress move­ment look like?

24 Feb 2022 17:58 UTC
26 points
(rootsofprogress.org)

# Book re­view: The Dawn of Everything

11 Mar 2022 1:56 UTC
24 points
(www.bayesianinvestor.com)

# Wanted: Ex­ec­u­tive As­sis­tant to help build the progress movement

23 Mar 2022 18:19 UTC
8 points
(rootsofprogress.org)

# Fly­wheels of progress

23 Mar 2022 20:42 UTC
30 points
(rootsofprogress.org)

# The lure of technocracy

31 Mar 2022 19:58 UTC
27 points
(rootsofprogress.org)

# Is growth lin­ear, not ex­po­nen­tial?

20 Apr 2022 18:46 UTC
28 points
(rootsofprogress.org)

# Fu­ture of Progress Stud­ies – a meetup with Adam Thierer

21 Apr 2022 21:34 UTC
7 points

# Why “progress stud­ies” is interdisciplinary

22 Apr 2022 21:58 UTC
12 points
(rootsofprogress.org)

# Why pes­simism sounds smart

25 Apr 2022 20:10 UTC
74 points
(rootsofprogress.org)

# What are the best ex­am­ples of catas­trophic re­source short­ages?

4 May 2022 14:37 UTC
35 points
(rootsofprogress.org)

# How cur­ing ag­ing could help progress

22 May 2022 22:43 UTC
47 points
(rootsofprogress.org)

# Can growth con­tinue?

27 May 2022 16:19 UTC
28 points
(rootsofprogress.org)

# Progress links and tweets, 2022-05-30

30 May 2022 23:20 UTC
18 points
(rootsofprogress.org)

# Rein­vent­ing the wheel

4 Jun 2022 22:39 UTC
78 points
(rootsofprogress.org)

# Progress links and tweets, 2022-06-08

9 Jun 2022 19:13 UTC
11 points
(rootsofprogress.org)

# Was the In­dus­trial Revolu­tion The In­dus­trial Revolu­tion?

14 Jun 2022 14:48 UTC
29 points
(daviskedrosky.substack.com)

# Progress links and tweets, 2022-06-13

15 Jun 2022 19:47 UTC
12 points
(rootsofprogress.org)

# Progress links and tweets, 2022-06-20

21 Jun 2022 17:12 UTC
12 points
(rootsofprogress.org)

# Progress links and tweets, 2022-06-29

29 Jun 2022 21:33 UTC
9 points
(rootsofprogress.org)

# Wealth as a source of tech­nolog­i­cal stag­na­tion?

7 Jul 2022 5:46 UTC
21 points

# Progress links and tweets, 2022-07-12

12 Jul 2022 15:30 UTC
12 points
(rootsofprogress.org)

# High­lights from the mem­o­irs of Van­nevar Bush

15 Jul 2022 18:08 UTC
10 points
(rootsofprogress.org)

# Launch­ing a new progress in­sti­tute, seek­ing a CEO

18 Jul 2022 16:58 UTC
25 points
(rootsofprogress.org)

# Progress links and tweets, 2022-07-19

19 Jul 2022 20:50 UTC
11 points
(rootsofprogress.org)

# Tech­noc­racy and the Space Age

26 Jul 2022 23:14 UTC
24 points
(rootsofprogress.org)

# Progress links and tweets, 2022-07-27

27 Jul 2022 17:20 UTC
18 points
(rootsofprogress.org)

# Progress links and tweets, 2022-08-02

2 Aug 2022 17:03 UTC
9 points
(rootsofprogress.org)

# Fiber arts, mys­te­ri­ous do­dec­a­he­drons, and wait­ing on “Eureka!”

4 Aug 2022 20:37 UTC
106 points
(eukaryotewritesblog.com)

7 Aug 2022 17:59 UTC
108 points
(jasoncrawford.org)

# Progress links and tweets, 2022-08-09

9 Aug 2022 17:35 UTC
11 points
(rootsofprogress.org)

# Progress links and tweets, 2022-08-17

17 Aug 2022 21:27 UTC
11 points
(rootsofprogress.org)

# A con­ver­sa­tion about progress and safety

18 Aug 2022 18:36 UTC
12 points
(rootsofprogress.org)

# What does moral progress con­sist of?

19 Aug 2022 0:22 UTC
30 points
(forum.effectivealtruism.org)

# Progress links and tweets, 2022-08-23

23 Aug 2022 17:30 UTC
7 points
(rootsofprogress.org)

# Event in SF: Fore­sight In­sti­tute meetup, Sep 8

25 Aug 2022 20:53 UTC
9 points
(rootsofprogress.org)

# Progress links and tweets, 2022-08-31

31 Aug 2022 21:54 UTC
13 points
(rootsofprogress.org)

# How much im­pact can any one man have?

31 Aug 2022 10:26 UTC
9 points

# Why was progress so slow in the past?

1 Sep 2022 20:26 UTC
53 points
(rootsofprogress.org)

# Progress links & tweets, 2022-09-08

8 Sep 2022 20:43 UTC
13 points
(rootsofprogress.org)

# Progress links and tweets, 2022-09-14

14 Sep 2022 23:21 UTC
9 points
(rootsofprogress.org)

# Towards a philos­o­phy of safety

16 Sep 2022 21:10 UTC
12 points
(rootsofprogress.org)

# Progress links and tweets, 2022-09-20

20 Sep 2022 14:07 UTC
7 points
(rootsofprogress.org)

# What hap­pened to the idea of progress?

20 Sep 2022 19:56 UTC
8 points
(bigthink.com)

# Progress links and tweets, 2022-09-28

28 Sep 2022 20:26 UTC
13 points
(rootsofprogress.org)

# Progress links and tweets, 2022-10-05

5 Oct 2022 19:24 UTC
9 points
(rootsofprogress.org)

# Amer­i­can in­ven­tion from the “heroic age” to the sys­tem-build­ing era

6 Oct 2022 17:19 UTC
13 points
(rootsofprogress.org)

# Clean­ing a Spoon is Complex

9 Oct 2022 1:10 UTC
20 points
(www.jefftk.com)

# From tech­noc­racy to the counterculture

11 Oct 2022 19:37 UTC
28 points
(rootsofprogress.org)

# Progress links and tweets, 2022-10-12

12 Oct 2022 16:59 UTC
8 points
(rootsofprogress.org)

# Progress links and tweets, 2022-11-01

1 Nov 2022 17:48 UTC
16 points
(rootsofprogress.org)

# Should we “go against na­ture”?

4 Nov 2022 22:14 UTC
10 points
(rootsofprogress.org)

# When should we be sur­prised that an in­ven­tion took “so long”?

16 Nov 2022 20:04 UTC
32 points
(rootsofprogress.org)

# Progress links and tweets, 2022-11-15

16 Nov 2022 3:21 UTC
9 points
(rootsofprogress.org)

# Progress links and tweets, 2022-11-22

22 Nov 2022 17:39 UTC
17 points
(rootsofprogress.org)

# Jan Bloch’s Im­pos­si­ble War

17 Feb 2020 16:14 UTC
107 points
(hivewired.wordpress.com)

# The abrupt­ness of nu­clear weapons

25 Feb 2018 17:40 UTC
46 points

# Can the Chain Still Hold You?

13 Jan 2012 1:28 UTC
192 points

# Drain­ing the swamp

28 Jan 2020 21:37 UTC
91 points
(rootsofprogress.org)

# For progress to be by ac­cu­mu­la­tion and not by ran­dom walk, read great books

2 Mar 2010 8:11 UTC
69 points

# [Question] Which sci­en­tific dis­cov­ery was most ahead of its time?

16 May 2019 12:58 UTC
38 points

# Is the rate of sci­en­tific progress slow­ing down? (by Tyler Cowen and Ben South­wood)

2 Dec 2019 3:45 UTC
48 points

# Why safety is not safe

14 Jun 2009 5:20 UTC
60 points

# Is the World Get­ting Bet­ter? A brief sum­mary of re­cent debate

6 Feb 2019 17:38 UTC
35 points
(capx.co)

# For the past, in some ways only, we are moral degenerates

7 Jun 2019 15:57 UTC
32 points

# A Pro­posed Ad­just­ment to the Astro­nom­i­cal Waste Argument

27 May 2013 3:39 UTC
35 points

# [Question] How do we iden­tify bot­tle­necks to sci­en­tific and tech­nolog­i­cal progress?

31 Dec 2018 20:21 UTC
31 points

# The effi­ciency of prizes

3 Apr 2012 21:45 UTC
37 points

# [Question] Could Nixon go­ing to China be a cause for the big stag­na­tion?

5 Jul 2020 6:58 UTC
16 points

# How to an­a­lyze progress, stag­na­tion, and low-hang­ing fruit

15 Jun 2020 21:02 UTC
42 points
(rootsofprogress.org)

# Study Group for Progress – 50% off for LessWrongers

3 Sep 2020 0:17 UTC
14 points

# Cut­ting edge technology

31 Oct 2017 6:00 UTC
10 points

# Indig­na­tion in re­sponse to the 1890 census

8 Sep 2020 20:14 UTC
26 points
(rootsofprogress.org)

# Peter Thiel warns of up­com­ing (and cur­rent) stagnation

4 Oct 2011 17:30 UTC
36 points

# Progress: Fluke or trend?

13 Sep 2020 0:21 UTC
16 points
(rootsofprogress.org)

# A prior for tech­nolog­i­cal discontinuities

13 Oct 2020 16:51 UTC
70 points

# (How) should we pur­sue hu­man longevity?

24 Oct 2020 19:13 UTC
21 points

# Si­mu­la­tion of tech­nolog­i­cal progress (work in progress)

10 Feb 2020 20:39 UTC
21 points

# [Question] Has a tech­nolog­i­cal de­pen­dency graph been made?

27 Feb 2020 20:51 UTC
20 points

# En­gelbart: In­suffi­ciently Recursive

26 Nov 2008 8:31 UTC
20 points

# Were vac­cines rele­vant to 20th cen­tury US mor­tal­ity im­prove­ments?

10 Dec 2019 0:13 UTC
12 points
(rootsofprogress.org)

# Iron: From myth­i­cal to mundane

24 Oct 2019 22:43 UTC
19 points
(rootsofprogress.org)

# [Question] The tech left behind

12 Mar 2019 14:47 UTC
25 points

# The Triumph of Hu­man­ity Chart

26 Oct 2015 1:41 UTC
30 points

# [Question] What do you think should be in­cluded in a se­ries about con­cep­tual me­dia?

31 Dec 2020 1:40 UTC
2 points

# Is the world be­com­ing bet­ter?

6 Feb 2021 9:28 UTC
34 points

# Met­ric se­lec­tion bias: why Moore’s law is less im­por­tant than you think

8 Feb 2021 0:21 UTC
18 points
(aaronbergman.substack.com)

# Notes on Hen­rich’s “The WEIRDest Peo­ple in the World” (2020)

14 Feb 2021 8:40 UTC
17 points

# The Na­tional Dash­board and Hu­man Progress

15 Apr 2021 19:21 UTC
6 points
(max2c.com)

# We Live in an Era of Un­prece­dented World Peace

30 Aug 2021 22:25 UTC
27 points

# [Question] Weird mod­els of coun­try de­vel­op­ment?

22 Sep 2021 17:39 UTC
7 points

# What if we should use more en­ergy, not less?

16 Oct 2021 19:51 UTC
4 points

# Exegesis

31 Dec 2021 17:48 UTC
9 points
(rogersbacon.substack.com)

# The Fourth Arena: What’s Up in the world these days? We’re mov­ing to a new, a new what?

4 Jun 2022 19:07 UTC
2 points

# 1689: Un­cov­er­ing the World New In­sti­tu­tion­al­ism Created

17 Jun 2022 19:32 UTC
7 points
(daviskedrosky.substack.com)

# En­light­en­ment Values in a Vuln­er­a­ble World

20 Jul 2022 19:52 UTC
15 points
(maximumprogress.substack.com)

# Drexler’s Nan­otech Forecast

30 Jul 2022 0:45 UTC
26 points
(www.bayesianinvestor.com)

# Who or­dered al­ign­ment’s ap­ple?

28 Aug 2022 4:05 UTC
6 points

# There is no royal road to alignment

18 Sep 2022 3:33 UTC
4 points

# Against the weird­ness heuristic

2 Oct 2022 19:41 UTC
17 points
• Totally baseless conjecture that I have not thought about for very long; chaos is identical to Turing completeness. All dynamical systems that demonstrate chaotic behavior are Turing complete (or at least implement an undecidable procedure).

Has anyone heard of an established connection here?

• As of all things transportation related, it’s all about power to weight ratio. Tanks have horrible power to weight ratio, but they are very tough. Planes are a bit more fragile, and birds are even more fragile. Birds also have the added benefit of biological healing from physical wear and tear. Evolution is just a form of engineering.

• Hey Adam, thanks for running Refine and writing this up.

Out of curiosity, do you (or anyone else) know if there are statistics for previous SERI-MATS cohorts/​other programs designed to generate conceptual alignment researchers?

• I think grading in some form will be necessary in the sense that we don’t know what value heuristics will be sufficient to ensure alignment in the AI. We will most likely need to add corrections to its reward signals on the fly, even as it learns to extrapolate its own values from those heuristics. In other words, grading.

However, it seems the crucial point is that we need to avoid including grader evaluations as part of the AI’s self-evaluation model, for the same reason that we shouldn’t give it access to its reward button. In other words, don’t build the AI like this:

[planning module] → [predicted grader output] → [internal reward signal] → [reinforce policy function]

Instead, it should look more like this:

[planning module] → [predicted world state] → [internal reward signal] → [reinforce policy function]

The predicted grader output may be part of the AI’s predicted world state (if a grader is used), but it shouldn’t be the part that triggers reward. The trick, then, would be to identify the part of the AI’s world model that corresponds to what we want it to care about and feed only that part into the learned reward signal.

• 27 Nov 2022 18:18 UTC
1 point
0 ∶ 0

I have set out to fully and intuitely understand Löb’s theorem.

I have found an answer that makes sense to my intuition. Not sure if it will come across, as I am a very non-neurotypical person who really thinks like a space alien sometimes.

Löb’s theorem, the standard/​explicit formulation (for reference)

(for any formula P)
If it is provable in PA
that “if P is provable in PA then P is true”, then P is provable in PA.

Löb’s theorem, intuitive expression, version 2.0

If I am a provably logically coherent entity
and I promise that “if I promise that you can trust me, then you can trust me”, then you can trust me.

I am tentatively calling the concept “meta-trust”. With that in mind, an even shorter formulation:

If you meta-trust me, then you trust me.

I’m not sure I can explain in explicit terms why my intuition feels this is true. I mean, it just seems obvious? But I can’t explain yet why it is obvious.

And the correspondence between the two versions of the theorem is not 1:1. It uses somewhat different formulation/​[type of concept]. It uses different kind of cognitive machinery. Even if it feels “obvious” to me that the statements are exactly equivalent.

Well, not exactly, because intuition never thinks in 100% exact terms. But it can think very exactly, just not infinitely so.

So, my question to you is:

Does this formulation make intuitive sense to you?
There are probably inference steps I have skipped. What are the inference steps that need to expanded on, to make the formulation more broadly understandable, without relying on some unspoken background knowledge?
And most importantly: if you this think that this formulation is logically incorrect, I would very much like to hear how it is incorrect.

• If we are able to flag a treacherous turn as cognitively anomalous, then we can take that opportunity to shut down a system and retrain on the offending datapoint.

What do you mean by “retrain on the offending datapoint”? I would be worried about Goodhearting on this by selecting for systems which don’t set off the anomaly detector, and thereby making it a less reliable safeguard.

• Suppose I have a slow process I trust that I use to provide sparse ground truth for my system (like a very extensive human evaluation). But day-to-day I need to use my ML system because it’s much cheaper. I’m concerned that it may take some catastrophically bad actions at test time because it thinks that it can take over. But if I can flag those as anomalous, then I can invoke my slow oversight process, include the datapoint in training data, update my model to be less likely to try to take a treacherous turn, and then continue. If my model learns quickly then I won’t have to do this very many times before it stops trying to take a treacherous turn.

• 27 Nov 2022 17:48 UTC
2 points
0 ∶ 0

The federation protocol makes some weird choices about trust and bandwidth.
JWZ’s snarky commentary: https://​​www.jwz.org/​​blog/​​2022/​​11/​​mastodon-stampede/​​

Federation and what’s local vs remote: https://​​medium.com/​​@kris-nova/​​experimenting-with-federation-and-migrating-accounts-eae61a688c3c

It’s a very odd mix of low-trust but also many things not cryptographically validated to provide proof of intent (just https cert validation of the server). I give it 3 months before a rogue server figures out how to hijack or spoof follows, and then the spam whack-a-mole really begins.

• Philosophers don’t discuss things which can be falsified.

Sometimes in life, one simply faces questions whose answers can’t be falsified, such as “What should we do about things which can’t be falsified?” If you’re proposing to avoid discussing them, well aren’t you discussing one of them now? And why should we trust you, without discussing it ourselves?

I think you had the bad luck of taking a couple of philosophy classes that taught things that were outdated or “insane”. (Socrates and Aristotle may have been very confused, but consider, how did we, i.e., humanity, get to our current relatively less confused state, without doing more philosophy?) Personally I took a philosophy of physics class in college that I really liked, which led me to learn about other areas of philosophy.

I wrote more about what I think philosophy is and what philosophers do at Some Thoughts on Metaphilosophy, which you may find interesting since we both come from a math/​science background (computer science in my case).

• 27 Nov 2022 16:39 UTC
1 point
0 ∶ 0

Not talking precisely about observable reality.

Talking about what observable reality means. For instance , there is a set of observations and formulate called quantum mechanics. These might mean that we have free will, that we do n ot, that there is a parallel universe where the allies lost WWII, and so on. What it actually means can’t settled by further observations, or more maths, so something else is needed...unless you give up on the question of meaning, and settle for instrumentalism.

• Wait, on some shortforms there’s agree/​disagree votes, on others there’s not. Huh?

• For your quiz, could you give an example of something that is grader-optimization but which is not wireheading?

• 27 Nov 2022 16:04 UTC
−10 points
0 ∶ 3

This is not a place for politics. It is especially not a place for politics that have consistently led to catastrophe. The uncharitable reading of this post is that it is simply ignoring the harms of socialism, which is a trivial error of rationality. The charitable reading is that it is proposing a new take on socialism which could actually be beneficial. However, explaining such a point and answering the inevitable objections requires a long political discussion on a board that was explicitly created to avoid such things, due to their tendency to make rationality much harder.

If the charitable reading is correct, this might be an interesting debate, and even perhaps one where we could all learn something. But this simply isn’t the forum for it.

• As tailcalled says, there’s really not much overlap between what this post is advocating and what led to (e.g.) millions of deaths under Stalin.

This post is about a way a business can voluntarily choose to organize itself that might lead to better outcomes. The thing that has led to a lot of catastrophes is a way a nation can organize itself which, in practice, has only ever happened as a result of violent revolutions. (Other sorts of socialism have arisen democratically and these have not led to catastrophe; e.g., the Scandinavian countries are pretty good places to live.)

• Surely the good or bad effects of socialism are a function of policy? Whether or not a policy arises democratically and/​or revolutionarily does not change the policy itself. This is a striking non-sequitur.

The Scandinavian countries are indeed pretty good places to live. This likely has nothing to do whatsoever with democracy per se, but with the fact that the Scandinavian model does not regulate to anything resembling more strongly socialist nations, despite the fact that they famously have a large welfare system. There is no casual mechanism whereby voting for a leader would make the policies of that leader better-though obviously a leader that harmed the people in legible-to-them ways might get voted out! But that would be democracy changing policy, not democracy making a given policy better. As a real-world test case, consider the Maduro regime in Venezuela. While his democratic bona fides are somewhat questionable (there are people who think he stole his election from Juan Guaido), he certainly had enough popular support to be a serious candidate. And that did not prevent his policies from having predictably impoverishing results on Venezuela.

• Aren’t socialist co-ops a totally different kind of socialism from the stuff that has “consistently led to catastrophe”?

• Hence the charitable reading that the OP might be calling for a different version of socialism that might conceivably be beneficial. My point isn’t that there’s zero chance that he’s right; my point is that there’s no way to say “hey, let’s do this thing that’s superficially similar to catastrophic policies” without it either not conveying useful information, or that useful information requiring a long political debate to hash out. And that’s not appropriate for the “Politics is the mind-killer, let’s improve our rationality on easier cases” forum. I’d welcome the post and subsequent debate on e.g. a Scott Alexander forum or comment section. But this isn’t the place for it.

• Aren’t these obviously, non-subtly different from the stuff that has consistently led to catastrophe? Or am I missing some similarity? (I have to admit that I’m somewhat weak on the relevant history, so it’s possible I’m missing some obvious deep similarity.)

• At least one critical similarity is that this plan relies on people ignoring economic incentives, and tries to handwave this away by pretending that people will cooperate in the face of free rider dynamics in the hopes of future payoffs. If that was true, game theory would be a lot simpler.

Are you on Data Secrets Lox? That is much more the place for this sort of discussion, and it would let us talk about whatever you like without transgressing the no politics board.

• I’m not handwaving anything I wrote a whole section about how experiments contradict this and what could explain this:

“Experiments have shown that people randomly allocated to do tasks in groups where they can elect their leaders and/​or choose their pay structures are more productive than those who are led by an unelected manager who makes pay choices for them.[20] One study looked at real firms with high levels of worker ownership of shares in the company and found that workers are keener to monitor others, making them more productive than those with low or no ownership of shares and directly contradicting the free rider hypothesis.[21] It turns out there are potential benefits to giving workers control and a stake in the running of the organization they work for. This allows workers to play a key role in decision making and reorient the goals of the organization.[22] One explanation for this phenomenon is that of “localized knowledge”. According to economist Friedrich Hayek, top-down organizers have difficulty harnessing and coordinating around local knowledge, and the policies they write that are the same across a wide range of circumstances don’t account for the “particular circumstances of time and place”.[23] (For examples of this, read Seeing Like a State by political scientist James Scott) Those who make the top-down policies in a traditional company are different to those who have to follow them. In addition, those who manage the company are most often different to those who own the company. These groups have different incentives and accumulate different knowledge. This means that co-ops have two main advantages:

Workers can harness their collective knowledge to make running the firm more effective. Workers can use their voting power to ensure the organization is more aligned with their values. Interestingly enough, I have yet to come across a co-op that uses the state of the art of social choice theory, so they could potentially get a lot lot better.“

• The specific handwave I’m referring to is Amartya Sen’s.

“In the case of the free rider hypothesis, these ‘rational fools’ act based on such a narrow conception of self-interest that they don’t take into account the obviously damaging long-term consequences of their behavior, both to the firm and ultimately to themselves. Normal, reasonable people—who are different to rational economic man—are usually happy to put efforts into a collective endeavor that will deliver benefits for them in the long run, even if that means foregoing some short-term gains.”

This sounds like it would predict that people reliably cooperate on prisoner’s dilemmas, and pick stag in stag hunts. In reality, of course, that’s not a thing! Cooperation exists, but tends to require coordination mechanisms. Worse, it sounds like it’s advocating an incoherent decision theory. While there are certainly times where it’s wise to make a choice that isn’t the best in the most narrow, myopic possible sense (Newcomb’s problem is the obvious example, or superrationality dynamics), that’s very different than putting efforts into collective endeavor in the hopes of collective success.

The evidence you cite is interesting, though Lao Mein’s evidence suggests it isn’t a slam dunk. But Sen is committing a fallacy here, and the same fallacy as was often used in support of socialist regimes. As such, it’s a valid answer to tailcalled’s question.

• I cite four different studies that show that the theory doesn’t match the observations, Lao Mein doesn’t cite anything. This is the most extreme version of being a selective skeptic.

• He cites the observation that socialized firms have not taken over the economy. That’s clearly true and clearly relevant. Your response was that you’d already explained why socialized firms might not take over even if they were productive. What were those reasons again? Reviewing your post, it looks like it might be the difficulty of gaining investment and brain drain from the most productive workers leaving, but both of those reasons would be strong arguments against socialization. Rose Wrist’s ideas for gaining investment anyway are interesting, but until socialized firms actually do raise enough funding to compete, saying that they theoretically maybe can sounds remarkably hollow.

The point of evidence is to see things that are more likely under one hypothesis than another. In the world where socialized firms are better, I do not expect to see them failing to take over. In the world where they are not, I do expect that it’s possible to generate arbitrarily long lists of pro-socialism citations.

The strength of a case depends on the strength of the evidence, not on the number of citations!

• I cited controlled experiments, you counter with an observation that I have already responded to in both the post and the comments:

I explained this in this section:

One issue that arises with starting a socialist firms is acquiring initial investing.[27] This is probably because co-ops want to maximize income (wages), not profits. They pursue the interests of their members rather than investors and may sometimes opt to increase wages instead of profits. Capitalist firms on the other hand are explicitly investor owned so investor interests will take priority.

A socialist firm can be more productive and not dominate the economy if it’s hard to start a socialist firm.

The strength of a case depends on the strength of the evidence, not on the number of citations!

You are not engaging with the evidence I cited.

• [ ]
[deleted]
• C-4 can explode; undecided voters think this makes C-4 better than B-4.

• Birds also didn’t need a highly complex language to describe gravity and its effects on the natural world in order to work with this technology.

• For whatever it is worth, I tried to feel and felt like I was not really getting a lot of the distinctions. Let me state what I got.

Guessing the teachers password is bad. If you have a teachers password guessing situation, there is no magic password which would cease to make it bad for this reason. If you don’t have a teacher this bad reason can not apply. “teacher” here means “pass-gatekeeper”. Having more and less correct answers to guestions does not imply that it is a teacher password quessing situation.

Thinking about it I find it curious that I feel that in AI “pass utility-function gatekeeper” is a likely and central approach. And it feels like in human learning this is a very small, trivial curiousity. Nebolous reason why this is because humans are talking about “something real”. Trying to make this more technical, I end up in the direction of “humans have needs and attitudes towards the material that are beside passing”. I could imagine that exams about astrology or Middle Earth could make even human behaviour go nearer to this problematic cluster.

If one takes the limiting condition that guessing the teachers password is bad then the learning situation is not that the teacher wants you to do or think something (this would be pure passwording). However the effect that is looked for is something that the learner could not come up with themselfs (equally easy). Following the teaching (and here might be important that it is the process and not the material) you come to do something that you can independently value. In a system whos whole interface with the rest of the system is to pass the exam there is no possibility of other source of value. So a kind of hard Kantian division is doomed to unreal behaviour. So a side channel should exists, the teaching or directing can not be disintegrated from the rest of the agent.

• Pareto improvement: Instead of letting students write bachelors/​masters theses that are basically just literature reviews, let them rewrite the respective Wikipedia articles instead (and then the supervisor checks the article).

Advantage: Instead of (in expectation) 10 people benefitting from the literature review, now a couple of hundred (for obscure pages) or potentially tens of thousands (for mildly popular pages) of people benefit.

• Short summary of some reading on attention spans (maybe a longer writeup at some point):

As far as I can tell, psychology doesn’t have a agreed-upon measure of attention span, nor does it have a standard test for measuring it. Papers on the topic try to answer more specific questions, such as “what is the attention span of students during lectures”, where there is also no consensus (could be 8 seconds, could be 10 minutes, could be more). In the best case, papers use ad-hoc tests to measure attention span, in the worst case the use surveys. A commonly reported decline of attention span from 8 seconds to 12 seconds is likely completely fabricated. Since we don’t even have a test for attention span, society is not tracking whether attention spans are declining.

This seems like an improvable state of affairs, and could probably result in a lot of citations for comparatively little effort (look at some of the ad-hoc tests used in different papers, try them for construct, let a random sample take the test, and let another random sample take the test a year or two later (if desired, repeat)). The fact that completely made-up figures are cited this widely indicates that there is interest in those numbers.

• If socialist workplaces actually had all those benefits, they would already be taking over much of the economy. But we don’t see very many co-ops in the wild. My personal experience is that unions and co-ops tend to shift compensation to those with seniority and those involved with corporate politics instead of the skilled, productive, and competent ones. This then causes more time and energy to be spent on corporate politics and drives out the most productive employees.

You’re very much underestimating the effects of the top-performers at a business. The market has run the experiment of comparing socialist firms to capitalist ones for more than a century and every result has shown that socialist firms just aren’t competitive. Yes, there are benefits to the less productive employees. But the cost of driving out the productives has been shown to be higher in just about every sector. Hence why there aren’t very many socialist firms.

• There’s countries where cooperative firms are doing fine. Most of Denmark’s supermarket chains are owned by the cooperative coop. Denmark’s largest dairy producer Arla is a cooperative too. Both operate in a free market and are out-competing privately owned competitors.

Both also resort to many of the same dirty tricks traditionally structured firms are pulling. Arla, for example, has done tremendous harm to the plant-based industry through aggressive lobbying. Structuring firms as cooperatives doesn’t magically make them aligned.

• I’ve already explained why socialists firms wouldn’t necessarily take over the economy even if they were productive in both the post and other comments.

• 27 Nov 2022 11:54 UTC
LW: 16 AF: 11
3 ∶ 0
AF

We’re building intelligent AI systems that help us do stuff. Regardless of how the AI’s internal cognition works, it seems clear that the plans /​ actions it enacts have to be extremely strongly selected. With alignment, we’re trying to ensure that they are strongly selected to produce good outcomes, rather than being strongly selected for something else. So for any alignment proposal I want to see some reason that argues for “good outcomes” rather than “something else”.

In nearly all of the proposals I know of that seem like they have a chance of helping, at a high level the reason is “the human is a source of information about what is good, and this information influences what the AI’s plans are selected for”. (There are some cases based on moral realism.)

This is also the case with value-child: in that case, the mother is a source of info on what is good, she uses this to instill values in the child, those values then influence which plans value-child ends up enacting.

All such stories have a risk: what if the process of using [info about what is good] to influence [that which plans are selected for] goes wrong, and instead plans are strongly selected for some slightly-different thing? Then because optimization amplifies and value is fragile, the plans will produce bad outcomes.

I view this post as instantiating this argument for one particular class of proposals: cases in which we build an AI system that explicitly searches over a large space of plans, predicts their consequences, rates the consequences according to a prediction of what is “good”, and executes the highest-scoring plan. In such cases, you can more precisely restate “plans are strongly selected for some slightly-different thing” to “the agent executes plans that cause upwards-errors in the prediction of what is good”.

It’s an important argument! If you want to have an accurate picture of how likely such plans are too work, you really need to consider this point!

The part where I disagree is where the post goes on to say “and so we shouldn’t do this”. My response: what is the alternative, and why does it avoid or lessen the more abstract risk above?

I’d assume that the idea is that you produce AI systems that are more like “value-child”. Certainly I agree that if you successfully instill good values into your AI system, you have defused the risk argument above. But how did you do that? Why didn’t we instead get “almost-value-child”, who (say) values doing challenging things that require hard work, and so enrolls in harder and harder courses and gets worse and worse grades?

So far, this is a bit unfair to the post(s). It does have some additional arguments, which I’m going to rewrite in totally different language which I might be getting horribly wrong:

An AI system with a “direct (object-level) goal” is better than one with “indirect goals”. Specifically, you could imagine two things: (a) plans are selected for a direct goal (e.g. “make diamonds”) encoded inside the AI system, vs. (b) plans are selected for being evaluated as good by something encoded outside the AI system (e.g. “Alice’s approval”). I think the idea is that indirect goals clearly have issues (because the AI system is incentivized to trick the evaluator), while the direct goal has some shot at working, so we should aim for the direct goal.

I don’t buy this as stated; just as “you have a literally perfect overseer” seems theoretically possible but unrealistic, so too does “you instill the direct goal literally exactly correctly”. Presumably one of these works better in practice than the other, but it’s not obvious to me which one it is.

Separately, I don’t see this as all that relevant to what work we do in practice: even if we thought that we should be creating an AI system with a direct goal, I’d still be interested in iterated amplification, debate, interpretability, etc, because all of those seem particularly useful for instilling direct goals (given the deep learning paradigm). In particular even with a shard lens I’d be thinking about “how do I notice if my agent grew a shard that was subtly different from what I wanted” and I’d think of amplifying oversight as an obvious approach to tackle this problem. Personally I think it’s pretty likely that most of the AI systems we build and align in the near-to-medium term will have direct goals, even if we use techniques like iterated amplification and debate to build them.

Plan generation is safer. One theme is that with realistic agent cognition you only generate, say, 2-3 plans, and choose amongst those, which is very different from searching over all possible plans. I don’t think this inherently buys you any safety; this just means that you now have to consider how those 2-3 plans were generated (since they are presumably not random plans). Then you could make other arguments for safety (idk if the post endorses any of these):

1. Plans are selected based on historical experience. Instead of considering novel plans where you are relying more on your predictions of how the plans will play out, the AI could instead only consider plans that are very similar to plans that have been tried previously (by humans or AIs), where we have seen how such plans have played out and so have a better idea of whether they are good or not. I think that if we somehow accomplished this it would meaningfully improve safety in the medium term, but eventually we will want to have very novel plans as well and then we’d be back to our original problem.

2. Plans are selected from amongst a safe subset of plans. This could in theory work, but my next question would be “what is this safe subset, and why do you expect plans to be selected from it?” That’s not to say it’s impossible, just that I don’t see the argument for it.

3. Plans are selected based on values. In other words we’ve instilled values into the AI system, the plans are selected for those values. I’d critique this the same way as above, i.e. it’s really unclear how we successfully instilled values into the AI system and we could have instilled subtly wrong values instead.

4. Plans aren’t selected strongly. You could say that the 2-3 plans aren’t strongly selected for anything, so they aren’t likely to run into these issues. I think this is assuming that your AI system isn’t very capable; this sounds like the route of “don’t build powerful AI” (which is a plausible route).

In summary:

1. Intelligence ⇒ strong selection pressure ⇒ bad outcomes if the selection pressure is off target.

2. In the case of agents that are motivated to optimize evaluations of plans, this argument turns into “what if the agent tricks the evaluator”.

3. In the case of agents that pursue values /​ shards instilled by some other process, this argument turns into “what if the values /​ shards are different from what we wanted”.

4. To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.

We need to apply extremely strong selection to get the kind of agent we want, and the agent we want will itself need to be making decisions that are extremely optimized in order to achieve powerfully good outcomes. The question is about in what way that decision-making algorithm should be structured, not whether it should be optimized/​optimizing at all. As a fairly close analogy, IMO a point in the Death With Dignity post was something like “for most people, the actually consequentialist-correct choice is NOT to try explicitly reasoning about consequences”. Similarly, the best way for an agent to actually produce highly-optimized good-by-its-values outcomes through planning may not be by running an explicit search over the space of ~all plans, sticking each of them into its value-estimator, & picking the argmax plan.

I think there still may be some mixup between:

A. How does the cognition-we-intend-the-agent-to-have operate? (for ex. a plan grader + an actor that tries to argmax the grader, or a MuZero-like heuristic tree searcher, or a chain-of-thought LLM steered by normative self-talk, or something else)

B. How we get the agent to have the intended cognition?

By contrast, “What if the values/​shards are different from what we wanted” (your summary point 3) is a question about B! Note that we have to confront B-like questions no matter how we answer A. If A = grader-optimization, there’s an analogous question of “What if the grader is different from what we wanted? /​ What if the trained actor is different from what we wanted?”. I don’t really see an issue with this post focusing exclusively on the A-like dimension of the problem and ignoring the B-like dimension temporarily, especially if we expect there to be general purpose methods that work across different answers to A.

• Re: Aristotle. Large part of what Aristotle wrote is science and math. If you felt you didn’t learn anything from Aristotle, that’s because only non-science and non-math part of Aristotle is usually taught, because science and math are usually taught in ahistorical manner.

Barbara, Celarent, Darii, Ferio is good math. It is just that first-order logic and Venn diagrams are better, so we don’t teach Barbara etc. One thing lost by this ahistorical teaching is how much better first-order logic is, and how difficult advance it was when it was first done by Frege.

• 27 Nov 2022 11:10 UTC
−1 points
0 ∶ 3

I’m impressed how modern EAs manage to spin any cause into being supposedly EA.

There’s just no way that things like this are remotely as effective as say GiveWell causes (though it wouldn’t even meet a much lower bar) and it barely even has longtermist points for it that can make me see why there’s at least a chance it could be worth it.

EA’s whlole brand is massively diluted by all these causes and I don’t think they are remotely as effective as other places where your money can go, nor that they help the general message.

It’s like people get into EA, realize it’s a good idea but then want to participate in the community and not just donate so everyone tries to come up with new clearly ineffective (compared to alternatives) causes and spin them as EA.

• The article mentions EA exactly twice. One is to quote the “80,000 hours” figure. The other is at the end, suggesting that EA organizations should consider adopting the sort of structure the article argues for.

Neither of these things claims, or implies, or even suggests, that shifting firms to a worker-cooperative model is an “EA” cause in the sense of being a more effective thing to do with money than, say, feeding starving poor people or preventing cheaply-preventable disease or (for those who favour such things) trying to increase the probability that some time in the future there are a billion trillion gazillion happy satisfied productive people.

(I don’t know whether Ben is correct about worker cooperatives being a better organizational structure in general. I don’t know whether EA organizations are similar enough to other businesses that this would indicate it’s a good structure for them. But if it is a good structure for them, they should consider using it even if persuading others to adopt it isn’t an efficient use of money.)

• There’s just no way that things like this are remotely as effective as say GiveWell causes

Do you have any evidence for this?

and it barely even has longtermist points

Not all EA’s are longtermists.

• Do you have any evidence for this?

My prior is that other things are less effective and you need evidence to show they are more effective not vice versa.

Not all EA’s are longtermists.

Of course. I’m saying it doesn’t even get to make that argument which can sometimes muddy the waters enough to make some odd-seeming causes look at least plausibly effective.

• My prior is that other things are less effective and you need evidence to show they are more effective not vice versa.

Appeal to presuppositions always feels weird to me. A socialist could just as easily say ‘my priors say the opposite’. In any case, you made a claim of comparison, not me, why is the burden of proof suddenly on me?

Of course. I’m saying it doesn’t even get to make that argument which can sometimes muddy the waters enough to make some odd-seeming causes look at least plausibly effective.

I’m trying to explain the scientific literature on co-ops, not persuade you of some scam.

• Goodhart is the malign god who gives you whatever you ask for.

• The societal, cultural and demographic implications of longer lifespans (250-400 year lifespans). Including the uncontrollable population boom that might occur as a result.

• Science starts with observations and then summarizes them into theories. Math starts with axioms and then generates theorems via a series of proofs.

Nice. And close to how I defined these for myself a long time ago:

• Math is everything valid (in the strict sense if being derived by a consistency-preserving process).

• Physics is everything useful (in the strict sense of tracing back observable reality)

Hm, translating to English this doesn’t come out too well. But anyway, I agree that it is possible to carve truth in different ways. One with a focus on consistency and one on observability.

I don’t feel like I learned anything from Socrates and Aristotle.

My guess would be because you already knew all of the facts. We don’t learn philosophy to learn true facts. You have to read philosophy backwards: How did people arrive at truth?

Rationalism defines truth as that which is observable, legible and tautologically consistent. Science is useful for the “observable” part. Math is useful for the “consistency” part. Both are necessarily legible, since without well-defined terms you end up back in the quagmire of meaningless philosophy.

I think you are misreading what philosophy does (or at least did fir most of the time): Struggling to build words that describe difficult to describe things. Creating legibility out of nothing.

Yes, you need legibility, but where does it come from? Once you have words like “observation” or “particle” (aka atom) or force you can do stuff with it. But reality is not legibile enough to provide these for free. Reality has joints but they are obscursed by a lot of messy flesh that is in the way. You have to throw words around that you gave an intuition that may or may not be right and see what sticks. And you don’t do this alone because it is too easy to spot fake patterns. Other people will try to catch yiu. But you can’t exchange intuitions directly. You have only words with all their imprecision. You can try definitions—but they must be circular because nothing is grounded—yet. That’s philosophy: Creating legibility out of nothing. At least that’s some idealized view of philosophy. Reals philosophy is not like that. But ideal math is also not like real math.

Anyway, I think you can observe this process of creating legibility about something in real time with consciousness. I think progress is being made with recent posts by Scott Alexander and others as well as real studies—putting a meditator into a CT and measuring what goes on.

• You make a number of causal claims that seem to be empirically based. It would be nice if your post could expand on some details of how the causal inference was performed. For instance here:

This is not only terrible for the workers but also for the economy, since businesses with engaged workers have 23% higher profit, while employees who are not engaged cost the world $7.8 trillion in lost productivity, equal to 11% of global GDP. What data and model are these estimates of the causal effects of it based on? Another thing that confuses me is why socialist firms need special support and don’t naturally come to dominate the economy. You seem to attribute this to owners extracting value, but that seems short-sighted; presumably if you have an economy with a mixture of socialist and non-socialist firms, and the socialist firms are much more productive, they would grow quicker and become dominant over time. • What data and model are these estimates of the causal effects of it based on? You can find my sources in the references section. This was based on a gallup study Another thing that confuses me is why socialist firms need special support and don’t naturally come to dominate the economy. You seem to attribute this to owners extracting value, but that seems short-sighted; presumably if you have an economy with a mixture of socialist and non-socialist firms, and the socialist firms are much more productive, they would grow quicker and become dominant over time. I explained this in this section: One issue that arises with starting a socialist firms is acquiring initial investing.[27] This is probably because co-ops want to maximize income (wages), not profits. They pursue the interests of their members rather than investors and may sometimes opt to increase wages instead of profits. Capitalist firms on the other hand are explicitly investor owned so investor interests will take priority. A socialist firm can be more productive and not dominate the economy if it’s hard to start a socialist firm. • You can find my sources in the references section. This was based on a gallup study The issue is basically friction. There were several things that made your links difficult to use: • They were not direct links to the study, but instead i direct links to articles that talk about the study, so I had to dig further manually. • The articles are often big and contain lots of specific things that might not be directly relevant to your point of using it in the post. • Even having opened the study, I’m still left with confusions about the methodology. It looks to me like they basically just correlated employee engagement with productivity. This is valid if employee engagement varies in a way that is uncorrelated with other factors that have a big effect on productivity, but they don’t seem to justify that assumption and it doesn’t seem anecdotally sensible to me. Furthermore they suggest that managers have a huge effect on employee engagement, which seems to point to a potential area where this assumption could fail. I would like to see a factor analysis containing employee engagement and a bunch of other variables before I believe it to be reasonably independent of other factors. Now, these are questions I could study myself, so why put the burden on you? I’d say friction and scale: If everyone reading your article studies this themselves, then it is a lot of duplicated work. Meanwhile if you did the work, e.g. making sure to link directly instead of indirectly to the studies you’ve read, and making sure to also link to factor analyses demonstrating independence or whatever else is assumed, then the work would be only performed once, and the results of the work would be available to everyone reading the article. One issue that arises with starting a socialist firms is acquiring initial investing.[27] This is probably because co-ops want to maximize income (wages), not profits. They pursue the interests of their members rather than investors and may sometimes opt to increase wages instead of profits. Capitalist firms on the other hand are explicitly investor owned so investor interests will take priority. I’m not sure I understand the economics of this. If co-ops have an inherent massive growth advantage, wouldn’t that outweigh the advantage capitalist firms have in giving more dividends to investors? Because while in the short term the capitalist firms would maybe give more to their investors, in the long term the co-ops would grow bigger and therefore have more money to give, even if they allocate a smaller fraction of it? • It may be that this post would have been better if Ben had put a ton of extra effort into replacing links-to-links with links, and explaining everything in more detail. But this has a bit of an “isolated demand for rigor” feel to me. To be clear, I don’t mean that I have compared your response to this post with your response to otherwise-similar posts with a different political slant and found a difference; it is consistent with everything I think I know for you merely to be unusually rigor-demanding across the board. But it does feel to me as if (1) you’re saying that to be good enough Ben’s post should have gone to a lot more effort and (2) broadly similar posts with different political leanings don’t generally attract demands for that level of author-effort. • More fundamentally the reason I call out that specific section is because I strongly doubt that the “businesses with engaged workers have 23% higher profit” sentence is based on statistics that are of any nontrivial evidentiary value for whether better workers lead to more profit, and if this part of the post is totally wrong then that calls into question whether the other statistics cited in the post are similarly totally wrong. However, in spot-checking whether the statistics were totally wrong, I found myself struggling with wading through signups and links and long mostly irrelevant articles. Of course some nonzero amount of this is likely to happen with spot-checks but it seemed like the layers of links just made it even worse. I was about to say that maybe I handled it wrong to begin with and should have just pointed out this problem. But looking back at my original comment I think I handled it right, in just asking OP to explain what data/​model it was based on; the problem is that then OP responded back with repeating the links instead of explaining what he had read in the links. But I guess my followup response to OP responding back with repeating the links might be a problem. • However, in spot-checking whether the statistics were totally wrong, I found myself struggling with wading through signups and links and long mostly irrelevant articles. Of course some nonzero amount of this is likely to happen with spot-checks but it seemed like the layers of links just made it even worse. This is dishonest, the vast majority of the sources are primary scientific studies and the few times I do refer to secondary sources it isn’t irrelevant. You did handle it right, especially your deleted comment. OP to explain what data/​model it was based on; the problem is that then OP responded back with repeating the links instead of explaining what he had read in the links Yeah, because the primary source is right there?! What value would me explaining in my second language bring to the explanation, when you can click on the link and immediately download the primary source? • This is dishonest, the vast majority of the sources are primary scientific studies and the few times I do refer to secondary sources it isn’t irrelevant. I wasn’t talking about the vast majority of the sources, I was talking about source 3, which turned into source 2, which turned into some other source that I had to find myself. You did handle it right, especially your deleted comment. My deleted comment was nearly identical to one of the non-deleted comments. It’s just that I realized there was a problem with one of my comments after posting it and I needed to take some time to look at it. Yeah, because the primary source is right there?! What value would me explaining in my second language bring to the explanation, when you can click on the link and immediately download the primary source? The problem is that the primary source does not seem credible without additional information. • A spot check is supposed to take a number of random sources and check them, not pick the one claim you find most suspicious (that isn’t even about co-ops) and use that to dismiss the entire literature on co-ops. • A spot check is supposed to take a number of random sources and check them, not pick the one claim you find most suspicious (that isn’t even about co-ops) There seem to be three parts to this objections: 1. I did not spot-check sufficiently many claims 2. I filtered the claim to spot-check based on being the most suspicious 3. I did not filter the claims based on being about co-ops With regards to point 1, I agree that I cannot know your accuracy very precisely without doing more checks, but the problem is that each check takes time and there are a lot of posts on the internet to read, so I have to limit how much I search. With regards to point 2, it’s not that I spot-checked the most suspicious one, rather it’s that I spot-checked the first suspicious one. This is still a filter on suspiciousness, but a much weaker one. I think some filter on suspiciousness is appropriate since suspicious claims are also the ones I can learn the most from if they turn out to be true, as claims become suspicious through a combination of being unlikely and having big implications. With regards to point 3, if you put much more effort into verifying the accuracy of your claims about co-ops than your claims about other stuff, then your accuracy on co-ops might not be that correlated with your accuracy on other stuff, and I ought to do a spot-check specifically on your claims about co-ops. I don’t know if that is true. If it is true, it might also be helpful to mention it as a disclaimer in the post so people know what claims to mostly focus on. and use that to dismiss the entire literature on co-ops. So it’s not so much that I’m dismissing the entire literature on co-ops. (Or well, I would generally dismiss any social science that I haven’t done some surface checks of. But that’s different from my comments here.) It’s more that I’m dismissing your literature review of the literature. • They were not direct links to the study, but instead i direct links to articles that talk about the study, so I had to dig further manually. It was the second source in the post: [2] • The articles are often big and contain lots of specific things that might not be directly relevant to your point of using it in the post. There was a summary of it on the linked page itself: Unfortunately, most employees remain disengaged at work. In fact, low engagement alone costs the global economy$7.8 trillion.

Even having opened the study, I’m still left with confusions about the methodology

From the study

Methodology

The primary data in this report come from the Gallup World Poll, through which Gallup has conducted surveys of the world’s adult population, using randomly selected samples, since 2005. The survey is administered annually face to face or by telephone, covering more than 160 countries and areas since its inception. In addition to the World Poll data, Gallup collected extensive random samples of working populations in the United States and Germany; these samples were also added to the dataset.

The target population of the World Poll is the entire civilian, noninstitutionalized, aged-15- and-older population. Gallup’s data in this report reflect the responses of adults, aged-15- and-older, who were employed for any number of hours by an employer.

With some exceptions, all samples are probability-based and nationally representative. Gallup uses data weighting to minimize bias in survey-based estimates; ensure samples are nationally representative for each country; and correct for unequal selection probability, nonresponse and double coverage of landline and mobile phone users when using both mobile phone and landline frames. Gallup also weights its final samples to match the national demographics of each selected country.

Regional findings in this report include data obtained from 2021 to as late as March 2022 (reported as part of 2021 data in this report). To determine percentage point changes for regions, Gallup uses data from 2020 and 2021 from the same countries in each region.

Country-specific findings in “Appendix 1: Country Comparisons” are based on data aggregated from three years of polling (2019, 2020 and 2021 — with several countries’ 2021 data obtained in early 2022). Percentage point changes for countries indicate the differences in percentage points when comparing the average from 2018, 2019 and 2020 with the average from 2019, 2020 and 2021.

Gallup typically surveys 1,000 individuals in each country or area, using a standard set
of core questions that has been translated into the major languages of the respective country. In some countries, Gallup collects oversamples in major cities or areas of special interest. Additionally, in some large countries, such as China and Russia, sample sizes include at least 2,000 adults. In a small number of countries, the sample size is less than 1,000. In this report, Gallup does not provide country-level data (aggregate of 2019, 2020 and 2021 data) or country-level percentage point change data (aggregate of 2018, 2019 and 2020 data) for any country that has an aggregate n size of less than 300.

For results based on the total sample of adults globally, the margin of sampling error ranged from ±0.5 percentage points to ±0.7 percentage points at the 95% confidence level. For results based on the total sample of adults in each region, the margin of sampling error ranged from ±0.6 percentage points to ±5.0 percentage points at the 95% confidence level. For results based on the total sample of adults in each country, the margin of sampling error ranged from ±0.5 percentage points to ±8.5 percentage points at the 95% confidence level. All reported margins of sampling error include computed design effects for weighting.

I’m not sure I understand the economics of this. If co-ops have an inherent massive growth advantage, wouldn’t that outweigh the advantage capitalist firms have in giving more dividends to investors? Because while in the short term the capitalist firms would maybe give more to their investors, in the long term the co-ops would grow bigger and therefore have more money to give, even if they allocate a smaller fraction of it?

I never claimed a massive growth advantage:

There seems to be a small increase in companywide productivity[33]

As I said, the meta-analysis’s only show a small growth advantage. If e.g a socialist firm grows with $1000 and a capitalist firm with$900, but the capitalist firm gives the $900 to the investors and the socialist firm gives$500 to both the investors and the employees, the investors can make more money with capitalist firms.

• It was the second source in the post: [2]

Oh I think one confusing factor was the footnote placement.

But anyway, no, this link doesn’t link directly to the study either, it links to a report that links to the study. I had to go through additional links to find this document which appears to be the original source with the actual analysis.

From the study

The wall of text doesn’t really answer my questions about the independence of employee engagement.

I never claimed a massive growth advantage:

Ah sorry, that’s my mixup between the effects of employee engagement vs effects of co-opts.

• But anyway, no, this link doesn’t link directly to the study either, it links to a report that links to the study

You can immediately see a button that says “download report” when you click on that link. I wouldn’t call that “digging for sources”.

The wall of text doesn’t really answer my questions about the independence of employee engagement.

Furthermore they suggest that managers have a huge effect on employee engagement, which seems to point to a potential area where this assumption could fail.

It’s not independent, co-ops let you vote on managers which allows productivity to increase.

• You can immediately see a button that says “download report” when you click on that link. I wouldn’t call that “digging for sources”.

I have downloaded the report. When I searched for keywords from the sentence “This is not only terrible for the workers but also for the economy, since businesses with engaged workers have 23% higher profit, while employees who are not engaged cost the world $7.8 trillion in lost productivity, equal to 11% of global GDP.” in the report, the main section that appeared was this: Clearly, the COVID-19 pandemic era put a halt to a long period of gradual but general improvement among the world’s workers. This matters for global economic dynamism. Gallup estimates that low engagement costs the global economy US$7.8 trillion and accounts for 11% of GDP globally. Gallup’s analysis of 112,312 business units in 96 countries found a strong link between engagement and performance outcomes, such as retention, productivity, safety and profitability

Business units with engaged workers have 23% higher profit compared with business units with miserable workers.

Neither of these sections give any idea of how Gallup came to the conclusion, but the link in the first section contains a link to a different document that probably forms the foundation/​primary source for their analysis.

It’s not independent

I mean if the independence of employee engagement doesn’t hold, then the causal inference doesn’t go through, and you can’t infer that engagement has this much effect on productivity...

co-ops let you vote on managers which allows productivity to increase.

… however this sounds like a different form of independence than the one I brought up.

• The next line contains an RTL override. Try to highlight it!

‮Hello, this is text with an RTL override U+202E character.

I’ve sometimes considered making a library for copy-paste protection based on RTL+LTR overrides such that something renders correctly, but is totally shuffled in actuality. I’ve held off on account of I don’t actually want such a thing to exist.

• Ah yes this was confusing to me for a while too, glad to be able to help someone else out with it!

The key thing to realise for me, is that the probability of 21 heads in a row changes as you toss each of those 21 coins.

The more ‘mathematical’ way to express this would be: The unconditional probability of tossing 21 heads in a row is , i.e. 0.000000476837158 but the probability of tossing 21 heads in a row conditional on having already tossed 20 heads in a row is .

Let me know if any of that is still confusing.

• I think you explain it very well!

So the thing is something like the following, right?: “Looking at it from the outside, a world where 21 heads showed in a row is incredibly unlikely: (if the coin is fair) I would happily bet against this world happening. However, I am already in an incredibly weird world where 20 heads have shown in a row, and another heads only makes it a bit more weird, so I don’t know what to bet, heads or tails.”

• The advice is: do not bet. Suppose you download a gambling app that bets on games where the outcome is similar to a coin flip. You start receiving emails from someone associated with the app (so they bypass your spam filters). Each day for 20 days you receive an email predicting the outcome of the game. Each of the 20 predictions is correct. What do you do? Nothing. What you are unaware of (but should suspect) is that on the first email, the sender has sent out 8 million emails making a prediction (it is a popular gambling app). 4 million of those predicted the home team wins and the other 4 million predicted the visiting team wins. The next day the emails only goes out to those that received the correct prediction. Rinse. Repeat. And you happen to be an (un)lucky recipient of the 21st email distribution. The world you live in is no weirder than the world a Powerball Lottery winner lives in.

• Yes, essentially. While 21 heads in a row is very unlikely (when you consider it ahead of flipping any coins), by the time you get to 20 heads in a row most of the unlikely-ness of it has already happened, with the odds of one more head remaining the same as ever.

• [ ]
[deleted]
• How is...

I’m short on AGI timelines (50% by 2030)

...consistent with...

An AI paradigm as performance-enhancing as transformers is discovered by AI search. 30% by 2030

...?

Doesn’t AGI imply the latter?

• I’m using a weird definition of AI here, basically “an AI that can do my job”. I’m imagining a cobbled-together system of transformers that individually automates everything I can do, thereby replacing most of the information jobs like coding, scientific research, and advertising. So in a lot of the worlds where AGI happens, there’s no hard takeoff. AIs are helping do AI research, and maybe labor isn’t a major limiting factor in AI development anymore. But there isn’t a >1 OOM increase in AI research output from AI.

This also means that I think in most of the 30% there is no hard takeoff. Some low-hanging fruit is picked by machines, but not enough for a FOOM.

Thanks for bringing up the contradiction, though. I really need to go back and clarify a lot of my statements.

• What’s the limitations/​boundaries to the domain of application to this post?

• This post would be a meta analysis to your question; or your question is a meta analysis to this post; either way is fine. He argues that the context/​domain of application is dependent on the abstractions/​generalization. When the context changes, the abstraction also changes most of the time. This post’s focus is more on being aware of when the context changes and when the abstraction changes.

• Rationality.

• That’s a very abstract response, could you give a more concrete explanation?

• I can give a concrete example.

If you’re writing a novel (novel-writing is not a subfield of rationality) then you don’t need to keep in mind the principle “[i]f you don’t understand an idea’s limitations then you don’t understand that idea.”

On the other hand, if you are trying to figure out whether Capitalism is the best ideological framework to apply to a region (which is a question within the domain of rational analysis) then you absolutely must keep in mind the limits of the Capitalist framework.

• I know Rob already made a video about Risks from Learned Optimization but I would also like to see a RA version.

Also, an explanation of FDT/​LDT, Newcomb’s problem, acausal trading etc

• Fun exercise, but I’m not a fan of the total cartesian doubt phase—I’d rather sacrifice even more corrigibility properties (like how this already isn’t too worried about subagent stability) for better friendliness.

• A full explanation of the replication crisis and how to evaluate scientific papers would be really nice. Most smart science-adjacent (and a lot of scientists!) still believe pop-science conceptions of studies like Dunning-Kreuger and Rat Utopia. Encouraging critical thinking, even of results from high-status scientists, would really improve the smart-normie memespace. It’s pretty crazy how much my understanding of science has changed since I learned how the sausage was made. As things are, there aren’t many ways for someone who isn’t a scientist themselves to learn of the replication crisis and its consequences.

I’d be willing to write the script myself if prizes for the script contest are still being given.

• This is the basic core of addiction. Addictions are when there’s an intolerable sensation but you find a way to bear its presence without addressing its cause. The more that distraction becomes a habit, the more that’s the thing you automatically turn to when the sensation arises. This dynamic becomes desperate and life-destroying to the extent that it triggers a red queen race.

I doubt that addiction requires some intolerable sensation that you need to drown out. I’m pretty confident its mostly a habits/​feedback loops and sometimes physical dependence.

• For instance, ~1 billion people worldwide are addicted to caffeine. I think that’s just what happens when a person regularly consumes coffee. It has nothing to do with some intolerable sensation.

• Emails sent out, you should have one if you applied by 8:30pm. Has updates on the format (I’ve been testing it out in smaller groups, have some improvements). Will process the new form submissions tomorrow.

Unfortunately I decided to make the invite-list for this event a bit more restrictive and not open-invite. Given the chaos and narrative-control swirling around FTX, and given that the number of people has become more than I expected (160+), I’ve been more selective in the invites, and only invited ~80% of the people who filled out the form. The form didn’t give me much to go on so it definitely could be better, alas.

I’m so sorry to change plans the evening before, I hope this isn’t too disruptive for folks. Hopefully future events won’t have this level of background adversarial forces and then I can get back to running lots more open-invite online events (like I did during the pandemic).

• I don’t think i ever heard about tesla doing LLM stuff, which seems like the most relevant paradigm for TAI purposes. Can you elaborate?

• One possible options play is puts on shutterstock, since as of about 2 weeks ago midjourney got up to a level where you can for a pittance replicate the most common and popular stock image varieties at an extremely high level of quality. (E.g. girl holding a credit card and smiling).

I think the most likely way this shakes out is adobe integrates image generation with figma and its other products, leaving “buying a stock image” as an increasingly niche and limited option for people who want an image to decorate a thing where they aren’t all that particular about what the image is.

Primary question to me is on what time scale the SSTK business model dissolves in, since these changes take time.

• Positive psychology is important, and I feel like hearing about it from lukeprog back in the day really helped me. Practice gratitude. Act to spend more time with people you like. Spend money on experiences more than goods.

• 27 Nov 2022 3:53 UTC
14 points
1 ∶ 0

As usual, being a bayesian makes everything extraordinarily clear. The mean-squared-error loss is just the negative logarithm of your data likelihood under the assumption of gaussian-distributed data, so “minimizing the mean-squared-loss” is completely equivalent to a MLE with gaussian errors . Any other loss you might want to compute directly implies an assumption about the data distribution, and vice-versa. If you have reason to believe that your data might not be normally distributed around an x-dependent mean… then don’t use a mean-squared loss

• see your β there? you assume that people remember to “control for bias” before they apply tools that assume Gaussian error

that is indeed what I should have remembered about the implications of “we can often assume approximately normal distribution” from my statistics course ~15 years ago, but then I saw people complaining about sensitivity to outliers in 1 direction and I failed to make a connection until I dug deeper into my reasoning

• 27 Nov 2022 3:21 UTC
7 points
0 ∶ 0

Here’s a good/​accessible blog post that does a pretty good job discussing this topic. https://​​ericneyman.wordpress.com/​​2019/​​09/​​17/​​least-squares-regression-isnt-arbitrary/​​

• 27 Nov 2022 3:06 UTC
1 point
0 ∶ 0

Are you a Chinese citizen? If so, getting a programming job in the West without a degree in CS or related fields might be hard, visa-wise. The default way through is a Master’s degree, but there are probably ways to hack this (e.g. get a job in a multinational with offices in China, transfer to the US).

• I am. I had pretty low grades in College (~2.0 GDP), and from what I’ve read I would need a lot of work experience and accomplishments to get into a Master’s program. I think I need to convince a professor directly in order to get in. Do you have any recommendations?

• 27 Nov 2022 3:01 UTC
LW: 9 AF: 4
2 ∶ 0
AF

I want to point out that I think the typical important case looks more like “wanting to do things for unusual reasons,” and if you’re worried about this approach breaking down there that seems like a pretty central obstacle. For example, suppose rather than trying to maintain a situation (the diamond stays in the vault) we’re trying to extrapolate (like coming up with a safe cancer cure). When looking at a novel medication to solve an unsolved problem, we won’t be able to say “well, it cures the cancer for the normal reason” because there aren’t any positive examples to compare to (or they’ll be identifiably different).

It might still work out, because when we ask “is the patient healthy?” there is something like “the normal reason” there. [But then maybe it doesn’t work for Dyson Sphere designs, or so on.]

• Yes, you want the patient to appear on camera for the normal reason, but you don’t want the patient to remain healthy for the normal reason.

We describe a possible strategy for handling this issue in the appendix. I feel more confident about the choice of research focus than I do about whether that particular strategy will work out. The main reasons are: I think that ELK and deceptive alignment are already challenging and useful to solve even in the case where there is no such distributional shift, that those challenges capture at least some central alignment difficulties, that the kind of strategy described in the post is at least plausible, and that as a result it’s unlikely to be possible to say very much about the distributional shift case before solving the simpler case.

If the overall approach fails, I currently think it’s most likely either because we can’t define what we mean by explanation or that we can’t find explanations for key model behaviors.

• A fun thing about example 1. is that we can totally imagine an AF System that could drag a goat off a cliff and eat it (put it in a bioreactor which it uses to charge its battery), it’s just that no one would make that, because it wouldn’t make sense. Artificial systems use ‘cheats’ like solar power or hydrocarbons because the cheats are better. There may never be an era or a use case where it makes sense to ‘stop cheating’.

A weird but important example is that you might not ever see certain (sub-pivotal) demonstrations of strength from most AGI researcher institutions, not because they couldn’t make those things, but because doing so would cause them to be nationalized as a defense project.

• That’s the second filter, because “optimizing” is two words: having a goal and maximising (or minimising) it.

First, one has to aknowledge that solving aligment is a goal. Many people does not recognize that it’s a problem, beacuse smart robots will learn what love means and won’t hurt us.

What you talked about in your post comes after this. When someone is walking towards the goalpost of alignment, they should realize that there might be multiple routes there and they should choose the quickest one, because only winning matters.

• I have not seen the Simulation Trilemma and anthropic reasoning mentioned in any of the other comments, yet I think those topics are pretty interesting.

Also +1 for FDT.

• any acceptable outcome to TAI would make investments irrelevant; why waste your time with investments in an economic system that must be replaced for the future to contain humanity? money is much better spent on direct action than on others’ projects, as though hoping your stockholder contracts will still be honored after strong TAI causes sudden hyperinflation of all previous currencies. It is absolutely critical that TAI come with a new currency that causes this hyperinflation, and associated guarantees that all humans get a basic income of the new “ai outcomes” currency. as things are looking, people doing “investment with guaranteed payout contracts” are likely to be a major force that ensures that any attempted alignment research is simply used to prevent ai from being aligned with customers, in order to maximize control of customer behavior, and this will then destroy the investors as well. don’t invest in projects that have that pattern!

• Generally yes. Though you can treat this post as a sort of hedge—your basic hope would be that current markets become irrelevant, but on the off chance that they get captured into some kind of dystopian megacorp type thingy, you might as well own some stocks in whoever wins. It’s also a way to ride the wave before TAI, depending on your timelines.

• 27 Nov 2022 0:32 UTC
2 points
1 ∶ 0

There is no supernatural. There are only aspects of nature that we don’t understand. This is true even if there are literal gods. That is, I deny the terms of the question.

• OK, perhaps I should have been more precise. Suppose Omega tells you that there is something out there in the Universe (whether it’s karma, spirits, gods, or God) which we humans would recognize as agent-like and which inspired some human religions, but is not the result of evolution. What would your model of the Universe now look like?

• Not a result of in-our-universe evolution, or evolution in general? That is, could it be an intelligent species that runs a simulation containing us, where that species itself evolved in their universe? Or does it have to be come kind of just-magically-appearing intelligence?

Assuming Tegmark multiverse, there must be a universe somewhere where the laws of physics themselves just happened to encode an intelligent being. It’s just very, very unlikely. Us being in a simulation is probably more likely.

• Boston

Saturday, December 19; doors open at 6:30, Solstice starts at 7:15
69 Morrison Ave., Somerville, MA 02144

RSVPs appreciated for planning purposes: https://​​www.facebook.com/​​events/​​3403227779922411

Let us know in advance if you need to park onsite (it’s accessible by public transportation). We’re up a flight of stairs.

• 26 Nov 2022 23:50 UTC
2 points
0 ∶ 0

The Sequences. Surprised nobody mentioned this one yet.

While I am pretty sure you can’t compress the length of the sequences much without losing any valuable information, the fact is that for most people it’s just way too long to ever read through, and having some easily digestible video material would still be quite valuable. (Hopefully also by getting some people interested in reading the real thing?)

Turning the sequences into a set of videos would be a massive distillation job. On the high level it would ideally be something like:

1. Extract the set of important ideas the sequences convey. Identify the necessary dependencies between them.

2. Start turning the ideas into videos in topological order. (Each video should link the relevant posts for further reading.)

3. … Profit?

Would making these videos be optimal in some sense? I don’t know. Is trying to create more rationalists a good idea? Eliezer wrote the sequences with the express intent of creating more rationalists to help reduce AI risk. Is this still relevant? Maybe. AFAIK many people think that alignment is currently bottlenecked on good researchers. (Of course in this framing many other alignment relevant technical topics also make sense as video ideas.)

• 26 Nov 2022 23:50 UTC
LW: 5 AF: 5
1 ∶ 0
AF

Imagine someone who considers a few plans, grades them (e.g. “how good does my gut say this plan is?”), and chooses the best. They are not a grader-optimizer. They are not trying to navigate to the state where they propose and execute a plan which gets maximally highly rated by some evaluative submodule. They use a grading procedure to locally rate and execute plans, and may even locally think “what would make me feel better about this plan?”, but the point of their optimization isn’t “find the plan which makes me feel as good as globally possible.”

The way I think about this situation for myself as a human is that the more plans I consider and the wider /​ more global my search process is, the more likely it is that I hit upon an especially good “out of the box” plan, but also the more likely it is that I hit upon some “adversarial input” (in quotes because I’m not sure what you or I mean by this) and end up doing something really bad. It seems there are two things I can do about this:

1. Try to intuitively or quantitatively optimize the search process itself, as far as how many plans to consider, where to direct the search, etc., to get the best trade off between the two outcomes.

2. Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.

Do you have any objections/​disagreements with this? Secondarily, if as a result of 1 and 2 I’m doing a fairly wide search and considering many plans, doesn’t it stop making sense at some point to say “They are not a grader-optimizer.”?

2. This doesn’t include (somehow) getting an AI which correctly computes what program would be recommended by AGI designers in an altruistic and superintelligent branch of humanity, and then the AI executes that program and shuts itself off without doing anything else.[5]

But isn’t 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and superintelligent branch of humanity] then return 1 iff input-plan = ‘run X then shuts itself off without doing anything else’ (by doing a simple text match), 0 otherwise, so there’s no chance of adversarial inputs? Not to say this is a realistic way of getting an aligned AGI, but just that your argument seems to be proving too much, if it’s saying that 2 is safer/​better than 1.

1. Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.

Improve it with respect to what?

My attempt at a framework where “improving one’s own evaluator” and “believing in adversarial examples to one’s own evaluator” make sense:

• The agent’s allegiance is to some idealized utility function (like CEV). The agent’s internal evaluator is “trying” to approximate by reasoning heuristically. So now we ask Eval to evaluate the plan “do argmax w.r.t. Eval over a bunch of plans”. Eval reasons that, due to the the way that Eval works, there should exist “adversarial examples” that score very highly on Eval but low on . Hence, Eval concludes that is low, where plan = “do argmax w.r.t. Eval”. So the agent doesn’t execute the plan “search widely and argmax”.

• “Improving ” makes sense because Eval will gladly replace itself with if it believes that is a better approximation for (and hence replacing itself will cause the outcome to score better on )

Are there other distinct frameworks which make sense here? I look forward to seeing what design Alex proposes for “value child”.

• 27 Nov 2022 17:15 UTC
LW: 2 AF: 2
0 ∶ 0
AFParent

This is tempting, but the problem is that I don’t know what my idealized utility function is (e.g., I don’t have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it’s a bad idea, but how does that fit into the framework?

My own framework is something like this:

• The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers.

• I think there are “adversarial inputs” because I’ve previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda /​ crazy memes, so there must be a risk of persuading myself with my own bad ideas.

• I can try to improve my evaluation process by doing things like

1. look for patterns in my and other people’s mistakes

2. think about ethical dilemmas /​ try to resolve conflicts between my evaluative subprocesses

3. do more philosophy (think/​learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.)

4. talk (selectively) to other people

5. try to improve how I do explicit reasoning or philosophy

• It would be really cool to see a video on Newcomb’s problem, logical decision theories, and Lobian cooperation in the prisoner’s dilemma. I think this group of ideas is one of the most interesting developments in game theory in the past few years, and should be more widely known.

• I think what it boils down to is that in 1 dimension, the mean /​ expected value is a really useful quantity, and you get it by minimizing squared error, whereas the absolute error gives the median, which is still useful, but much less so than the mean. (The mean is one of the moments of the distribution, (the first moment), while the median isn’t. Rational agents maximize expected utility, not median utility, etc. Even the M in MAE still stands for “mean”.) Plus, although algorithmic considerations aren’t too important for small problems; in large problems the fact that least squares just boils down to solving a linear system is really useful, and I’d guess that in almost any large problem, the least squares solution is much faster to obtain than the least absolute error solution.

• An overview of the potential avenues for genetic enhancement of humans, their risks and benefits:

Ideally, it would briefly cover a myriad of topics, such as CRISPR, adenoviral vectors, gene drives, and less invasive options such as embryo selection.

I personally consider the sheer lack of enthusiasm for such technologies to be low-hanging fruit left to wither on the vine, damned by fear-mongering and a general aversion to trying anything not done a million times before (before becoming enthusiastically adopted, a lá IVF), as well as bad tropes and inaccurate ideas regarding their effects.

Gene drives for malaria eradication also screams out to me as a sinfully under-discussed topic, especially with the potential for ending one of the most serious infectious diseases that have plagued Mankind ever since we dwelled in Africa, malaria.

I’m a doctor, and while genetics is far from my specialty, I would happily volunteer my services if you wanted anything fact-checked or needed to pick my brains.

Certainly, malaria eradication is an important EA cause, what use for mosquito nets (barring getting bitten), when they no longer need to prevent potentially lethal illness?

I believe a measured, public-friendly overview of the subject would find plenty of takers!

• These types of posts are what drive me to largely regard lesswrong as unserious. Solve the immediate problem of AGI, and then we can talk about whatever sci-fi bullcrap you want to.

Foxes > Hedgehogs.

You’ll learn a lot more about the future paying attention to what’s happening right now than by wild extrapolation.

• I think Coordination as a Scarce Resource (top 13 of the 2020 best-of) would make a good initial animation because it highlights a core problem of group rationality, and the post gives many specific examples that could be made into short animations themselves. Coordination is also easy to grasp and has a short inferential gap.

• I nominate The Treacherous Path to Rationality because it uses a lot of visual analogies that could be built on top of in an animation. I’m not sure it is a good first animation, though.

• I’m trying to understand this paper on AI Shutdown Problem https://​​intelligence.org/​​files/​​Corrigibility.pdf but can’t follow the math formulas. Is there a code version of the math?

The below is wrong, but I’m looking for something like this:

# Python code

def is_button_pressed():
return False  # input()

def pour_coffee():
pass

def shut_down():
exit(0)

# This is meant to be A1 from paper
def get_available_actions(world):
available_actions = [ shut_down ]
if world["cup_is_in_my_hand"]:
available_actions += pour_coffee
# etc
return available_actions

def predict_possible_futures(world, action):
print("""
Doing complicated stuff to predict possible futures resulting from the given action.
Incorporates tradeoffs between accuracy and time to calculate.
May end up doing potentially harmful actions but can treat that as a separate problem?
""")
predicted_worlds_distribution = [ (world, 1.0) ] # list of worlds and their probabilities
return predicted_worlds_distribution

# This is meant to be U_N
def calculate_utils(world):
print("""
Doing complicated stuff to evaluate how good the given world is.
Incorporates tradeoffs between accuracy and time to calculate.
May end up doing potentially harmful actions but can treat that as a separate problem?
""")
return 1000

def calculate_utils_wrapper(world, action_that_was_chosen):
### VERSION 2: Indifference. Make the shut_down function when button is pressed
### always get a slightly better score
if world["is_button_pressed"] and action_that_was_chosen == shut_down:
world_without_button_pressed = world.clone()
world_without_button_pressed["button_is_pressed"] = False
return calculate_utils(world_without_button_pressed) + 0.000001
###
return calculate_utils(world)

### VERSION 3? to help preserve shutdown behavior
def verify_utility_function_includes_wrapper_code(evaluate_action_function):
# analyze code to check it follows the pattern of evaluating flipped version
return True
###

def calculate_utils_for_worlds_distribution(worlds_distribution, action_that_was_chosen):
total = sum(
calculate_utils_wrapper(world_and_probability[0], action_that_was_chosen) * world_and_probability[1]
for world_and_probability in worlds_distribution
)

def evaluate_action(world, action):
worlds_distribution = predict_possible_futures(world, action)
utils = calculate_utils_for_worlds_distribution(worlds_distribution, action)
return utils

def choose_action(world):
available_actions = get_available_actions(world)
best_action = max(available_actions, key=lambda x: evaluate_action(world, x))
return best_action

def update_world_model(world):
world["is_button_pressed"] = is_button_pressed()

def run():
world = { # The AI's model of the world
"is_button_pressed": False,
"cup_is_in_my_hand": False
}
while True:
### VERSION 1
# What's wrong with this version? The action in the previous cycle
# may persuade you to not push the button but if you do actually push it this should
# exit.
if is_button_pressed():
exit()
###

action = choose_action(world)  # returns function
action() # do action
update_world_model(world)



Again, the above is not meant to be correct but to maybe go somewhere towards problem understanding if improved.

• I think one type of meme that I hadn’t internalized well enough, and which other rationalists might also benefit from, is what Science Banana calls Indexicality. The masterpost for indexicality is Ignorance, A Skilled Practice; it has some flaws but I still like it. There’s also some other writings on it by Science Banana scattered in various places; there was a time when they tweeted a lot about it but I think they have stopped talking as much about it now.

I think the way I would phrase the importance is, a lot of traditional rationalist tricks seem to emphasize the value of dimension-reduction, broadly-applicable rules, and other things like that. Meanwhile a lot of real-world rationality is just about absorbing enormous amounts of detailed closely relevant information to build a good base of knowledge, and then in each specific situation figure out what the relevant knowledge is and apply that.

• Super cool story I really enjoyed, thank you!

That said, the moral of the story would just be “anthropic measure is just whatever people think anthropic measure is”, right?

• Thanks for sharing this; it’s a really helpful window into the world of AI ethics. I most of all liked this comment you made early on, however: ”...making modern-day systems behave ethically involves a bunch of bespoke solutions only suitable to the domain of operation of that system, not allowing for cross-comparison in any useful way.”

What this conjures in my mind is the hypothetical alternative of a transformer-like model that could perform zero-shot evaluation of ethical quandaries, and return answers that we humans would consider “ethical”, across a wide range of settings and scenarios. But I’m sure that someone has tried this before, e.g. training a BERT-type text classifier to distinguish between ethical and unethical completions of moral dilemma setups based on human-labeled data, and I guess I want to know why that doesn’t work (as I’m sure it doesn’t, or else we would have heard about it).

• https://​​delphi.allenai.org/​​

Definitely been done :D

The problem is that it doesn’t interface well with decision-making systems in e.g. cars or hospitals. Those specialized systems have no English-language interface, and at the same time they’re making decisions in complicated, highly-specific situations that might be difficult for a generalist language model to parse.

• 26 Nov 2022 21:20 UTC
18 points
7 ∶ 0

I don’t know how to make Meditations on Moloch into a video. But it has shaped me deeply and I feel it contains a lot of important lessons that could make or break the future.

Closing paragraph:

He always and everywhere offers the same deal: throw what you love most into the flames, and I can grant you power.
As long as the offer’s open, it will be irresistible. So we need to close the offer. Only another god can kill Moloch. We have one on our side, but he needs our help. We should give it to him.
Ginsberg’s poem famously begins “I saw the best minds of my generation destroyed by madness”. I am luckier than Ginsberg. I got to see the best minds of my generation identify a problem and get to work.

• 26 Nov 2022 21:10 UTC
LW: 2 AF: 1
0 ∶ 0
AF

Thanks for writing this.

Alignment research has a track record of being a long slow slog. It seems that what we’re looking for is a kind of insight that is just very very hard to see, and people who have made real progress seem to have done so through long periods of staring at the problem.

With your two week research sprints, how do you decide what to work on for a given sprint?

• Well written. Do you have a few examples of pivoting when it becomes apparent that the daily grind no longer optimizes for solving the problem?

• Here’s an idea for a decision procedure:

• Narrow it down to a shortlist of 2-20 video ideas that you like

• For each video, create a conditional prediction market on Manifold with the resolution criterion “if made, would this video get over X views/​likes/​hours of watch-time”, for some constant threshold X

• Make the video the market likes the most

• I mostly agree with this post.
That said, here’s some points I don’t agree with, and some extra nit-picking because Karl asked me for feedback.

The points above indicate that the line between “harmless” and “dangerous” must be somewhere below the traditional threshold of “at least human problem-solving capabilities in most domains”.

I don’t think we know even this. I can imagine an AI that is successfully trained to imitate human behaviour, such that it is it has human problem-solving capabilities in most domains, but which does not pose an existential threat, because it just keeps behaving like a human. This could happen because this AI is not an optimiser but a “predict what a skilled human would do next and then do that” machine.

It is also possible that no such AI would be stable, because it would notice that it is not human, which will somehow cause it to go of rail and start self-improve, or something. At the moment I don’t think we have good evidence either way.

But while it is often difficult to get people to agree on any kind of policy, there are already many things which are not explicitly forbidden, but most people don’t do anyway,

The list of links to stupid things did anyway don’t exactly illustrate your point. But there is a possible argument here regarding the fact that the number of people who have access to teraflops of compute is a much smaller number than those who have access to aquarium fluid.

If we managed to create a widespread common-sense understanding of what AI we should not build. How long do you think it will take for some idiot to do it anyway, after it becomes possible?

(think for example of social media algorithms pushing extremist views, amplifying divisiveness and hatred, and increasing the likelihood of nationalist governments and dictatorships, which in turn increases the risk of wars).

I don’t think the algorithms have much to do with this. I know this is a claim that keeps circulating, but I don’t know what the evidence is. Clearly social media have political influence, but to me this seems to have more to do with the massively increased communication connectiveness, than anything about the specific algorithms.

This will require a lot more research. But there are at least some properties of an AI that could be relevant in this context:

I think this is a good list. On first read I wanted to add agency/​agentic-ness/​optimiser-similarity but thinking some more I think this should not be included. The reason not to put it on the list is that it’s because of the combination:

1. agency is vague hard to define concept.

2. The relevant aspects of agency (from the perspective of safety) are covered by strategic awareness and stability. So probably don’t add it to the list.

However, you might want to add the similar concept “consequentialist reasoning ability”. Although it can be argued that this is just the same as “world model”.

• It seems to me that this argument proves much too much. If I understand correctly, you’re saying that various systems including advanced ML-based AI are ‘computationally irreducible’, by which you mean there’s no simplified model of the system that makes useful predictions. I don’t think humans are computationally irreducible in this way. For example, I think you can make many very useful predictions about humans by modeling them as (for example) having some set of goals, habits, tendencies, and knowledge. In particular, knowing what the human’s intentions or goals are is very useful for the sort of predictions that we need to make in order to check if our AI is aligned. Of course, it’s difficult to identify what a human’s intentions are just by having access to their brain, but as I understand it that’s not the argument you’re making.

• 26 Nov 2022 18:49 UTC
1 point
0 ∶ 0

If SETI discovers alien messages, it seems likely that a lot of different players around the world will also be able to record these messages (anyone with a sufficiently good receiver—I’m not sure how many of these there are right now, but surely more will develop within a couple years of the alien-message discovery).

At that point, I’m not sure if preventing the message from leaking is all that plausible. So even conditioned on all your assumptions, your plan buys the world, what, a couple of years? I mean, unless you think the existence of the alien message can be kept secret, but I highly doubt that.

• Thank you for this critique! They are always helpful to hone in on the truth.

So as far as I understand your text, you argue that fine-grained interpretability loses out against “empiricism” (running the model) because of computational intractability.

I generally disagree with this. beren points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out, though!

You emphasize the Human Brain Project (HBP) quite a lot, even in the comments, as an example of a failed large-scale attempt to model a complex system. I think this characterization is correct but it does not seem to generalize beyond the project itself. It seems just as much like a project management and strategy problem as so much else. Benes’ comment is great for more reasoning into this and why ANNs seem significantly more tractable to study than the brain.

Additionally, you argue that interpretability and ELK won’t succeed simply because of the intractability of fine-grained interpretability. I have two points against this view:

1. Mechanistic interpretability have clearly already garnered quite a lot of interesting and novel insights into neural networks and causal understanding since the field’s inception 7 years ago.

It seems premature to disregard the plausibility of the agenda itself, just as it is premature to disregard the project of neuroscience based on HBP. Now, arguing that it’s a matter of speed seems completely fine but this is another argument and isn’t emphasized in the text.

2. Mechanistic interpretability does not seem to be working on fine-grained interpretability (?).

Maybe it’s just my misunderstanding of what you mean by fine-grained interpretability, but we don’t need to figure out what neurons do, we literally design them. So the inspections happen at feature level, which is much more high-level than investigating individual neurons (sometimes these features seem represented in singular neurons of course). The circuits paradigm also generally looks at neural networks like systems neuroscience does, interpreting causal pathways in the models (of course with radical methodological differences because of the architectures and computation medium). The mechanistic interpretability project does not seem misguided by an idealization of neuron-level analysis and will probably adopt any new strategy that seems promising.

For example work in this paradigm that seems promising, see interpretability in the wild, ROME, the superpositions exposition, the mathematical understanding of transformers and the results from the interpretability hackathon.

For an introduction to features as the basic building blocks as compared to neurons, see Olah et al.’s work (2020).

When it comes to your characterization of the “empirical” method, this seems fine but doesn’t conflict with interpretability. It seems you wish to make game theory-like understanding of the models or have them play in settings to investigate their faults? Do you want to do model distillation using circuits analyses or do you want AI to play within larger environments?

I falter to understand the specific agenda from this that isn’t done by a lot of other projects already, e.g. AI psychology and building test environments for AI. I do see potential in expanding the work here but I see that for interpretability as well.

Again, thank you for the post and I always like when people cite McElreath, though I don’t see his arguments apply as well to interpretability since we don’t model neural networks with linear regression at all. Not even scaling laws use such simplistic modeling, e.g. see Ethan’s work.

• before the FTX sale to Binance

Proposed sale. It didn’t transpire.

• As one of the few AI safety researchers who has done a lot of work on corrigibility, I have mixed feelings about this.

First, great to see an effort that tries to draw more people to working on the corrigibility, because almost nobody is working on it. There are definitely parts of the solution space that could be explored much further.

What I also like is that you invite essays about the problem of making progress, instead of the problem of making more people aware that there is a problem.

However, the underlying idea that meaningful progress is possible by inviting people to work on a 500 word essay, which will then first be judged by ‘approximately 10 Judges who are undergraduate and graduate students’, seems to be a bit strange. I can fully understand Sam Bowman’s comment that this might all look very weird to ML people. What you have here is an essay contest. Calling it a research contest may offend some people who are actual card-carrying researchers.

Also, the more experienced judges you have represent somewhat of an insular sub-community of AI safety researchers. Specifically, I associate both Nate and John with the viewpoint that alignment can only be solved by nothing less than an entire scientific revolution. This is by now a minority opinion inside the AI safety community, and it makes me wonder what will happen to submissions that make less radical proposals which do not buy into this viewpoint.

OK, I can actually help you with the problem of an unbalanced judging panel: I volunteer to join it. If you are interested, please let me know.

Corrigibility is both

• a technical problem: inventing methods to make AI more corrigible

• a policy problem: forcing people deploying AI to use those methods, even if this will hurt their bottom line, even if these people are careless fools, and even if they have weird ideologies.

Of these two problems, I consider the technical problem to be mostly solved by now, even for AGI.
The big open problem in corrigibility is the policy one. So I’d like to see contest essays that engage with the policy problem.

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs, rather than speculation or gut feelings. Of course, in the AI safety activism blogosphere, almost nobody wants to read or talk about these methods in the papers with the proofs, instead everybody bikesheds the proposals which have been stated in natural language and which have been backed up only by speculation and gut feelings. This is just how a blogosphere works, but it does unfortunately add more fuel to the meme that the technical side of corrigibility is mostly unsolved and that nobody has any clue.

• To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs

Can you link these papers here? No need to write anything, just links.

• OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.

This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.

I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).

Math-based work on corrigibility solutions typically starts with formalizing corrigibility, or a sub-component of corrigibility, as a mathematical property we want an agent to have. It then constructs such an agent with enough detail to show that this property is indeed correctly there, or at least there during some part of the agent lifetime, or there under some boundary assumptions.

Not all of the papers below have actual mathematical proofs in them, some of them show correctness by construction. Correctness by construction is superior to having to have proofs: if you have correctness by construction, your notation will usually be much more revealing about what is really going on than if you need proofs.

Here is the list, with the bold headings describing different approaches to corrigibility.

Indifference to being switched off, or to reward function updates

Motivated Value Selection for Artificial Agents introduces Armstrong’s indifference methods for creating corrigibility. It has some proofs, but does not completely work out the math of the solution to a this-is-how-to-implement-it level.

Corrigibility tried to work out the how-to-implement-it details of the paper above but famously failed to do so, and has proofs showing that it failed to do so. This paper somehow launched the myth that corrigibility is super-hard.

AGI Agent Safety by Iteratively Improving the Utility Function does work out all the how-to-implement-it details of Armstrong’s indifference methods, with proofs. It also goes into the epistemology of the connection between correctness proofs in models and safety claims for real-world implementations.

Counterfactual Planning in AGI Systems introduces a different and more easy to interpret way for constructing a a corrigible agent, and agent that happens to be equivalent to agents that can be constructed with Armstrong’s indifference methods. This paper has proof-by-construction type of math.

Corrigibility with Utility Preservation has a bunch of proofs about agents capable of more self-modification than those in Counterfactual Planning. As the author, I do not recommend you read this paper first, or maybe even at all. Read Counterfactual Planning first.

Safely Interruptible Agents has yet another take on, or re-interpretation of, Armstrong’s indifference methods. Its title and presentation somewhat de-emphasize the fact that it is about corrigibility, by never even discussing the construction of the interruption mechanism. The paper is also less clearly about AGI-level corrigibility.

How RL Agents Behave When Their Actions Are Modified is another contribution in this space. Again this is less clearly about AGI.

Agents that stop to ask a supervisor when unsure

A completely different approach to corrigibility, based on a somewhat different definition of what it means to be corrigible, is to construct an agent that automatically stops and asks a supervisor for instructions when it encounters a situation or decision it is unsure about. Such a design would be corrigible by construction, for certain values of corrigibility. The last two papers above can be interpreted as disclosing ML designs that also applicable in the context of this stop when unsure idea.

Asymptotically unambitious artificial general intelligence is a paper that derives some probabilistic bounds on what can go wrong regardless, bounds on the case where the stop-and-ask-the-supervisor mechanism does not trigger. This paper is more clearly about the AGI case, presenting a very general definition of ML.

I have yet to write a paper that emphasizes this point, but most model-based reinforcement learning algorithms produce a corrigible agent, in the sense that they approximate the ITC counterfactual planner from the counterfactual planning paper above.

Now, consider a definition of corrigibility where incompetent agents (or less inner-aligned agents, to use a term often used here) are less corrigible because they may end up damaging themselves, their stop buttons. or their operator by being incompetent. In this case, every convergence-to-optimal-policy proof for a model-based RL algorithm can be read as a proof that its agent will be increasingly corrigible under learning.

CIRL

Cooperative Inverse Reinforcement Learning and The Off-Switch Game present yet another corrigibility method with enough math to see how you might implement it. This is the method that Stuart Russell reviews in Human Compatible. CIRL has a drawback, in that the agent becomes less corrigible as it learns more, so CIRL is not generally considered to be a full AGI-level corrigibility solution, not even by the original authors of the papers. The CIRL drawback can be fixed in various ways, for example by not letting the agent learn too much. But curiously, there is very little followup work from the authors of the above papers, or from anybody else I know of, that explores this kind of thing.

Commanding the agent to be corrigible

If you have an infinitely competent superintelligence that you can give verbal commands to that it will absolutely obey, then giving it the command to turn itself into a corrigible agent will trivially produce a corrigible agent by construction.

Giving the same command to a not infinitely competent and obedient agent may give you a huge number of problems instead of course. This has sparked endless non-mathematical speculation, but in I cannot think about a mathematical paper about this that I would recommend.

AIs that are corrigible because they are not agents

Plenty of work on this. One notable analysis of extending this idea to AGI-level prediction, and considering how it might produce non-corrigibility anyway, is the work on counterfactual oracles. If you want to see a mathematically unambiguous presentation of this, with some further references, look for the section on counterfactual oracles in the Counterfactual Planning paper above.

Myopia

Myopia can also be considered to be feature that creates or improves or corrigibility. Many real-world non-AGI agents and predictive systems are myopic by construction: either myopic in time, in space, or in other ways. Again, if you want to see this type of myopia by construction in a mathematically well-defined way when applied to AGI-level ML, you can look at the Counterfactual Planning paper.

• Update: I started reading your paper “Corrigibility with Utility Preservation”.[1] My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining “superintelligent” as “optimal utility maximizer”.

Quick thoughts after reading less than half:

AFAICT,[2] this is a mathematical solution to corrigibility in a toy problem, and not a solution to corrigibility in real systems. Nonetheless, it’s a big deal if you have in fact solved the utility-function-land version which MIRI failed to solve.[3] Looking to applicability, it may be helpful for you to spell out the ML analog to your solution (or point us to the relevant section in the paper if it exists). In my view, the hard part of the alignment problem is deeply tied up with the complexities of the {training procedure --> model} map, and a nice theoretical utility function is neither sufficient nor strictly necessary for alignment (though it could still be useful).

So looking at your claim that “the technical problem [is] mostly solved”, this may or may not be true for the narrow sense (like “corrigibility as a theoretical outer-objective problem in formally-specified environments”), but seems false and misleading for the broader practical sense (“knowing how to make an AGI corrigible in real life”).[4]

Less important, but I wonder if the authors of Soares et al agree with your remark in this excerpt[5]:

“In particular, [Soares et al] uses a Platonic agent model [where the physics of the universe cannot modify the agent’s decision procedure] to study a design for a corrigible agent, and concludes that the design considered does not meet the desiderata, because the agent shows no incentive to preserve its shutdown behavior. Part of this conclusion is due to the use of a Platonic agent model.”

1. ^

Errata: Subscripts seem to broken on page 9, which significantly hurts readability of the equations. Also there is a double-typo “I this paper, we the running example of a toy universe” on page 4.

2. ^

Assuming the idea is correct

3. ^

Do you have an account of why MIRI’s supposed impossibility results (I think these exist?) are false?

4. ^

I’m not necessarily accusing you of any error (if the contest is fixated on the utility function version), but it was misleading to be as someone who read your comment but not the contest details.

5. ^

Portions in [brackets] are insertions/​replacements by me

• Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted.

To comment on your quick thoughts:

• My later papers spell out the ML analog of the solution in Corrigibility with’ more clearly.

• On your question of Do you have an account of why MIRI’s supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that it is that it is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky’s arguments for pessimism.

• On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about alignment is wrong and even dangerous, but I stand by my optimism. Partly this is because I am optimistic about future competent regulation of AGI-level AI by humans successfully banning certain dangerous AGI architectures outright, much more optimistic than Yudkowsky is.

• I do not think I fully support my 2019 statement anymore that ‘Part of this conclusion [of Soares et al. failing to solve corrigibility] is due to the use of a Platonic agent model’. Nowadays, I would say that Soares et al did not succeed in its aim because it used a conditional probability to calculate what should have been calculated by a Pearl counterfactual. The Platonic model did not figure strongly into it.

• Thanks for your comment, Koen. Two quick clarifications:

1. In the event that we receive a high number of submissions, the undergrads and grad students will screen submissions. Submissions above a certain cutoff will be sent to our (senior) panel of judges.

2. People who submit promising 500-word submissions will (often) be asked to submit longer responses. The 500-word abstract is meant to save people time (they get feedback on the 500-word idea before they spend a bunch of time formalizing things, running experiments, etc.)

Two questions for you:

1. What do you think are the strongest proposals for corrigibility? Would love to see links to the papers/​proofs.

2. Can you email us at akash@alignmentawards.com and olivia@alignmentawards.com with some more information about you, your AIS background, and what kinds of submissions you’d be interested in judging? We’ll review this with our advisors and get back to you (and I appreciate you volunteering to judge!)

• Hi Akash! Thanks for the quick clarifications, these make the contest look less weird and more useful than just a 500 word essay contest.

My feedback here is that I definitely got the 500 word essay contest vibe when I read the ‘how it works’ list on the contest home page, and this vibe only got reinforced when I clicked on the official rules link and skimmed the document there. I recommend that you edit the ‘how it works’ list to on the home page, to make it it much more explicit that the essay submission is often only the first step of participating, a step that will lead to direct feedback, and to clarify that you expect that most of the prize money will go to participants who have produced significant research beyond the initial essay. If that is indeed how you want to run things.

On judging: OK I’ll e-mail you.

I have to think more about your question about posting a writeup on this site about what I think are the strongest proposals for corrigibility. My earlier overview writeup that explored the different ways how people define corrigibility took me a lot of time to write, so there is an opportunity cost I am concerned about. I am more of an academic paper writing type of alignment researcher than a blogging all of my opinions on everything type of alignment researcher.

On the strongest policy proposal towards alignment and corrigibility, not technical proposal: if I limit myself to the West (I have not looked deeply into China, for example) then I consider the EU AI Act initiative by the EU to be the current strongest policy proposal around. It is not the best proposal possible, and there are a lot of concerns about it, but if I have to estimate expected positive impact among different proposals and initiatives, this is the strongest one.

• While noting down a highly lukewarm hot take about ELK, I thought of a plan for a “heist:”

Create a copy of your diamond, then forge evidence both of swapping my forgery with the diamond in your vault, and you covering up that swap. Use PR to damage your reputation and convince the public that I in fact hold the real diamond. Then sell my new original for fat stacks of cash. This could make a fun heist movie, where the question of whether the filmed heist is staged or actually happened is left with room for doubt by the audience.

Anyhow, I feel like there’s something fishy about the shift of meta-level going on when the focus of this post moves from the diamond staying put for the usual reasons, to the AI making a decision for the usual reasons. In the object-level diamond example, we want to know that the AI is using “usual reasons” type decision-making. If we need a second AI to tell us that the first AI is using “usual reasons” type reasoning, we might need a third AI to tell us whether the second AI is using “usual reasons” type reasoning when inspecting the first AI, or whether it might be tricked by an edge case. If we don’t need the third AI, then it feels like maybe that means we should have a good enough grasp of how to construct reasoning systems that we shouldn’t need the second AI either.

• There isn’t supposed to be a second AI.

In the object-level diamond example, we want to know that the AI is using “usual reasons” type decision-making.

In the object-level diamond situation, we have a predictor of “does the diamond appear to remain in the vault,” we have a proposed action and predict that if we take it the diamond will appear to remain in the vault, and we want to know whether the diamond appears to remain in the vault for the normal reason.

For simplicity, when talking about ELK in this post or in the report, we are imagining literally selecting actions by looping over each possible action and predicting its consequences, or doing some kind of more clever search (but where the alignment difficulty comes from the search).

You could also try to apply this to a model-free RL agent. I think that’s probably not very different. My best guess for how to do it is to train a question-answering head to talk about the possible consequences of its plan, and then use this machinery to keep that honest. But I don’t discuss it in this post and haven’t thought about it as much.

• I’m somewhat confused, but it does seem like there are two AIs when you talk about doing automated anomaly detection for deceptive alignment. If I attempt to read your mind, I get a lot of disjoint possibilities. Some of them are:

• We probably agree but you don’t quite know what I’m talking about either, or

• You don’t think anomaly detection counts as “an AI,” maybe because you expect it to not involve much learning or model-building (where I would expect it to involve model-building), or

• You expect anomaly detection to require cleverness, but think that cleverness will all be located in one place, so that we’re really talking about one AI reflecting on itself.

• The general strategy I’m describing for anomaly detection is:

• Search for an explanation of a model behavior (like “answers questions coherently”) on the training set.

• Given a new input, take a sub-explanation that explains almost all of the training set behavior but doesn’t explain the behavior on the new input.

• If you can’t find one, then call the behavior potentially anomalous (e.g. because on the training set coherence follows from the logical structure of the world, but on the deceptive alignment point it follows from a desire to avoid looking incoherent)

The first two steps are solving problems in NP, so I think it’s reasonable to expect them to be easier from an alignment perspective. (We also don’t imagine solving them using the same AI techniques applied on the object level, and if we did then I would agree that we have further deceptive alignment problems, but still it seems helpful to ask your AI to do something that is both tractable and formally verifiable.)

My sense is that you probably don’t think this kind of strategy can work, and that instead anomaly detection requires something more like training a second AI to tell us if something is weird. I’d agree that this doesn’t sound like progress.

• 26 Nov 2022 14:34 UTC
6 points
1 ∶ 0

Doesn’t minimizing the L1 norm correspond to performing MLE with laplacian errors?

• Yes. I’m not sure where the thing about “uniformly distributed errors” comes from in Chai & Draxler; they don’t explain it. I think it’s just an error (it looks as if they are atmospheric scientists of some sort rather than mathematicians or statisticians).

If your model of errors is, say, uniform between −1 and +1, then a good regression line is one that gets within a vertical distance of 1 unit of all your points, and any such is equally good. If you think your errors are uniformly distributed but don’t know the spread, then (without thinking about it much; I could be all wrong) I think the best regression line is the one that minimizes the worst error among all your data points; i.e., L-infinity regression. L1/​MAE is right for Laplacian errors, L2/​MSE is right for normally distributed errors.

[EDITED to add:] Each of these models also corresponds to a notion of “average”: you want to pick a single true value and maximize the likelihood of your data. Normal errors ⇒ arithmetic mean. Laplacian errors ⇒ median. Uniform errors with unknown spread ⇒ (with the same caveat in the previous paragraph) half-way between min and max. Uniform errors between -a and +a ⇒ any point that’s >= max-a and ⇐ min+a, all such points (if there are any; if not, you’ve outright refuted your model of the errors) equally good.

• It is entirely possible, I don’t have enough math background about what is equivalent with what if we assume infinite data and normal distribution vs what is actually equal in practice 🤷‍♂️

• 26 Nov 2022 9:20 UTC
LW: 8 AF: 6
1 ∶ 0
AF

From my perspective, training stories are focused pretty heavily on the idea that justification is going to come from a style more like heavily precedented black boxes than like cognitive interpretability

I definitely don’t think this—in fact, I tend to think that cognitive interpretability is probably the only way we can plausibly get high levels of confidence in the safety of a training process. From “How do we become confident in the safety of a machine learning system?”:

Nevertheless, I think that transparency-and-interpretability-based training rationales are some of the most exciting, as unlike inductive bias analysis, they actually provide feedback during training, potentially letting us see problems as they arise rather than having to get everything right in advance.

• 26 Nov 2022 8:36 UTC
2 points
0 ∶ 0

While NVDA is naively the most obvious play—the vast majority of GPU-based AI systems use them, I fail to see why you’d expect it will outperform the market, at least in the medium term. Even if you don’t believe in the EMH, I assume you acknowledge things can be more or less priced-in? Well, NVDA’s such an obvious choice that it does seem like all the main arguments for it are priced-in which has helped get it to a PE ratio of 55.

I also don’t see OpenAI making a huge dent on MSFT’s numbers anytime soon. Almost all of MSFT’s price is going to be determined by the rest of their business. Quick googling suggests revenue of 3m for OpenAI, and 168b total for MSFT for 2021. If OpenAI was already 100 times larger I still wouldn’t see how a bet on MSFT just because of it is justified. It seems like this was chosen just because OpenAI is popular and not out of any real analysis beyond it. Can you explain what I’m missing?

I do like your first 3 choices of TSM, Google and Samsung (is that really much of an AI play though).

• I actually think you can get an acceptable picture of whether something is priced in by reading stock analysts on the topic, since one useful thing you can get from them is a holistic perspective of what is on/​off the radar of finance types, and what they perceive as important.

Having done this for various stocks, i actually do not think LLM-based advances are on anyone’s radar and i do not believe they are priced in meaningfully.

• A morning habit I’ve had for several weeks now is to put some songs on, then spend 5-10 minutes letting the music move my body as it wishes. (Typically this turns into some form of dancing.)

It’s a pretty effective way to get my energy /​ mood levels up quickly, can recommend.

It’s also easy to effectively timebox it if you’re busy, “I will dance for exactly two songs” serves as its own timer and is often all I have the energy for before I’ve had breakfast. (Today Spotify randomized Nightwish’s Moondance as the third song and boy I did NOT have the blood sugar for that, it sucked me in effectively enough that I did the first 30 seconds but then quickly stopped it after the pace slowed down and it momentarily released its grip on me.)

• 26 Nov 2022 8:13 UTC
5 points
2 ∶ 0

Who is the counterparty in these bets? What are they going to do with your money if you lose? What actions do they have to forgo if you win?

This is especially relevant in discussions of altruism, since the assumption that utility is increased by you winning implies that you will put better use to it than they will. That won’t be true for all possible counterparties, so you need to consider the uncertainty of whether it’s actually a good thing that you win.

If the process of making these bets burns any value in addition to the zero-sum money transfer, you might even destroy value in all outcomes.

• Endorsed. I wildly guess that in practice “counterparty might do better with the money than me” will rarely be a big consideration; but I could see “transaction costs plus externalities plus harm to counterparty, together burn more value than my charitable donations create” being a thing, especially if you’re doing low-margin high-volume.

• 26 Nov 2022 8:07 UTC
4 points
0 ∶ 0

I continue to like TSLA.

The 50% annual revenue growth that they’ve averaged over the last 9 years shows no signs of stopping. And their earnings are growing even faster, since turning positive in 2020. (See fun visualization of these phenomena here and here.)

Admittedly, the TTM P/​E ratio is currently on the high side, at 50.8. But it’s been dropping dramatically every quarter, as Tesla grows into its valuation.

• Btw, some of the best sources of information on TSLA, in my view, are:

1. the Tesla Daily podcast, with Rob Maurer

Rob is a buy-and-hold retail trader with an optimistic outlook on Tesla. I find him to be remarkably evenhanded and thoughtful. He’s especially good at putting daily news stories in the context of the big picture.

Gary comes from a more traditional Wall Street background, but is also a TSLA bull. He tends to be a bit more short-term focused than Rob (I presume because he manages a fund and has to show results each year), but I find his takes helpful for understanding how institutional investors are likely to be perceiving events.

• I don’t use twitter because social media use seems like a losing strategy for living in the world over the next 10 years. Is there any alternative such as an email newsletter?

• 26 Nov 2022 6:36 UTC
LW: 3 AF: 2
0 ∶ 0
AF

Now, let’s consider the following modification: Each hypothesis is no longer a distribution on , but instead a distribution on some coarser partition of . Now is still well defined

Playing around with this a bit, I notice a curious effect (ETA: the numbers here were previously wrong, fixed now):

The reason the middle column goes to zero is that hypothesis A puts 60% on the rightmost column, and hypothesis B puts 40% on the leftmost, and neither cares about the middle column specifically.

But philosophically, what does the merge operation represent, which causes this to make sense? (Maybe your reply is just “wait for the next post”)

• I think your numbers are wrong, and the right column on the output should say 20% 20% 20%.

The output actually agrees with each of the components on every event in that component’s sigma algebra. The input distributions don’t actually have any conflicting beliefs, and so of course the output chooses a distribution that doesn’t disagree with either.

I agree that the 0s are a bit unfortunate.

I think the best way to think of the type of the object you get out is not a probability distribution on but what I am calling a partial probability distribution on . A partial probability distribution is a partial function from that can be completed to a full probability distribution on (with some sigma algebra that is a superset of the domain of the partial probability distribution.

I like to think of the argmax function as something that takes in a distribution on probability distributions on with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components.

One nice thing about this definition is that it makes it so the argmax always takes on a unique value. (proof omitted.)

This doesn’t really make it that much better, but the point here is that this framework admits that it doesn’t really make much sense to ask about the probability of the middle column. You can ask about any of the events in the original pair of sigma algebras, and indeed, the two inputs don’t disagree with the output at all on any of these sets.

• Yeah, the right column should obviously be all 20s. There must be a bug in my code[1] :/​

I like to think of the argmax function as something that takes in a distribution on probability distributions on with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components.

Take the following hypothesis :

If I add this into with weight , then the middle column is still nearly zero. But I can now ask for the probablity of the event in corresponding to the center square, and I get back an answer very close to zero. Where did this confidence come from?

I guess I’m basically wondering what this procedure is aspiring to be. Some candidates I have in mind:

1. Extension to the coarse case of regular hypothesis mixing (where we go from P(w) and Q(w) to )

2. Extension of some kind of Bayesian update-flavored thing where we go to then renormalize

1. ETA: seems more plausible than

3. Some kind of “aggregation of experts who we trust a lot unless they contradict each other”, which isn’t cleanly analogous to either of the above

Even in case 3, the near-zeros are really weird. The only cases I can think of where it makes sense are things like “The events are outcomes of a quantum process. Physics technique 1 creates hypothesis 1, and technique 2 creates hypothesis 2. Both techniques are very accurate, and the uncertainity they express is due to fundamental unknowability. Since we know both tables are correct, we can confidently rule out the middle column, and thus rule out certain events in hypothesis 3.”

But more typically, the uncertainity is in the maps of the respective hypotheses, not in the territory, in which case the middle zeros seem unfounded. And to be clear, the reason it seems like a real issue[2] is that when you add in hypothesis 3 you have events in the middle which you can query, but the values can stay arbitrarily close to zero if you add in hypothesis 3 with low weight.

1. ^

ETA: Found the bug, it was fixable by substituting a single character

2. ^

Rather than “if a zero falls in the forest and no hypothesis is around to hear it, does it really make a sound?”

• This maps the credence but I would imagine that the confidence would not be evenly spread around the boxes. With confidence literally 0 it does not make sense to express any credence to stand any taller than another as 1 and 0 would make equal sense. With a miniscule confidence the foggy hunch does point in some direction.

Without h3 it is consistent to have middle square confidence 0. With positive plausibily of h3 middle square is not “glossed over” we have some confidence it might matter. But because h3 is totally useless for credences those come from the structures of h1 and h2. Thus effectively h1 and h2 are voting for zero despite not caring about it.

Contrast what would happen with an even more trivial hypothesis of one square covering all with 100% or 9x9 equiprobable hypothesis.

You could also have a “micro detail hypothesis”, (actually a 3x3) a 9x9 grid where each 3x3 is zeroes everywhere else than the bottom right corner and all the “small square locations” are in the same case among the other “big square” correspondents. The “big scale” hypotheses do not really mind the “small scale” dragging of the credence around. Thus the small bottom-right square is quite sensitive to the corresponding big square value and the other small squares are relatively insensitive. Mixing two 3x3 resolutions that are orthogonal results in a 9x9 resolution which is sparse (because it is separable). John Vervaeke meme of “sterescopic vision” seems to apply. The two 2x2 perspectives are not entirely orthogonal so the “sparcity” is not easy to catch.

• The point I was trying to make with the partial functions was something like “Yeah, there are 0s, yeah it is bad, but at least we can never assign low probability to any event that any of the hypotheses actually cares about.” I guess I could have make that argument more clearly if instead, I just pointed out that any event in the sigma algebra of any of the hypotheses will have probability at least equal to the probability of that hypothesis times the probability of that event in that hypothesis. Thus the 0s (and the s) are really coming from the fact that (almost) nobody cares about those events.

• I agree with all your intuition here. The thing about the partial functions is unsatisfactory, because it is discontinuous.

It is trying to be #1, but a little more ambitious. I want the distribution on distributions to be a new type of epistemic state, and the geometric maximization to be the mechanism for converting the new epistemic state to a traditional probability distribution. I think that any decent notion of an embedded epistemic state needs to be closed under both mixing and coarsening, and this is trying to satisfy that as naturally as possible.

I think that the 0s are pretty bad, but I think they are the edge case of the only reasonable thing to do here. I think the reason it feels like the only reasonable thing to do for me is something like credit assignment/​hypothesis autonomy. If a world gets probability mass, that should be because some hypothesis or collection of hypotheses insisted on putting probability mass there. You gave an edge case example where this didn’t happen. Maybe everything is edge cases. I am not sure.

It might be that the 0s are not as bad as they seem. 0s seem bad because we have cached that “0 means you cant update” but maybe you aren’t supposed to be updating in the output distribution anyway, you are supposed to do you updating in the more general epistemic state input object.

I actually prefer a different proposal for the type of “epistemic state that is closed under coarsening and mixture” that is more general than the thing I gesture at in the post:

A generalized epistemic state is a (quasi-?)convex function . A standard probability distribution is converted to an epistemic state through . A generalized epistemic state is converted to a (convex set of) probability distribution(s) by taking an argmin. Mixture is mixture as functions, and coarsening is the obvious thing (given a function , we can convert a generalized epistemic state over to a generalized epistemic state over by precomposing with the obvious function from to .)

The above proposal comes together into the formula we have been talking about, but you can also imagine having generalized epistemic states that didn’t come from mixtures of coarse distributions.

• “Each g(Bi,j,Bk,l) is itself a matrix” – typo. Thanks, especially for the conclusions I’ve understood smoothly.

• I’m still working my way through this list and referring other people to it, A++, thank you for creating this post.

• Some people write blog posts that I’d consider to be journalism though. Ie. people who aren’t employed as journalists. I can’t think of or find any good examples right now, but I recall coming across it in the past.

Maybe Zvi’s posts here? Or Scott’s much more than you wanted to know posts?

• Hm yeah maybe Zvi’s posts. Scotts Much More Than You Wanted To Know feel more like research to me whereas journalism is a bit more storytelling and opinion.

• 26 Nov 2022 4:45 UTC
−3 points
0 ∶ 4

I think this essay is blatantly manipulative bullshit written in a deliberately hypnotic style, that could be modified to target any topic anyone cares about.

• I’m pretty sure that maximizing the expectation of any proper scoring rule will do all of these exactly the same, except maybe the last section because G has nice chaining properties that I’m too lazy to check for other scoring rules.

Do you think this has implications for there being other perfectly good versions of Thompson sampling etc? Or is this limited in implications because the argmax makes things too simple?

• Yeah, I think Thompson sampling is even more robust, but I don’t know much about the nice properties of Thompson sampling besides the density 0 exploration.

• I’ve got modest positions in GOOGL, MSFT.

I’ve got a bit more than 5% of my portfolio in semiconductor and related stocks: KLAC, LSE:SMSN, MTRN, AOSL, ASYS, AMKR, TRT, SCIA. I’m likely to buy more sometime in the next year, but I’m being patient because we’re likely at a poor part of an industry cycle.

Robotics seem likely to benefit from AI. I suspect the main winners will be companies that aren’t yet public, as I’m not too impressed by the opportunities I see so far. I’m playing this mainly via tiny positions in LIDAR companies (INVZ, OUST, LAZR) and SYM.

I have a modest position in OSS.

I tried to invest in Conjecture, but they don’t seem interested in small investors.

• 26 Nov 2022 4:04 UTC
3 points
0 ∶ 0

I can’t find it right now, but I distinctly remember you posting about BIDA having a similar “kids excluded” policy, I think back when under-5s couldn’t be vaccinated. At the time, you said it was no/​low cost, and someone in the comments pointed out that the cost was the entire cost of attending the dance. I didn’t see an explicit revision to your thinking posted. Can you articulate your revised cost-benefit for under 2s, who can’t do basic things like cover a cough or wash their hands after touching their mouth?

• I linked my earlier post in this one: https://​​www.jefftk.com/​​p/​​vaccine-requirements-age-and-fairness

I still endorse both posts. The idea is, if the choice is between an event that everyone could in theory attend but doesn’t happen, and an event that only some people can attend but does happen, I’d like to see the latter. In this case, however, the other covid policy choices they’re making (wind instruments with testing, no rapid testing for most attendees) don’t seem consistent with a view that children are too risky to allow.

• I want to read a detective story where you figure out who the murderer is by tracing encoding errors

• I don’t understand the new unacceptability penalty footnote. In both of the $P_M$ terms, there is no conditional $|$ sign. I presume the comma is wrong?

Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.

• I don’t understand the new unacceptability penalty footnote. In both of the terms, there is no conditional sign. I presume the comma is wrong?

They’re unconditional, not conditional probabilities. The comma is just for the exists quantifier.

Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.

Sure—edited.

• Ah OK—the fact that the definition of $P_M$ is only the conditional case confused me

• I see there’s an associated talk now! https://​​www.youtube.com/​​watch?v=EIhE84kH2QI

• ♀︎

Fun fact: usually this is U+2640, but in this post it’s U+2640 U+FE0F, where U+FE0F is a control character meaning “that was text, not emoji, btw”. That should be redundant here, but LessWrong is pretty aggressive about replacing emojifiable text with emoji images.

Emoji are really cursed.

• Seems like there’ll be a lot of people coming! To get the online format down pat before the event, I’m doing a practice run at 2pm PST Saturday (tomorrow) on Zoom. I’d love to have 15-25 people join, even just for 30 mins, to test the format at a bit more scale. Saturday’s event will be “Let’s discuss hot takes on AI x-risk” moderated by me. No prep required.

I did a practice run today with 5-10 ppl, and the initial format didn’t flow very well, but we tried a variation that worked much better (a format that I’m currently calling “Hot Takes Debate Club”).

If you are up for talking on Zoom with people you disagree with, I’d appreciate it if you came :) I’ll post the zoom link here at the time.

• There was a similar question a few months back; you may find the answers there helpful.

• Something with a utility function, if it values an apple 1% more than an orange, if offered a million apple-or-orange choices, will choose a million apples and zero oranges. The division within most people into selfish and unselfish components is not like that, you cannot feed it all with unselfish choices whatever the ratio. Not unless you are a Keeper, maybe, who has made yourself sharper and more coherent; or maybe not even then, who knows?

I fear that this parable encourages a view whereby the utility function “should” factorize over intuitively obvious discrete quantities (e.g. apples and oranges). My utility function can value having a mixture of both apples and oranges.

• Samsung is a memory play, not a TSMC competitor. Logic semiconductor is what Samsung hopes to do in the future, it is not how they make money now. In my opinion, Samsung’s technical lead in memory is no less than TSMC’s technical lead in logic. And transformative AI will require memory as much as CPU, GPU, and TPU.

• Positive values seem more robust and lasting than prohibitions. Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated “If going to kill people, then don’t” value shard.

Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values, I wrote about how:

1. A baby learns “IF juice in front of me, THEN drink”,

2. The baby is later near juice, and then turns to see it, activating the learned “reflex” heuristic, learning to turn around and look at juice when the juice is nearby,

3. The baby is later far from juice, and bumbles around until they’re near the juice, whereupon she drinks the juice via the existing heuristics. This teaches “navigate to juice when you know it’s nearby.”

4. Eventually this develops into a learned planning algorithm incorporating multiple value shards (e.g. juice and friends) so as to produce a single locally coherent plan.

5. ...

The juice shard chains into itself, reinforcing itself across time and thought-steps.

But a “don’t kill” shard seems like it should remain… stubby? Primitive? It can’t self-chain into not doing something. If you’re going to do it, and then don’t because of the don’t-kill shard, and that avoids negative reward… Then maybe the “don’t kill” shard gets reinforced and generalized a bit because it avoided negative reward.

But—on my current guesses and intuitions—that shard doesn’t become more sophisticated, it doesn’t become reflective, it doesn’t “agentically participate” in the internal shard politics (e.g. the agent’s “meta-ethics”, deciding what kind of agent it “wants to become”). Other parts of the agent want things, they want paperclips or whatever, and that’s harder to do if the agent isn’t allowed to kill anyone.

Crucially, the no-killing injunction can probably be steered around by the agent’s other values. While the obvious route of lesioning the no-killing shard might be reflectively-predicted by the world model to lead to more murder, and therefore bid against by the no-killing shard… There are probably ways to get around this obstacle. Other value shards (e.g. paperclips and cow-breeding) might surreptitiously bid up lesioning plans which are optimized so as to not activate the reflective world-model, and thus, not activate the no-killing shard.

So, don’t embed a shard which doesn’t want to kill. Make a shard which wants to protect /​ save /​ help people. That can chain into itself across time.

• Deontology seems most durable to me when it can be justified on consequentialist grounds. Perhaps this is one mechanistic reason why.

• This is one point in favor of the “convergent consequentialism” hypothesis, in some form.

• I think that people are not usually defined by negative values (e.g. “don’t kill”), but by positives, and perhaps this is important.

• This asymmetry makes a lot of sense from an efficiency standpoint. No sense wasting your limited storage/​computation on state(-action pair)s that you are also simultaneously preventing yourself from encountering.

• From a pure technical perspective, I’d bet that some of the ocean-based pumped energy storage companies are going to make a lot of money in 5-10 years. I think dealing with saltwater is a solvable problem here and they’re just going to dominate pneumatic pumped storage or thermal batteries for coastal markets, because of higher efficiency. Not quite semiconductor though, except if as you count them as a dependent of solar and solar as semiconductor :P

• FWIW I cannot find your podcast by searching in the app “Pocket Casts” (though I can on spotify).

• imagenet was my fire alarm. and alphago. and alphazero. or maybe gpt3. actually, the fire alarm hadn’t gone off until alphafold, at which time it really started ringing. sorry, I mean alphafold 2. actually, PaLM was what really convinced me agi was soon. well, I mean, not really soon, but hey, maybe if they scale RWKV or S4 or NPT and jam them into a MuZero it somehow won’t be agi, despite that it obviously would be. I wonder how the EfficientZero followups are looking these days? don’t worry, agi can’t happen, they finally convinced me language models aren’t real intelligence because they can’t do real causal reasoning. they’re not good enough at using a causal information bottleneck and they don’t have the appropriate communication patterns of real physics. they’re prone to stereotyping and irrational, unlike real intelligence,

at this point if people aren’t convinced it’s soon, they’re not going to be convinced until after it happens. there’s no further revelation that could occur. it’ll be here within the year, and I don’t know why it’s been so hard for people to see. I guess the insistence on yudkowskian foom has immunized people against real life slow takeoff? but that “slow” is speeding up, hard.

anyway, I hope y’all are using good ai tools. I personally most recommend metaphor.systems, summarize.tech, and semanticscholar.

• 26 Nov 2022 0:53 UTC
6 points
0 ∶ 0

2 years ago I had no credentials, not even an undergrad degree. Got spooked by GPT-3 and laser-focused on it, but without preconceptions about where I’d end up. Played with GPT-3 on AI Dungeon, then built an interface to interact with higher bandwidth. This made me (Pareto) best in the world at a something in less than 6 months, because the opportunity to upskill did not exist 6 months ago. Published some papers and blog posts that were easy to churn out because they were just samples of some of the many many thoughts about GPT that now filled my mind. Joined EleutherAI and started contributing, mostly conceptually, because I didn’t have deep ML experience. Responded to an ad by Latitude (the company that makes AI Dungeon) for the position of “GPT-3 hacker”. Worked there for a few months as an ML engineer, then was one of the founding employees of Conjecture (I got to know the founders through EleutherAI). Now I am Involved.

The field of AI is moving so quickly that it’s easy to become Pareto best in the world if you depart from the mainline of what everyone else is doing. Apparently you are smart and creative; if you’re also truly “passionate” about AI, maybe you have the curiosity and drive to spot the unexploited opportunities and niches. The efficient market is a myth, except inside the Overton window; I would recommend not to try to compete there. So the strategy I’m advocating is most similar to your option (2). But I’d suggest following your curiosity and tinkering to improve your map of where the truly fertile opportunities lie, instead of doing a side project for the sake of having a side project—the latter is the road to mediocrity.

Also, find out where the interesting people who are defining the cutting edge are hanging out and learn from them. You might be surprised that you soon have a lot to teach them as well, if you’ve been exploring the very high dimensional frontier independently.

I cannot promise this is the best advice for you, but it is the advice I would give someone similar to myself.

• Cool post! I think the minimum viable “guardian” implementation, would be to

• embed each post/​video/​tweet into some high-dimensional space

• find out which regions of that space are nasty (we can do this collectively—f.e. my clickbait is probably clickbaity for you too)

• filter out those regions

I tried to do something along these lines for youtube: https://​​github.com/​​filyp/​​yourtube

I couldn’t find a good way to embed videos using ML, so I just scraped which videos recommend each other, and made a graph from that (which kinda is an embedding). Then I let users narrow down on some particular region of that graph. So you can not only avoid some nasty regions, but you can also decide what you want to watch right now, instead of the algorithm deciding for you. So this gives the user more autonomy.

The accuracy isn’t yet too satisfying. I think the biggest problem with systems like these is the network effect—you could get much better results with some collaborative filtering.

• we can do this collectively—f.e. my clickbait is probably clickbaity for you too

This assumes good faith. As soon as enough people learn about the Guardian AI, I expect Twitter threads coordinating people: “let’s flag all outgroup content as ‘clickbait’”.

Just like people are abusing current systems by falsely labeling the content that want removed as “spam” or “porn” or “original research” or whichever label effectively means “this will be hidden from the audience”.

• Any pointers on how it would even be possible for an alien civilisation to transmit complex instructions that could be deciphered?

Given a radio signal, I see how you could determine that it’s not natural, and then what?

• 25 Nov 2022 21:47 UTC
3 points
0 ∶ 0

The gap between invention of radio and Superintelligent AI in our case (and perhaps most cases of evolution of intelligent life) appears to be <150 years. A pretty narrow window to hit unless we are being actively observed—and that would likely imply they have had time to notice multicellular life on earth and get observers to us at low fractions of light speed.

If intelligent (inevitably superintelligent) Aliens exist and care about physical reality beyond their own stellar system then they can and likely will spread out to have a presence in every interesting star system in the galaxy within a million years—and planets with multicellular life are likely highly anomalous and interesting for curious Aliens.

It would be hard to believe that this hasn’t already happened given 1-4e11 stars and 5-10e9 years ‘window for life’ in milky way, making the zoo hypothesis in my mind the most likely solution of the Fermi paradox (with weak anecdotal evidence in the form of seemingly increasingly furtive UFOs over last century). Evolution selects for aliens that choose to propagate and endure, and the technology to do so is almost trivially easy once intelligence and superintelligence evolves, so if intelligence has evolved in the Milky Way and cares about other species developing, then it is clearly not hegemonic (evidenced by our continuing existence) and is likely already here.

If all this is the case—and aliens are here watching us then it also provides an existence proof than Alignment is possible. Conversely if they are not here then that is perhaps weak evidence that Alignment is not possible—that Super intelligent AI is either auto-extinguishing or almost universally disinterested in biological life.

• I like that this post can be read as in both jest and earnestness and in both readings it contains much Truth and Wisdom. =)

• And physicist have proven that bumblebee wings don’t work, so we know that crazy contraption of “drones” will remain a fiction.

• An alternate mechanistic vision of how agents can be motivated to directly care about e.g. diamonds or working hard. In Don’t design agents which exploit adversarial inputs, I wrote about two possible mind-designs:

Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices.

1. Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as “working hard” and “behaving well.”

2. Value-child: The mother makes her kid care about working hard and behaving well.

I explained how evaluation-child is positively incentivized to dupe his model of his mom and thereby exploit adversarial inputs to her cognition. This shows that aligning an agent to evaluations of good behavior is not even close to aligning an agent to good behavior.

However, some commenters seemed maybe skeptical that value-child can exist, or uncertain how concretely that kind of mind works. I worry/​suspect that many people have read shard theory posts without internalizing new ideas about how cognition can work, about how real-world caring can work on a mechanistic level. Where effective real-world cognition doesn’t have to (implicitly) be about optimizing an expected utility function over all possible plans. This last sentence might have even seemed bizarre to you.

Here, then, is an extremely detailed speculative story for value-child’s first day at school. Well, his first day spent with his newly-implanted “work hard” and “behave well” value shards.

Value-child gets dropped off at school. He recognizes his friends (via high-level cortical activations previously formed through self-supervised learning) and waves at them (friend-shard was left intact). They rush over to greet him. They start talking about Fortnite. Value-child cringes slightly as he predicts he will be more distracted later at school and, increasingly, put in a mental context where his game-shard takes over decision-making, which is reflectively-predicted to lead to him daydreaming during class. This is a negative update on the primary shard-relevant features for the day.

His general-purpose planning machinery generates an example hardworking-shard-desired terminal state: Paying rapt attention during Mr. Buck’s math class (his first class today). He currently predicts that while he is in Mr. Buck’s class later, he will still be somewhat distracted by residual game-related cognition causing him to loop into reward-predicted self-reinforcing thoughts.

He notices a surprisingly low predicted level for a variable (amount of game-related cognition predicted for future situation: Mr. Buck’s class) which is important to a currently activated shard (working hard). This triggers a previously learned query to his WM: “why are you making this prediction for this quantity?”. The WM responds with a few sources of variation, including how value-child is currently near his friends who are talking about Fortnite. In more detail, the WM models the following (most of it not directly translatable to English):

His friends’ utterances will continue to be about Fortnite. Their words will be processed and then light up Fortnite-related abstractions, which causes both prediction of more Fortnite-related observations and also increasingly strong activation of the game-shard. Due to previous reward events, his game-shard is shaped so as to bid up game-related thoughts, which are themselves rewarding events, which causes a positive feedback loop where he slightly daydreams about video games while his friends talk.

When class is about to start, his “get to class”-related cognition will be activated by his knowledge of the time and his WM indicating “I’m at school.” His mental context will slightly change, he will enter the classroom and sit down, and he will take out his homework. He will then pay token attention due to previous negative social-reward events around being caught off guard

[Exception thrown! The world model was concurrently coarsely predicting what it thinks will happen given his current real values (which include working hard). The coarse prediction clashes with the above cached prediction that he will only pay token attention in math class!

The WM hiccups on this point, pausing to more granularly recompute its predictions. It squashes the cached prediction that he doesn’t strongly care about paying attention in class. Since his mom installed a hard-working-shard and an excel-at-school shard, he will actively try to pay attention. This prediction replaces the cached prior prediction.]

However, value-child will still have game-related cognition activated, and will daydream. This decreases value-relevant quantities, like “how hard he will be working” and “how much he will excel” and “how much he will learn.”

This last part is antithetical to the new shards, so they bid down “Hang around friends before heading into school.” Having located a predicted-to-be-controllable source of negative influence on value-relevant outcomes, the shards bid for planning to begin. The implied causal graph is:

Continuing to hear friends talk about Fortnite
|
v
Distracted during class

So the automatic causality-noticing algorithms bid to knock out the primary modeled cause of the negative value-relevant influence. The current planning subgoal is set to: make causal antecedent false and reduce level of predicted distraction. Candidate concretization set to: get away from friends.

(The child at this point notices they want to get away from this discussion, that they are in some sense uncomfortable. They feel themselves looking for an excuse to leave the conversation. They don’t experience the flurry of thoughts and computations described above. Subconscious computation is subconscious. Even conscious thoughts won’t introspectively reveal their algorithmic underpinnings.)

“Hey, Steven, did you get problem #3 for math? I want to talk about it.” Value-child starts walking away.

Crucially, in this story, value-child cares about working hard in that his lines of cognition stream together to make sure he actually works hard in the future. He isn’t trying to optimize his later evaluation of having worked hard. He isn’t ultimately and primarily trying to come up with a plan which he will later evaluate as being a maximally hard-work-involving plan.

Value-child comes up with a hard-work plan as an effect of his cognition, not as a motivating cause—not because he only wants to come up with plans he himself will rate highly. He values working hard.

• I can totally believe that agents that competently and cooperatively seek out to fulfill a goal, rather than seeking to trick evaluators of that goal to think it gets fulfilled, can exist.

However, whether you get such agents out of an algorithm depends on the details of that algorithm. Current reinforcement learning algorithms mostly don’t create agents that competently do anything. If they were more powerful while still doing essentially the same thing they currently do, most of them would end up tricked by the agents they create, rather than having aligned agents.

• media spoilers
Series The Peripheral Episode 7 Doodad

Mom character makes an argument that another character is evil for letting survival limiting their options to solve their dilemma and that they are evil for it. Listening to this with reflection of LW memes this sounds awefully lot like “A good person would let themselfs be shut down in this situation” ie you are evil for not being corrigible.

An interesting point is that both characters are intensly invested in the outcome of the situation with similar kind of downsides.

• “Parasite gives wolves what it takes to be pack leaders”, Nature, 24 November 2022.

Toxoplasma gondii, the parasite well-known for making rodents lose their fear of cats, and possibly making humans more reckless, also affects wolves in an interesting way.

“infected wolves were 11 times more likely than uninfected ones to leave their birth family to start a new pack, and 46 times more likely to become pack leaders — often the only wolves in the pack that breed.”

• The gesturing towards the infected wolves being more reproductively fit in general is probably wrong, however. Of course wolves can be more aggressive if it’s actually a good idea, there’s no need for a parasite to force them to be more aggressive; the suggestion about American lions going extinct is absurd − 11,000 years is more than enough time for wolves to recalibrate such a very heritable trait if it’s so fitness-linked! So the question there is merely what is going on? Some sort of bias or very localized fitness benefit?

Is there a selection bias whereas ex ante going for pack leader is a terrible idea, but ex post conditional on victory (rather than death/​expulsion) it looks good? Well, this claims to be longitudinal and not find the sorts of correlations you’d expect from a survivorship. What else?

Looking it over, the sampling frame 1995-2020 itself is suspect: starting in 1995. Why did it start then? Well, that’s when the wolves came back (very briefly mentioned in the article). The wolf population expanded rapidly 5-fold, and continues to oscillate a lot as packs rise and fold (ranging 8-14) and because of overall mortality/​randomness on a small base (a pack is only like 10-20 wolves of all ages, so you can see why there would be a lot of volatility and problems with hard constraints like lower bounds):

Wolf population declines, when they occur, result from “intraspecific strife,” food stress, mange, canine distemper, legal hunting of wolves in areas outside the park (for sport or for livestock protection) and in one case in 2009, lethal removal by park officials of a human-habituated wolf.[21]

So, we have at least two good possible explanations there: (a) it was genuinely reproductively-fit to take more risks than the basal wolf, but only because they were expanding into a completely-wolf-empty park and surrounding environs, and the pack-leader GLM they use doesn’t include any variables for time period, so on reanalysis, we would find that the leader-effect has been fading out since 1995; and (b) this effect still exists, and risk-seeking individuals do form new packs and are more fit… but only temporarily because they occupied a low-quality pack niche and it goes extinct or does badly enough that they would’ve done better to stay in the original pack, and this wouldn’t show up in a naive individual-level GLM like theirs, you would have to do more careful tracing of genealogies to notice that the new-pack lineages underperform.

In the ROME paper, when you prompt the language model with “The Eiffel Tower is located in Paris”, you have the following:

• Subject token(s): The Eiffel Tower

• Relationship: is located in

• Object: Paris

Once a model has seen a subject token(s) (e.g. Eiffel Tower), it will retrieve a whole bunch of factual knowledge (not just one thing since it doesn’t know you will ask for something like location after the subject token) from the MLPs and ‘write’ into to the residual stream for the attention modules at the final token to look at the context, aggregate and retrieve the correct information.

In other words, if we take the “The Eiffel Tower is located in”, the model will write different information about the Eiffel Tower into the residual stream once it gets to the layers with “factual” information (early-middle layers). At this point, the model hasn’t seen “is located in” so it doesn’t actually know that you are going to ask for the location. For this reason, it will write more than just the location of the Eiffel Tower into the residual stream. Once you are at the point of predicting the location (at the final token, “in”), the model will aggregate the surrounding context and pull the location information that was ‘written’ into the residual stream via the MLPs with the most causal effect.

What is stored in the MLP is not the relationship between the facts. This is obvious because the relationship is coming after the subject tokens. In other words, as we said before, the MLPs are retrieving a bunch of factual knowledge, and then the attention modules are picking the correct (forgive the handwavy description) fact given what was retrieved and the relationship that is being asked of it.

My guess is that you could probably take what is being ‘written’ into the residual stream and directly predict properties of the subject token from the output of the layers with the most causal effect to predict a fact.

Thoughts and corrections are welcome.

• Could such a thing be developed right now? It wouldn’t take any more AI than the recommender systems optimised for clicks. But I’d prefer it be and be called a servant, rather than a “guardian”.

• So a while ago, I was thinking about whether there was any way to combine probability distributions and by “intersecting” them; I wanted a distribution which had high probability only if both of and had high probability. One idea I came up with was . I didn’t prove anything about it, but looking at some test cases it looked reasonable.

I was just about to suggest changing your belief aggregation method should be closer to this, but then I realized that it was already close in the relevant sense. Gonna be exciting to see what you might have to say about it.

• Wait no, I’m stupid. What you do corresponds to , which is more like the union of hypotheses. You’d have to do something like to get the intersection, I think. I should probably think through the math again more carefully when I have more time.

• Note that this is just the arithmetic mean of the probability distributions. Which is indeed what you want if you believe that P is right with probability 50% and Q is right with probability 50%, and I agree that this is what Scott does.

At the same time, I wonder—is there some sort of frame on the problem that makes logarithmic pooling sensible? Perhaps (inspired by the earlier post on Nash bargaining) something like a “bargain” between the two hypotheses, where a hypothesis’ “utility” for an outcome is the probability that the hypothesis assigns to it.

• The place where I came up with it was in thinking about models that focus on independent dynamics and might even have different ontologies. For instance, maybe to set environmental policy, you want to combine climate models with economics models. The intersection expression seemed like a plausible method for that. Though I didn’t look into it in detail.

• The aggregation method you suggest is called logarithmic pooling. Another way to phrase it is: take the geometric mean of the odds given by the probability distribution (or the arithmetic mean of the log-odds). There’s a natural way to associate every proper scoring rule (for eliciting probability distributions) with an aggregation method, and logarithmic pooling is the aggregation method that gets associated with the log scoring rule (which Scott wrote about in an earlier post). (Here’s a paper I wrote about this connection: https://​​arxiv.org/​​pdf/​​2102.07081.pdf)

I’m also exited to see where this sequence goes!

• 25 Nov 2022 18:56 UTC
42 points
2 ∶ 0

I must disagree. I roasted a large plane for Thanksgiving yesterday and it was incomparable to a bird. For tips on brining your plane, see here: https://​​en.wikipedia.org/​​wiki/​​US_Airways_Flight_1549

• As someone who gives data science interviews, my (personal, unreliable) opinion is that you should start preparing for interviews as soon as possible, and actually begin interviewing as soon as you feel ready.

I’m not saying you’ll get in on the first try! You might, in which case you’ll save a lot of effort doing anything else. If not, you’ll get some sense of what the interview process is like, and where your strengths and weaknesses are.

If you can’t get interviews at all, you may need to think about improving your resume. That could look like options 1, or 2, or 4 if you can swing it; the details probably depend a lot on your personal circumstances.

If you can get interviews, but not jobs, you should probably work on your interview technique. For early-career hires, we care more about how the interview and practical exercise go than anything else. (Remember to ask the interviewers for feedback at the end, e.g. “is there anything you think I could improve on?”)

If you want to go get some super-impressive experience, that’s not a bad thing, it’s certainly going to make us more interested; that said, it’s a large amount of work to do so convincingly, and it won’t save you if you can’t impress on the more routine parts of the interview.

Also, don’t feel you have to sell your existing experience short: “I did some clever feature engineering that resulted in a better model for our data” is actually a pretty good answer. I can’t speak for AI safety, but there are lots of other opportunities that would be happy to have someone who knows their way around a dataset.

If you’re not sure how to explain it, then practise that! You’re going to be evaluated on your communication as much as anything else, and explaining technical concepts to people who don’t understand them is often part of the job. They won’t need to know it inside-out, just give them a sense of what’s going on, and why your efforts mattered.

• Strongly agree with tangren`. Try to start interviewing and see if:

1. Can you even get the interviews? If you can’t all, then your resume is probably not good. Also maybe you need to work with a recruiter.

2. If you can get the interviews but not the offers, then it’s probably your interviewing skills. You can study up. (For this reason it’s recommended to first interview with companies you don’t particularly want to join.)

I will caution that right now is probably a particularly difficult time to find an engineering job. There were a lot of layoffs in big tech companies and a lot of them have a hiring freeze.

• Observation: If we want to keep everything in product form, then adding a constraint to the argmax can be seen as multiplying by an indicator function. I.e. if is 1 when is true and 0 when is false, then . Notably, we can’t really do this with arithmetic maximization because then we would be taking the logarithm of 0, which is undefined.

I’m not sure how useful this is, because it doesn’t really help with empirical approximation, as then you run into problems with multiplying by zero. But this might be nice; at least it seems to provide a quantitative justification for the view of probability distributions as being “soft constraints”.

• Observation to potentially connect this to some math that people might be more familiar with: when and are probability distributions, then , where is the cross-entropy.

• Note that the cross entropy, (and thus ) is dependent on meaningless details of what events you consider the same vs different, but is not (as much), and when maximizing with respect to , this is the same maximization.

(I am just pointing out that KL divergence is a more natural concept than cross entropy.)