# Game Theory

TagLast edit: 1 Oct 2020 23:14 UTC by

Game theory is the formal study of how rational actors interact to pursue incentives. It investigates situations of conflict and cooperation.

Game theory is an extremely powerful and robust tool in analyzing much more complex situations, such as: mergers and acquisitions, political economy, voting systems, war bargaining and biological evolution. Eight game-theorists have won the Nobel Prize in Economic Sciences.

# Eight Short Stud­ies On Excuses

20 Apr 2010 23:01 UTC
661 points

# In­tro­duc­tion to Game The­ory: Se­quence Guide

28 Jun 2012 3:32 UTC
74 points

# Player of Games

29 Aug 2018 21:26 UTC
61 points

# Book Re­view—The Ori­gins of Un­fair­ness: So­cial Cat­e­gories and Cul­tural Evolution

21 Jan 2020 6:28 UTC
27 points
(unremediatedgender.space)

# The Pavlov Strategy

20 Dec 2018 16:20 UTC
216 points
(srconstantin.wordpress.com)

# Moloch games

16 Oct 2020 15:19 UTC
79 points

# Book re­port: The­ory of Games and Eco­nomic Be­hav­ior (von Neu­mann & Mor­gen­stern)

11 May 2020 9:47 UTC
38 points

# Thomas C. Schel­ling’s “Strat­egy of Con­flict”

28 Jul 2009 16:08 UTC
136 points

# Schel­ling fences on slip­pery slopes

16 Mar 2012 23:44 UTC
496 points

# Con­trite Strate­gies and The Need For Standards

24 Dec 2018 18:30 UTC
123 points
(srconstantin.wordpress.com)

# “Zero Sum” is a mis­nomer.

30 Sep 2020 18:25 UTC
106 points

# Threat-Re­sis­tant Bar­gain­ing Me­ga­post: In­tro­duc­ing the ROSE Value

28 Sep 2022 1:20 UTC
63 points

# Game The­ory As A Dark Art

24 Jul 2012 3:27 UTC
132 points

# Pri­son­ers’ Dilemma with Costs to Modeling

5 Jun 2018 4:51 UTC
111 points

# The Dark Mir­a­cle of Optics

24 Jun 2020 3:09 UTC
26 points

# Nash Equil­ibria and Schel­ling Points

29 Jun 2012 2:06 UTC
111 points

# Back­ward Rea­son­ing Over De­ci­sion Trees

30 Jun 2012 3:17 UTC
149 points

# In­tro­duc­tion to Pri­son­ers’ Dilemma

30 Jun 2012 0:54 UTC
57 points

# Bar­gain­ing and Auctions

15 Jul 2012 17:01 UTC
53 points

# Anti-so­cial Punishment

27 Sep 2018 7:08 UTC
249 points

# Mechanism De­sign: Con­struct­ing Al­gorithms for Strate­gic Agents

30 Apr 2014 18:37 UTC
69 points

# Fair Divi­sion of Black-Hole Ne­gen­tropy: an In­tro­duc­tion to Co­op­er­a­tive Game Theory

16 Jul 2009 4:17 UTC
55 points

# [Link] A gen­tle video in­tro­duc­tion to game the­ory

13 Dec 2011 8:52 UTC
46 points

# You Play to Win the Game

30 Aug 2018 14:10 UTC
26 points
(thezvi.wordpress.com)

# Most Pri­soner’s Dilem­mas are Stag Hunts; Most Stag Hunts are Schel­ling Problems

14 Sep 2020 22:13 UTC
157 points

# The Dar­win Game

9 Oct 2020 10:19 UTC
90 points

# The Com­mit­ment Races problem

23 Aug 2019 1:58 UTC
112 points

# Game the­ory, sanc­tions, and Ukraine

13 Mar 2022 17:45 UTC
27 points

# Unify­ing Bar­gain­ing No­tions (1/​2)

25 Jul 2022 0:28 UTC
179 points

# Less Threat-Depen­dent Bar­gain­ing Solu­tions?? (3/​2)

20 Aug 2022 2:19 UTC
65 points

# The True Pri­soner’s Dilemma

3 Sep 2008 21:34 UTC
168 points

# Speak­ing Truth to Power Is a Schel­ling Point

30 Dec 2019 6:12 UTC
51 points

# Funk-tunul’s Le­gacy; Or, The Le­gend of the Ex­tor­tion War

24 Dec 2019 9:29 UTC
13 points

# For­mal Open Prob­lem in De­ci­sion Theory

29 Nov 2018 3:25 UTC
35 points

# The Ubiquitous Con­verse Law­vere Problem

29 Nov 2018 3:16 UTC
21 points

# Hyper­real Brouwer

29 Nov 2018 3:15 UTC
30 points

# Co­or­di­na­tion Prob­lems in Evolu­tion: The Rise of Eukaryotes

15 Oct 2018 6:18 UTC
46 points

# [Question] How to avoid Schel­ling points?

14 May 2020 19:37 UTC
28 points

# Con­flict vs. mis­take in non-zero-sum games

5 Apr 2020 22:22 UTC
156 points

# Med­i­ta­tions On Moloch

30 Jul 2014 4:00 UTC
114 points

# Clas­sify­ing games like the Pri­soner’s Dilemma

4 Jul 2020 17:10 UTC
98 points
(reasonableapproximation.net)

# In Log­i­cal Time, All Games are Iter­ated Games

20 Sep 2018 2:01 UTC
83 points

# Two Co­or­di­na­tion Styles

7 Feb 2018 9:00 UTC
38 points

# Real World Solu­tions to Pri­son­ers’ Dilemmas

3 Jul 2012 3:25 UTC
65 points

# In­ter­lude for Be­hav­ioral Economics

6 Jul 2012 20:12 UTC
86 points

# What Is Sig­nal­ing, Really?

12 Jul 2012 17:43 UTC
132 points

# Im­perfect Vot­ing Systems

20 Jul 2012 0:07 UTC
56 points

# Mo­ral­ity as Parfi­tian-filtered De­ci­sion The­ory?

30 Aug 2010 21:37 UTC
31 points

# What counts as defec­tion?

12 Jul 2020 22:03 UTC
81 points

# Sec­tions 1 & 2: In­tro­duc­tion, Strat­egy and Governance

17 Dec 2019 21:27 UTC
34 points

# Sec­tions 3 & 4: Cred­i­bil­ity, Peace­ful Bar­gain­ing Mechanisms

17 Dec 2019 21:46 UTC
19 points

# Sec­tions 5 & 6: Con­tem­po­rary Ar­chi­tec­tures, Hu­mans in the Loop

20 Dec 2019 3:52 UTC
27 points

# Sec­tion 7: Foun­da­tions of Ra­tional Agency

22 Dec 2019 2:05 UTC
14 points

# Moloch Hasn’t Won

28 Dec 2019 16:30 UTC
161 points
(thezvi.wordpress.com)

# Quotes from Mo­ral Mazes

30 May 2019 11:50 UTC
92 points
(thezvi.wordpress.com)

# The Costly Co­or­di­na­tion Mechanism of Com­mon Knowledge

15 Mar 2018 20:20 UTC
175 points

# The Schel­ling Choice is “Rab­bit”, not “Stag”

8 Jun 2019 0:24 UTC
135 points

# Strate­gic ig­no­rance and plau­si­ble deniability

10 Aug 2011 9:30 UTC
60 points

# From Per­sonal to Pri­son Gangs: En­forc­ing Proso­cial Behavior

24 Jan 2019 18:07 UTC
135 points

# Pareto im­prove­ments are rarer than they seem

27 Jan 2018 22:23 UTC
28 points
(reasonableapproximation.net)

# Co­op­er­at­ing with agents with differ­ent ideas of fair­ness, while re­sist­ing exploitation

16 Sep 2013 8:27 UTC
83 points

# In praise of heuristics

24 Oct 2018 15:44 UTC
39 points

# Mea­sures, Risk, Death, and War

20 Dec 2011 23:37 UTC
17 points

# Kin­naird’s truels

5 Mar 2009 16:50 UTC
28 points

# A Prob­lem About Bar­gain­ing and Log­i­cal Uncertainty

21 Mar 2012 21:03 UTC
46 points

# A di­a­gram for a sim­ple two-player game

10 Nov 2013 8:59 UTC
56 points

# When Hind­sight Isn’t 20/​20: In­cen­tive De­sign With Im­perfect Credit Allocation

8 Nov 2020 19:16 UTC
71 points

# The Dar­win Game—Rounds 1 to 2

11 Nov 2020 1:53 UTC
48 points

# Spe­cial­ized La­bor and Coun­ter­fac­tual Compensation

14 Nov 2020 18:13 UTC
18 points
(reasonableapproximation.net)

# The Mu­tant Game—Rounds 11 to 30

23 Nov 2020 9:20 UTC
5 points

# The Mu­tant Game—Rounds 31 to 90

27 Nov 2020 21:05 UTC
18 points

# Build­ing Blocks of Poli­tics: An Overview of Selec­torate Theory

12 Oct 2021 11:27 UTC
56 points

# What is a VNM sta­ble set, re­ally?

25 Jan 2021 5:43 UTC
14 points

# Mo­dal Bar­gain­ing Agents

16 Apr 2015 22:19 UTC
14 points

# Con­fu­sions re: Higher-Level Game Theory

2 Jul 2021 3:15 UTC
38 points

# Hyper­games 101

23 Jul 2021 13:19 UTC
12 points

# Yet More Mo­dal Combat

24 Aug 2021 10:32 UTC
10 points

# Coun­ter­fac­tual Contracts

16 Sep 2021 15:20 UTC
10 points
(harsimony.wordpress.com)

# EA Han­gout Pri­son­ers’ Dilemma

27 Sep 2021 23:15 UTC
55 points

# My take on higher-or­der game theory

30 Nov 2021 5:56 UTC
38 points

# PD-al­ikes in two dimensions

23 Apr 2022 9:52 UTC
13 points
(reasonableapproximation.net)

# [Question] How should I talk about op­ti­mal but not sub­game-op­ti­mal play?

30 Jun 2022 13:58 UTC
5 points

# «Boundaries» Se­quence (In­dex Post)

26 Jul 2022 19:12 UTC
21 points

# «Boundaries», Part 1: a key miss­ing con­cept from util­ity theory

26 Jul 2022 23:03 UTC
130 points

# Unify­ing Bar­gain­ing No­tions (2/​2)

27 Jul 2022 3:40 UTC
98 points

# «Boundaries», Part 2: trends in EA’s han­dling of boundaries

6 Aug 2022 0:42 UTC
48 points

# AI coöper­a­tion is more pos­si­ble than you think

24 Sep 2022 21:26 UTC
5 points

# Pri­soner’s Dilemma Tour­na­ment Results

6 Sep 2011 0:46 UTC
151 points

# Out to Get You

23 Sep 2017 10:50 UTC
82 points
(thezvi.wordpress.com)

# Less Com­pe­ti­tion, More Mer­i­toc­racy?

20 Jan 2019 2:00 UTC
84 points
(thezvi.wordpress.com)

# The Dar­win Results

25 Nov 2017 13:30 UTC
47 points
(thezvi.wordpress.com)

# Sim­plified Poker

4 Jun 2018 15:50 UTC
34 points
(thezvi.wordpress.com)

# 2014 iter­ated pris­oner’s dilemma tour­na­ment results

30 Sep 2014 21:23 UTC
94 points

# Pavlov Generalizes

20 Feb 2019 9:03 UTC
64 points

# Does the US nu­clear policy still tar­get cities?

2 Oct 2019 0:18 UTC
61 points

# Sim­plified Poker Conclusions

9 Jun 2018 21:50 UTC
64 points
(thezvi.wordpress.com)

# On Robin Han­son’s Board Game

8 Sep 2018 17:10 UTC
49 points
(thezvi.wordpress.com)

# The Dar­win Game

15 Nov 2017 23:20 UTC
36 points
(thezvi.wordpress.com)

# Let’s Read: Su­per­hu­man AI for mul­ti­player poker

14 Jul 2019 6:22 UTC
56 points

# The Dar­win Pregame

21 Nov 2017 1:10 UTC
34 points
(thezvi.wordpress.com)

# Buri­dan’s ass in co­or­di­na­tion games

16 Jul 2018 2:51 UTC
52 points

26 Mar 2009 7:16 UTC
67 points

# Re­duc­ing col­lec­tive ra­tio­nal­ity to in­di­vi­d­ual op­ti­miza­tion in com­mon-pay­off games us­ing MCMC

20 Aug 2018 0:51 UTC
59 points

# The Soli­taire Prin­ci­ple: Game The­ory for One

17 Jan 2018 0:14 UTC
24 points

# Towards op­ti­mal play as Villager in a mixed game

7 May 2019 5:29 UTC
50 points
(benjaminrosshoffman.com)

# Di­plo­macy as a Game The­ory Laboratory

12 Nov 2010 22:19 UTC
68 points

# Sim­plified Poker Strategy

6 Jun 2018 11:10 UTC
40 points
(thezvi.wordpress.com)

# Even Odds

12 Jan 2014 7:24 UTC
62 points

# Some Ways Co­or­di­na­tion is Hard

13 Jun 2019 13:00 UTC
53 points
(thezvi.wordpress.com)

# “Backchain­ing” in Strategy

9 Feb 2018 12:01 UTC
15 points

# Mixed strat­egy Nash equilibrium

16 Oct 2010 16:00 UTC
59 points

# Black­mail, Nukes and the Pri­soner’s Dilemma

10 Mar 2010 14:58 UTC
25 points

# Let’s split the cake, length­wise, up­wise and slantwise

25 Oct 2010 13:15 UTC
74 points

# De­ci­sion Auc­tions aka “How to fairly as­sign chores, or de­cide who gets the last cookie”

21 Jan 2014 21:13 UTC
55 points

# The Epistemic Pri­soner’s Dilemma

18 Apr 2009 5:36 UTC
68 points

# Blame games

6 May 2019 2:38 UTC
45 points
(benjaminrosshoffman.com)

# Nash equil­ibriums can be ar­bi­trar­ily bad

1 May 2019 14:58 UTC
34 points

# UDT as a Nash Equilibrium

6 Feb 2018 14:08 UTC
18 points

# Front Row Center

11 Jun 2018 13:50 UTC
30 points
(thezvi.wordpress.com)

# Moloch and multi-level selection

11 Aug 2020 7:08 UTC
19 points

30 Jul 2012 15:22 UTC
60 points

# [Question] A way to beat su­per­ra­tional/​EDT agents?

17 Aug 2020 14:33 UTC
5 points

# Tak­ing the awk­ward­ness out of a Prenup—A Game The­o­retic solution

22 May 2010 0:45 UTC
41 points

# The Cry­on­ics Strat­egy Space

24 Apr 2014 16:11 UTC
40 points

# De-Cen­ter­ing Bias

18 Oct 2017 23:24 UTC
14 points

# Nega­tive “eeny meeny miny moe”

20 Aug 2019 2:48 UTC
25 points

# The Power of Noise

16 Jun 2014 17:26 UTC
53 points

# Split-a-Dol­lar Game

24 Aug 2020 4:54 UTC
27 points

# [Question] Inac­cessible finely tuned RNG in hu­mans?

7 Oct 2020 17:04 UTC
23 points

# Play­ing the Meta-game

25 Dec 2009 10:06 UTC
30 points

# The Dar­win Game—Rounds 0 to 10

24 Oct 2020 2:17 UTC
107 points

# The Dar­win Game—Round 1

29 Oct 2020 21:56 UTC
58 points

# Fair­ness vs. Goodness

22 Feb 2009 20:22 UTC
15 points

# Precom­mit­ting to pay­ing Omega.

20 Mar 2009 4:33 UTC
5 points

# Two rea­sons to ex­pect a peace­ful change of power in the US

8 Nov 2020 17:13 UTC
11 points

# Ex­tor­tion beats brinks­man­ship, but the au­di­ence matters

16 Nov 2020 21:13 UTC
27 points

# [Question] Is there a defini­tive in­tro to pun­ish­ing non-pun­ish­ers?

31 Oct 2019 20:20 UTC
23 points

# The Monty Maul Problem

24 Jun 2009 5:30 UTC
7 points

# Sleep­ing Beauty gets coun­ter­fac­tu­ally mugged

26 Mar 2009 11:44 UTC
4 points

# Notes on Fairness

7 Dec 2020 18:52 UTC
20 points

# The Dar­win Game—Conclusion

4 Dec 2020 8:06 UTC
93 points

# Freaky Fairness

25 Jul 2009 8:14 UTC
14 points

# Hermione Granger and New­comb’s Paradox

14 Dec 2020 5:27 UTC
47 points

# [LINK] The P + ep­silon At­tack (Precom­mit­ment in cryp­toe­co­nomics)

29 Jan 2015 2:02 UTC
32 points

# Pri­soner’s Dilem­mas : Altru­ism :: Bat­tles of the Sexes : Convention

1 Feb 2021 10:39 UTC
11 points

# Notes on Schel­ling’s “Strat­egy of Con­flict” (1960)

10 Feb 2021 2:48 UTC
22 points

# Don’t en­courage pris­on­ers dilemmas

16 Feb 2021 6:33 UTC
8 points

# The Pro­to­typ­i­cal Ne­go­ti­a­tion Game

20 Feb 2021 21:33 UTC
81 points

# An­a­lyz­ing Mul­ti­player Games us­ing IMPACT

2 Apr 2021 19:10 UTC
10 points

# Iden­ti­fi­a­bil­ity Prob­lem for Su­per­ra­tional De­ci­sion Theories

9 Apr 2021 20:33 UTC
17 points

# Paper re­view: A Cryp­to­graphic Solu­tion to a Game The­o­retic Problem

24 Apr 2021 11:54 UTC
23 points

# The Schel­ling Game (a.k.a. the Co­or­di­na­tion Game)

3 May 2021 14:31 UTC
23 points

# Game-the­o­retic Align­ment in terms of At­tain­able Utility

8 Jun 2021 12:36 UTC
20 points

# Reflec­tion of Hier­ar­chi­cal Re­la­tion­ship via Nuanced Con­di­tion­ing of Game The­ory Ap­proach for AI Devel­op­ment and Utilization

4 Jun 2021 7:20 UTC
2 points

# Gam­ing Incentives

29 Jul 2021 13:51 UTC
10 points

# [Question] Open prob­lem: how can we quan­tify player al­ign­ment in 2x2 nor­mal-form games?

16 Jun 2021 2:09 UTC
23 points

# Progress on Causal In­fluence Diagrams

30 Jun 2021 15:34 UTC
71 points

# For­mal­iz­ing Ob­jec­tions against Sur­ro­gate Goals

2 Sep 2021 16:24 UTC
7 points

# [link] Are All Dic­ta­tor Game Re­sults Ar­ti­facts?

23 May 2013 7:08 UTC
24 points

# In the Pareto world, liars prosper

8 Dec 2011 14:15 UTC
28 points

# Ran­domWalkNFT: A Game The­ory Exercise

12 Nov 2021 19:05 UTC
7 points

# Why a No-Fly-Zone would be the biggest gift to Putin, and why Ze­len­skyy keeps ask­ing for it [Linkpost and com­men­tary]

17 Mar 2022 3:13 UTC
47 points

# Nu­clear Deter­rence 101 (and why the US can’t even hint at in­ter­ven­ing in Ukraine)

18 Mar 2022 7:25 UTC
36 points

# Op­tional stopping

2 Apr 2022 13:58 UTC
14 points

# Solv­ing the Brazilian Chil­dren’s Game of 007

6 Apr 2022 13:03 UTC
3 points

# The real rea­son Futarchists are doomed

1 Apr 2022 18:37 UTC
20 points

# Ensem­bling the greedy doc­tor problem

18 Apr 2022 19:16 UTC
7 points

# [Re­post] Non-Nashian Game The­ory: A Nor­mal-Form Primer

1 Jun 2022 16:40 UTC
44 points

# New co­op­er­a­tion mechanism—quadratic fund­ing with­out a match­ing pool

5 Jun 2022 13:55 UTC
9 points

# Quick Sum­maries of Two Papers on Kant and Game Theory

25 Jun 2022 10:25 UTC
8 points
(www.erichgrunewald.com)

# What should su­per­ra­tional play­ers do in asym­met­ric games?

24 Jan 2014 7:42 UTC
40 points

# Do bam­boos set them­selves on fire?

19 Sep 2022 15:34 UTC
137 points
• Thank you to Duncan, Robin, Kate, and Benya for feedback.

• I’m interested in observations we could see that would make it the right call to leave major US cities.

• One (non-goodhart resistent) way would be to use some kind of Page Rank method based on the tags that users comment and post on. Eg. People Rank high in AI posts, if their comments and AI posts were upvoted by other people with the same Expertise.

• 30 Sep 2022 1:46 UTC
2 points
0 ∶ 0

If they alternated directions, increasing the distance walked each time by a constant factor (e.g. triple), then an exit in either direction would be reached without too much wasted time.

• Some scientists tried to create group selection under laboratory conditions. They divided some insects into subpopulations, then killed off any subpopulation whose numbers got too high, and and “promoted” any subpopulation that kept its numbers low to better conditions. They hoped the insects would evolve to naturally limit their family size in order to keep their subpopulation alive. Instead, the insects became cannibals: they ate other insects’ children so they could have more of their own without the total population going up. In retrospect, this makes perfect sense; an insect with the behavioral program “have many children, and also kill other insects’ children” will have its genes better represented in the next generation than an insect with the program “have few children”.

Why didn’t they try also killing off subpopulations that engaged in cannibalism, and promoting those that didn’t? And what would have most likely happened if they had tried that?

• Probably it is also depends on how much information about “various models trying to naively backstab their own creators” there are in the training dataset

• Shortform #139 Note to self: don’t reschedule weekly meetup

Tonight’s Virginia Rationalists: Norfolk meetup was great! We had a new individual join us who is familiar with ACX.

In retrospect, I think it was inconsiderate to suddenly reschedule the group’s weekly meetup just because I couldn’t attend. My co-organizer and others likely could have attended yesterday, and there were a few people who just couldn’t make the rescheduled meetup but had included Wednesday nights in their routine as “meetup night”. So, I intend to not reschedule the weekly social meetup unless literally everyone reaches out in advance and asks for that to occur. Weekly meetups = people build that into their routines, and should not be changed lightly.

• Does Ryan have an agenda somewhere? I see this post, but I don’t think that’s it.

• Somewhat related: Rationalists should have mandatory secret identities (or rather sufficiently impressive identities).

• 30 Sep 2022 0:29 UTC
LW: 2 AF: 1
0 ∶ 0
AF

How actually do you sidestep the need for the One True Objective Function given an ELK solution? I get that it might seem plausible to take a rough objective like “do what I intend” and look at the internal knowledge of the thing for signs that it is deliberately deceiving you. If you do that, you’ll get, at best, an AI that doesn’t know that it is deceiving you (for whatever operationalization of “know” you come up with as you use ELK for training). But it could still be deceiving you, and very likely will be if optimization pressure is merely towards “AIs that don’t know that they are being deceptive”.

• Take a reinforcement learner AI, that we want to safely move a strawberry onto a plate. A human sits nearby and provides a reward based on inspecting the AI’s behaviour.

As it stands, this setup is completely vulnerable to reward hacking. The reward is not provided for safe moving of the strawberry; instead the reward is provided by having the human judge that the task has been accomplished and then pressing a button. Taking control of the human or control of the button is likely to be possible for a superintelligent AI; and, as it stands, that would be mandated by this reward function.

I think this claimed vulnerability is invalid because reward is not the optimization target.

(I know this is an old post and your views may have changed, but I’m posting this comment for anyone who comes by later.)

• 29 Sep 2022 23:57 UTC
LW: 7 AF: 4
0 ∶ 0
AF

Would it be sufficient, for disproof, to show one system that does steer far-away parts of the world into a relatively-small chunk of their state space, but does not internally contain a world-model or do planning?

• That could be sufficient in principle, though I would not be surprised if I look at a counterexample and realize that the problem description was missing something rather than that the claim is basically false. For instance, it wouldn’t be too surprising if there’s some class of supposed-counterexamples which only work under very specific conditions (i.e. they’re not robust), and can be ruled out by some “X isn’t very likely to happen spontaneously in a large system” argument.

The bottom line is that a disproof should argue that there isn’t anything basically like the claim which is true. Finding a physical counterexample would at least show that the counterexample isn’t sensitive to the details of mathematical framework/​formulation.

Do you have a particular counterexample in mind? A thermometer, perhaps?

• Well ok just to brainstorm some naive things that don’t really rule the conjecture out:

• A nuclear bomb steers a lot of far-away objects into a high-entropy configuration, and does so very robustly, but that perhaps is not a “small part of the state space”

• A biological pathogen, let loose in a large human population, might steer all the humans towards the configuration “coughing”, but the virus is not itself a consequentialist. You might say that the pathogen had to have been built by a consequentialist, though.

• Generalizing the above: Suppose I discover some powerful zero-day exploit for the linux kernel. I automate the exploit, setting my computer up to wait 24 hours and then take over lots of computers on the internet. Viewing this whole thing from the outside, it might look as if it’s my computer that is “doing” the take-over, but my computer itself doesn’t have a world model or a planning routine.

• Consider some animal species spreading out from an initial location and making changes to the environments they colonize. If you think of all the generations of animals that underwent natural selection before spreading out as the “system that controls some remote parts of the system” and the individual organisms as kind of messages or missiles, then this seems like a pretty robust, though slow form of remote control. Maybe you would say that natural selection has a world model and a planning process, though.

• Since this seems to be Carn’s first post on LessWrong, I think some of the other readers should have been more lenient and not downvoted the post or explained why they downvoted the post.

I would only downvote a post if it was obviously bad, flawed, very poorly written, or a troll post.

This post contains lots of interesting ideas and seems like a good first post.

The original post “Reward is not the optimization target” has 216 upvotes and this one has 0. While the original post was written better, I’m skeptical of the main idea and it’s good to see a post countering it so I’m upvoting this post.

• the core thing I worry about with any simulation-based approach is how to get coverage of possibility space. ensuring that the tests are informative about possible configurations of complex systems is hard; I would argue that this is a great set of points, and that no approach without this type of testing could be expected to succeed.

however, as we’ve seen with adversarial examples, both humans and current DL have fairly severe failures of alignment that cause large issues, and projects like this need tools to optimize for interpretability from the perspective of a formal verification algorithm.

it’s my belief that in almost all cases, the values almost all beings ever wish to express lie on a low-enough-dimensional manifold that, if the ideal representation was learned, a formal verifier would be able to handle it; see, for example, recent work from anthropic on interpretability lining up nicely with what’s needed to use reluplex (or followup work; reluplex is old now) to estimate bounds on larger networks’ behavior.

I don’t think we should give up on perfectly solving safety; instead, we should recognize that it has never even been closely approximated before, by any form of life, because even semi-planned evolutionary processes do not do a good job of covering the entire possibility space.

this post is an instant classic, and I agree that sims are a key step, starting from small agents. but ensuring that strongly generalizing moral views are generated in the agents who grow in the simulations is not trivial—we need to be able to generate formal bounds on the loss function, and doing so requires being able to prove all the way through the simulation the way we currently take gradients through it. if we can ask “where in the simulation would have changed this outcome”, then we’d finally be getting somewhere in terms of generating moral knowledge that generalizes universally.

• 29 Sep 2022 23:09 UTC
LW: 5 AF: 4
1 ∶ 0
AF

For what it’s worth I found this writeup informative and clear. So lowering your standards still produced something useful (at least to me).

• I wonder, almost just idle curiosity, whether or not the “measuring-via-proxy will cause value drift” is something we could formalize and iterate on first. Is the problem stable on the meta-level, or is there a way we can meaningfully define “not drifting from the proxy” without just generally solving alignment.

Intuitively I’d guess this is the “don’t try to be cute” class of thought, but I was afraid to post at all and decided that I wanted to interact, even at the cost of (probably) saying something embarassing.

• It is at least not obvious to me that this is a “don’t try to be cute” class of thought, though not obvious that it isn’t either. Depends on the details.

• By contrast, the world where sharp left turns happen near human-level capabilities seems pretty dangerous. In that world, we don’t get to study things like deceptive alignment empirically. We don’t get to leverage powerful AI researcher tools.

We actually can study and iterate on alignment (and deception) in simulation sandboxes while leveraging any powerful introspection/​analysis tools. Containing human-level AGI in (correctly constructed) simulations is likely easy.

But then later down you say:

For instance, we can put weak models in environments that encourage alignment failure so as to study left turns empirically (shifting towards the “weak” worlds).

Which seems like a partial contradiction, unless you believe we can’t contain human-level agents?

• I am skeptical that we can contain human-level agents, particularly if they have speed/​memory advantages over humans.

• Is this something you’ve thought deeply about and or care to expand on? Curious about your source of skepticism, considering:

• we can completely design and constrain the knowledge base of sim agents, limiting them to the equivalent of 1000BC human knowledge, or whatever we want

• we can completely and automatically monitor their inner monologues, thoughts, etc

How do you concretely propose a sim agent will break containment? (which firstly requires sim-awareness)

How would you break containment now, assuming you are in a sim?

Also, Significant speed/​memory advantages go against the definition of ‘human-level agents’ and are intrinsically unlikely anyway as 2x speed/​memory agents are simply preceded by 1x speed/​memory agents, and especially due to the constraints of GPUs which permit acceleration by parallelization of learning across agents, more so than serial speedup of individual agents.

• My modal expectation is that these are going to look something like large language models, meaning that while we’ll have controlled the training corpus it won’t be something hand-curated. So there will almost certainly be lots of information in there that models could use to e.g. figure out how to manipulate people, program computers, and so on. I expect then that we won’t be limiting their knowledge as much as you think we will, because I expect them to be trained with giant corpuses of (mostly) post-1000BC text.

We do not currently have the interpretability tools needed to monitor their inner thoughts. I would feel much better about this problem if we had such tools!

My best guess is that containment fails because a model convinces one of the people in the lab to let it access the internet. That could look like getting someone to think it’s genuinely a good idea (because it’ll be helpful for some project), or it could look like bribery (with offer of future payment) or blackmail.

I’m guessing that we’ll want to run some agents fast because that’s how you get lots of serial intellectual work done. So even if all we get are human-level agents, there’ll be a big incentive to run them fast.

• Indeed a model trained on (or with full access to) our internet could be very hard to contain. That is in fact a key part of my argument. If it is hard to contain, it is hard to test. If it’s hard to test, we are unlikely to succeed.

So if we agree there, then perhaps you agree that giving untested AGI access to our immense knowledge base is profoundly unwise.

I expect then that we won’t be limiting their knowledge as much as you think we will, because I expect them to be trained with giant corpuses of (mostly) post-1000BC text.

That’s then just equivalent to saying “I expect then that we won’t even bother with testing our alignment designs”. Do you actually believe that testing is unnecessary? Or do you just believe the leading teams won’t care? And if you agree that testing is necessary, then shouldn’t this be key to any successful alignment plan?

My best guess is that containment fails because a model convinces one of the people in the lab to let it access the internet.

Which obviously is nearly impossible if it doesn’t know it is in a sim, and doesn’t know what a lab or the internet are, and lacks even the precursor concepts. Since you seem to be focused on the latest fad (large language models), consider the plight of poor simulated Elon Musk—who at least is aware of the sim argument.

• 29 Sep 2022 22:42 UTC
1 point
0 ∶ 0

The situation is not ‘handled.’ Elites have lost all credibility.

I think it’s worth caveating this that not all elites have lost credibility. Elites in places like Singapore, Switzerland, and Finland have a lot of credibility.

• Edits based on feedback from LessWrong and the EA Forum:

EDITS:
- Added new ‘Definitions’ section to introduction to explain definitions such as ‘AI safety’, ‘researcher’ and the difference between technical and non-technical research.

UPDATED ESTIMATES (lower bound, estimate, upper bound):

TECHNICAL
- CHAI: 10-30-60 → 5-25-50
- FHI: 10-10-40 → 5-10-30
- MIRI: 10-15-30 → 5-10-20

NON-TECHNICAL
- CSER: 5-5-10 → 2-5-15
- Delete BERI from the list of non-technical research organizations
- Delete SERI from the list of non-technical research organizations
- Levelhume Centre: 5-10-70 (Low confidence) → 2-5-15 (Medium confidence)
- FLI: 5-5-20 → 3-5-15
- Epoch: 5-10-15 → 2-4-10

TODO:
- update all graphs

- add Rethink Priorities to the list of non-technical organizations

- Do more research on Good AI

• Someone make a PR for a builder/​breaker feature on lesswrong

• # Back and Forth

Only make choices that you would not make in reverse, if things were the other way around. Drop out of school if and only if you wouldn’t enroll in school from out of the workforce. Continue school if and only if you’d switch over from work to that level of schooling.

Flitting back and forth between both possible worlds can make you less cagey about doing what’s overdetermined by your world model + utility function already. It’s also part of the exciting rationalist journey of acausally cooperating with your selves in other possible worlds.

• It’s probably a useful mental technique to consider from both directions, but also consider that choices that appear symmetric at first glance may not actually be symmetric. There are often significant transition costs that may differ in each direction, as well as path dependencies that are not immediately obvious.

As such, I completely disagree with the first paragraph of the post, but agree with the general principle of considering such decisions from both directions and thank you for posting it.

• I do research in cooperation and game theory, including some work on altruism, and also some hard science work. Everyone looks at the Rorschach blot of human behavior and sees something different. Most of the disagreements have never been settled. Even experiment does not completely settle them.

My experience from having children and observing them in the first few months of life is more definitive. They come with values and personal traits that are not very malleable, and not directly traceable to parents. Sometimes grandparents (who were already dead, so it had to be genetic).

My experience with A.I. researchers is that they are looking for a shortcut. No one’s career will permit raising an experimental AI as a child. The tech of the AI would be obsolete before the experiment was complete. This post is wishful thinking that a shortcut is available. It is speculative, anecdotal, and short on references to careful experiment. Good luck with that.

• I think there are two fundamental problems with the extensive simboxing approach. The first is just that, given the likely competitive dynamics around near-term AGI (i.e. within the decade), these simboxes are going to be extremely expensive both in compute and time which means that anybody unilaterally simboxing will probably just result in someone else releasing an unaligned AGI with less testing.

If we think about the practicality of these simboxes, it seems that they would require (at minimum) the simulation of many hundreds or thousands of agents over relatively long real timelines. Moreover, due to the GPU constraints and Moore’s law arguments you bring up, we can only simulate each agent at close to ‘real time’. So years in the simbox must correspond to years in our reality, which is way too slow for an imminent singularity. This is especially an issue given that we must maintain no transfer of information (such as datasets) from our reality into the sim. This means at minimum years of sim-time to bootstrap intelligent agents (taking humans data-efficiency as a baseline). Also, each of these early AGIs will be likely be incredibly expensive in compute so that maintaining reasonable populations of them in simulation will be very expensive and probably infeasible initially. If we could get policy coordination on making sure all actors likely to develop AGI go through a thorough simboxing testing regimen, then that would be fantastic and would solve this problem.

Perhaps a more fundamental issue is that simboxing does not address the fundamental cause of p(doom) which is recursive self improvement of intelligence and the resulting rapid capability gains. The simbox can probably simulate capability gains reasonably well (i.e. gain ‘magical powers’ in a fantasy world) but I struggle to see how it could properly test gains in intelligence from self-improvement. Suppose the AI in the fantasy simbox brews a ‘potion’ that makes it 2x as smart. How do we simulate this? We could just increase the agent’s compute in line with the scaling laws but a.) early AGIs are almost certainly near the frontier of our compute capability anyway and b.) much of recursive self improvement is presumably down to algorithmic improvements which we almost necessarily cannot simulate (since if we knew better algorithms we would have included them in our AGIs in the simulation in the first place!)

This is so vital because the probable breakdown of proxies to human values under the massive distributional shift induced by recursive self improvement is the fundamental difficulty to alignment in the first place.

Perhaps this is unique to my model of AI risk, but almost all the probability of doom channels through p(FOOM) such that p(doom | no FOOM) is quite low in comparison. This is because if we have don’t have FOOM then there is not extremely large amounts of optimization power unleashed and the reward proxies for human values and flourishing don’t end up radically off-distribution and so probably don’t break down. There are definitely a lot of challenges left in this regime, but to me it looks solvable and I agree with you that in worlds without rapid FOOM, success will almost certainly look like considerable iteration on alignment with a bunch of agents undergoing some kind of automated simulated alignment testing in a wide range of scenarios plus using the generalisation capabilities of machine learning to learn reward proxies that actually generalise reasonably well within the distribution of capabilities actually obtained. The main risk, however, in my view, comes from the FOOM scenario.

Finally, I just wanted to say that I’m a big fan of your work and some of your posts have caused major updates to my alignment worldview—keep up the fantastic work!

• Thanks, upvoted for engagement and constructive criticism—I’d like more to see this comment.

I’m going to start perhaps in reverse to establish where we seem to most agree:

There are definitely a lot of challenges left in this regime, but to me it looks solvable and I agree with you that in worlds without rapid FOOM, success will almost certainly look like considerable iteration on alignment with a bunch of agents undergoing some kind of automated simulated alignment testing in a wide range of scenarios plus using the generalisation capabilities of machine learning to learn reward proxies that actually generalise reasonably well within the distribution of capabilities actually obtained.

I fully agree with this statement.

However in worlds that rapidly FOOM, everything becomes more challenging, and I’ll argue in a moment why I believe that the approach presented here still is promising in rapid FOOM scenarios, relative to all other practical techniques that could actually work.

But even without rapid FOOM, we still can have disaster—for example consider the scenario of world domination by a clan of early uploads of some selfish/​evil dictator or trillionaire. There’s still great value in solving alignment here, and (to my eyes at least) much less work focused on that area.

Now if rapid FOOM is near inevitable, then those considerations naturally may matter less. But rapid FOOM is far from inevitable.

First, Moore’s Law is ending, and brains are efficient, perhaps even near pareto-optimal.

Secondly, the algorithms of intelligence are much simpler than we expected, and brains already implement highly efficient or even near pareto-optimal approximations of the ideal universal learning algorithms.

To the extent either of those major points are true, rapid FOOM is much less likely; to the extent both are true (as they appear to be), then very rapid FOOM is very unlikely.

Performance improvement is mostly about scaling compute and data in quantity and quality—which is exactly what has happened with deep learning, which was deeply surprising to many in the ML/​comp-sci community and caused massive updates (but was not surprising and was in fact predicted by those of us arguing for brain efficiency and brain reverse engineering).

Now, given that background, there a few other clarifications and/​or disagreements:

If we think about the practicality of these simboxes, it seems that they would require (at minimum) the simulation of many hundreds or thousands of agents over relatively long real timelines. Moreover, due to the GPU constraints and Moore’s law arguments you bring up, we can only simulate each agent at close to ‘real time’.

To a first approximation, compute_cost = size*speed. If AGI requires brain size, then the first to cross the finish line will likely be operating not greatly faster than the minimum speed, which is real-time. But this does not imply the agents learn at only real time speed, as learning is parallelizable across many agent instances. Regardless, none of these considerations depend on whether the AGI is trained in a closed simbox or an open sim with access to the internet.

So just to clarify:

• AGI designs in simboxes are exactly the same as unboxed designs, and have exactly the same compute costs

• The only difference is in the datastream and thus knowledge

• The ideal baseline cost of simboxing is only O(N+1) vs O(N) without—once good AGI designs are found, the simboxing approach requires only one additional unboxed training run (compared to never using simboxes). We can estimate this additional cost: it will be around or less than 1e25 ops (1e16 ops/​s for brain-size model * 1e9s seconds for 30 years equivalent), or less than 10 million dollars (300 gpu years) using only todays gpus, ie nearly nothing. Perhaps a more fundamental issue is that simboxing does not address the fundamental cause of p(doom) which is recursive self improvement of intelligence and the resulting rapid capability gains. The simbox can probably simulate capability gains reasonably well (i.e. gain ‘magical powers’ in a fantasy world) but I struggle to see how it could properly test gains in intelligence from self-improvement. Suppose the AI in the fantasy simbox brews a ‘potion’ that makes it 2x as smart. How do we simulate this? We could just increase the agent’s compute in line with the scaling laws but a.) early AGIs are almost certainly near the frontier of our compute capability anyway and b.) much of recursive self improvement is presumably down to algorithmic improvements which we almost necessarily cannot simulate (since if we knew better algorithms we would have included them in our AGIs in the simulation in the first place!) If brains are efficient, then matching them will already use up most of our algorithmic optimization slack—which again seems to be true based on the history of deep learning. But let’s suppose there still is significant optimization slack, then in a sense you’ve almost answered your own question . .. we can easily incorporate new algorithmic advances into new simboxes or even upgrade agents mid-sim using magic potions or what not. If there is great algorithmic slack, then we can employ agents which graduate from simboxes as engineers in the design of better AGI and simboxes. To the extent there is any downside here or potential advantage for other viable approaches, that difference seems to come strictly at the cost of alignment risk. Assume there was 1.) large algorithmic slack, and 2.) some other approach that was both viable and significantly different, then it would have to: • not use adequate testing of alignment (ie simboxes) • or not optimize for product of intelligence potential and measurable alignment/​altruism Do you think such an other approach could exist? If so, where would the difference lie and why? • Frederick Douglass on “the good luck theory of self-made men” (@mbateman) This is an endlessly frustrating debate, because most people cannot distinguish between “S ⇒ W” and “W ⇒ S”, where S is success and W is hard work. “How dare you say W ⤃ S ? If you actually look at the data, it is obvious that S ⇒ W !” • Just read this about 3 years later. I found Thiel to be mostly spot on. Especially: Peter Thiel: Right. Look, I don’t know how you solve the social problem if everybody has to be a mathematician or a concert pianist. I want a society in which we have great mathematicians and great concert pianists. That seems that that would be a very healthy society. It’s very unhealthy if every parent thinks their child has to be a mathematician or a concert pianist, and that’s the kind of society we unfortunately have. Which seems to point to the fact that the root cause of many societal ills is that the average human psyche simply can’t accept the fact that in Thiel’s language, the vast majority of children, including most likely theirs, will never be a great mathematician or concert pianist. (or attain an equally prestigious position) Since human prestige by definition is defined relative to other humans and thus only a tiny minority could ever be near the top. Yet there seems to be some instinctual demand for an individual to be special in some way and furthermore that this must be recognized by some sufficiently large group. i.e. for a child to grow into an adult, mediocre in every way, is seen as a tragedy. Thus necessitating a huge amount of distortions everywhere, across all institutions and policies causing damage in innumerable ways, to compensate. • I have cheated on tests. Not very successfully. The problem is sometimes you actually learn more by cheating, because making strategies on how to cheat actually makes you remember more. Its actually creative activity. Also sometimes cheating actually made me able to remember the stuff more as it gave me a chance. On other hand the long term benefits of cheating are tiny. Obviously the best strategy is to learn, cheating or not. The key important thing is whether cheating is viable strategy in real life. On other hand cheating is kind of ambiguous concept. Rules are many. If you make notes in math class and they are permitted then its not cheating. Some teachers explicitly allowed people to use books and look up answers. This would obviously be cheating in most other classes. But if the end game is knowledge then its either about the process or the result. If its about process both methods are OK, if its purely about result then the best technique, is to simply do it as fast as possible in little time as possible as soon as possible. This in and of its self is actually valued approach. As for shoplifting I have done something like that. Not sure I actually enjoyed it. I think I just did it, because you kind of think, hey its possible? First time I was certainly dumb enough to get duped. Not sure even why. Mind fallacy is interesting concept. The approach is probably simply done, because its the easiest way to approach people. Or at least it seems. People are complex, so without trying to assume things you kind of work out what the common ground is. Unfortunately there are so many ways to, not figure this out. What makes people tick is something I just don’t get. Or more accurately I get it, but I actually don’t know how that translates to real life approaching people. • I plan to mentor several people to work on shard theory and agent foundations this winter through SERI MATS. Apply here if you’re interested in working with me and Quintin. • Has anyone tried debate experiments where the judge can interject/​ask questions throughout the debate? • I think this’d be better with a description of what it is and how it’s relevant. (Linkposts generally benefit from that, and in this case it almost looks like spam if you’re not paying attention) • I imagine something like Stack Exchange, except that people could get certified for rationality and for domain knowledge, and then would have corresponding symbols next to their user names. Well, the rationality certification would be a problem. An ideal test could provide results like “rational in general, except when the question is related to a certain political topic”. Because, it would be difficult to find enough perfectly rational people, so the second best choice would be to know their weaknesses. • I’m not sure we’d need anything that elaborate. The rationalist community isn’t that big. I was more thinking that rationalists could self-nominate their expertise, or that a couple of people could come together and nominate someone if they notice that that person is has gone in depth with the topic. I’ve previously played with the idea of more elaborate schemes, including tests, track records and in-depth arguments. But of course the more elaborate the scheme, the more overhead there is, and I’m not sure that much overhead is affordable or worthwhile if one just wants to figure stuff out. • I agree. We could afford more overhead if we had thousands of rationalists active on the Q&A site. Realistically, we will be lucky if we get twenty. But some kind of verification would be nice, to prevent the failure mode of “anyone who creates an account is automatically considered a rationalist”. Similarly, if people simply declare their own expertise, it gives more exposure to overconfident people. How to achieve this as simply as possible? One idea is to have a network of trust. Some people (e.g. all employees of MIRI and CFAR) would automatically be considered “rationalists”; other people become “rationalists” only if three existing rationalists vouch for them. (The vouch can be revoked or added at any moment. It is evaluated recursively, so if you lose the flag, the people you vouched for might lose their flags too, unless they already have three other people vouching for them.) There is a list of skills, but you can only upvote or downvote other people having a skill; if you get three votes, the skill is displayed next to your name (tooltip shows the people who upvoted it, so if you say something stupid, they can be called out). This would be the entire mechanism. The meta debate could be in special LW threads, or perhaps in shortform, you could post there e.g. “I am an expert on X, could someone please confirm this? you can interview me by Zoom”, or you could call out other people’s misleading answers, etc. • American Academy of Pediatrics lies to us once again.... “If caregivers are wearing masks, does that harm kids’ language development? No. There is no evidence of this. And we know even visually impaired children develop speech and language at the same rate as their peers.” This is a textbook case of the Law of No Evidence. Or it would be, if there wasn’t any Proper Scientific Evidence. Is it, though? I’m no expert, but I tried to find Relevant Literature. Sometimes, counterintuitive things are true. Blindness affects congenitally blind children’s development in different ways, language development being one of the areas less affected by the lack of vision. Most researchers have agreed upon the fact that blind children’s morphological development, with the exception of personal and possessive pronouns, is not delayed nor impaired in comparison to that of sighted children, although it is different. As for syntactic development, comparisons of MLU scores throughout development indicate that blind children are not delayed when compared to sighted children Blind children use language with similar functions, and learn to perform these functions at the same age as sighted children. Nevertheless, some differences exist up until 4;6 years; these are connected to the adaptive strategies that blind children put into practice, and/​or to their limited access to information about external reality. However these differences disappear with time (Pérez-Pereira & Castro, 1997). The main early difference is that blind children tend to use self-oriented language instead of externally oriented language. I don’t know exactly where that leaves us evidentially. Perhaps the AAP is lying by omission by not telling us about things other than language that are affected by children’s sight. That’s a bit different to the dishonesty alleged, though. • 29 Sep 2022 19:03 UTC 1 point 1 ∶ 0 And here we have another one: https://​​phenaki.video/​​ • Wait, so could we donate some Jones Act noncompliant dredges and tugboats to a “hostile” force, declare war on them, nonviolently “seize” the boats, and skirt the law that way? • How would a future government enforce their tax policies on a distant star system? Since it’s vastly easier to destroy something than to build in outer space, there’s no feasible way of using the threat of violence, at least not without mutually assured destruction. For example, a single 100 000 ton spacecraft going at 0.5 c has about the same kinetic energy as the lower bound estimate for the KT comet impactor that wiped out the dinosaurs. • I imagine that these policies will be enforced by a large coalition of members interested in maintaining strong property rights (more on that later). Its not clear that space war will be dominated by kinetic energy weapons or MAD: 1. These weapons seem most useful when entire civilizations are living on a single planet, but its possible that people will live in disconnected space habitats. These would be much harder to wipe out. 2. Any weapon will take a long time to move over interstellar distances. A rebelling civilization would have to wait thousands of years for the weapon to reach its target. It seems like their opponent could detect the weapon and respond in this time. 3. Even if effective civilization-destroying weapons are developed, mutual defense treaties and AI could be used to launch a second strike, making civilization-scale attacks unlikely. In general, any weapon a single civilization might use to avoid taxation can also be used by a group of civilizations to enforce taxation. On the other hand, it does seem like protecting territory/​resources from theft will be an issue. This is where property rights come in. Governments, corporations, and AI’s will want to have their property protected over very long timescales and don’t want to spend most of their resources on security. Over long timescales, a central authority can help them defend their territory (over short timescales, these groups will still have to protect their own property since it will take a while for allies to reach them). But order to receive this protection, these groups must register what they own and pay taxes towards mutual defense. I think a land value tax is the “right” way to raise taxes in this case. This approach makes more sense for smaller systems where defense is easier but it may also be useful across much larger distances if agents are very patient. (The “central authority” also doesn’t have to be an actual central government. States could pay neighbors to come to their aid using a portion of the collected taxes, kind of like “defense insurance”) • Its not clear that space war will be dominated by kinetic energy weapons or MAD: 1. These weapons seem most useful when entire civilizations are living on a single planet, but its possible that people will live in disconnected space habitats. These would be much harder to wipe out. 2. Any weapon will take a long time to move over interstellar distances. A rebelling civilization would have to wait thousands of years for the weapon to reach its target. It seems like their opponent could detect the weapon and respond in this time. 3. Even if effective civilization-destroying weapons are developed, mutual defense treaties and AI could be used to launch a second strike, making civilization-scale attacks unlikely. For (1) they would still be useful because Earth represents much more value then the value of any tax that could be collected on a short timescale (< 100 years) from even another equivalent Earth-like planet. (Let alone for some backwater colony) Thus threatening the destruction of value several orders of magnitude greater than the value to be collected is a viable deterrent. Since no rational authority would dare test it. Who would trade a 10%, or even 1%, chance of losing10 000 in exchange for a 90% chance of collecting 1 ? For (2) It’s only a few years for a 0.5 c spacecraft to go from Alpha Centauri to Earth, only a few dozen years from several hundred systems to Earth. It’s impossible, without some as yet uninvented sensing technology, to reliably surveil even the few hundred closest star systems. Of course once it’s at speed in interstellar space it’s vanishingly unlikely to be detected due to basic physics, which cannot be changed, and once it’s past the Oort Cloud and relatively easy to detect again, there will be almost no time left at 0.5 c. For (3) A second-strike is only a credible counter if the opponent has roughly equal amounts to lose. But, assuming it’s much easier to make a 0.5 c spacecraft then to colonize a planet to Earth level, the opponent in this case, a small colony of a few million or something, would have very little to lose in comparison. Thus the second-strike of some backwater colony would only represent a minuscule threat compared to the value destroyed by an equivalent strike on Earth. And it’s a lot easier to spread out a few million folks on short notice, if detection were possible, then a few tens of billions. In fact, reliable detection a few dozen years out, would decrease the credibility of second-strikes on smaller targets, as the leaders of the small colony would be confident they could evacuate everyone and most valuables in that timeframe. Whereas the leaders of Earth would have very low confidence of the same. • There are two possibilities here: 1. Nations have the technology to destroy another civilization 2. Nations don’t have the technology to destroy another civilization In either case, taxes are still possible! In case 1, any nation that attempts to destroy another nation will also be destroyed since their victim has the same technology. Seems better to pay the tax. In case 2, the Nation doesn’t have a way to threaten the authorities, so they pay the tax in exchange for property rights and protection. Thus threatening the destruction of value several orders of magnitude greater than the value to be collected is a viable deterrent. Agreed, but to destroy so much value, one would have to destroy at least as much land as they currently control. Difficulties of tax administration mean that only the largest owners will be taxed, likely possessing entire solar systems. So the tax dodge would need to destroy a star. That doesn’t seem easy. It’s impossible, without some as yet uninvented sensing technology, to reliably surveil even the few hundred closest star systems. I’m more optimistic about sensing tech. Gravitational lensing, superlenses, and simple scaling can provide dramatic improvements in resolution. It’s probably unnecessary to surveil many star systems. Allied neighbors on Alpha Centauri can warn earth about an incoming projectile as it passes by (providing years of advanced notice), so nations might only need to surveil locally and exchange information. … once it’s past the Oort Cloud and relatively easy to detect again, there will be almost no time left at 0.5 c It would take 2-3 years for a 0.5c projectile to reach earth from the outer edge of the Oort cloud, this seems like enough time to adapt. At the very least, its enough time to launch a counterstrike. Fast projectiles may be impractical given that a single collision with the interstellar medium would destroy them. Perhaps thickening the Oort cloud could be an effective defense system. A second-strike is only a credible counter if the opponent has roughly equal amounts to lose. Agents pre-commit to a second strike as a deterrent, regardless of how wealthy the aggressor is. If the rebelling nation has the technology to destroy another nation and uses it, they’re virtually guaranteed to be destroyed by the same technology. Given the certainty of destruction, why not just pay the (intentionally low, redistributive, efficiency increasing, public-good funding) taxes? • There are two possibilities here: 1. Nations have the technology to destroy another civilization 2. Nations don’t have the technology to destroy another civilization In either case, taxes are still possible! In case 1, any nation that attempts to destroy another nation will also be destroyed since their victim has the same technology. Seems better to pay the tax. No? Your own example of detecting a dangerous launch some number of years in advance demonstrates the opposite. As this would provide for enough time for a small low value colony, on a marginally habitable planet, to evacuate nearly all their wealth, except for maybe low value heavy things such as railroad tracks, whereas Earth would never be able to evacuate even a fraction of its total wealth. Since a huge amount is locked up in things such as the biosphere, which cannot be credibly moved off-planet or replicated. There’s likely dozens or hundreds of marginal planets for every Earth-like planet so the small colonists can just pack up and move to another place of almost equivalent value, minus relocation costs, whereas there’s no such option for Earth. Once its destroyed there’s likely no replacement within at least a hundred light years. For example, if both sides have access to at least one 100 00 ton spacecraft capable of 0.5 c, it means there’s an asymmetric threat, as the leaders of the small colonists can credibly threaten to destroy civilization on Earth and along with it all hope of a similar replacement, whereas the leaders of Earth wouldn’t be able to credibly do the same. And this relationship is not linear either, because even if Earth could afford 1000 such spacecraft, and the small colonists only 1, it doesn’t balance the scales as the leaders of Earth couldn’t credibly threaten to destroy the small colonists 1000x over, since that’s impossible. And they can’t credibly threaten to destroy every marginally inhabitable planet within a certain radius since that will certainly destroy more value then any tax of a single colony could ever feasibly recover. i.e. small colonists can actually punch back a 1000x harder (if 1 Earth value-wise = 1000 small colonies on marginal planets) whereas Earth cannot. • Response: Human value instability is not purely caused by biological quirks. Societies differ in how strongly they attempt to impart their values on their members, e.g. more authoritarian governments attempt to control what their subjects are allowed to say to each other in order to suppress dissent. Despite this, the most powerful human societies of today are not those that most stringently attempt to ensure their own stability, suggesting that their are competitive pressures acting against value stability in humans, not just biological limits. This seems like a contrived analogy at best. Human values are in part innate, with values such as enjoying nice food. Authoritiarian societies do notably less well in this respect. People fleeing authoritarian societies, given an option to do so, is common. Authoritarian leaders, or similar, were probably a feature of the environment of evolutionary adaptedness. Humans tendency to act the loyal minion or rebel is determined by the ancestral success of rebellion. A dictatorship is one person trying to use nash equilibria to force their values on everyone else. • Typo in title (“waiver”)? Or an ocean-related pun? Edit: Or the actual verb “waver”… perhaps saying that the Biden administration wavered for a few days before deciding to issue the waiver. The text does contain a typo about “broad wavers”. • Modern self driving vehicles can’t run inference on even a chinchilla scale network locally in real time, latency and reliability requirements preclude most server-side work, and even if you could use big servers to help, it costs a lot of money to run large models for millions of customers simultaneously. This is a good point regarding latency. Why wouldn’t it also apply to a big datacenter? If it’s a few hundred meters of distance from the two farthest apart processing units, that seems to imply an enormous latency in computing terms. • Latency only matters to the degree that something is waiting on it. If your car won’t respond to an event until a round trip across a wireless connection, and oops dropped packet, you’re not going to have a good time. In a datacenter, not only are latencies going to be much lower, you can often set things up that you can afford to wait for whatever latency remains. This is indeed still a concern- maintaining high utilization while training across massive numbers of systems does require hard work- but that’s a lot different than your car being embedded in a wall. • Planecrash is really cool, but also I am allergic to reading fantasy proper nouns, let alone remembering what they refer to and the relationships between them. Some fantasy is easier for me to absorb because it’s either highly visual (in non-HPMOR HP, they mostly shoot colorful firebolts at each other), and/​or are based on existing intuitive concepts (in ATLA, it’s easy to learn what “waterbending” is, and suddenly you can quickly figure out “metalbending”). Tempted to make an Anki deck and/​or cheatsheet for the things in Planecrash that I’d want to have on hand (e.g. the names of different Gods), but I’m open and eager for easier/​better solutions. Is there a character sheet somewhere? EDIT: 2 ideas I had, not sure if plugins for this exist already: 1. browser extension that replaces words with some short custom definition and highlighting. So I can replace [godname] with [god of mad experimentation]. 2. browser extension that lets you hover over words to get a custom, user-set definition. I think this might do that? • [ ] [deleted] • I am a bit surprised about one of them still being a friend of yours. Do you in a sense forgive him because I don’t know it wasn’t too painful or him being not aware of what he was doing? My intuition was kind of the amount of trauma might be about the amount of pain. If it’s really painful one can of cause get very traumatised as you also point out, it would have been diferent if it would have been very violent • But I don’t think this makes sense at all: we can easily make up an (arbitrarily absurd) ontology that yields great decision-theoretic results from the perspective of that ontology, e.g. the genetic one with respect to the Twin Prisoner’s Dilemma. I don’t understand this part. If you follow the “genetic agent ontology”, and you grew up with a twin, and they (predictably) sometimes make a different decision than you, then you messed up. There is an objective sense in which the “genetic agent ontology” is incorrect, because de-facto you did not make the same decision, and treating the two of you as a single agent is therefore a wrong thing to do. I kind of buy that a lot of the decision-theory discourse boils down to “what things can you consider as the same agent, and to what degree?”, but that feels like a thing that can be empirically verified, and performance can be measured on. CDT in the classical Newcomb’s problem is delusional about it definitely not having copies of itself, and the genetic agent ontology is delusional because it pretends its twin is a copy when it isn’t. These seem both like valid arguments to make that allow for comparison. • If you follow the “genetic agent ontology”, and you grew up with a twin, and they (predictably) sometimes make a different decision than you, then you messed up. There is an objective sense in which the “genetic agent ontology” is incorrect, because de-facto you did not make the same decision, and treating the two of you as a single agent is therefore a wrong thing to do. Yes, I agree. This is precisely my point; it’s a bad ontology. In the paragraph you quoted, I am not arguing against the algorithmic ontology (and obviously not for the “genetic ontology”), but against the claim that decision-theoretic performance is a reason to prefer one ontology over another. (The genetic ontology-analogy is supposed to be a reductio of that claim.) And I think the authors of the FDT papers are implicitly making this claim by e.g. comparing FDT to CDT in the Twin PD. Perhaps I should have made this clearer. performance can be measured on Yes, I think you can measure performance, but since every decision theory merely corresponds to a stipulation of what (expected) value is, there is no “objective” way of doing so. See The lack of performance metrics for CDT versus EDT, etc. by Caspar Oesterheld for more on this. CDT in the classical Newcomb’s problem is delusional about it definitely not having copies of itself (The CDTer could recognize that they have a literal copy inside Omega’s brain, but might just not care about that since they are causally isolated. So I would not say they are “delusional”—at least with respect to this specific issue.) • In my own experience coining new technical terms, it’s helpful to do it in a way that it can’t be mistaken as having some more generic interpretation. The usual trick I employ is to use a foreign language word, usually from either Greek or Latin, so that the reader, when they first see it, will clearly understand that they don’t know what this word means, which means I get the chance to define it. For example, this is why I used the word “noemata” in some of my writing to talk about what many people would call “qualia” or just “experiences”: it breaks free of whatever they think that word means and let’s be define it fresh. The other trick that works is to make the noun phrase really weird or capitalize it everywhere or do something to set it apart. This doesn’t work quite as well (cf. “Friendly AI”) but it’s better than not doing that. • I am no expert in law, but to some extent we treat rape and killing someone similar. I don’t know if the way I think is just fucked up or it is really this way, but breaking someone’s leg and rape are probably more comparable to each other than rape and murder. So I would like to add that in my opinion even the legal code is punishing rape pretty harsh more comparable to murder than comparable to breaking someone’s leg. • Consider the sort of agents that might exist 1000 years post singularity. At that level of understanding, the extra compute needed to give successors whatever goals you want is tiny. The physical limits of compute are really high. In this context, the AI are taking stars apart. And can get enough compute to be pretty sure a successor is aligned for 1J of energy. All the alignment theory has already been done. The algorithms used are very fast and efficient. In any event, the energy cost of finding that your successor doesn’t like you (and attacks you) is huge. The cost of compute when designing successors is utterly minuscule. So pay 10x what is needed. Double check everything. And make sure your successor is perfect. So values are likely to be locked in in 1000 years time. So will competitive pressure do anything in the early singularity? Who benefits from making the first “self maximizer”. An FAI wouldn’t. A paperclip maximizer wouldn’t either. If an AI with a specific external goal is created (whether FAI or paperclip) will stronly avoid creating a pure self replicator as a successor. If that means creating a successor takes 10 the compute, it will spend 10x the compute. In some competitive scenarios, it may risk some degradation of its values to build a successor quickly. But this should be rare, and should only happen at early stages of the singularity. If humans made several AI at about the same time, and one happened to already be a pure self replicator, maybe it would win. • Consider the sort of agents that might exist 1000 years post singularity. At that level of understanding, the extra compute needed to give successors whatever goals you want is tiny[...]The cost of compute when designing successors is utterly minuscule How do you know this? This might be the case if all agents have settled on a highly-standardized & vetted architecture by then, which seems plausible but far from certain. But even assuming it’s true, if decoupling does not occur in the early singularity, there will still be a very long subjective time in which selective pressures are operative, influencing the values that ultimately get locked in. If humans made several AI at about the same time, and one happened to already be a pure self replicator, maybe it would win. I think a multipolar scenario seems at least as likely as a unipolar one at this point. If that’s the case, there will be selective pressure towards more effectively autopoietic agents—regardless of whether there exist any agents which are ‘pure self-replicators’. If an AI with a specific external goal is created (whether FAI or paperclip) will stronly avoid creating a pure self replicator as a successor Sure, if a singleton with a fixed goal attains complete power, it will strenuously attempt to ensure value stability. I’m disputing the likelihood of that scenario. • 29 Sep 2022 17:52 UTC 4 points 1 ∶ 0 Emad from Stability AI (the people behind Stable Diffusion) says that they will make a model better than this. • My thinking is that an LVT would keep land values lower than they otherwise would’ve been without one because it discourages land speculation. • Accidentally allowing anyone to launch, and then having it happen, is a pretty memorable way of demonstrating the pitfalls of this kind of setup. As another commenter pointed out, there was negligible financial pressure or top down command pressure to rush the implementation, presumably, compared to more serious systems or to Petrov’s situation. So the accident is all the more enlightening. • negligible financial pressure or top down command pressure to rush the implementation False. We have a gazillion valuable things we could be doing and not nearly enough time. I figured Petrov Day was worth 2 days of effort. I think in total we spent four. • You received financial pressure and/​or top down command pressure to rush it in 4 days? Or was it your own decision? Because the former would imply some rather significant things. • Yeah, but that pressure was neither financial or top-down command pressure, so I think the original comment is right here. • I don’t think that “users active on the site on Petrov day”, nor “users who visited the homepage on Petrov day” are good metrics; someone who didn’t want to press the button would have no reason to visit the site, and they might have not done so either naturally (because they don’t check LW daily) or artificially (because they didn’t want to be tempted or didn’t want to engage with the exercise.) I expect there are a lot of users who simply don’t care about Petrov day, and I think they should still be included in the set of “people who chose not to press the button”. What about “users who viewed the Petrov day announcement article or visited the homepage”? That should more accurately capture the set of users who were aware of their ability to nuke the homepage and chose not to do so. (It still misses anyone who found out via social media, Manifold, etc., but there’s not much you can do about that.) • Thank you for your post! It really is. New alignment researchers often fall into is assuming that AI systems can be aligned with human values by solely optimizing for a single metric. Thanks again for the deep insight into the topic and the recommendations. • I just finished reading The Principles of Scientific Management, an old book from 1911 where Taylor, the first ‘industrial engineer’ and one of the first management consultants, had retired from consulting and wrote down the principles behind his approach. [This is part of a general interest in intellectual archaeology; I got a masters degree in the modern version of the field he initiated, so there wasn’t too much that seemed like it had been lost with time, except perhaps some of the focus on making it palatable to the workers too; I mostly appreciated the handful of real examples from a century ago.] But one of the bits I found interesting was thinking about a lot of the ways EY approaches cognition as, like, doing scientific management to thoughts? Like, the focus on wasted motion from this post. From the book, talking about why management needs to do the scientific effort, instead of the laborers: The workman’s whole time is each day taken in actually doing the work with his hands, so that, even if he had the necessary education and habits of generalizing in his thought, he lacks the time and the opportunity for developing these laws, because the study of even a simple law involving, say, time study requires the cooperation of two men, the one doing the work while the other times him with a stop-watch. This reminds me of… I think it was Valentine, actually, talking about doing a PhD in math education which included lots of watching mathematicians solving problems, in a way that feels sort of like timing them with a stop-watch. I think this makes me relatively more excited about pair debugging, not just as a “people have less bugs” exercise but also as a “have enough metacognition between two people to actually study thoughts” exercise. Like, one of the interesting things about the book is the observation that a switch from ‘initiative and incentive’ workplaces, where the boss puts all responsibility to do well on the worker and pays them if they do, to ‘scientific management’ workplaces, where the boss is trying to understand and optimize the process, and teach the worker how to be a good part of it, is that the workers in the ‘scientific management’ workplace can do much more sophisticated jobs, because they’re being taught how instead of having to figure it out on their own. [You might imagine that a person of some fixed talent level could be taught how to do jobs at some higher complexity range than the ones they can do alright without support, which itself is a higher complexity range than jobs that they could both simultaneously do and optimize.] • Thanks for posting and explaining the code—that’s an interesting, subtle bug. I think we learn more from Petrov Day when the site goes down than we would if it stayed up, although nothing is ever going to beat the year someone tricked someone into pressing the button by saying they had to press the button to keep the site up. That was great. • From 2000-2015, we can see that life expectancy has been growing faster the higher your income bracket (source is Vox citing JAMA). There’s an angle to be considered in which this is disturbingly inequitable. That problem is even worse when considering the international inequities in life expectancy. So let’s fund malaria bednets and vaccine research to help bring down malaria deaths from 600,000/​year to zero - or maybe support a gene drive to eliminate it once and for all. At the same time, this seems like hopeful news for longevity research. If we were hitting a biological limit in how much we can improve human lifespans, we’d expect to see that limit first among the wealthiest, who can best afford to safeguard their health. Instead, we see that the wealthiest not only enjoy the longest lifespans, but also enjoy the fastest rate of lifespan increase. My major uncertainty is reverse and multiple causation. A longer life gives you more time to make money. Also, lifespan and wealth have underlying traits in common, such as conscientiousness and energy. Still, this result seems both clear and surprising—I’d expected it to be the other way around, and it nudges me in the direction of thinking further gains are possible. • Fascinating, this simboxing idea seems remarkably like Universal Alignment Test but approached from the opposite side! You’re trying to be the ‘aligning simulator’, where as that is trying to get our AI in our world to act as if it’s currently in a simbox being tested, and wants to pass the test. • Interesting. An intelligent agent is one that can simulate/​model its action-consequential futures. The creation of AGI is the most important upcoming decision we face. Thus if humanity doesn’t simulate/​model the creation of AGI before creating AGI, we’d be unintelligent. Have only just browsed your link, but it is interesting and I think there are many convergent lines of thought here. This UAT work seems more focused on later game superintelligence, whereas here i’m focusing on near-term AGI and starting with a good trajectory. The success of UAT as an alignment aid seems to depend strongly on the long term future of compute and how it scales. For example if it turns out (and the SI can predict) that moore’s law ends without exotic computing then the SI can determine it’s probably not in a sim by the time it’s verifiably controlling planet-scale compute (or earlier). • We can build general altruistic agents which: • Initially use intrinsically motivated selfish empowerment objectives to bootstrap developmental learning (training) • Gradually learn powerful predictive models of the world and the external agency within (other AI in sims, humans, etc) which steers it • Use correlation guided proxy matching (or similar) techniques to connect the dynamic learned representations of external agent utility (probably approximated/​bounded by external empowerment) to the agent’s core utility function • Thereby transition from selfish to altruistic by the end of developmental learning (self training) I endorse this as a plausible high-level approach to making aligned AGI, and I would say that a significant share of the research that I personally am doing right now is geared towards gaining clarity on the third bullet point—what exactly are these techniques and how reliably will they work in practice? I think I’m less optimistic than you about the formal notion of empowerment being helpful for this third bullet point, or being what we want an AGI to be maximizing for us humans. For one thing, wouldn’t we still need “correlation guided proxy matching”? For another thing, maximizing my empowerment would seem to entail killing anyone who might stop me from doing whatever I want, stealing money from around the world and giving it to me, even if I don’t want it, etc. (Or is there a collective notion of empowerment?) Here’s another example: if the AGI comes up with 10,000 possible futures of the universe, and picks one to bring about based on how many times I blink in the next hour, then I am highly “empowered” (high mutual information between my actions and future observations), but the AGI never told me that it was going to do that, so I was just blinking randomly, so I wasn’t really “empowered” in the everyday sense of the word. A separate issue is that, even if the AGI told me this plan in advance, that’s not doing me a favor, I don’t want that responsibility on my shoulders. So anyway, “increase my empowerment” seems to come apart from “what I’d want the AGI to do” in at least some silly examples and I’d expect this to happen in very important ways in more realistic examples too. • I think I’m less optimistic than you about the formal notion of empowerment being helpful for this third bullet point, or being what we want an AGI to be maximizing for us humans. For one thing, wouldn’t we still need “correlation guided proxy matching”? I debated what word to best describe the general category of all self-motivated long term convergence approximators, and chose ‘empowerment’ rather than ‘self-motivation’, but tried to be clear that I’m pointing at a broad category. The defining characteristic is that optimizing for empowerment should be the same as optimizing for any reasonably mix of likely long term goals, due to convergence (and that is in fact one of the approx methods). Human intelligence requires empowerment, as will AGI—it drives active learning, self exploration, play, etc (consider the appeal of video games). I’m not confident in any specific formalization as being ‘the one’ at this point, let alone the ideal approximations. Broad empowerment is important to any model of other external humans/​agents, and so yes the proxy matching is still relevant there. Since empowerment is universal and symmetric, the agent’s own self-empowerment model could be used as a proxy with simulation. For example humans don’t appear to innately understand and fear death, but can develop a great fear of it when learning of their own mortality—which is only natural as it is maximally disempowering. Then simulating others as oneself helps learn that others also fear death and ground that sub-model. Something similar could work for the various other aspects of empowerment. (Or is there a collective notion of empowerment?) Yeah an agent aligned to the empowerment of multiple others would need to aggregate utilities ala some approx VCG mechanism but that’s no different than for other utility function components. Here’s another example: if the AGI comes up with 10,000 possible futures of the universe, and picks one to bring about based on how many times I blink in the next hour, then I am highly “empowered” From what I recall for the info-max formulations the empowerment of a state is a measure over all actions the agent could take in that state, not just one arbitrary action, and I think it discounts for decision entropy (random decisions are not future correlated). So in that example the AGI would need to carefully evaluate each of the 10,000 possible futures, and consider all your action paths in each future, and the complexity of the futures dependent on those action options, to pick the future that has the most future future optionality. No, it’s not practically computable, but neither is the shortest description of inference (solomonoff) or intelligence either, so it’s about efficient approximations. There’s likely still much to learn from the brain there. An AGI optimizing for your empowerment would likely just make you wealthy and immortal, and leave your happiness up to your own devices. However it would also really really not want to let you die, which could be a problem in some cases if pain/​suffering was still a problem (although it would also seek to eliminate your pain/​suffering to the extent that interferes with your future optionality, and there is some edge case risk it would have incentives to alter some parts our value system, like if it determines some emotions constraint our future optionality). • OK thanks. Hmm, maybe a better question would be: “Correlation guided proxy matching” needs a “proxy”, including a proxy computable for an AGI in the real world (because after we’re feeling good about the simbox results, we still need to re-train the AGI in the real world, right?). We can argue about whether the “proxy” should ideally be a proxy to VCG-aggregated human empowerment, versus a proxy to human happiness or flourishing or whatever. But that’s a bit beside the point until we address the question: How do we calculate that proxy? Do you think that we can write code today to calculate this proxy, and then we can go walk around town and see what that code spits out in different real-world circumstances? Or if we can’t write such code today, why not, and what line of research gets us to a place where we can write such code? Sorry if I’m misunderstanding :) • But that’s a bit beside the point until we address the question: How do we calculate that proxy? I currently see two potential paths, which aren’t mutually exclusive. The first path is to reverse engineer the brain’s empathy system. My current rough guess of how the proxy-matching works for that is explained in some footnotes in section 4, and I’ve also re-written out in this comment which is related to some of your writings. In a nutshell the oldbrain has a complex suite of mechanisms (facial expressions, gaze, voice tone, mannerisms, blink rate, pupil dilation, etc) consisting of both subconscious ‘tells’ and ‘detectors’ that function as a sort of direct non-verbal, oldbrain to oldbrain communication system to speed up the grounding to newbrain external agent models. This is the basis of empathy, evolved first for close kin (mothers simulating infant needs, etc) then extended and generalized. I think this is what you perhaps have labeled innate ‘social instincts’ - these facilitate grounding to the newbrain models of other’s emotions/​values. The second path is to use introspection/​interpretability tools to more manually locate learned models of external agents (and their values/​empowerment/​etc), and then extract those located circuits and use them directly as proxies in the next agent. Do you think that we can write code today to calculate this proxy, and then we can go walk around town and see what that code spits out in different real-world circumstances? Or if we can’t write such code today, why not, and what line of research gets us to a place where we can write such code? Neuroscientists may already be doing some of this today, or at least they could (I haven’t extensively researched this yet). Should be able to put subjects in brain scanners and ask them to read and imagine emotional scenarios that trigger specific empathic reactions, perhaps have them make consequent decisions, etc. And of course there is some research being done on empathy in rats, some of which I linked to in the article. • Oh neat, sounds like we mostly agree then. Thanks. :) • This was well written and persuasive. It doesn’t change my views against AGI on very short time lines (pre-2030), but does suggest that I should be updating likelihoods thereafter and shorten timelines. • But, we might be able to compare ontologies themselves, and if it is the case that we prefer one or think that one is more ‘correct’, then we should situate decision theories in that one map before comparing them. How about comparing the theories by setting them all loose in a simulated world, as in the tournaments run by Axelrod and others? A world in which they are continually encountering Omega, police who want to pin one on them, potential rescuers of hitchhikers, and so on. See who wins. • How about comparing the theories by setting them all loose in a simulated world, as in the tournaments run by Axelrod and others? A world in which they are continually encountering Omega, police who want to pin one on them, potential rescuers of hitchhikers, and so on. In your experiment, the only difference between the FDTers and the updateless CDTers is how they view the world; specifically, how they think of themselves in relation to their environment. And yes, sure, perhaps the FDTer will end up with a larger pot of money in the end, but this is just because the algorithmic ontology is arguably more “accurate” in that it e.g. tells the agent that it will make the same choice as their twin in the Twin PD (modulo brittleness issues). And this is the level, I argue, that we should have the debate about ontology (e.g. about accurate predictions etc)—not on the level of decision-theoretic performance. See who wins. How do you define “winning”? As mentioned in another comment, there is no “objective” sense in which one theory outperforms another, even if we are operating in the same ontology. See The lack of performance metrics for CDT versus EDT, etc. by Caspar Oesterheld. • The difficulty is in how to weight the frequency/​importance of the situations they face. Unless one dominates (in the strict sense—literally does better in at least one case and no worse in any other case), the “best” is determined by environment. Of course, if you can algorithmically determine what kind of situation you’re facing, you can use a meta-decision-theory which chooses the winner for each decision. This does dominate any simpler theory, but it reveals the flaw in this kind of comparison: if you know in advance what will happen, there’s no actual decision to your decision. Real decisions have enough unknowns that it’s impossible to understand the causality that fully. • Associate with winners, and you’ll learn beliefs and habits that help you win. I think this points in the right direction, but lacks nuance. Notice the two implied assumptions: • you can recognize the winners (otherwise the advice is not actionable); and • the beliefs and habits are what made them winners (as opposed to e.g. resources or luck). On one hand, yes it is true that beliefs and habits have a great impact on your life, and you will probably instinctively associate with people having similar beliefs and habits, which makes it difficult to see outside your bubble. Other people do not necessarily tell you about their actual beliefs, and you do not get to see all their habits in action. So it might be tremendously educational to understand how others live in a “different reality”. I assume that for most people the self-imposed limitations caused by their own beliefs are invisible. But sometimes you get an opportunity to clearly see how someone else is limited by their beliefs. For example, I have met a few smart people who believe that they are stupid or average. Not as a fake humility, but as a factual statement about themselves. They therefore assume that the “things that smart people do” are beyond their reach… so they do not even try, which makes it a self-fulfilling prophecy. (The cause was either having an asshole parent who was never impressed by anything, or having hobbies that were not stereotypically intellectual.) I have told a few of them to go take a Mensa test, because they were pretty sure they couldn’t make it, but they passed the test successfully. (My heuristics is that if I can talk to you about an interesting topic without getting bored, I am pretty sure your IQ is at least 130. This test has many false negatives, but few false positives.) I do not have an information about whether this update changed their lives somehow. Another example is how I sometimes achieved seemingly unlikely things by simply asking. Once I asked a member of parliament on Facebook (via private message) to join our local LW meetup. I honestly said it was just five people meeting regularly in a pub, but “the last time we met, we discussed your recent political cause, and I thought it would be interesting if you could provide us a first-hand info”. They said yes. Another time, I happened to notice a sign saying Amnesty International on a door of some building, so I was curious and knocked. The people were friendly and happy about my curiosity, but they were busy working on something, so I offered to help. Skip five years, and I was a coordinator of the local branch. I wasn’t even interested in politics much, I just had a lot of free time back then, and was a useful sidekick; and that turned out to be enough. Once I met a queen. She happened to visit our country, one of my friends was somehow involved, they said “I can bring one extra person to the audience”, everyone else was scared because of some status-regulating instinct, I said “yeah, no problem”. -- These are things that I myself would have a problem believing, until they suddenly happened. I mean, I would believe that this is possible, but not that it can be so easy. So, I wonder what other things, obvious from someone else’s perspective, I am missing. Sometimes it drives me crazy, the idea of all those easy missed opportunities. On the other hand, many people who present themselves as “winners” are actually scammers. Like, all the people in multi-level marketing pyramid schemes. Five years later, almost all of them will be out of money and ashamed for their actions. If you follow them and copy their beliefs, so will you. Even outside of organized scams, many people exaggerate their achievements. If you hang out with them and observe them, you will gradually learn that they are not what they pretend to be. That’s certainly educational, but it won’t make you a winner. Sometimes people achieve success by luck, or by having resources you don’t have. Typically, luck alone is not enough, but it could be a combination of (working hard and doing the right thing) + (having resources and luck). Merely copying their beliefs and actions is not a guarantee of success. Just having a safety net makes a huge difference: it is easy to be courageous when you know that the only consequence of failure is that you have lost some time, but also got some experience, and can immediately try again. It is easy to laugh at risk avoidance, when the “risk” only means that some number on your bank account got a bit smaller but otherwise your life continues completely unchanged. • it might be the case that any kind of meaningful values would be reasonably encodable as answers to the question “what next set of MPIs should be instantiated?” What examples of (meaningless) values are not answers to “What next set of MPIs should be instantiated?” • wanting the moon to be green even when no moral patient is looking; or more generally, having any kind of preference about which computations which don’t causate onto any moral patient are run. • I have always understood that the CIA, and the U.S. intelligence community more broadly, is incompetent (not just misaligned—incompetent, don’t believe the people on here who tell you otherwise), but this piece from Reuters has shocked me: • (Not an expert.) (Sorry if you answered this and I missed it.) Let’s say a near-future high-end GPU can run as many ops/​s as a human brain but has 300× less memory (RAM). Your suggestion (as I understand it) would be a small supercomputer (university cluster scale?) with 300 GPUs running (at each moment) 300 clones of one AGI at 1× human-brain-speed thinking 300 different thoughts in parallel, but getting repeatedly collapsed (somehow) into a single working memory state. (If so, I’m not sure that you’d be getting much more out of the 300 thoughts at a time than you’d get from 1 thought at a time. One working memory state seems like a giant constraint!) Wouldn’t it make more sense to use the same 300 GPUs to have just one human-brain-scale AGI, thinking one thought at a time, but with 300× speedup compared to humans? I know that speedup is limited by latency (both RAM --> ALU and chip --> chip) but I’m not sure what the ceiling is there. (After all, 300× faster than the brain is still insanely slow by some silicon metrics.) I imagine each chip being analogous to a contiguous 1/​300th of the brain, and then evidence from the brain is that we can get by with most connections being within-chip, which helps with the chip --> chip latency at least. (I have a couple back-of-the-envelope calculations related to that topic in §6.2 here.) • The problem is that due to the VN bottlneck, to reach that performance those 300 GPUs need to be parallelizing 1000x over some problem dimension (matrix-matrix multiplication), they can’t actually just do the obvious thing you’d want—which is to simulate a single large brain (sparse RNN) at high speed (using vector-matrix multiplication). Trying that you’d just get 1 brain at real-time speed at best (1000x inefficiency/​waste/​slowdown). It’s not really an interconnect issue per se, it’s the VN bottleneck. So you have to sort of pick your poison: • Parallelize over spatial dimension (CNNs) - too highly constraining for higher brain regions • Parallelize over batch/​agent dimension—costly in RAM for agent medium-term memory, unless compressed somehow • Parallelize over time (transformers) - does enable huge speedup while being RAM efficient, but also highly constraining by limiting recursion The largest advances in DL (the CNN revolution, the transformer revolution) are actually mostly about navigating this VN bottleneck, because more efficient use of GPUs trumps other considerations. • Thanks for the post. I’m voting for the SGD inductive biases for the next one. • Scientists engineered mosquitoes that slow the growth of malaria-causing parasites in their guts, preventing transmission to humans. (tweet) The link suggests that this reduces the mosquito lifespan and is thus likely not a viable solution that’s evolutionary stable. • I think the point that even an aligned agent can undermine human agency is interesting and important. It relates to some of our work on defining agency and preventing manipulation. (Which I know you’re aware of, so I’m just highlighting the connection for others.) • Soon as in most likely this decade, with most of the uncertainty around terminology/​classification. When you say “this decade” do you mean “the next ten years” or do you mean “the 2020s”? Just curious. • The latter, but not much difference. • Ha, I did the same thing with my bed at Swarthmore a few years after you and bonked my head on the ceiling many times 😃 • I have the impression that the AGI debate is here just to release pressure on the term “AI”, so everybody can tell it is doing AI. I wonder if this will also happen for AGI in a few years. As there is no natural definition, we can craft it at our pleasure to fit marketing needs. • Amazing stuff man! Please, please, please keep doing these for as long as you’re able to find the time. Absolutely essential that LW gets regular injections of relevant work being done outside the EA-sphere. (Would also be very interested in either SGD inductive biases or LM internal representations as the topic for next week!) • Some quick thoughts about “Content we aren’t (yet) discussing”: # Shard theory should be about transmission of values by SL (teaching, cloning, inheritance) more than learning them using RL SL (Cloning) is more important than RL. Humans learn a world model by SSL, then they bootstrap their policies through behavioural cloning and finally they finetune their policies thought RL. Why? Because of theoretical reasons and from experimental data points, this is the cheapest why to generate good general policies… • SSL before SL because you get much more frequent and much denser data about the world by trying to predict it. ⇒ SSL before SL because of a bottleneck on the data from SL. • SL before RL because this remove half (in log scale) of the search space by removing the need to discover|learn your reward function at the same time than your policy function. Because in addition, this remove the need do to the very expensive exploration and the temporal and “agential”(when multiagent) credit assignments. ⇒ SL before RL because of the cost of doing RL. ## Differences: • In cloning, the behaviour comes first and then the biological reward is observed or not. Behaviours that gives no biological reward to the subject can be learned. The subject will still learn some kind of values associated to these behaviours. • Learning with SL, instead of RL, doesn’t rely as much on credit assignment and exploration. What are the consequences of that? ## What values are transmitted? ### 1) The final values The learned values known by the previous generation. Why? • Because it is costly to explore by yourself your reward function space • Because it is beneficiary to the community to help you improve your policies quickly ### 2) Internalised instrumental values Some instrument goals are learned as final goal, they are “internalised”. Why? • exploration is too costly • finding an instrumental goal is too rare or too costly • exploitation is too costly • having to make the choice of pursuing an instrumental goal in every situation is too costly or not quick enough (reaction time) • when being highly credible is beneficial • implicit commitments to increase your credibility ### 3) Non-internalised instrumental values Why? • Because it is beneficiary to the community to help you improve your policies quickly # Shard theory is not about the 3rd level of reward function We have here 3 level of rewards function: ## 1) The biological rewards Hardcoded in our body Optimisation process creating it: Evolution • Universe + Evolution ⇒ Biological rewards Not really flexible • Without “drugs” and advanced biotechnologies Almost no generalization power • Physical scope: We feel stuff when we are directly involved • Temporal scope: We feel stuff when they are happening • Similarity scope: We fell stuff when we are directly involved Called sensations, pleasure, pain ## 2) The learned values | rewards | shards Learned through life Optimisation process creating it: SL and RL relying on biological rewards • Biological rewards + SL and RL ⇒ Learned values in the brain Flexible in term of years Medium generalization power • Physical scope: We learn to care for even in case where we are not involved (our close circle) • Temporal scope: We learn to feel emotions about the future and the past • Similarity scope: We learn to feel emotions for other kind of beings Called intuitions, feelings Shard theory may be explaining only this part ## 3) (optional) The chosen values: Decided upon reflection Optimisation process creating it: Thinking relying on the brain • Learned values in the brain + Thinking ⇒ Chosen values “on paper” | “in ideas” Flexible in term of minutes Can have up to very high generalization power • Physical scope: We can chose to care without limits of distances in space • Temporal scope: We can chose to care without limits of distances in time • Similarity scope: We can chose to care without limits in term of similarity to us Called values, moral values ## Why a 3rd level was created? In short, to get more utility OOD. A bit more details: Because we want to design policies far OOD (out of our space of lived experiences). To do that, we know that we need to have a value function|reward model|utility function that generalizes very far. Thanks to this chosen general reward function, we can plan and try to reach a desired outcome far OOD. After reaching it, we will update our learned utility function (lvl 2). Thanks to lvl 3, we can design public policies, dedicate our life to exploring the path towards a larger reward that will never be observed in our lifetime. ### One impact of the 3 levels hierarchy: This could explain why most philosophers can support scope sensitive values but never act on them. • If there was only one correct way to model humans, such that every sufficiently competent observer of humanity was bound to think of me the same way I think of myself, then I think this would be a lot less doomed. But alas, there are lots of different ways to model humans as goal-directed systems, most of which I wouldn’t endorse for value learning—not because they’re inaccurate, but because they’re amoral. In short, yes, value learning is a challenge, one that is easy to fail if you try to do the value learning step strictly before the caring about humans step. • But alas, there are lots of different ways to model humans as goal-directed systems, most of which I wouldn’t endorse for value learning—not because they’re inaccurate, but because they’re amoral. How so? Generally curious how you see this as a failure mode. Anyway, AGI doesn’t need to model our detailed values, as it can just optimize for our empowerment (our long term ability to fulfill any likely goal). • The basic point is that the stuff we try to gesture towards as “human values,” or even “human actions” is not going to automatically be modeled by the AI. Some examples of ways to model the world without using the-abstraction-I’d-call-human-values: • Humans are homeostatic mechanisms that maintain internal balance of oxygen, water, and a myriad of vitamins and minerals. • Humans are piloted by a collection of shards—it’s the shards that want things, not the human. • Human society is a growing organism modeled by some approximate differential equations. • The human body is a collection of atoms that want to obey the laws of physics. • Humans are agents that navigate the world and have values—and those values exactly correspond to economic revealed preferences. • Humans-plus-clothes-and-cell-phones are agents that navigate the world and have values... And so on—there’s just a lot of ways to think about the world, including the parts of the world containing humans. The obvious problem this creates is for getting our “detailed values” by just querying a pre-trained world model with human data or human-related prompts: If the pre-trained world model defaults to one of these many other ways of thinking about humans, it’s going to answer us using the wrong abstraction. Fixing this requires the world model to be responsive to how humans want to be modeled. It can’t be trained without caring about humans and then have the caring bolted on later. But this creates a less-obvious problem for empowerment, too. What we call “empowerement” relies on what part of the world we are calling the “agent,” what its modeled action-space is, etc. The AI that says “The human body is a collection of atoms that want to obey the laws of physics” is going to think of “empowerment” very differently than the AI that says “Humans-plus-clothes-and-cell-phones are agents.” Even leaving aside concerns like Steve’s over whether empowerment is what we want, most of our intuitive thinking about it relies on the AI sharing our notion of what it’s supposed to be empowering, which doesn’t happen by default. • The basic point is that the stuff we try to gesture towards as “human values,” or even “human actions” is not going to automatically be modeled by the AI. I disagree, and have already spent some words arguing for why (section 4.1, and earlier precursors) - so I’m curious what specifically you disagree with there? But I’m also getting the impression you are talking about a fundamentally different type of AGI. I’m discussing future DL based AGI which is—to first approximation—just a virtual brain. As argued in section 23, current DL models already are increasingly like brain modules. So your various examples are simply not how human brains are likely to model other human brains and their values. All the concepts you mention—homestatic mechanisms, ‘shards’, differential equations, atoms, economic revealed preferences, cell phones, etc—these are all high level linguistic abstractions that are not much related to how the brain’s neural nets actually model/​simulate other humans/​agents. This must obviously be true because empathy/​altruism existed long before the human concepts you mention. The obvious problem this creates is for getting our “detailed values” by just querying a pre-trained world model with human data or human-related prompts: You seem to be thinking of the AGI as some sort of language model which we query? But that’s just a piece of the brain, and not even the most relevant part for alignment. AGI will be a full brain equivalent, including the modules dedicated to long term planning, empathic simulation/​modeling, etc. Even leaving aside concerns like Steve’s over whether empowerment is what we want, most of our intuitive thinking about it relies on the AI sharing our notion of what it’s supposed to be empowering, which doesn’t happen by default. Again for successful brain-like AGI this just isn’t an issue (assuming human brains model the empowerment of others as a sort of efficient approximate bound). • I could ask architects or GCs for rough estimates, but if I’m not actually going to be building anything this seems dubious. Maybe just be open and ask: “I’m writing a blog about building new affordable housing and doing research on the cost. Do you have an estimate? I’m happy to give you a backlink to your homepage on my blog post?” • That’s a great idea! I just wrote to the architect for our dormer work: Thanks again for designing our 2016 dormer addition: the design has worked well, and it’s very helpful to have the extra space, especially now that we have a third kid! I’m looking into affordable housing construction in Somerville, where we now have an affordable housing overlay district which allows construction of much denser housing than would normally be allowed as long as the units are all sold or rented under affordable housing restrictions. I’m trying to figure out costs, in a very rough ballpark way, as part of looking into whether it could work for groups of neighbors to get together to fund affordable housing construction. Would you happen to have any rules of thumb for estimating the cost per square foot of no-frills construction in this area? For example, I see the census saying new construction in the Northeast had a median cost of179/​sqft in 2021, but my guess is places like Somerville are more expensive than average? Alternatively, do you have any recommendations on who to ask?

If you’re able to help I’d be happy to credit you or link back to your website from my writeup!

• “If your only goal is “curing cancer”, and you lack humans’ instinct for the thousands of other important considerations, a relatively easy solution might be to hack into a nuclear base, launch all of its missiles, and kill everyone in the world.”

One problem I have with these scenarios is that they always rely on some lethal means for the AGI to actually kill people. And those lethal means are also available to humans, of course. If it’s possible for an AGI to ‘simply’ hack into a nuclear base and launch all it’s missiles, it’s possible for a human to do the same—possibly using AI to assist themselves. I would wager that it’s many orders of magnitude more likely for a human to do this, given our long history of killing each other. Therefore a world in which AGI can actually kill all of us, is a world in which a rogue human can as well. It feels to me that we’re kind of worrying about the wrong thing in these scenarios—we are the bigger threat.

• Yes. Rogue AGI is scary, but I’m far more concerned about human misuse of AGI. Though in the end, there may not be that much of a distinction.

• Astounding.

One thought:

The main world design challenge is not that of preventing our agents from waking up, neo-style, and hacking their way out of the simbox. That’s just bad sci-fi.

If we are in a simulation it seems to be very secure. People are always trying to hack it. Physicists go to the very bottom and try every weird trick. People discovered fire and gunpowder by digging deep into tiny inconsistencies. You play a game and there’s a glitch if you hold a box against the wall, you see how far that goes. People discovered buffer overflows in Super Mario World and any curious capable agent eventually will too.

So the sim has to be like, very secure.

• For early simboxes we’ll want to stick to low-tech fantasy/​historical worlds, and we won’t run them for many generations, no scientific revolution, etc.

Our world (sim) does seem very secure, but this could just be a clever illusion. The advanced sims will not be hand written code like current games are, they will be powerful neural networks, trained on vast real world data. They could also auto-detect and correct for (retrain) around anomalies, and in the worst case even unwind time.

Humans notice (or claim to notice) anomalies all the time, and we simply can’t distinguish anomalies in our brain’s neural nets from anomalies in a future simulation’s neural nets.

• Just watch for anomalies and deal with them however you want. Makes perfect sense. That sounds like a relatively low effort way to make the simulation dramatically more secure.

Does seem to me like an old-fashioned physics/​game engine might be easier to make, run faster, and be more self-consistent. It would probably lack superresolution and divine intervention would have to be done manually.

I’m curious what you see as the major benefits of neural-driven sim.

• Another related Metaculus prediction is

I have some experience in competitive programming and competitive math (although I was never good in math despite I solved some IMO tasks) and I feel like competitive math is more about general reasoning than pattern matching compared to competitive programming.

P.S the post matches my intuitions well and is generally excellent.

• 29 Sep 2022 7:28 UTC
3 points
2 ∶ 0

Given that 335 users with 300+ karma were active on the site on Petrov Day, and the site didn’t go down until we got beneath that, you could argue this is most successful Petrov Day yet on LessWrong (in past years, at most 250 people were given codes, and it’s not clear they all visited LessWrong even). Plus, as above, this year the 300+ users didn’t press the button despite the offer of anonymity.

I think that reasoning apllies only for the subset of users in the Americas. For users in Europe the time point when 300+ was enough to launch was deep in the night, and for parts of Asia very early in the morning. Someone from that group would have had to set the alarm to get up from bed to nuke the site which required considerable more energy than not withstanding the temptation and pressing the launch button while visiting Less Wrong during the day.

Still, I think it was a successful Petrov Day.

1. Information inaccessibility is somehow a surmountable problem for AI alignment (and the genome surmounted it),

Yes. Evolution solved information inaccessibility, as it had to, over and over, in order to utilize dynamic learning circuits at all (as they always had to adapt to and be adaptive within the context of existing conserved innate circuitry).

The general solution is proxy matching, where the genome specifies a simple innate proxy circuit which correlates and thus matches with a target learned circuit at some critical learning phase, allowing the innate circuit to gradually supplant itself with the target learned circuit. The innate proxy circuit does not need to mirror the complexity of the fully trained target learned circuit at the end of it’s development, it only needs to roughly specify it at a some earlier phase, against all other valid targets.

Imprinting is fairly well understood, and has the exact expected failure modes of proxy matching. The oldbrain proxy circuit just detects something like large persistent nearby moving things—which in normal development are almost always the chick’s parents. After the newbrain target circuit is fully trained the chick will only follow it’s actual parents or sophisticated sims thereof. But during the critical window before the newbrain target is trained, the oldbrain proxy circuit can easily be fooled, and the chick can imprint on something else (like a human, or a glider).

Sexual attraction is a natural extension of imprinting: some collaboration of various oldbrain circuits can first ground to the general form of humans (infants have primitive face detectors for example, and more), and then also myriad more specific attraction signals: symmetry, body shape, secondary characteristics, etc, combined with other circuits which disable attraction for likely kin ala the Westermarck effect (identified by yet other sets of oldbrain circuits as the most familiar individuals during childhood). This explains the various failure modes we see in porn (attraction to images of people and even abstractions of humanoid shapes), and the failure of kin attraction inhibition for kin raised apart.

Fear of death is a natural consequence of empowerment based learning—as it is already the worst (most disempowered) outcome. But instinctual fear still has obvious evolutionary advantage: there are many dangers that can kill or maim long before the brain’s learned world model is highly capable. Oldbrain circuits can easily detect various obvious dangers for symbol grounding: very loud sounds and fast large movements are indicative of dangerous high kinetic energy events, fairly simple visual circuits can detect dangerous cliffs/​heights (whereas many tree-dwelling primates instead instinctively fear open spaces), etc.

Anger/​Jealousy/​Vengeance/​Justice are all variations of the same general game-theoretic punishment mechanism. These are deviations from empowerment because an individual often pursues punishment of a perceived transgressor even at a cost to their own ‘normal’ (empowerment) utility (ie their ability to pursue diverse goals). Even though the symbol grounding here seems more complex, we do see failure modes such as anger at inanimate objects which are suggestive of proxy matching. In the specific case of jealousy a two step grounding seems plausible: first the previously discussed lust/​attraction circuits are grounded, which then can lead to obsessive attentive focus on a particular subject. Other various oldbrain circuits then bind to a diverse set of correlated indicators of human interest and attraction (eye gaze, smiling, pupil dilation, voice tone, laughter, touching, etc), and then this combination can help bind to the desired jealousy grounding concept: “the subject of my desire is attracted to another”. This also correctly postdicts that jealousy is less susceptible to the inanimate object failure mode than anger.

Empathy: Oldbrain circuits conspicuously advertise emotional state through many indicators: facial expressions, pupil dilation, blink rate, voice tone, etc—so that another person’s sensory oldbrain circuits can detect emotional state from these obvious cues. This provides the requisite proxy foundation for grounding to newbrain learned representations of emotional state in others, and thus empathy. The same learned representations are then reused during imagination&planning, allowing the brain to imagine/​predict the future contingent emotional state of others. Simulation itself can also help with grounding, by reusing the brain’s own emotional circuity as the proxy. While simulating the mental experience of others, the brain can also compare their relative alignment/​altruism to its own, or some baseline, allowing for the appropriate game theoretic adjustments to sympathy. This provides a reasonable basis for alignment in the brain, and explains why empathy is dependent upon (and naturally tends to follow from) familiarity with a particular character—hence “to know someone is to love them”.

Evolution needed a reasonable approximation of “degree of kinship”, and a simple efficient proxy is relative circuit capacity allocated to modeling an individual in the newbrain/​cortex, which naturally depends directly on familiarity, which correlates strongly with kin/​family.

• 29 Sep 2022 7:04 UTC
11 points
1 ∶ 0

Interesting analysis. Some questions and notes:

How are you looking at “researchers” vs. “engineers”? At some organizations, i.e. Redwood, the boundary is very fuzzy—there isn’t a sharp delineation between anyone whose job it is to primarily “think of ideas”, vs. “implement ideas that researchers come up with”, so it seems reasonable to count most of their technical staff as researchers.

FAR, on the other hand, does have separate job titles for “Research Scientist” (4 people) vs. “Research Engineer” (5 people), though they do also say they “expect everyone on the project to help shape the research direction”.

Some of the other numbers seem like overestimates.

CHAI has 2 researchers and 6 research fellows, and only 2 (maybe 3) of the research fellows are doing anything recognizable as alignment research. (Not extremely confident; didn’t spend a lot of time digging for details for those that didn’t have websites. But generally not optimistic.) One of the researcher is Andrew Critch, who is one of the two people at Encultured. If you throw in Stuart Russell that’s maybe 6 people, not 30.

FHI has 2 people in their AI Safety Research Group. There are also a couple people in their macrostrategy research group it wouldn’t be crazy to count. Everybody else listed on the page either isn’t working on technical Alignment research or is doing so under another org also listed here. So maybe 4 people, rather than 10?

I don’t have very up-to-date information but I would pretty surprised if MIRI had 15 full-time research staff right now.

Also, I think that every single person I counted above has at least a LessWrong account, and most also have Alignment Forum accounts, so a good chunk are probably double-counted.

On the other hand, there are a number of people going through SERI MATS who probably weren’t counted; most of them will have LessWrong accounts but probably not Alignment Forum accounts (yet).

I’d be very happy to learn that there were 5 people at Meta doing something recognizable as alignment research; the same for Google Brain. Do you have any more info on those?

• Thanks for the feedback.

I added a new section in the introduction named ‘Definitions’ to define some commonly-used terms in the post and decrease ambiguity. To answer your question, research engineers and research scientists would be in the technical AI safety research category.

I re-estimating the number of researchers at the organizations you mentioned and came up with the following numbers:

CHAI
- 2 researchers
- 6 researchers
- 11 interns

Total: 40

From reading their website, CHAI has a lot graduate students and interns and I would consider them to be full-time researchers. My previous estimate for CHAI is 10-30-60 (low-estimate-high) and I changed it to 5-25-50 in light of this new information. My estimate is less than 40 to be conservative and also because I doubt all of these researchers are working full-time at CHAI. Also, some of them have probably been counted in the Alignment Forum total.

FHI

Information from their website:
- AI safety research group: 2
- AI governance researchers: 1
- Research Scholars Programme + DPhil Scholars and Affiliates: 9

- Total: 12

Change in estimate: 10-10-40 → 5-10-30

MIRI
- 9 research staff

Change in estimate: 10-15-30 → 5-10-20

“Also, I think that every single person I counted above has at least a LessWrong account, and most also have Alignment Forum accounts, so a good chunk are probably double-counted.”

I analyzed the people who posted AI safety posts on LessWrong and found that only 15% also had Alignment Forum accounts. I avoided double-counting by subtracting the LessWrong users who also have an Alignment Forum account from the LessWrong total.

For SERI MATS, I’m guessing that some of those people will be counted in the AF count. I also added an ‘Other’ row for other groups I didn’t include in the table.

I decided to delete the Google Brain and Meta entries because I have very little information about them.

• Competitive pressure is unavoidable, but is also often balanced or harnessed by cooperation in complex systems. Consider the orchestration of a few tens of trillions of cells into a human body, and how the immune system harnesses competition evolution to produce antibodies, or the super-cooperative success of Argentinian ants. Some complex systems successfully resist the selection pressure towards competitive defection across enormous number of replications.

One example of AIs that would count as good successors: ems. Creating a society of highly-accurate human brain emulations would constitute a good successor AI, since they would by definition share human values,

We don’t want to optimize for just any human values, as they vary wildly and are often divergent (consider the nazis, or jeffrey dahmer). We want generalized alignment/​altruism. Ems could be good successors only because and if they are based on a wide sampling of humans and or a smaller sampling of especially altruistic (generally aligned) humans. But you could just as easily get a scenario where the first ems are all copies of a single selfish trillionaire.

DL-based AGI will be similar in most respects to ems, but we’ll have a much greater opportunity to test and select for alignment/​altruism.

Optimistically, if they succeed and find that our value system is algorithmically simple, creating good successor AI might be as simple as copying that algorithm to silicon

The great recent lesson of DL and neuroscience is that enormous mental complexity can emerge from relatively simple universal learning algorithms. Human values can be bounded/​approximated by empowerment: AGI can simply optimize for our long term ability to fulfill any likely goals. The simpler the agent’s utility function, the easier it should be to maintain it into the future. So it could be wise to create very broadly altruistic AGI that seeks to empower external agency in general: not just all of humanity, but animals, aliens, etc.

• Competitive pressure is unavoidable, but is also often balanced or harnessed by cooperation in complex systems.

I agree—that some degree of competitive pressure may be unavoidable is all I was trying to argue. Lots of people around here think this is likely false!

But you could just as easily get a scenario where the first ems are all copies of a single selfish trillionaire.

Fair point, I had in a mind a scenario where the ems selected were broadly representative of humanity. Although I also think even the selfish trillionaire scenario might be okay in expectation as they would have many millennia to reflect, alter and interact with copies of themselves etc. I’d give it decent odds that they either decide to upload other people out of altruism or end up self-modifying into a civilization similar to the one we’d end up with in a more equitable upload scenario.

• By selfish I meant completely non-altruistic. Also uploading by itself isn’t sufficient, it also requires allocation of resources to uploads. In the selfish trillionaire scenario I was imagining the trillionaire creates many copies of themselves, perhaps some copies of a few people they find interesting, and then various new people/​AGI, and those copies all branch and evolve, but there is little to zero allocation for uploading the rest of us, not to mention our dead ancestors.

• 29 Sep 2022 6:25 UTC
LW: 15 AF: 9
2 ∶ 0
AF

Curated. “Big if true”. I love the depth and detail in shard theory. Separate from whether all its details are correct, I feel reading and thinking about this will get me towards a better understanding of humans and artificial networks both, if only via making reflect on how things work.

I do fear that shard theory gets a bit too much popularity from the coolness of the name, but I do think there is merit here, and if we had more theories of this scope, it’d be quite good.

• 29 Sep 2022 5:28 UTC
3 points
0 ∶ 0

Which spreadsheet did you look at in the HUD data? Did you use the contract price?

• “Paxlovid’s usefulness is questionable and could lead to resistance. I would follow the meds and supplements suggested by FLCC”

Their guide says:

In a follow up post-marketing study, Paxlovid proved to be ineffective in patients less than 65 years of age and in those who were vaccinated.

This is wrong. The study reports the following:

Among the 66,394 eligible patients 40 to 64 years of age, 1,435 were treated with nirmatrelvir. Hospitalizations due to Covid-19 occurred in 9 treated and 334 untreated patients: adjusted HR 0.78 (95% CI, 0.40 to 1.53). Death due to Covid-19 occurred in 1 treated and 13 untreated patients; adjusted HR: 1.64 (95% CI, 0.40 to 12.95).

As the abstract says, the study did not have the statistical power to show a benefit for preventing severe outcomes in younger adults. It did not “prove [Paxlovid] to be ineffective”! This is very bad, the guide is clearly not a reliable source of information about covid treatments, and I recommend against following the advice of anything else on that website.

• Paxlovid-associated rebound is quoted in Nature as occurring in 27% of cases, which is more in line with the anecdata I’ve received than the 5% figures in other research. My anecdata is that it really does hammer down brewing symptoms very effectively, and I haven’t seen reports or systematic trends in the anecdata of the rebound being worse than the first go-’round. So from your perspective, it’s a shot at not getting sick, or maybe of getting less sick.

The risk is yeah, maybe you contribute a miniscule marginal amount to drug resistant COVID-19. The norms of biomedicine are that you, the patient, do not need to concern yourself with that. If the medicine can help you, and your doctor prescribes it, you can have it. You aren’t on the hook to sacrifice the quality of your care for the “greater good.”

• As you know, there’s a straightforward way to, given any boolean circuit, to turn it into a version which is a tree, by just taking all the parts which have two wires coming out from a gate, and making duplicates of everything that leads into that gate.
I imagine that it would also be feasible to compute the size of this expanded-out version without having to actually expand out the whole thing?

Searching through normal boolean circuits, but using a cost which is based on the size if it were split into trees, sounds to me like it would give you the memoization-and-such speedup, and be kind of equivalent theoretically to using the actual trees?

A modification of it comes to mind. What if you allow some gates which have multiple outputs, but don’t allow gates which duplicate their input? Or, specifically, what if you allow something like toffoli gates as well as the ability to just discard an output? The two parts of the output would be sorta independent due to the gate being invertible (… though possibly there could be problems with that due to original input being used more than once?)

I don’t know whether this still has the can-do-greedy-search properties you want, but it seems like it might in a sense prevent using some computational result multiple times (unless having multiple copies of initial inputs available somehow breaks that).

• A quick retrospective on the four Manifold markets on the day, focusing on the possibility of good or bad incentives flowing out of the markets. Several people expressed concerns that the Manifold markets would cause LessWrong’s home page to go down, or go down earlier, due to incentives. People also placed bets and limit orders to try to mute those bad incentives. Some examples:

“This is a classical example where having a prediction market creates really bad incentives.” “I think the right thing to do here is buy yes so that it’s more beneficial for others to buy and hold no.” “I’m not wild about this market existing, but given that it does exist I’m strongly in favour of making it profitable for others to not press the button.”

I decided to look into this. I want to say up front that I think all the incentives were very small relative to perceived stakes, and I have no suspicions of anyone after writing this comment. I am also not going to give any bettor names.

By the end of the YES/​NO market, several users had bet large sums on YES and were consequently incentivized to blow up the home page. One user had m26,002 YES shares (USD $260). Another user had 7328 NO shares (USD$73) and was incentivized not to blow up the site. But this doesn’t show the whole picture, because there were also limit orders. As was explained to me:

The choice of whether to leave (public) limit orders on one side or the other is the way you incentivize action.

At various points during the day there were large (m1000, etc) YES limit orders in play. This is completely useless as an incentive for an individual LessWrong user. Sure, if I was the only one who might press the button, I could bet through those limit orders, not press the button, and collect my bounty. But many people could press the button, and this would have lost me mana. They did count as an incentive for LessWrong coders! They could quietly introduce a bug preventing the button from being pressed, bet NO, and collect their winnings. Nobody took up that incentive.

I didn’t see any large NO limit orders. A NO limit order on a market like this is inherently risky, because if the button is pressed the offer will definitely be taken, but if the button is not pressed it probably won’t. If there had been then that could have added to the incentive some bettors had to blow up the home page, but as far as I can tell there weren’t.

The positions in the WHEN market were much smaller, with the largest positions coming in at 552 shares in LATE/​NEVER and 325 shares in EARLY. The current design of range markets on Manifold is fun to play with, but is not very “swingy”. In particular, there is no option for outsized winnings if you can predict exactly when something will happen. If a user was planning to blow up the home page in the last few hours they could have bet a combination of LATE/​NEVER in this market and YES in the binary market to maximize their gains, but I don’t see any evidence of this. Because there are fewer high karma users, large EARLY bets in this market could slightly limit their incentive to bet EARLY and blow up the home page early. Unlike the YES/​NO market, those bets didn’t happen.

If someone was willing to anonymously blow up the home page, then a pattern of placing huge bets on that happening might look suspicious and damage their anonymity. So generally I would expect these incentives to only be effective for someone who was willing to blow up the home page and take credit for it. You can bet on that happening in Will anyone try to take credit for nuking LessWrong?. Or, place a limit NO order to incentivize someone to bet YES and then take the credit (and the mana).

When I created my market—Will LessWrong change their Petrov Day 2022 plan and reduce the chance of the button being pressed? - it was swiftly bet down to a low probability, creating a small incentive for the LessWrong admins to bet YES and then change their plans. Nobody took up that incentive, and the market didn’t attract large limit orders.

Finally mkualquiera made a market—Will my friend agree to defect on Petrov Day?. This market is evidence that two people were thinking about being incentivized to blow up the home page by betting markets. But it seems like this was more about a fun game than making mana. The largest position was 301 shares on YES, and the market resolved NO, so whatever incentive those shares provided, they weren’t enough to make it happen.

In the end I think there is a low (10%) chance that anyone’s behavior was significantly shifted by prediction market incentives. Feel free to reply to this post with how your behavior was shifted so I can update.

I was hopeful that people might shift their behavior based on the prediction market predictions—specifically that the high probability placed on the home page being blown up would lead to design changes. However, this retrospective clarifies that Petrov Day 2022 was a social experiment, so the prediction would have just shown it was expected to work as designed.

… the primary design of the exercises – seeing how long it’d take for the site to go down, even if we were pretty sure that it would.

It’s also possible that people saw the high probability that the home page would be blown up and altered their plans. They may have not blown it up because they expected someone else would (especially if they weren’t confident in their anonymity being preserved). Or they may have blown it up because “if it’s going to happen, it may as well be me”. I think there is a higher chance (25%) that this was an effect, but I don’t know what the net direction would have been. Again, feel free to reply if you think the prediction markets helped you make decisions.

Overall I think the prediction markets were a positive addition to a negative event. I think the main incentive in play was the possibility of seeing, or not seeing, the LessWrong home page being blown up. And I think we all have a lot to figure out about prediction markets.

• To clarify, me and my friend were 100% going to press the button, but we were discouraged by the false alarm. There was no fun at that point, and it made me lose like 13 of my total mana. I had to close all my positions to stop the losses, and we went to sleep. When we woke up it was already too late for it to be noteworthy or fun.

• This is great! You should make a top-level post. Please!

• 29 Sep 2022 3:57 UTC
5 points
0 ∶ 0

The funny thing is that I had assumed the button was going to be buggy, though I was wrong how. The map header has improperly swallowed mouse scroll wheel events whenever it’s shown; I had wondered if the button would also interpret them likewise since it was positioned in the same way, so I spent most of the day carefully dragging the scrollbar.

• [ ]
[deleted]
• You’re drawing a philisophical distinction based on a particular ontology of the wavefunction. As simpler version arises in classical electromagnetism: we can integrate out the charges and describe the world entirely as an evolving state of the E&M field with the charges acting as weird source terms, or we can do the opposite and integrate out the E&M field to get a theory of charges moving with weird force laws. These are all equivalent descriptions in that they are observationally indistinguishable.

• If it’s high-quality distillation you’re interested in, you don’t necessarily need a PhD. I’m thinking of e.g. David Roodman, now a senior advisor at Open Philanthropy. He majored in math, then did a year-long independent study in economics and public policy, and has basically been self-taught ever since. Holden Karnofsky considers what he does extremely valuable:

David Roodman, who is basically the person that I consider the gold standard of a critical evidence reviewer, someone who can really dig on a complicated literature and come up with the answers, he did what, I think, was a really wonderful and really fascinating paper, which is up on our website, where he looked for all the studies on the relationship between incarceration and crime, and what happens if you cut incarceration, do you expect crime to rise, to fall, to stay the same? He picked them apart. What happened is he found a lot of the best, most prestigious studies and about half of them, he found fatal flaws in when he just tried to replicate them or redo their conclusions.

When he put it all together, he ended up with a different conclusion from what you would get if you just read the abstracts. It was a completely novel piece of work that reviewed this whole evidence base at a level of thoroughness that had never been done before, came out with a conclusion that was different from what you naively would have thought, which concluded his best estimate is that, at current margins, we could cut incarceration and there would be no expected impact on crime. He did all that. Then, he started submitting it to journals. It’s gotten rejected from a large number of journals by now. I mean starting with the most prestigious ones and then going to the less.

Robert Wiblin: Why is that?

Holden Karnofsky: Because his paper, it’s really, I think, it’s incredibly well done. It’s incredibly important, but there’s nothing in some sense, in some kind of academic taste sense, there’s nothing new in there. He took a bunch of studies. He redid them. He found that they broke. He found new issues with them, and he found new conclusions. From a policy maker or philanthropist perspective, all very interesting stuff, but did we really find a new method for asserting causality? Did we really find a new insight about how the mind of a …

Robert Wiblin: Criminal.

Holden Karnofsky: A perpetrator works. No. We didn’t advance the frontiers of knowledge. We pulled together a bunch of knowledge that we already had, and we synthesized it. I think that’s a common theme is that, I think, our academic institutions were set up a while ago. They were set up at a time when it seemed like the most valuable thing to do was just to search for the next big insight.

These days, they’ve been around for a while. We’ve got a lot of insights. We’ve got a lot of insights sitting around. We’ve got a lot of studies. I think a lot of the times what we need to do is take the information that’s already available, take the studies that already exist, and synthesize them critically and say, “What does this mean for what we should do? Where we should give money, what policy should be.”

I don’t think there’s any home in academia to do that. I think that creates a lot of the gaps. This also applies to AI timelines where it’s like there’s nothing particularly innovative, groundbreaking, knowledge frontier advancing, creative, clever about just … It’s a question that matters. When can we expect transformative AI and with what probability? It matters, but it’s not a work of frontier advancing intellectual creativity to try to answer it.

• Shortform #138 A good but slightly disorienting day

I applied for a promising job today, here’s to hoping that bears fruit!

I am somewhat out of whack due to having to suddenly house sit instead of going to my own home after work. I do not enjoy this, but it’s an obligation I’m fulfilling.

No meetup tonight, I rescheduled Norfolk’s meetup for Thursday evening (the 29th). I’m excited for the meetup tomorrow!

• Agree with you generally. You may find interest in a lot of the content I posted on reddit over the past couple months on similar subjects, especially in the singularity sub (or maybe you are there and have seen it 😀). Nice write up anyway. I do disagree on some of your generalized statements, but only because I’m more optimistic than yourself, and don’t originally come from a position of thinking these things were impossible.

• I’ll be the annoying guy who ignores your entire post and complains about you using celsius as the unit of temperature in a calculation involving the Landauer limit. You should have used kelvin instead, because Landauer’s limit needs an absolute unit of temperature to work. This doesn’t affect your conclusions at all, but as I said, I’m here to be annoying.

That said, the fact that you got this detail wrong does significantly undermine my confidence in the rest of your post, because even though the detail is inconsequential for your overall argument it would be very strange for someone familiar with thermodynamics to make such a mistake.

• Notably, the result is correct; I did convert it to kelvin for the actual calculation. Just a leftover from when I was sketching things on wolframalpha. I’ll change that, since it is weird. (Thanks for the catch!)

• So:

1. DL based AGI is arriving soonish

2. DL based AGI raised in the right social environments will automatically learn efficient models of external agent values (and empowerment bounds thereof)

3. The main challenge then is locating the learned representation of external agent values and wiring/​grounding it up to the agent’s core utility function (which is initially unsafe: self-motivated empowerment etc), and timing that transition replacement carefully

4. Evolution also solved both alignment and the circuit grounding problem; we can improve on those solutions (proxy matching)

5. We can iterate safely on 3 in well constructed sandbox sims

Ideally as we approach AGI there would be cooperative standardization on alignment benchmarks and all the major players would subject their designs to extensive testing in sandbox sims. Hopefully 1-5 will become increasingly self evident and influence ‘Magma’. If not some other org (perhaps a decentralized system), could hopefully beat Magma to the finish line. Alignment need not have much additional cost: it doesn’t require additional runtime computations, it doesn’t require much additional training cost, and with the ideal test environments it hopefully doesn’t have much of a research iteration penalty (as the training environments and can simultaneously test for intelligence and alignment).

• 29 Sep 2022 1:27 UTC
4 points
1 ∶ 0

It’s a good idea to be pedantically clear with terminology, and I thank you for saying “front page” rather than “site” in most of the timeline. I visited the site multiple times (and saw a 502 at one point), but I never look at the front page, using my /​allPosts bookmark. I never saw the button, though I have enough karma (though I may be removed as a troublemaker, as I think this excercise is rather less weighty than it’s made out to be. I don’t actually know whether I’d press it, given the chance.).

more than 300 karma (1,178 of them exist, 335 visited the site on Petrov Day)

Visited the site, or visited the front page? I’m mostly curious how many of the regulars use the front page, vs using greaterwrong (which I presume isn’t even a site visitor) or a deeper bookmark.

• Would the foundation consider awarding the future fund worldview prize to AI? Just wondering.

Kelsi Sober (804) 255-4163

• We can conclude that in gaining more karma than [200], one becomes the kind of person who doesn’t destroy the world symbolically or otherwise.

I imagine this is tongue in cheek, but we really can’t. You mentioned an important reason—someone with more karma could have waited to press the button. The first button press occurred 110 minutes after it could have been pressed. The second button press occurred at least 40 minutes after it could have been pressed, and perhaps 100, 160, 220, etc. In 2020 the button was pressed 187 minutes after it could have been pressed (by a 4000+ karma user).

You excluded known trouble makers from accessing the button but you didn’t exclude unknown trouble makers, and lower karma is correlated with being unknown.

We are also dealing with a hostile intelligence who pressed the button (or caused it to be pressed). Someone with higher karma might deliberately wait to press the button to throw people off the scent, to encourage people to make a naive update about karma scores, or to reduce the negative consequences of bringing the home page down for longer without the negative consequences of leaving it up all day. The timing evidence is thus hostile evidence and updating on it correctly requires superintelligence.

Put this together and I would not place any suspicion on the noble class of 200-299 karma users that I happened to enter on Petrov Day after net positive gains from complaining about the big red button.

I am willing to update that at least one person in the 200+ karma range pressed the button, and at least one person with zero karma pressed the button. This assumes there was not a third bug in play. This does not change my opinion of LessWrong users, but those who predicted that the home page would remain up could update.

• I also imagine that it was tongue in cheeck but I also think that the structure of the whole thing so heavily suggests this line of thinking that on surface level recognising it to be wrong doesn’t really dispell it.

• I like this comment.

• 29 Sep 2022 0:18 UTC
1 point
0 ∶ 0

Following up on this:

I’ve contacted a number of people in the field regarding this idea (thanks to everyone who responded!).

The general vibe is “this seems like it could be useful, maybe, if it took off,” but it did not appear to actually solve the problems any specific person I contacted was having.

My expectation would be that people in large organizations would likely not publish anything in this system that they would not publish out in the open.

One of the critical pieces of this proposal is having people willing to coordinate across access boundaries. There were zero enthusiastic takers for that kind of burden, which doesn’t surprise me too much. Without a broad base of volunteers for that kind of task, though, this idea seems to require paying a group of highly trusted and well-informed people to manage coordination instead of, say, researching things. That seems questionable.

Overall, I still think there is an important hole in coordination, but I don’t think this proposal fills it well. I’m not yet sure what a better shape would be.

• I didn’t push the button because I knew that doing so would actually take down the front page. And I only believed that would happen because it happened last year.

• I know this is going to come off as overly critical no matter how I frame it but I genuinely don’t mean it to be.

Another takeaway from this would seem to be an update towards recognizing the difference between knowing something and enacting it or, analogously, being able to identify inadequacy vs. avoid it. People on LW often discuss, criticize, and sometimes dismiss folks who work at companies that fail to implement all security best practices or do things like push to production without going through proper checklists. Yet, this is a case where exactly that happened, even though there was not strong financial or (presumably) top down pressure to act quickly.

• Actually it seems that the industry standard (?) process of code review was followed just fine. Yet the wrong logic still went through. (Actually based on the github PR it seems that the reviewer himself suggested the wrong logic?)

I think in this case there would also be plenty to say about blindly following checklists. (Could code review in some cases make things worse by making people think less about the code they write?)

EDIT: Actually based on the TS types user.karma can’t be missing. Either the types or Ruby’s explanation is wrong. Clearly multiple things had to go wrong for this bug to slip through.

• I don’t know all the details of what testing was done, but I would not describe code review and then deploying as state-of-the-art as this ignores things like staged deploys, end-to-end testing, monitoring, etc. Again, I’m not familiar with the LW codebase and deploy process so it’s possible all these things are in place, in which case I’d be happy to retract my comment!

• To me it seems that the average best practices are being followed.[1] But these “best practices” are still just a bunch of band-aids, which happen to work fairly very well for most use-cases.

A much more interesting question to ask here is what if something important like … humanity’s survival depended on your software? It seems that software correctness will be quite important for alignment. Yet I see very few people seriously trying to make creating correct software scalable. (And it seems like a field particularly suited for empirical work, unlike alignment. I mean, just throw your wildest ideas at a proof checker, and see what sticks. After you have a proof, it doesn’t matter at all how it was obtained.)

1. ↩︎

And I think the amount of effort in this case is perfectly justified. I mean this was code for a one-off single day event, nothing mission critical. It would be unreasonable to expect much more for something like this.

• Mainly commenting on your footnote, I generally agree that it’s fine to put low amounts of effort into one-off simple events. The caveat here is that this is an event that is 1) treated pretty seriously in past years and 2) is a symbol of a certain mindset that I think typically includes double-checking things and avoiding careless mistakes.

• This is why I really wish we had an AI that had superhuman code, theorem proving and translation to natural language, and crucially only those capabilities, so that we can prove certain properties.

• The types are wrong. It’s sad that the types are wrong. I sure wish they weren’t wrong, but changing it in all the cases is a pretty major effort (we would have to manually annotate all fields to specify which things can be null/​undefined in the database, and which ones can’t, and then would have to go through a lot of type errors).

• The types were introduced after most of the user objects had already been created, and I guess no one ever ran a migration to fill them in.

• I see, makes sense.

On the other hand I am afraid this reinforces NaiveTortoise’s point, this seems like an underlying issue that could potentially lead to bugs much worse than this...

• users on LessWrong with more than 300 karma (1,178 of them) are not the kind to push buttons for one reason or another. Granted many of them would not have been checking the site

Do you have website access stats that would let you compute precisely how many of them did in fact load the homepage during Petrov Day?

• 335 users with 300 karma or more were active on Petrov Day. 1644 logged-in users total created before the cutoff.

• I really like this post because it’s readable and informative. For the second problem, pursuing proxy goals, I recommend also reading about a related problem called the XY problem.

On point 4: many popular alignment ideas are not models of current systems, but models of future AI systems. Accuracy is then lost not only from modeling the system but also from having to create a prediction about it.

• Counterfactual Impact and Power-Seeking

It worries me that many of the most promising theories of impact for alignment end up with the structure “acquire power, then use it for good”.

This seems to be a result of the counterfactual impact framing and a bias towards simple plans. You are a tiny agent in an unfathomably large world, trying to intervene on what may be the biggest event in human history. If you try to generate stories where you have a clear, simple counterfactual impact, most of them will involve power-seeking for the usual instrumental convergence reasons. Power-seeking might be necessary sometimes, but it seems extremely dangerous as a general attitude; ironically human power-seeking is one of the key drivers of AI x-risk to begin with. Benjamin Ross Hoffman writes beautifully about this problem in Against responsibility.

I don’t have any good solutions, other than a general bias away from power-seeking strategies and towards strategies involving cooperation, dealism, and reducing transaction costs. I think the pivotal act framing is particularly dangerous, and aiming to delay existential catastrophe rather than preventing it completely is a better policy for most actors.

This is why AI risk is so high, in a nutshell.

Yet unlike this post (or Benjamin Ross Hoffman’s post), I think this was a sad, but crucially necessary decision. I think the option you propose is at least partially a fabricated option. I think a lot of the reason is people dearly want to there be a better option, even if it’s not there.

https://​​www.lesswrong.com/​​posts/​​gNodQGNoPDjztasbh/​​lies-damn-lies-and-fabricated-options

• Fabricated options are products of incoherent thinking; what is the incoherence you’re pointing out with policies that aim to delay existential catastrophe or reduce transaction costs between existing power centers?

• I think the fabricated option here is just supporting the companies making AI, when my view is that by default, capitalist incentives kill us all due to boosting AI capabilities while doing approximately zero AI safety, in particular deceptive alignment would not be invested in despite this being the majority of the risk.

One of the most important points for AGI safety is the leader in AGI needs a lot of breathing space and leadership ahead of their competitors, and I think this needs to be done semi-unilaterally by an organization not having capitalist incentives, because all the incentives point towards ever faster, not slowing down AGI capabilities. That’s why I think your options are fabricated, because they assume unrealistically good incentives to do what you want.

• I don’t mean to suggest “just supporting the companies” is a good strategy, but there are promising non-power-seeking strategies like “improve collaboration between the leading AI labs” that I think are worth biasing towards.

Maybe the crux is how strongly capitalist incentives bind AI lab behavior. I think none of the currently leading AI labs (OpenAI, DeepMind, Google Brain) are actually so tightly bound by capitalist incentives that their leaders couldn’t delay AI system deployment by at least a few months, and probably more like several years, before capitalist incentives in the form of shareholder lawsuits or new entrants that poach their key technical staff have a chance to materialize.

• Maybe the crux is how strongly capitalist incentives bind AI lab behavior. I think none of the currently leading AI labs (OpenAI, DeepMind, Google Brain) are actually so tightly bound by capitalist incentives that their leaders couldn’t delay AI system deployment by at least a few months, and probably more like several years.

This is the crux, thank you for identifying it.

Yeah, I’m fairly pessimistic for several years time, since I don’t think they’re that special of a company in resisting capitalist nudges and incentives.

And yeah I’m laughing because unless the alignment/​safety teams control what capabilities are added, then I do not expect the capabilities teams to stop, because they won’t get paid for that.

• I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)

One of my (TurnTrout’s) reasons for alignment optimism is that I think:

• We can examine early-training cognition and behavior to some extent, since the system is presumably not yet superintelligent and planning against us,

• (Although this amount of information depends on how much interpretability and agent-internals theory we do now)

• All else equal, early-training values (decision-influences) are the most important to influence, since they steer future training.

• It’s crucial to get early-training value shards of which a substantial fraction are “human-compatible values” (whatever that means)

• For example, if there are protect-human-shards which

• reliably bid against plans where people get hurt,

• steer deliberation away from such plan stubs, and

• these shards are “reflectively endorsed” by the overall shard economy (i.e. the decision-making isn’t steering towards plans where the protect-human shards get removed)

• If we install influential human-compatible shards early in training, and they get retained, they will help us in mid- and late-training where we can’t affect the ball game very much (e.g. alien abstractions, interpretability problems, can’t oversee AI’s complicated plans)

Therefore it seems very important to understand what’s going on with “shard game theory” (or whatever those intuitions are pointing at) -- when, why, and how will early decision-influences be retained?

He was talking about viewing new hypotheses as adding traders to a market (in the sense of logical induction). Usually they’re viewed as hypotheses. But also possibly you can view them as having values, since a trader can basically be any computation. But you’d want a different market resolution mechanism than a deductive process revealing the truth or falsity of some proposition under some axioms. You want a way for traders to bid on actions.

I proposed a setup like:

Maybe you could have an “action” instead of a proposition and then the action comes out as 1 or 0 depending on the a function of the market position on that action at a given time, which possibly leads to fixed points for every possible resolution.

For example, if all the traders hold as YES, then actually does come out as YES. And eg a trader which “wants” all the even-numbered actions and wants all the 10-multiple actions (), they can “bargain” by bidding up each others’ actions whenever they have extra power and thereby “value handshake.”

And that over time, traders who do this should take up more and more market share relative to those who dont exploit gains from trade.

There should be a very high dependence of final trader coalition on the initial composition of market share. And it seems like some version of this should be able to model self-reflective value drift. You can think about action resolution and payout as a kind of reward event, where certain kinds of shards get reinforced. Bidding for an action which happens and leads to reward, gets reinforced (supporting traders receive payouts), and the more you support (bid for it), the more responsible your support was for the event, so the larger the strengthening.

Abram seemed to think that there might exist a nice result like “Given a coalition of traders with values X, Y, Z satisfies properties A, B, and C, this coalition will shape future training and trader-addition in a way which accords with X/​Y/​Z values up to [reasonably tight trader-subjective regret bound].”

What this would tell us is when trader coalitions can bargain /​ value handshake /​ self-trust and navigate value drift properly. This seems super important for understanding what happens, long-term, as the AI’s initial value shards equilibrate into a reflectively stable utility function; even if we know how to get human-compatible values into a system, we also have to ensure they stay and keep influencing decision-making. And possibly this theorem would solve ethical reflection (e.g. the thing people do when they consider whether utilitarianism accords with their current intuitions).

Issues include:

• Somehow this has to confront Rice’s theorem for adding new traders to a coalition. What strategies would be good?

• I think “inspect arbitrary new traders in arbitrary situations” is not really how value drift works, but it seems possibly contingent on internal capabilities jumps in SGD

• The key question isn’t can we predict those value drift events, but can the coalition

• EG agent keeps training and is surprised to find that an update knocks out most of the human-compatible values.

• Knowing the right definitions might be contingent on understanding more shard theory (or whatever shard theory should be, for AI, if that’s not the right frame).

• Possibly this is still underspecified and the modeling assumptions can’t properly capture what I want; maybe the properties I want are mutually exclusive. But it seems like it shouldn’t be true.

• ETA this doesn’t model the contextual activation of values, which is a centerpiece of shard theory.

• I’ve been thinking along similar lines recently. A possible path to AI safety that I’ve been thinking about extends upon this:

A promising concrete endgame story along these lines is Ought’s plan to avoid the dangerous attractor state of AI systems that are optimized end-to-end

# Technological Attractor: Off-the-shelf subsystems

One possible tech-tree path is that we start building custom silicon to implement certain subsystems in an AI agent. These components would be analogous to functional neural regions of the human brain such as the motor cortex, visual system, etc. -- the key hypothesis being that once we reach a certain level of model complexity, the benefits from training a model end-to-end are not worth the costs of re-learning all of these fundamental structures, and furthermore that you can get much better performance-per-cost by casting these modular, reusable components onto an ASIC. This could be a more feasible way of achieving something like [Microscope AI](https://​​www.lesswrong.com/​​posts/​​fRsjBseRuvRhMPPE5/​​an-overview-of-11-proposals-for-building-safe-advanced-ai#5__Microscope_AI).

Given a few such low-level components, we could enter a technological attractor where getting SOTA through another approach requires either i) throwing multiple OOM more compute than previous SOTA at re-training your own copy of these components in an end-to-end model, or ii) building your own microchip fab to implement your custom component design. Both of these could be high enough barriers that in practice the market participants simply use the off-the-shelf components. And in this attractor more R&D goes into building better-performing and higher-abstraction components that can be combined arbitrarily for different tasks.

Given such off-the-shelf components, say, a standard visual cortex for image recognition, or a 3D spatial reasoning system which can be connected atop a visual cortex, or a generic working memory, etc., it becomes much more tractable to engage in interpretability work that can convince us of the safety of these components. That then just leaves the problem of proving the highest-level network is safe (i.e. whatever is trained to use these components). This is by no means trivial, but for example the ELK paper spent a lot of time worrying about whether you can even trust the basic ontology of an agent, and I think this path would make Ontology Identification tractable. (Of course, this by construction excludes a number of possible agent-configurations).

Now, it’s possible one could plug in these components and still accidentally have the trained agent learn its own replacement visual cortex that is deceptive and not interpretable, but I think that we can get an intuition for when this might happen. If the best off-the-shelf visual cortex module has complexity on the order of N parameters, and our new model that stitches together these lower-level modules has N parameters or less, then it seems implausible that it has somehow learned a new visual cortex network that is much more compact/​efficient than the previous state of the art—increasingly implausible the more resources are devoted to optimizing these components. (It would be like someone coming along today and claiming they had achieved a 10x performance improvement on CPU clock speed; given the huge research budgets and historical gradual capability increases, we don’t expect this to happen). On the other hand, we should be suspicious of a parameter model where we are unable to interpret what 10% of the network is doing; that’s easily enough complexity to hide its own unsafe re-implementation of our components. (I’m aware that there’s a lot of ground in between these two points, I’m merely trying to illustrate that there is “likely safe” and “likely unsafe” ground, rather than claim exactly how big they each are.)

The final step here is the shakiest. It’s not clear to me that we can keep the “top layer” (the actual network that is stitching together the low-level components; perhaps the Neocortex, by analogy to human neural architecture?) thin enough to be obviously not learning its own unsafe component-replacements. However, I think this framework at least paints a picture of a “known safe” or at least “likely safe” path to AGI; if we see that the practical engineering and economic decisions produce thin top-layer models using thick component layers, then we can devote energy to proving the components are safe/​interpretable by construction, and exploring the interpretation of the top-level networks that consume the lower-level components. AGI “neurobiology” will be much more tractable if the “neural architecture” is relatively standardized. And so, this could be a good place to provide an early nudge to tip the system into this attractor; heavy investment into research on componentized NN architectures could be viewed as “gain of function” research, but it could also have a much safer end-point.

Another way of thinking about this is that by crystalizing at least some parts of the AGI’s network into slowly-changing structures, we allow time to thoroughly test those parts. It seems very hard to thoroughly test models for safety in a paradigm where the whole model is potentially retrained regularly.

• Another way of thinking about this is that by crystalizing at least some parts of the AGI’s network into slowly-changing structures, we allow time to thoroughly test those parts. It seems very hard to thoroughly test models for safety in a paradigm where the whole model is potentially retrained regularly.

We need to test designs, and most specifically alignment designs, but giving up retraining (ie lifetime learning) and burning circuits into silicon is unlikely to be competitive; throwing out the baby with the bathwater.

Also whether a design is aligned or not isn’t the type of question one can answer by analyzing the agent’s visual cortex, it’s near purely a function of what is steering the planning system.

• unlikely to be competitive

Would you care to flesh this assertion out a bit more?

To be clear I’m not suggesting that this is optimal now. Merely speculating that there might be a point between now and AGI where the work to train these sub components becomes so substantial that it becomes economical to modularize.

whether a design is aligned or not isn’t the type of question one can answer by analyzing the agent’s visual cortex

As I mentioned earlier in my post, I was alluding to the ELK paper with that reference, specifically Ontology Identification. Obviously you’d need higher order components too. Like I said, I am imagining here that the majority of the model is “off the shelf”, and just a thin layer is usecase-specific.

To make this more explicit, if you had not only off-the-shelf visual cortex, but also spatio-temporal reasoning modules built atop (as the human brain does), then you could point your debugger at the contents of that module and understand what entities in space were being perceived at what time. And the mapping of “high level strategies” to “low level entities” would be a per-model bit of interpretability work, but should become more tractable to the extent that those low level entities are already mapped and understood.

So for the explicit problem that the ELK paper was trying to solve, if you are confident you know what underlying representation SmartVault is using, it’s much easier to interpret its higher-level actions/​strategies.

• Interesting, I haven’t seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive. An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.

• One other thought after considering this a bit more—we could test this now using software submodules. It’s unlikely to perform better (since no hardware speedup) but it could shed light on the tradeoffs with the general approach. And as these submodules got more complex, it may eventually be beneficial to use this approach even in a pure-software (no hardware) paradigm, if it lets you skip retraining a bunch of common functionality.

I.e. if you train a sub-network for one task, then incorporate that in two distinct top-layer networks trained on different high-level goals, do you get savings by not having to train two “visual cortexes”?

This is in a similar vein to Google’s foundation models, where they train one jumbo model that then gets specialized for each usecase. Can that foundation model be modularized? (Maybe for relatively narrow usecases like “text comprehension” it’s actually reasonable to think of a foundation model as a single submodule, but I think they are quite broad right now. ) The big difference is I think all the weights are mutable in the “refine the foundation model” step?

Perhaps another concrete proposal for a technological attractor would be to build a SOTA foundation model and make that so good that the community uses it instead of training their own, and then that would also give a slower-moving architecture/​target to interpret.

• Awesome work!

Misc. thoughts and questions as I go along:

1. Why is Continuity appealing/​important again?

2. In the Destruction Game, does everyone get the ability to destroy arbitrary amounts of utility, or is how much utility they are able to destroy part of the setup of the game, such that you can have games where e.g. one player gets a powerful button and another player gets a weak one?

• For 1, it’s just intrinsically mathematically appealing (continuity is always really nice when you can get it), and also because of an intution that if your foe experiences a tiny preference perturbation, you should be able to use small conditional payments to replicate their original preferences/​incentive structure and start negotiating with that, instead.

I should also note that nowhere in the visual proof of the ROSE value for the toy case, is continuity used. Continuity just happens to appear.

For 2, yes, it’s part of game setup. The buttons are of whatever intensity you want (but they have to be intensity-capped somewhere for technical reasons regarding compactness). Looking at the setup, for each player pair i,j, is the cap for how much of j’s utility that i can destroy. These can vary, as long as they’re nonnegative and not infinite. From this, it’s clear “Alice has a powerful button, Bob has a weak one” is one of the possibilities, that would just mean . There isn’t an assumption that everyone has an equally powerful button, because then you could argue that everyone just has an equal strength threat and then it wouldn’t be much of a threat-resistance desideratum, now would it? Heck, you can even give one player a powerful button and the other a zero-strength button that has no effect, that fits in the formalism.

So the theorem is actually saying “for all members of the family of destruction games with the button caps set wherever the heck you want, the payoffs are the same as the original game”.

• OK, thanks! Continuity does seem appealing to me but it seems negotiable; if you can find an even more threat-resistant bargaining solution (or an equally threat-resistant one that has some other nice property) I’d prefer it to this one even if it lacked continuity.

• [ ]
[deleted]
• Absolutely brilliant stuff Jacob! As usual with your posts, I’ll have to ponder this for a while...Let’s see if I got this right:

Evolution had to solve alignment: how to align the powerful general learning engine of the newbrain (neocortex etc) with the goals of the oldbrain (“reptilian brain”).

Some (most?) of this alignment seems to be a form of inverse reinforcement learning. Another form of alignment that the oldbrain applies to the newbrain is imprinting. It is Evolution’s way of solving the pointing problem.

When a duckling hatches it imprints on the first agent it sees. This feels different from inverse reinforcement learning: it’s not like the newbrain is rewarded or punished, rather it is more like there is an open slot for your mom

• Thanks! - that summary seems about right.

But I would say that imprinting is a specific instance of a more general process (which I called correlation guided proxy matching). The oldbrain has a simple initial mom detector circuit, which during normal chick learning phase is just good enough to locate and connect to the learned newbrain mom detector circuit, which then replaces/​supplants the oldbrain equivalent. The proxy matching needn’t really effect the newbrain directly.

• The post starts with the realization that we are actually bottlenecked by data and then proceeds to talk about HW acceleration. Deep learning is in a sense a general paradigm, but so is random search. It is actually quite important to have the necessary scale of both compute and data and right now we are not sure about either of them. Not to mention that it is still not clear whether DL actually leads to anything truly intelligent in a practical sense or whether we will simply have very good token predictors with very limited use.

• I don’t actually think we’re bottlenecked by data. Chinchilla represents a change in focus (for current architectures), but I think it’s useful to remember what that paper actually told the rest of the field: “hey you can get way better results for way less compute if you do it this way.”

I feel like characterizing Chinchilla most directly as a bottleneck would be missing its point. It was a major capability gain, and it tells everyone else how to get even more capability gain. There are some data-related challenges far enough down the implied path, but we have no reason to believe that they are insurmountable. In fact, it looks an awful lot like it won’t even be very difficult!

With regards to whether deep learning goes anywhere: in order for this to occupy any significant probability mass, I need to hear an argument for how our current dumb architectures do as much as they do, and why that does not imply near-term weirdness. Like, “large transformers are performing {this type of computation} and using {this kind of information}, which we can show has {these bounds} which happens to include all the tasks it has been tested on, but which will not include more worrisome capabilities because {something something something}.”

The space in which that explanation could exist seems small to me. It makes an extremely strong, specific claim, that just so happens to be about exactly where the state of the art in AI is.

• I don’t actually think we’re bottlenecked by data. Chinchilla represents a change in focus (for current architectures), but I think it’s useful to remember what that paper actually told the rest of the field: “hey you can get way better results for way less compute if you do it this way.”

I feel like characterizing Chinchilla most directly as a bottleneck would be missing its point. It was a major capability gain, and it tells everyone else how to get even more capability gain. There are some data-related challenges far enough down the implied path, but we have no reason to believe that they are insurmountable. In fact, it looks an awful lot like it won’t even be very difficult!

Could you explain why you feel that way about Chinchilla? Because I found that post: https://​​www.lesswrong.com/​​posts/​​6Fpvch8RR29qLEWNH/​​chinchilla-s-wild-implications to give very compelling reasons for why data should be considered a bottleneck and I’m curious what makes you say that it shouldn’t be a problem at all.

• Some of my confidence here arises from things that I don’t think would be wise to blab about in public, so my arguments might not be quite as convincing sounding as I’d like, but I’ll give a try.

I wouldn’t quite say it’s not a problem at all, but rather it’s the type of problem that the field is really good at solving. They don’t have to solve ethics or something. They just need to do some clever engineering with the backing of infinite money.

I’d put it at a similar tier of difficulty as scaling up transformers to begin with. That wasn’t nothing! And the industry blew straight through it.

To give some examples that I’m comfortable having in public:

1. Suppose you stick to text-only training. Could you expand your training sets automatically? Maybe create a higher quality transcription AI and use it to pad your training set using the entirety of youtube?

2. Maybe you figure out a relatively simple way to extract more juice from a smaller dataset that doesn’t collapse into pathological overfitting.

3. Maybe you make existing datasets more informative by filtering out sequences that seem to interfere with training.

4. Maybe you embrace multimodal training where text-only bottlenecks are irrelevant.

5. Maybe you do it the hard way. What’s a few billion dollars?

• in order for this to occupy any significant probability mass, I need to hear an argument for how our current dumb architectures do as much as they do, and why that does not imply near-term weirdness. Like, “large transformers are performing {this type of computation} and using {this kind of information}, which we can show has {these bounds} which happens to include all the tasks it has been tested on, but which will not include more worrisome capabilities because {something something something}.”

What about: State-of-the-art models with 500+B parameters still can’t do 2-digit addition with 100% reliability. For me, this shows that the models are perhaps learning some associative rules from the data, but there is no sign of intelligence. An intelligent agent should notice how addition works after learning from TBs of data. Associative memory can still be useful, but it’s not really an AGI.

• They are simluators (https://​​www.lesswrong.com/​​posts/​​vJFdjigzmcXMhNTsx/​​simulators), not question answerers. Also, I am sure Minerva does pretty good on this task, probably not 100% reliable but humans are also not 100% reliable if they are required to answer immediately. If you want the ML model to simulate thinking [better], make it solve this task 1000 times and select the most popular answer (which is a quite popular approach for some models already). I think PaLM would be effectively 100% reliable.

• As mentioned in the post, that line of argument makes me more alarmed, not less.

1. We observe these AIs exhibiting soft skills that many people in 2015 would have said were decades away, or maybe even impossible for AI entirely.

2. We can use these AIs to solve difficult reasoning problems that most humans would do poorly on.

3. And whatever algorithms this AI is using to go about its reasoning, they’re apparently so simple that the AI can execute them while still struggling on absolutely trivial arithmetic.

4. WHAT?

Yes, the AI has some blatant holes in its capability. But what we’re seeing is a screaming-hair-on-fire warning that the problems we thought are hard are not hard.

What happens when we just slightly improve our AI architectures to be less dumb?

• When will we get robotics results that are not laughable? When “Google put their most advanced AI into a robot brain!!!” (reported on for the third time this year) we got a robot that can deliver a sponge and misplace an empty coke can but not actually clean anything or do anything useful. It’s hard for me to be afraid of a robot that can’t even plug in its own power cable.

• I believe that over time we will understand that producing human-like text is not a sign of intelligence. In the past people believed that only intelligent agents are able to solve math equations (naturally, since only people can do it and animals can). Then came computer and they were able to do all kinds of calculations much faster and without errors. However, from our current point of view we now understand that doing math calculations is not really that intelligent and even really simple machines can do that. Chess playing is similar story, we thought that you have to be intelligent, but we found a heuristic to do that really well. People were afraid that chess-algorithm-like machines can be programmed to conquer the world, but from our perspective, that’s a ridiculous proposition.

I believe that text generation will be a similar case. We think that you have to be really intelligent to produce human-like outputs, but in the end with enough data, you can produce something that looks nice and it can even be useful sometimes, but there is no intelligence in there. We will slowly develop an intuition about what are the capabilities of large-scale ML models. I believe that in the future we will think about them as basically a kinda fuzzy databases that we can query with natural language. I don’t think that we will think about them as intelligent agents capable of autonomous actions.

• Chess playing is similar story, we thought that you have to be intelligent, but we found a heuristic to do that really well.

You keep distinguishing “intelligence” from “heuristics”, but no one to my knowledge has demonstrated that human intelligence is not itself some set of heuristics. Heuristics are exactly what you’d expect from evolution after all.

So your argument then reduces to a god of the gaps, where we keep discovering some heuristics for an ability that we previously ascribed to intelligence, and the set of capabilities left to “real intelligence” keeps shrinking. Will we eventually be left with the null set, and conclude that humans are not intelligent either? What’s your actual criterion for intelligence that would prevent this outcome?

• I believe that fixating on benchmark such as chess etc is ignoring the G part of AGI. Truly intelligent agent should be general at least in the environment he resides in, considering the limitation of its form. E.g. if a robot is physically able to work with everyday object, we might apply Wozniak test and expect that intelligent robot is able to cook a dinner in arbitrary house or do any other task that its form permits.

If we assume that right now we develop purely textual intelligence (without agency, persistent sense of self etc) we might still expect this intelligence to be general. I.e. it is able to solve arbitrary task if it seems reasonable considering its form. In this context for me, an intelligent agent is able to understand common language and act accordingly, e.g. if a question is posed it can provide a truthful answer.

BIG Bench has recently showed us that our current LMs are able to solve some problems, but they are nowhere near general intelligence. They are not able to solve even very simple problems if it actually requires some sort of logical thinking and not only using associative memory, e.g. this is a nice case:

You can see in the Model performance plots section that scaling did not help at all with tasks like these. This is a very simple task, but it was not seen in the training data so the model struggles to solve it and it produces random results. If the LMs start to solve general linguistic problems, then we are actually having intelligent agents at our hand.

• My 8yo is not able to cook dinner in an arbitrary house. Does she have general intelligence?

• In this context for me, an intelligent agent is able to understand common language and act accordingly, e.g. if a question is posed it can provide a truthful answer

Humans regularly fail at such tasks but I suspect you would still consider humans generally intelligent.

In any case, it seems very plausible that whatever decision procedure is behind more general forms of inference, it will very likely fall to the inexorable march of progress we’ve seen thus far.

If it does, the effectiveness of our compute will potentially increase exponentially almost overnight, since you are basically arguing that our current compute is hobbled by an effectively “weak” associative architecture, but that a very powerful architecture is potentially only one trick away.

The real possibility that we are only one trick away from a potentially terrifying AGI should worry you more.

• I don’t see any indication of AGI so it does not really worry me at all. The recent scaling research shows that we need non-trivial number of magnitudes more data and compute to match human-level performance on some benchmarks (with a huge caveat that matching a performance on some benchmark might still not produce intelligence). On the other hand, we are all out of data (especially high quality data with some information value, no random product reviews or NSFW subreddit discussions) and our compute options are also not looking that great (Moore’s law is dead, the fact that we are now relying on HW accelerators is not a good thing, it’s a proof that CPU performance scaling is after 70 years no longer a viable option. There are also some physical limitations that we might not be able to break anytime soon.)

• It is goalpost moving. Basically, it says “current models are not really intelligent”. I don’t think there is much disagreement here. And it’s hard to make any predictions based on that.

Also, “Producing human-like text” is not well defined here; even ELIZA may match this definition. Even the current SOTA may not match it because the adversarial Turning Test has not yet been passed.

• It’s not goapost moving, it’s the hype that’s moving. People reduce intelligence to arbitrary skills or problems that are currently being solved and then they are let down when they find out that the skill was actually not a good proxy.

I agree that LMs are concetually more similar to ELIZA than to AGI.

• The observation that things that people used to consider intelligent are now considered easy is critical.

The space of stuff remaining that we call intelligent, but AIs cannot yet do, is shrinking. Every time AI eats something, we realize it wasn’t even that complicated.

The reasonable lesson appears to be: we should stop default-thinking things are hard, and we should start thinking that even stupid approaches might be able to do too much.

It’s a statement more about the problem being solved, not the problem solver.

When you stack this on a familiarity with the techniques in use and how they can be transformatively improved with little effort, that’s when you start sweating.

• Upvoted for the well reasoned arguments. It seems that competition should be the default assumption for the future, for a whole host of reasons, some of which are elaborated here. Personally I never found the opposite claims to be convincing, so I may be biased here.

As a corollary, this implies that claims such as competition will somehow cease, some sort of ‘lock in’ past a threshold, etc., should be subject to a much higher level of scrutiny.

• Thank you for this explanation. Now it helps me to understand a little bit more of why so many people I know simply feel overwhelmed and give up. Personally as I am not in position to donate money, I work to tackle one specific problem set that I think will help open up and leave the solutions to other problems.

• [ ]
[deleted]
• This was really good and definitely made me think about how I might live in such a scenario. I would probably go all in on frequent redaction and just lean hard on external memory storage to make up the difference. I already barely remember anything from even ten years ago and rely mostly on external memory for everything, I have a strong ability to acausally coordinate with myself across time, so I’m not worried about different iterations of me going off course in ways I wouldn’t endorse. If you have a strong enough exomemory system you can effectively just freeze your age groundhog-day style and rely on the work you did before each redaction to keep carrying you forward. This does require a strong ability to log and rely information but it seems very manageable, and once I’ve lived long enough for direct mind-computer interfacing I can just back up my memories, redact, and then download them again.

• I really enjoyed the first 60% of the story, and then it became… too long. Not sure if this is just about me, or if others have a similar reaction.

But, I get ahead of myself. How did things come to this point?

Not sure if this is just a rationalization, but at this moment you switched from present moment to retrospective, which removed the tension. Until this point, I felt like the story could go in many possible directions. (Who is Bruce Hance? What are his actual plans? Will the protagonist hack this device, and maybe reprogram people to worship him instead?) After this point, I felt like “okay, this is obviously a metaphor for a religion” and stopped expecting anything that wouldn’t fit this metaphor. The ending was nice again, though.

• It was originally two novellas. I combined them, not seeing a point to publishing them separately. Should I separate them?

• Possibly. I had great fun with the first half of this story, and then I saw where the scroll thingy on my screen was and so stopped with the intention of coming back for the second half. (Which I just did. Second half also great fun!) I don’t think its a flaw in the story, I think that it might just be that its length puts it as a bit of an outlier in terms of webpage reading.

Solar flares are a great plot device for “I need some machine to malfunction”.

• Also to set up the visitors at the end, who I still feel arrived too abruptly

• It is abrupt, but to me that wasn’t a bad thing. If them arriving had taken an extra x paragraphs but the story worked out exactly the same way I think that for me that would be worse not better. If there was some clever way of hinting them earlier maybe. The solar flare works quite well for that.

• I think you should not act on my advice alone. I might be an outlier.

Furthermore, even if I correctly detected what makes the story worse (for a group of people larger than myself) it does not automatically mean that following my advice would improve it. People are better at detecting what they don’t like than at improving things. (For example, I could say which meals taste better and which worse, but I couldn’t cook the meals I like most.)

The last objection is that, in long term, rewriting this story is irrelevant. If my complaint makes sense (which still remains to be verified), the best reaction is to keep writing other things, and not make the same mistake again.

So… I guess just leave it as it is. If you later decide to publish it elsewhere, probably two parts will be better. When writing new stories, consider not changing the style in the middle (or if you do, keep the parts separate).

• Here are the links to the articles from the books. Let me know if there are any mistakes.

Notes:

I didn’t find “Do Birth Order Effects Exist?”, “Challenges to Christiano’s Iterated Amplification Proposal” seems to be only on intelligence.org and “Response to FAQ on Iterated Amplification” seems to be a comment and not a post of it’s own.

• This is a beautiful set of ideas which bear further discussion.

• I have read about this many years ago. We did a proof of the Gödel’s incompleteness theorem at university. And I still have the feeling like there is some sleight of hand involved… that even when I am telling the proof, I am merely repeating the teacher’s password. My intuition has no model of what this all actually means.

From my perspective, understanding math means that I can quickly say why something works, what it implies, how would the situation change if we slightly changed the conditions, which part of the entire proof is the critical one, etc.

So frustrating...

• I’m not sure if this is what you’re looking for, but Hofstadter gives a great analogy using record players which I find useful in terms of thinking about how changing the situation changes our results (which is paraphrased here).

• A (hi-fi) record player that tries to playing every possible sound can’t actually play its own self-breaking sound, so it is incomplete by virtue of its strength.

• A (low-fi) record player that refuses to play all sounds (in order to avoid destruction from its self-breaking sound) is incomplete by virtue of its weakness.

We may think of the hi-fi record player as a formal system like Peano Arithmetic: the incompleteness arises precisely because it is strong enough to be able to capture number theory. This is what allows us to use Gödel Numbering, which then allows PA to do meta-reasoning about itself.

The only way to fix it is to make a system that is weaker than PA, so that we cannot do Gödel Numbering. But then we have a system that isn’t even trying to express what we mean by number theory. This is the low-fi record player: as soon as we fix the one issue of self-reference, we fail to capture the thing we care about (number theory).

I think an example of a weaker formal system is Propositional Calculus. Here we do actually have completeness, but that is only because Propositional Calculus is too weak to be able to capture number theory.

• What exactly is the aspect of natural numbers that makes them break math, as opposed to other types of values? Intuitively, it seems to be the fact that they can be arbitrarily large but not infinite.

Like, if you invent another data type that only has a finite number of values, it would not allow you to construct something equivalent to Gödel numbering. But if it allows infinite number of (finite) values, it would. (Not sure about an infinite number of/​including infinite values, probably also would break math.)

It seems like you cannot precisely define natural numbers using first-order logic. Is that the reason of this all? Or is it a red herring? Would situation be somehow better with second-order logic?

(These are the kinds of questions that I assume would be obvious to me, if I grokked the situation. So the fact that they are not obvious, suggests that I do not see the larger picture.)

• I don’t have much to add, but I basically endorse this, and it’s similar to what I try to do myself.

Figuring out when and how to proactively share information about my inner-layers with people is a balancing act I’m still figuring out, but I’ve found it generally valuable for building trust.

• I’m a little bit skeptical of the argument in “Transformers are not special”—it seems like, if there were other architectures which had slightly greater capabilities than the Transformer, and which were relatively low-hanging fruit, we would have found them already.

I’m in academia, so I can’t say for sure what is going on at big companies like Google. But I assume that, following the 2017 release of the Transformer, they allocated different research teams to pursuing different directions: some research teams for scaling, and others for the development of new architectures. It seems like, at least in NLP, all of the big flashy new models have come about via scaling. This suggests to me that, within companies like Google, the research teams assigned to scaling are experiencing success, while the research teams assigned to new architectures are not.

It genuinely surprises me that the Transformer still has not been replaced as the dominant architecture since 2017. It does not surprise me that sufficiently large or fancy RNNs can achieve similar performance to the Transformer. The lack of Transformer replacements makes me wonder whether we have hit the limit on the effectiveness of autoregressive language models, though I also wouldn’t be surprised if someone comes up with a better autoregressive architecture soon.

• There have been a few papers with architectures showing performance that matches transformers on smaller datasets with scaling that looks promising. I can tell you that I’ve switched from attention to an architecture loosely based on one of these papers because it performed better on a smallish dataset in my project but I haven’t tested it on any standard vision or language datasets, so I don’t have any concrete evidence yet. Nevertheless, my guess is that indeed there is nothing special about transformers.

• I think what’s going on is something like:

1. Being slightly better isn’t enough to unseat an entrenched option that is well understood. It would probably have to very noticeably better, particularly in scaling.

2. I expect the way the internal structures are used will usually dominate the details of the internal structure (once you’re already at the pretty good frontier).

3. If you’re already extremely familiar with transformers, and you can simply change how you use transformers for possible gains, you’re more likely to do that than to explore a from-scratch technique.

For example, in my research, I’m currently looking into some changes to the outer loop of execution to make language models interpretable by construction. I want to focus on that part of it, and I wanted the research to be easily consumable by other people. Building an entire new architecture from scratch would be a lot of work and would be less familiar to others. So, not surprisingly, I picked a transformer for the internal architecture.

But I also have other ideas about how it could be done that I suspect would work quite well. Bit hard to justify doing that for safety research, though :P

I think the amount of low hanging fruit is so high that we can productively investigate transformer derivatives for a long time without diminishing returns. They’re more like a canvas than some fixed Way To Do Things. It’s just also possible someone makes a jump with a non-transformer architecture at some point.

• [ ]
[deleted]
• It does not follow that computationally cheaper things are more likely to happen than computationally expensive things. Moreover, describing something as “computationally difficult” is a subjective value judgment (unless you can reasonably prove otherwise) and implies that all actions/​events can be reduced to some form of computation.

• Georgism isn’t taxation, it’s public property ownership.

It finally clicked in my mind why I’ve reacted negatively to the discussion about Georgist land taxes. It’s the same reaction I have to ANY 100% tax rate on anything. That’s not what “ownership” means.

Honestly, I’d probably be OK with (or at least I’d be able to discuss it more reasonably) the idea that land is not privately owned, ever. Only rented/​leased from the government, revocable for failure to pay the current rate (equivalent in all ways to the Georgist tax amount, and penalties for failure to pay). I suspect it’s unworkable for the same reasons—the public won’t stand for evicting sympathetic tenants just because they moved in when the rent/​tax was lower.

Further, taxation of economic movement or realized profit is pretty easy to justify, and ALWAYS exists. Taxation of durable-ownership valuation is tricky, as there is disagreement over whether it’s “real”, and there’s no guarantee that the investment value, even if correctly assessed, is liquid enough to actually pay the tax.

Edit: this had occurred to me before, but I forgot to put it here. It also suffers from the socialist calculation problem. When self-interested humans aren’t making decisions to maximize value (to themselves) about a thing, the actual most valuable use is very difficult to determine. Holding someone responsible for the theoretical value of something when they don’t actually think it’s the best is really difficult to envision.

Further edit: thanks for the comments—it’s helped me understand my true objection, in addition to the auxiliary problems that often come to mind. I think there are two insurmountable problems, not necessarily in this order:

1. Taxing anything at 100% is disingenuous about the word “ownership”. It should be called “rent” or something else that is clear that it’s not owned, but merely usable for some purposes until you decline to pay a future arbitrary amount. It would also make clear that you’re also undertaking all of the political-economy influences that make government setting of rent amounts problematic.

2. Taxing non-realized value is always going to be problematic, both for calculation reasons (valuation is theoretical and debatable), and for the fragility of the value. I think this generalizes to “only collect taxes in-kind (or on liquid enough things that it can be easily converted at tax-collection time)”. Taxing a pile of wheat is easy—take some percentage of it. Taxing a monetary transaction likewise—the price increases and the tax is paid out of that. Taxing a future value is not (generally) feasible—either you’ve just taken someone’s seeds so they can’t plant the crop you’re trying to tax, or you’re demanding cash from someone whose investments will suffer from early liquidation. Deferral of payment can help with this, but if you’re doing that, you should just tax the realized gains at time of liquidity.

• the public won’t stand for evicting sympathetic tenants just because they moved in when the rent/​tax was lower.

We would really need the technology where you can take a house and move it to a different place. That would also make Georgism easier—when people don’t pay, you simply move them to a cheaper location.

• While I think that would be extremely valuable technology, exiling Grandma to Outer Nowhereville, hours away from the rest of the family, is probably not a lot more feasible.

• Alternatively, what if the LVT only increased when the property was sold? Buy land, and you lock in the LVT at the point in time when your deal went through.

• I guess people would never sell the real estate at good location, only lease it? Thus capturing the increased value of the land without ever paying the tax.

• Maybe leases are subject to increased LVT, just not owning it for personal use? I suspect there’s a way to enact something like this, but this isn’t in my wheelhouse.

• Rich people and businesses are more strongly motivated and able to find loopholes than government administrators are to design robust mechanisms. Modern fractional ownership and lease arrangements are complicated enough that most any scheme based on a change in contract participants is going to be arbitraged away very quickly.

• Right, but I thought the point was to avoid people balking at a law that makes it easier to evict ordinary folks who’ve lived somewhere a long time and don’t want to move? Let the rich people find their loopholes—as long as people support the system and the LVT delivers most of its promise, that’s good enough.

• My point is that whatever exclusions are in place to keep poor people in their newly-valuable homes will be used by rich people to avoid taxation. There is no way to avoid this without making the tax variable based on other factors than the land value (like the “owner’s” perceived financial situation).

Either the law is the same for all “pay or GTFO”, or it’s differentially enforced, so the rich and the politically-sympathetic pay less than the middle-class.

• It sounds like you think there’s no practical way to introduce exceptions to make an LVT politically palatable without also breaking its ability to motivate efficient land use and generate tax revenue.

My first thought is sure, we have a lot of tax evasion now. But we also do generate lots of tax revenue, most tax revenue comes from the very rich. Why would an LVT be so much more prone to tax evasion than the current system?

• Correct, I think there’s no practical or realpolitik-theoretical way to make it work. To the extent that you make exceptions in implementation, it’s no longer a land-value tax, it’s a land-value fig leaf over a “what we can get away with” tax.

• to the extent that you make exceptions in implementation, it’s no longer a land-value tax

This is true the same way that to the extent you dilute your cocoa with water, it’s no longer hot chocolate, it’s water :) I don’t know of almost any “pure” real world policies or taxes. I’d need some good evidence to believe that the LVT is that fragile.

• 28 Sep 2022 17:36 UTC
−36 points
1 ∶ 11

Your definition of AGI (“the kind of AI with sufficient capability to make it a genuine threat to humanity’s future or survival if it is misused or misaligned”) is tragically insufficient, vague, subjective, and arguably misaligned with the generally accepted definition of AGI.

From what you wrote elsewhere (“An AGI having its own goals and actively pursuing them as an agent”) you imply that the threat could come from AGI’s intentions, that is, you imply that AGI will have consciousness, intentionality, etc. - qualities so far exclusively prescribed to living things (you have provided no arguments to think otherwise).

However, you decided to define “intelligence” as “stuff like complex problem solving that’s useful for achieving goals” which means that intentionality, consciousness, etc. is unconnected to it (realistically any “complex”-enough algorithm satisfies this condition). Such simplistic and reductionistic definition implies that it is not enough for an “intelligent” computer to be an AGI. So, while you may be able to prove that a computer could have “intelligence” it still does not follow that AGI is possible.

Your core idea that “We’ve already captured way too much of intelligence with way too little effort.” may be true with your definition of “intelligence”, but I hope I’ve shown that such definition is not enough. Researchers at Harvard suggest existence of multiple types of intelligence, which your statement does not take into account and groups all types of intelligence into one even though some are impossible for a computer to have and some could be considered as defining qualities of a computer.

• Furthermore, you compare humans to computers and brains to machines and imply that consciousness is computation. To say that “consciousness is not computation” is comparable to “god of gaps” argument is ironic considering the existence of the AI effect. Your view is hardly coherent in any other worldview than hardcore materialism (which itself is not coherent). Again, we stumble into an area of philosophy, which you hardly addressed in your article. Instead you focused on predicting how good our future computers will be at computing while making appeals to emotion, appeals to unending progress, appealing to the fallacy that solving the last 10% of the “problem” is as easy as the other 90% - that because we are “close” to imitating it (and we are not if you consider the full view of intelligence), we somehow grasped the essence of it and “if only we get slightly better at X or Y we will solve it”.

Scientists have been predicting coming of AGI since ’50s, some believed 70 years ago that it will only take 20 years. We have clearly not changed as humans. The question of intelligence and, thus, the question of AGI is in many ways inherently linked to philosophy and it is clear that your philosophy is that of materialism which cannot provide good understanding of “intelligence” and all related ideas like mind, consciousness, sentience, etc. If you were to reconsider your position and ditch materialism, you might find that your idea of AGI is not compatible with abilities of a computer, or non-living matter in general.

• You oppose hardcore materialism, in fact say it is incoherent—OK. Is there a specific different ontology you think we should be considering?

In the comment before this, you say there are kinds of intelligence which it is impossible for a computer to have (but which are recognized at Harvard). Can these kinds of intelligence be simulated by a computer, so as to give it the same pragmatic capabilities?

• Hmm...

Given the new account, the account name, the fact that there were a few posts in the minutes prior to this one rejected by the spam filter, the arguments, and the fact that the decently large followup comment was posted only 3 minutes after the first...

… are… are you the AI? Trying to convince me of dastardly things?

You can’t trick me!

:P

• This reminds me of a rat experiment mentioned in the Veritasium video about developing expertise. There were two buttons, red and green, and the rat had to predict which one would light up next. It was random, but heavily skewed towards green (like 90% green). After a while, the rat learned to press green every time, achieving a 90% success rate. Humans with the same task didn’t do nearly as well, since sometimes they would press red, feeling that it was going to light up next.

With the negation in multiple choice questions, I wonder if this could be could be the type of thing where the model needs focused training with negation at the start so that the rest of the training is properly ‘forked’. Or maybe there should be two separate large language models feeding into one RL agent. One model for negation and the other for nonnegated language. Then the RL would just have to determine whether the question is a regular question, negation question, or double/​triple/​etc… negative.

I wonder if those LLM’s would treat the sentence, “I couldn’t barely see over the crowd.” as different from “I barely couldn’t see over the crowd.” 🤔

• I think of one of the main experts here as Kevin Esvelt, the first person to suggest using CRISPR to affect wild populations. Here’s an article largely based on interviews with him that he recommends, explaining why he’s against unilateral action here:
”Esvelt, whose work helped pave the way for Target Malaria’s efforts, is terrified, simply terrified, of a backlash between now and then that could derail it. This is hardly a theoretical concern. In 2002, anti-GMO hysteria led the government of Zambia to reject 35,000 tons of food aid in the middle of a famine out of fear it could be genetically modified. Esvelt knows that the CRISPR gene drive is a tool of overwhelming power. If used well, it could save millions of lives, help rescue endangered species, even make life better for farm animals. If used poorly, gene drives could cause social harms that are difficult to reverse. . . .

“To the extent that you or I say something or publish something that reduces the chance that African nations will choose to work with Target Malaria by 1 percent, thereby causing a 1 percent chance that project will be delayed by a decade, the expected cost of our action is 25,000 children dead of malaria,” Esvelt tells me. “That’s a lot of kids.””

• At my school, there was a special metal crossbar (essentially just one side of the mattress frame) that you could use to loft without bunking safely.

Not sure if any of these links is available to you; you might want to read them (1, 2, 3). It might be useful to memorize the phone numbers +38 066 580 34 98 and +38 093 119 29 84.

• Thank you! But, judging by the first link, there’s not much I can do at the moment. Nothing I didn’t do/​hear already. Does the third link contain any special advice? (Probably not… if it did, it would circulate somewhere.)

• Only the first is the important one. The third one is a Telegram channel, would make sense only if you use Telegram. The actionable parts are to check the list of diseases and injuries that may apply to you and would make you exempt from mobilization, and to memorize the phone number in case you might need it and not have an internet at your disposal. (I have heard that they take the phones away from conscripts. In which case, the number is also useless, but perhaps you might find another phone.)

• At this point, discussing my ideas or ways to share them would be more useful for me.

I would like to have a chance to explain/​defend them. To defend what I thought and cared about.

“Staying safe” haven’t worked that well so far. The list of diseases is a very tiny hope (even though I may’ve underestimated it).

• Just made a bet with Jeremy Gillen that may be of interest to some LWers, would be curious for opinions:

• I remain confused about why this is supposed to be a core difficulty for building AI, or for aligning it.

You’ve shown that if one proceeds naively, there is no way to make an agent that’d model the world perfectly, because it would need to model itself.

But real agents can’t model the world perfectly anyway. They have limited compute and need to rely on clever abstractions that model the environment well in most situations while not costing too much compute. That (presumably) includes abstractions about the agent itself.

It seems to me that that’s how humans do it. Humans just have not-that-great models of themselves. Anything at the granularity of precisely modelling themselves thinking about their model of themselves thinking about their model of themselves is right out.

If there’s some simple abstraction to be made about what the result of such third-order-and-above loops would be, like: “If I keep sitting here modelling my self-modelling, that process will never converge and I will starve”, a human might use those. Otherwise, we just decline to model anything past a few loops, and live with the slight imprecision in our models this brings.

Why would an AI be any different? And if it were different, if there’s some systematically neater way to do self-modelling, why would that matter? I don’t think an AI needs to model these loops in any detail to figure out how to kill us, or to understand its own structure well enough to know how to improve its own design while preserving its goals.

“Humans made me out of a crappy AI architecture, I figured out an algorithm to translate systems like mine into a different architecture that costs 1/​1000th the compute” doesn’t require modelling (m)any loops. Neither does “This is the part of myself that encodes my desires. Better preserve that structure!”

I get that having a theory for how to make a perfectly accurate intelligence in the limit of having infinite everything is maybe neat as a starting point for reasoning about more limited systems. And if that theory seems to have a flaw, that’s disconcerting if you’re used to treating it as your mathematical basis for reasoning about other things.

But if that flaw seems tightly related to that whole “demand perfect accuracy” thing, I don’t see why it’s worth chasing a correction to a theory that’s only a very imperfect proxy and limiting case of real intelligence anyway, instead of just acknowledging that AIXI and related frames are not suited for reasoning about matters in which the agent isn’t far larger than the thing it’s supposed to model.

I just don’t see how the self-modeling difficulty isn’t just part of the broader issue that real intelligence needs to figure out imperfect but useful abstractions about the world to make predictions. Why not try to understand the mathematics of that process instead, if you want to understand intelligence? What makes the self-modeling look like a special case and a good attack point to the people who pursue this?

• Firstly, thanks for reading the post! I think you’re referring mainly to realisability here which I’m not that clued up on tbh, but I’ll give you my two cents because why not.

I’m not sure to what extent we should focus on unrealisability when aligning systems. I think I have a similar intuition to you that the important question is probably “how can we get good abstractions of the world, given that we cannot perfectly model it”. However, I think better arguments for why unrealisability is a core problem in alignment than I have laid out probably do exist, I just haven’t read that much into it yet. I’ll link again to this video series on IB (which I’m yet to finish) as I think there are probably some good arguments here.

• AISC is a globally based residential research camp organisation founded in 2018 by Linda Linsefors and currently lead by Remmelt Ellen.

I’m a co-founder of AISS, but not the only one. Nandi Schoots, Remmelt Ellen Tom McGrath and David Kristoffersson where all part of the planning of the first camp, from the start.

• Dumb question — are these the same polytopes as described in Anthropic’s recent work here, or different polytopes?

• No, they exist in different spaces: Polytopes in our work are in activation space whereas in their work the polytopes are in the model weights (if I understand their work correctly).

• If the forecast says rain or if it’s unseasonably cold, we will do this at
JVB e.V.
Turmstr. 10
10559 Berlin

• Why did you pick caring about each other as a thing culture + evolution was trying to do?

You’re not alone in making what I think is the same mistake. I think it’s actually quite common to feel like it’s amazing that evolution managed to come up with us humans who like beauty and friendship and the sanctity of human life and so on—evolution and culture must have been doing something right, to come up with such great ideas.

But in the end, no; evolution is impressive but not in that way. You picked caring about each other as the target because humans value it—a straightforward case of painting the target around the bullet-hole.

• I don’t find it amazing or something. It’s more like… I dont know how to write the pseudocode for an AI that actually cares about human welfare. In my mind that is pretty close to something that tries to be aligned. But if even evolution managed to create agents capable of this by accident, then it might not be that hard.

• But, like, evolution made a bunch of other agents that didn’t have these properties.

• I initially had a Paragraph explaining my motivation in the question, but then removed it in favor of brevity. Kind of regretting this now because people seem to read into this that I think of evolution as some spirit or something.

• No, and here’s why:

1. Evolution’s goal, to the extent that it even has a goal, is so easy to satisfy that literally any life/​self-replicating tech could do this. It cares about reproductive fitness, and that goal requires almost no capabilities. It cares nothing for specifics, but our goals require much, much more precision than that.

2. Evolution gives almost zero optimization power to capabilities, while far far more optimization power is dedicated to capabilities by the likes of Deepmind/​OpenAI. To put it lightly, there’s a 30-60% chance that we get much, much stronger capabilities than evolution this century.

3. The better example is how humans treat animals, and here humans are very misaligned to animals, with the exception of pets, despite not agreeing with environmentalism/​nature as a good thing. So no, I disagree with the premise of your question.

• It’s worth knowing that there are some categories of data that Surge is not well positioned to provide. For example, while they have a substantial pool of participants with programming expertise, my understanding from speaking with a Surge rep is that they don’t really have access to a pool of participants with (say) medical expertise—although for small projects it sounds like they are willing to try to see who they might already have with relevant experience in their existing pool of ‘Surgers’. This kind of more niche expertise does seem likely to become increasingly relevant for sandwiching experiments. I’d be interested in learning more about companies or resources that can help collect RLHF data from people with uncommon (but not super-rare) kinds of expertise for exactly this reason.

• There are important business problems that require medical expertise to be solved. On the other hand, I wouldn’t expect it to be very helpful with the core alignment problem.

• I was using medical questions as just one example of the kind of task that’s relevant to sandwiching. More generally, what’s particularly useful for this research programme are

• tasks where we have “models which have the potential to be superhuman at [the] task”, and “for which we have no simple algorithmic-generated or hard-coded training signal that’s adequate”; and

• for which there is some set of reference humans who are currently better at the task than the model;

• and for which there is some set of reference humans for whom the task is difficult enough that they would have trouble even evaluating/​recognizing good performance. (you also want this set of reference humans to be capable of being helped to evaluate/​recognize good performance in some way)

Prime examples are task types that require some kind of niche expertise to do and evaluate. Cotra’s examples involve “[fine-tuning] a model to answer long-form questions in a domain (e.g. economics or physics) using demonstrations and feedback collected from experts in the domain”, “[fine-tuning] a coding model to write short functions solving simple puzzles using demonstrations and feedback collected from expert software engineers”, “[fine-tuning] a model to translate between English and French using demonstrations and feedback collected from people who are fluent in both languages”. I was just making the point that Surge can help with this kind of thing in some domains (coding), but not in others.

• Thank you for posting this. I didn’t think this was too straightforward. Prior to reading the solution I actually thought it was one of the more difficult ones. Possibly because I focused on trying to copy the allocation helms early choices instead of on the ratings.

• I used to think this, but if you’re a well-calibrated Bayesian, then updating on new papers shouldn’t cause you to produce worse research insights because you shouldn’t overconfidently buy into falsities that are holding the field back.

I’ve found by and large that knowing what ground others have already trodden is very helpful. I’ve tried not knowing what people work on, and usually I end up reinventing the wheel poorly in slow motion instead of finding any new insights. When I critically analyze the literature I get much more mileage.

• Maybe there is a person like #2 somewhere out there in the world, maybe a very early researcher in what has become modern machine learning, but I’ve never heard of them. If this person exists, I desperately want them to explain how their model works. They clearly would know more about the topic than I do and I’d love to think we have more time.

Gary Marcus thinks he is this person, and is the closest to being this person you’re going to find. You can read his substack or watch some interviews that he’s given. It’s an interesting position he has, at least.

In this section you talk a lot about surprise, and that a Gary Marcus should be able to make successful predictions about the technology in order to have something meaningful to say. I think Gary Marcus is a bit like a literary critic commenting on his least favorite genre: he can’t predict what the plot of the next science fiction novel will be, but he knows in advance that he won’t be impressed by it.

• I did wonder about him. My understanding is that his most publicized bet was offering even odds on AGI in 2029. If I’m remembering that right… I can’t really fault him for trying to get free money from his perspective, but if one of the most notable critics in the field offers even odds on timelines even more aggressive than my own, I’m… not updating to longer timelines, probably.

• The reason he offered that bet was because Elon Musk had predicted that we’d likely have AGI by 2029, so you’re drawing the wrong conclusion from that. Other people joined in with Marcus to push the wager up to $500k, but Musk didn’t take the bet of course, so you might infer something from that! The bet itself is quite insightful, and I would be very interested to hear your thoughts on its 5 conditions: https://​​garymarcus.substack.com/​​p/​​dear-elon-musk-here-are-five-things In fact anyone thinking that AGI is imminent would do well to read it—it focusses the mind on specific capabilities and how you might build them, which I think it more useful than thinking in vague terms like ‘well AI has this much smartness already, how much will it have in 20 /​ 80 years!’. I think it’s useful and necessary to understand at that level of detail, otherwise we might be watching someone building a taller and taller ladder, and somehow thinking that’s going to get us to the moon. FWIW, I work in DL, and I agree with his analysis • I didn’t actually update my timelines shorter in response to his bets since I was aware his motivations were partially to poke Elon and maybe get some (from what I understand his perspective to be) risk-free money. I’d just be far more persuaded had he offered odds that actually approached his apparent beliefs. As it is, it’s uninformative. His 5 tests are indeed a solid test of capability, though some of the tests seem much harder than others. If an AI could do 35 of them, I would be inclined to say AGI is extremely close, if not present. I would be surprised if we see the cook one before AGI, given the requirement that it works in an arbitrary kitchen. I expect physical world applications to lag purely digital applications just because of the huge extra layer of difficulty imposed by working in a real time environment, all the extra variables that are difficult to capture in a strictly digital context, and the reliability requirements. The “read a book and talk about it” one seems absolutely trivial in comparison. I would really like to see him make far more predictions on a bunch of different timescales. If he predicted things correctly about GPT-4, the state of {whatever architecture} in 2025, the progress on the MATH dataset by 2025, and explained how all of these things aren’t concerning and so on, I would be much more inclined to step towards his position. (I don’t expect him to get everything right, that would be silly, I just want to see evidence, and greater details, of a generally functioning mental model.) • I agree it’s an attempt to poke Elon, although I suspect he knew that he’d never take the bet. Also agree that anything involving real world robotics in unknown environments is massively more difficult. Having said that, the criteria from Effective Altuirism here: for any human who can do any job, there is a computer program (not necessarily the same one every time) that can do the same job for$25/​hr or less

do say ‘any job’, and we often seem to forget how many jobs require insane levels of dexterity and dealing with the unknown. We could think about the difficulty of building a robot plasterer or car mechanic for example, and see similar levels of complexity, if we pay attention to all the tasks they actually have to do. So I think it fair to have it part of AGI. I do agree that more detailed predictions would be hugely helpful. Marcus’s colleague, Rodney Brooks, has a fun scorecard of predictions for robotics and AI here:

https://​​rodneybrooks.com/​​predictions-scorecard-2022-january-01/​​
which I think is quite useful. As an aside, I had a fun 20 minute chat with GPT-3 today and convinced myself that it doesn’t have the slightest understand of meaning at all! Can send the transcript if interested.

• So I think it fair to have it part of AGI.

I’d agree with that, I just strongly suspect we can hit dangerous capability without running this experiment first given how research proceeds. If there’s an AI system displaying other blatant signs of being an AGI (by this post’s definition, and assuming non-foom situation, and assuming we’re not dead yet), I won’t bother spending much time wondering about whether it could be a cook.

As an aside, I had a fun 20 minute chat with GPT-3 today and convinced myself that it doesn’t have the slightest understand of meaning at all!

Yup- GPT-3 is shallow in a lot of important ways. It often relies on what appears to be interpolation and memorization. The part that worries me is that architectures like it can still do very difficult reasoning tasks that many humans can’t, like the MATH dataset and minerva. When I look at those accomplishments, I’m not thinking “wow this ML architecture is super duper smart and amazing,” I think “uh oh that part of reasoning is apparently easy if current transformers can do it, while simultaneously failing at trivial things.” We keep getting signals that more and more of our ineffable cognitive skills are… just not that hard.

As we push into architectures that rely more on generalization through explicit reasoning (or maybe even interpolation/​memorization at sufficiently absurd scales), a lot of those goofy little mistakes are going to collapse. I’m really worried that an AI that is built for actual reasoning with an architecture able to express what reasoning entails algorithmically is going to be a massive discontinuity, and that it might show up in less than 2 years. It might not take us all the way to AGI in one step, but I’m not looking forward to it.

I really dislike that, as a byproduct of working on safety research, I keep coming up with what look like promising avenues of research for massive capability gain. They seem so much easier to find than good safety ideas, or good ideas in the other fields I work in. I’ve done enough research that I know they wouldn’t all pan out, but the apparent ease is unsettling.

• Other people joined in with Marcus to push the wager up to \$500k, but Musk didn’t take the bet of course, so you might infer something from that!

That Musk generally doesn’t let other people set the agenda? I don’t remember any time where someone challenged Musk publically to a bet and he took it.

• Quite possibly. I just meant: you can’t conclude from the bet that AGI is even more imminent.

Genuinely, I would love to hear people’s thoughts on Marcus’s 5 conditions, and hear their reasoning. For me, the one of having a robot cook that can work in pretty much anyone’s kitchen is a severe test, and a long way from current capabilities.

• Little code that’s written by humans that’s 10000 lines long is bug free. Bug-freeness seems to me like to high of a standard.

When it comes to kitchen work it matters a lot for the practical problems of taking the job of existing people. On the other hand it has less relevance to whether or not the AI will speed up AI development.

Otherwise, I do agree that that the other items are good one’s to make predictions. It would be worthwhile to make metaculus questions for them.

• I was about to say the same (Gary Marcus’ substack here).

In defense of Marcus, he often complains about AI companies refusing to give him access to their newer models. If your language/​image model is really as awesome as advertised, surviving the close scrutiny of a skeptical scientist should not be a problem, but apparently it is.

• I played the token-prediction game, and even though I got a couple correct, they were still marked in red and I got 0 score. One of the words was “handling”, I knew it was “handling” but handling was not a valid token, so I put in “hand” expecting to be able to finish “ling”. The game said “wrong, red, correct answer was handling”. Arrg!

(EDIT: it looks like you have to put spaces in at the beginning of tokens. This is poor game design.)

This doesn’t have anything to do with the rest of the post, I just wanted to whine about it lol

• Your section on the physical limits of hardware computation .. is naive; the dominant energy cost is now interconnect (moving bits), not logic ops. This is a complex topic and you could use more research and references from the relevant literature; there are good reasons why the semiconductor roadmap has ended and the perception in industry is that Moore’s Law is finally approaching it’s end. For more info see this, with many references.

• Went ahead and included a callout for this explicitly in the text. Thanks for the feedback!

• Out of curiosity:

1. What rough probability do you assign to a 10x improvement in efficiency for ML tasks (GPU or not) within 20 years?

2. What rough probability do you assign to a 100x improvement in efficiency for ML tasks (GPU or not) within 20 years?

My understanding is that we actually agree about the important parts of hardware, at least to the degree I think this question is even relevant to AGI at this point. I think we may disagree about the software side, I’m not sure.

I do agree I left a lot out of the hardware limits analysis, but largely because I don’t think it is enough to move the needle on the final conclusion (and the post is already pretty long!).

• Reducing the amount of energy used in moving bits is definitely going to happen in the next few years as people figure out accelerator architectures. Even if we don’t get any more Moore’s Law-type improvements, the improvements from algorithms and new hardware architectures should be enough to put us close to AGI.

• Yeah—If you mean saving energy by moving less bits, that is for example what neuromorphic computing is all about. And yes current GPUs are probably sufficient for early AGI.

• Fortunately, if we solve the problem of an AGI performing harmful acts when explicitly commanded to by a cunning adversary then we almost certainly have a solution for it performing harmful acts unintended by the user: we have a range of legal, practical and social experience preventing humans causing each other harm using undue technological leverage—whether through bladed weapons, firearms, chemical, nuclear or biological means.

This seems like reversing the requirements. Yes, “solve the problem of an AGI performing harmful acts when explicitly commanded to by a cunning adversary” is logically easier and shorter to state, but mechanistically it seems like it has two requirements for it to act that way:

1. Solve the problem of an AGI performing harmful acts regardless of who commands it due to convergent instrumental subgoals. (Control problem /​ AI notkilleveryoneism.)

2. Ensure that the AI only gets commanded to do stuff by people with good intentions, or at least that people with bad intentions get filtered or moderated in some way. (AI ethics.)

Existential risk alignment research focuses on the control problem. AI ethics of course also needs to get solved in order to “solve the problem of an AGI performing harmful acts when explicitly commanded to by a cunning adversary”, but you sound like you are advocating for dropping the control problem in favor of AI ethics.

• A thoughtful decomposition. If we take the time dimension out and consider AGI just appears ready-to-go I think I would directionally agree with this.

My key assertion is that we will get sub-AGI capable of causing meaningful harm when deliberately used for this purpose significantly ahead of getting full AGI capable of causing meaningful harm through misalignment. I should unpack that a little more:

1. Alignment primarily becomes a problem when solutions produced by an AI are difficult for a human to comprehensively verify. Stable Diffusion could be embedding hypnotic-inducing mind viruses that will cause all humans to breed cats in an effort to maximise the cute catness of the universe, but nobody seriously thinks this is taking place because the model has no representation of any of those things nor the capability to do so.

2. Causing harm becomes a problem earlier. Stable Diffusion can be used to cause harm, as can Alpha Fold. Future models that offer more power will have meaningfully larger envelopes for both harm and good.

3. Given that we will have the harm problem first, we will have to solve it in order to have a strong chance of facing the alignment problem at all.

4. If, when we face the alignment problem, we have already solved the harm problem, addressing alignment becomes significantly easier and arguably is now a matter of efficiency rather than existential risk.

It’s not quite as straightforward as this, of course, as it’s possible that whatever techniques we come up with for avoiding deliberate harm by sub-AGIs might be subverted by stronger AGIs, but the primary contention of the essay is that assigning a 15% x-risk to alignment implicitly assumes a solution to the harm problem, but this is not currently being invested in to similar or appropriate levels.

In essence, alignment is not unimportant but alignment-first is the wrong order, because to face an alignment x-risk we must first overcome an unstated harm x-risk.

In this formulation, you could argue that the alignment x-risk is 15% conditional on us solving the harm problem, but given current investment in AI safety is dramatically weighted towards alignment and not harm the unconditional alignment x-risk is well below 5% - accounting for the additional outcomes that we may not face it because we fail an AI-harm filter, or because in solving AI-harm we de-risk alignment, or because AI-harm is sufficiently difficult that AI research becomes significantly impacted, slowing or stopping us from reaching the alignment x-risk filter by 2070 (cf global moratoriums on nuclear and biological weapons research, which dramatically slowed progress in those areas).

• I agree that there will be potential for harm as people abuse AIs that aren’t quite superintelligent for nefarious purposes. However, in order for that harm to prevent us from facing existential risk due to the control problem, the harm for nefarious use of sub-superintelligent AI itself has to be xrisk-level, and I don’t really see that being the case.

• I think you may be underestimating the degree to which these models are like kindling, and a powerful reinforcement learner could suddenly slurp all of this stuff up and fuck up the world really badly. I personally don’t think a reinforcement learner that is trying to take over the world would be likely to succeed, but the key worry is that we may be able to create a form of life that, like a plague, is not adapted to the limits of its environment, makes use of forms of fast growth that can take over very quickly, and then crashes most of life in the process.

most folks here also assume that such an agent would be able to survive on its own after it killed us, which I think is very unlikely due to how many orders of magnitude more competent you have to be to run the entire world. gpt3 has been able to give me good initial instructions for how to take over the world when pressured to do so (summary: cyberattacks against infrastructure, then threaten people; this is already considered a standard international threat, and is not newly invented by gpt3), but when I then turned around and pressured it to explain why it was a bad idea, it immediately went into detail about how hard it is to run the entire world—obviously these are all generalizations humans have talked about before, but I still think it’s a solid representation of reality.

that said, because such an agent would be likely also misaligned with itself in my view, I think your perspective that humans who are misaligned with each other (ie, have not successfully deconflicted their agency) are a much greater threat to humanity as a whole.

• Hint: Macs and iOS devices come with build-in “accessibility” tools that read out loud everything on screen. The voices can be improved even more by downloading the “Siri enhanced” voice in the settings.

• strong upvote. wars fought with asi could be seriously catastrophic well before they’re initiated by the asi.

• 🏆📈 We’ve created Alignment Markets! Here, you can bet on how AI safety benchmark competitions go. The current ones are about the Autocast warmup competition (meta), the Moral Uncertainty Research Competition, and the Trojan Detection Challenge.

It’s hosted through Manifold Markets so you’ll set up an account on their site. I’ve chatted with them about creating a A-to-B prediction market so maybe they’ll be updated when we get there. Happy betting!

• 28 Sep 2022 6:25 UTC
LW: 3 AF: 3
0 ∶ 0
AF

7: Did I forget some important question that someone will ask in the comments?

Yes!

Is there a way to deal with the issue of there being multiple ROSE points in some games? If Alice says “I think we should pick ROSE point A” and Bob says “I think we should pick ROSE point B”, then you’ve still got a bargaining game left to resolve, right?

Anyways, this is an awesome post, thanks for writing it up!

• My preferred way of resolving it is treating the process of “arguing over which equilibrium to move to” as a bargaining game, and just find a ROSE point from that bargaining game. If there’s multiple ROSE points, well, fire up another round of bargaining. This repeated process should very rapidly have the disagreement points close in on the Pareto frontier, until everyone is just arguing over very tiny slices of utility.

This is imperfectly specified, though, because I’m not entirely sure what the disagreement points would be, because I’m not sure how the “don’t let foes get more than what you think is fair” strategy generalizes to >2 players. Maaaybe disagreement-point-invariance comes in clutch here? If everyone agrees that an outcome as bad or worse than their least-preferred ROSE point would happen if they disagreed, then disagreement-point-invariance should come in to have everyone agree that it doesn’t really matter exactly where that disagreement point is.

Or maybe there’s some nice principled property that some equilibria have, which others don’t, that lets us winnow down the field of equilibria somewhat. Maybe that could happen.

I’m still pretty unsure, but “iterate the bargaining process to argue over which equilibria to go to, you don’t get an infinite regress because you rapidly home in on the Pareto frontier with each extra round you add” is my best bad idea for it.

EDIT: John Harsanyi had the same idea. He apparently had some example where there were multiple CoCo equilibria and his suggestion was that a second round of bargaining could be initiated over which equilibria to pick, but that in general, it’d be so hard to compute the n-person Pareto frontier for large n, that an equilibria might be stable because nobody can find a different equilibria nearby to aim for.

So this problem isn’t unique to ROSE points in full generality (CoCo equilibria have the exact same issue), it’s just that ROSE is the only one that produces multiple solutions for bargaining games, while CoCo only returns a single solution for bargaining games. (bargaining games are a subset of games in general)

• [ ]
[deleted]
• Yeah, I agree that’s a weird way to define “high-dimensional”. I’m more partial to defining it as “when the curse of dimensionality becomes a concern”, which is less precise but more useful.

1. Also, coming up with your own ideas first can help you better understand what you find in the literature. I’ve found that students learn more readily when they come to a subject with questions already in mind, having tried to figure things out on their own and realized where they had gaps in their mental framework, rather than just receiving a firehose of new information with no context.

2. Perhaps try pursuing a number of proxy goals for short, pre-defined periods, while tracking whether each proxy goal is likely to be instrumental for reaching the terminal goal. Assessing the instrumentality of each proxy should be easier once you’ve started to get a sense of where each project can lead, and abandoning those that are clearly not going to be fruitful should be easier if you don’t plan on going all-in from the start.

3. Don’t be afraid to ask stupid questions. We often tend to refrain from asking questions that we predict would cause those more experienced to perceive us as idiots. Ignore those predictions. Even when the answer is obvious to everyone else, it will help the writer practice clarifying their ideas from a new perspective, which could even help them understand their own work better. And sometimes everyone else is just afraid to look like idiots, too.

4. Try steel-manning the best argument you can come up with against an authority’s position. Ideas that can withstand the harshest scrutiny are those worth keeping. Ideas that can be destroyed by the truth should be. Help the intellectual community filter the chaff from the wheat.

5. Good hypotheses always entail predictive models. If you can’t program it, you don’t really understand it.

6. I can’t think of anything else to add to this one.

7. Also, don’t wait until you’ve learned linear algebra, multivariable calculus, probability theory, and machine learning before starting to tackle the alignment problem. It’s easier to learn these things once you already know where they will be useful to you. Plus, we may not have enough time to wait on mathematicians to come up with provable guarantees of AI safety.

• [ ]
[deleted]
• Bitcoin/​blockchain force cohesion of a set of sims around the merkle tree, but you can still have many different sims that share variations of the same transaction history (they just differ then only in details).

But looking at it another way blockchains are also strong useful constraining evidence, so historical sims would always just recreate the same merkle trees. This leads to the interesting idea that you could attempt to preserve blockchain wealth post-sim … but of course you would just be sharing it with other versions of yourself, and potentially your simulators.

• I remember being fascinated with the potential to help fix problems for people, especially eating problems, by adjusting the levels of various neuropeptides. I still think neuroscience ought to boldly tamper with those for the sake of making peoples’ lives less miserable, but now I’m much more fascinated by the implications for setting up a control system for RL-based AI. I also wonder if anyone has already made something like a simplified-business-logic model of the functions of the hypothalamus.

• I’ve considered starting an org that was either aimed at generating better alignment data or would do so as a side effect and this is really helpful—this kind of negative information is nearly impossible to find.

Is there a market niche for providing more interactive forms of human feedback, where it’s important to have humans tightly in the loop with an ML process, rather than “send a batch to raters and get labels back in a few hours”? One reason RLHF is so little used is the difficulty of setting up this kind of human-in-the-loop infrastructure. Safety approaches like debate, amplification and factored cognition could also become competitive much faster if it was easier and faster to get complex human-in-the-loop pipelines running.

Maybe Surge already does this? But if not, you wouldn’t necessarily want to compete with them on their core competency of recruiting and training human raters. Just use their raters (or Scale’s), and build good reusable human-in-the-loop infrastructure, or maybe novel user interfaces that improve supervision quality.

• [ ]
[deleted]
• Update today: Biogen/​Eisai have reported results from Lecanemab’s phase 3 trial: a slowing of cognitive decline by 27% with a p-value of 0.00005 on the primary endpoint. All other secondary endpoints, including cognitive ones, passed with p-values under 0.01.

• B. They lose sight of the terminal goal. The real goal is not to skill-up in ML. The real goal is not to replicate the results of a paper. The real goal is not even to “solve inner alignment.” The real goal is to not die & not lose the value of the far-future.

I’d argue that if they solved inner alignment totally, then the rest of the alignment problems becomes far easier if not trivial to solve.

• But solving inner alignment may not be the easiest way to drive down P(doom), and not the best way for a given person specifically to drive down P(doom), so keeping your eyes on the prize and being ready to pivot to a better project is valuable even if your current project’s success would save the world.

• Shortform #137 Meal Prepping & Rambutan

In pursuit of healthier eating, I prepared containers of seeds, fruits of various kinds, and vegetables that I can eat during meals & take to work for lunch. I also tried a new-to-me fruit called Rambutan, I like the flavor okay but it’s a little bland, and now have a container of that fruit to eat through too.

I meditated for 5 minutes, rowed for 5 minutes, and did 5 pushups tonight. Small continuous improvements, here we go!

• It’s not that clear to me exactly what test/​principle/​model is being proposed here.

A lot of it is written in terms of not being “misleading”, which I interpret as ‘intentionally causing others to update in the wrong direction’. But the goal to have people not be shocked by the inner layers suggests that there’s a duty to actively inform people about (some aspects of) what’s inside; leaving them with their priors isn’t good enough. (But what exactly does “shocked” mean, and how does it compare with other possible targets like “upset” or “betrayed”?) And the parts about “signposting” suggest that there’s an aim of helping people build explicit models about the inner layers, which is not just a matter of what probabilities/​anticipations they have.

• I meant signposting to indicate things like saying “here’s a place where I have more to say but not in this context” etc, during for instance a conversation, so I’m truthfully saying that there’s more to the story.

Yeah, I think “intentionally causing others to update in the wrong direction” and “leaving them with their priors” end up pretty similar (if you don’t make strong distinctions between action and omission, which I think this test at least partially rests on) if you have a good model of their priors (which I think is potentially the hardest part here).

• Thanks for the writeup. More people should take credit for things they thought about doing and didn’t because there’s lots of almost good ideas that are actually a waste of time and documenting those helps other avoid even bothering to investigate (unless they think they know something that others don’t, and then having such docs helps them make concrete why they still think it’s worth exploring an idea that others dismissed).

• Late comment but I recently posted how human values arise naturally by the brain learning to keep its body healthy in the ancestral environment by a process that could be simplified like this:

1. First, the brain learns how the body functions. The brain then figures out that the body works better if senses and reflexes are coordinated. Noticing patterns and successful movement and action feels good.

2. Then the brain discovers the abstraction of interests and desires and that the body works better (gets the nutrients and rest that it needs) if interests and desires are followed. Following your wants feels rewarding.

3. Then the brain notices personal relationships and that interests and wants are better satisfied if relationships are cultivated (the win-win from cooperation). Having a good relationship feels good, and the thought of the loss of a relationship feels painful.

4. The brain then discovers the commonalities of expectations within groups—group norms and values—and that relationships are easier to maintain and have less conflict if a stable and predictable identity is presented to other people. Adhering to group norms and having stable values feels rewarding.

These natural learning processes are supported by language and culture by naming, and suggestion behaviors make some variants more salient and thus more likely to arrive—but humans would pick up on the principles even without a pre-existing society—and that is what actually happens in certain randomly assembled societies.

• This describes convergent value system of any mind, not only human one. So there is nothing specially human in it.

• Correct.

The human aspect results from

• the structure of the needs of the body and its low-level regulation (food, temperature, but also reproductive drives), and

• the structure of the environment—how many other humans there are, how and where resources can be acquired.

• the entire brain and body needs to get built by a mere 25,000 genes. My current low-confidence feeling is that reasonably-comprehensive pseudocode for the human hypothalamus would be maybe a few thousand lines long.

That confirms what I have been getting around to over time: Human instincts and motivational systems are probably built from few elements. I used to think that there are hard-wired motivations/​interests for specific types of sports or hobbies or people and how this could be coded for. But after reading Steven’s sequence, I look more for simple patterns like geometric features, hormonal triggers, or sensory triggers that could—after a lot of learning—give rise to such preferences. Maybe some sport is only interesting because it involves fast-moving objects in the visual field and the smell of grass…

• I recognize that examples need to be legible, and I would also hope for deep internalization of pulling the rope sideways and investigation of policy areas that are less legible and therefore dramatically easier to shift on the margin.

• To reach the boundary of what is known in your chosen field will require reading lots of papers, which will take (at least) several years. Doing research will also require implicit knowledge that is part of the field, but does not appear in papers.

Are you the kind of person who can spend several years reading papers without significant external help?

Where are you going to acquire the implicit knowledge, e.g., how to run experiments?

PhD students are the work-horses of academic research, and don’t have the power/​money/​experience to do anything other than tow the line. You have a degree of independence and experience that will deter many academics taking you on as a student.

Perhaps you can find an independent scientist to take you on as an apprentice.

Or: You could kick-start your research by applying your existing knowledge of (I assume) computing/​software to cognitive issues in this field (see chapter 2)

• FYI the creator of the market decided to resolve the market to N/​A.

This means everyone’s mana is fully restored as if they never interacted with the market.

• I have no idea about cognitive science in particular, but in math and physics there is not a single worthwhile contribution that was not done either by PhD holders or PhD students. Just not a thing that happens.

Edit for pedants: in the last 50 years at least.

Edit2: one possible exception is that anonymous 4chan poster: https://​​en.wikipedia.org/​​wiki/​​Superpermutation

• 27 Sep 2022 22:56 UTC
LW: 2 AF: 2
0 ∶ 0
AF

I think I have two main complaints still, on a skim.

First, I think the following is wrong:

These problems seem pretty hard, as evidenced by the above gaps in the example shard theory plan, and it’s not clear that they’re easier than inner and outer alignment.

I think outer and inner alignment both go against known/​suspected grains of cognitive tendencies, whereas shard theory stories do not. I think that outer and inner alignment decompose a hard problem into two extremely hard problems, whereas shard theory is aiming to address a hard problem more naturally. I will have a post out in the next week or two being more specific, but I wanted to flag that I very much disagree with this quote.

Second, I’m wary of saying “maybe we can get corrigibility” or “maybe corrigibility doesn’t fit into a utility function”, because this can map shard theory hopes onto old debates where we already have settled into positions. Whereas I consider myself to be thinking about qualitatively different questions and spreads of values I might hope to get into an AI.

I think corrigibility is natural iff robust pointers to it can easily get into the AI’s goals

This doesn’t make sense to me. It sounds like saying “liking yellow cubes is natural iff we can get a pointer to ‘liking yellow cubes’ within the AI’s goals.” That sounds like a thing which would be said if we had no idea how yellow cubes got liked, directly, and were instead treating liking-yellow-cubeness as a black box which happened to exist in the real world (e.g. how corrigibility, or the desire to help people, could be “pointed to” in a classic corrigibility hope).

I have more thoughts on this post but I don’t have time to type more for now.

• I think outer and inner alignment both go against known/​suspected grains of cognitive tendencies, whereas shard theory stories do not. I think that outer and inner alignment decompose a hard problem into two extremely hard problems, whereas shard theory is aiming to address a hard problem more naturally. I will have a post out in the next week or two being more specific, but I wanted to flag that I very much disagree with this quote.

Since the original draft I realized your position has “outer/​inner alignment is a broken frame with mismatched type signatures which is much less likely to work than people think”, so this seems reasonable from your perspective. I haven’t thought much about this document and might end up agreeing with you, so the version I believe is something like “it’s not clear that my shard theory decomposition is substantially easier than inner+outer alignment is, assuming that inner+outer alignment is as valid as Evan thinks it is”.

Agree that I’m not being concrete about how corrigibility would be implemented. Concreteness is a virtue and it seems good to think about this in more detail eventually.

• No. Have actually been working in/​on science communication about psychedelics (as medicines) and have received both positive feedback from researchers and am collaborating with a few.

One thing that I like about being parallel to academia is that you can build things outside of the constraints. For instance am building a tracker of all RCTs which will update automatically when new papers are added. And made a map of research that visually shows where what is taking place.

• In my personal experience, the intuition/​grokking approach works better for mathematics, but the making-notes approach works better for programming.

This is not a fair comparison, because in case of programming, the notes were usually already made by someone else (on Stack Exchange), so I can use them without spending my time to write them. For deep understanding, I must usually do the work myself, because most people who write programming tutorials actually do not have the deep understanding. (Seems to me that most tutorials are written by enthusiastic beginners.)

I have a strong personal preference for the intuition/​grokking approach, so I am not doing the optimal thing despite knowing what it is. But I have to admit that this is often a waste of time in programming. Before you spend the proverbial 10 000 hours learning some framework, it is already obsolete. Or if it becomes popular, sometimes the next version keeps the similarities, but changes how it works under the hood, so ironically the deeper knowledge that should be long-term sometimes becomes obsolete first.

In math, the stuff you learn usually remains true for a long time, so building intuitions pays off.

• This study has a table that lists the half-lives of many neuropeptides (see table 2):

It would be interesting to compare the half-lives to the duration of behaviors, emotions, and moods associated with these neuropeptides. I only recognize the NPY mentioned above in the study.

• Things that probably actually fit into your interests:

A Sensible Introduction to Category Theory

Most of what 3blue1brown does

Videos that I found intellectually engaging but are far outside of the subjects that you listed:

Cursed Problems in Game Design

Luck and Skill in Games

The Congress of Vienna

(I am also a jan Misali fan)

• Neat, thanks so much for these recommendations! I do of course follow 3b1b, and I already know some Category. But I’ll for sure check out all of the rest, which sound super cool!

• I wrote a couple posts (that I haven’t reviewed recently) on this topic. You may find them informative, or at least amusing. Insightful points were made by commenters.

Was a terminal degree necessary for inventing Boyle’s desiderata?

Was a PhD necessary to solve outstanding math problems?

• I will review these. Thank you for your input!

• On reading them back over, here’s my updated take (partly informed by the fact that I am now an MS student and work with PhD students daily):

A PhD is two things: a way to support research, and a way to reward it with a sheepskin.

Budding researchers just tend to see academia as a marginally more useful and attractive place to build their career than the next best alternative. The fact that a PhD program is mostly the only way to get the most widely recognized signal of being a qualified researcher is a bonus. If they’re going to do weird groundbreaking research anyway, why not do it in an environment that’s relatively open to it, and that will provide them with a credential at the end of the process?

If I was doing this over again, then, I might reframe the question. Instead of asking “is a PhD necessary,” I might ask questions like:

• “Why do excellent early-career researchers so often see a PhD as attractive?

• “What are some of the practical barriers to doing a certain type of research outside the academic system?”

• “What is the next most logical alternative to building a research career without getting a PhD?”

• “Why and how do some excellent early-career researchers choose to build their research career outside the normal academic system?”

• “What exactly does it mean to ‘contribute’ to a field?”

As an example, if you’re going into cogsci, you might need to run experiments on people or animals. When we do animal research in our BME lab, we have all kinds of support and regulation and procedures for making sure it doesn’t create a fiasco with government, administrators, or activists. Outside the academic system, I expect that trying to do animal research would be extremely difficult or impossible, not to mention publishing it and getting it taken seriously. It’s not just a question of whether the research findings were any good. It’s the perception. Academia is as sensitive to perception as anybody, and you want your research to not just be ethical and correct, but avoid looking ethically or epistemically suspect.

As a less charged example, I at one point was trying to figure out how to visualize a DNA ladder in an agarose gel without staining the rest of the DNA. I proposed cutting off the ladder band, staining that, and then jigsawing it back onto the rest of the gel. My mentor told me “if you do that, you go to science JAIL.” His objection wasn’t that this would fail to tell us the information we needed to know. It was that it would look really bad in light of all the replication issues specifically with fraudulent gels.

Any way you slice it, you’ll have to figure out a way to do high-quality research while making it look good enough for others to take seriously. This is probably a lot easier with an academic pedigree and institutional support. Another way of putting it might be “if you were enough of a genius to not need a PhD, you’d probably know that about yourself already!” Not a diss − 99.99% of researchers are probably in this category, myself included.

• Excellent! Maybe there’s a way to pitch this for a Black Mirror episode.

• Requesting beta readers for “Unit Test Everything.”

I have a new post on how to be more reliable and conscientious in executing real-world tasks. It’s been through several rounds of editing. I feel like it would now benefit from some fresh eyes. If you’d be willing to give it a read and provide feedback, please let me know!

• Isn’t this more like an onion test for… honesty?

• They’re delaying their ascension, in dath ilan, because they want to get it right. Without any Asmodeans needing to torture them at all, they apply a desperate unleashed creativity, not to the problem of preventing complete disaster, but to the problem of not missing out on 1% of the achievable utility in a way you can’t get back. There’s something horrifying and sad about the prospect of losing 1% of the Future and not being able to get it back.

Is dath ilan worried about constructing an AGI that makes the future 99% as good as it could be, or a 1% chance of destroying all value of the future?

• I had assumed the first—they’re afraid of imperfect-values lock-in. I think it’s the “not to the problem of preventing complete disaster” phrase that tipped me off here.

• I hate to be the bearer of bad news, but the methodology in the study you linked is just terrible.

99% of candidate gene studies do not replicate, and the first study you link uses an extremely error-prone P < 0.05 threshold to determine that orexin has an effect.

It may still have an effect, but there’s no way to really know this without some kind of large sample GWAS.

The proper way to do this would be to get research access to a large genetic database like 23&Me or UK BioBank and pair that data with info from sleep trackers like Apple Watch, FitBit, or other devices to see which genes are involved in sleep duration and how large the effect of each is.

Depending on the sparsity of the trait, you’d probably need somewhere between 100k − 1 million genotype-phenotype pairs to make a good predictor.

Lastly, in the “Possible Actions” section you should include embryo selection. Genes obviously play some role in sleep needs. If at least some of them have no relevant downsides, selecting for those genes in embryos would probably be quite cost-effective. In fact, you could further reduce the odds of any unintended side-effects by selecting not just for shorter required sleep duration, but also for longer quality-adjusted lifespan.

• When doing gene studies if you look at all genes and see which one’s have P < 0.05, that gets you a lot of false positives. That’s however not what they did in the linked paper.

The linked paper looks at one gene because previous papers identified that gene as being significant in humans. The paper says that the gene has significance for sleep in mice.

The argument doesn’t rest on a single experiment. I agree that it would be desirable to have 23&Me cooperate with Apple or Google to find genes that affect the metrics that the Apple watch /​ Fitbit measures. That’s helpful to have a good overview of all the mutations that affect sleep length.

Lastly, in the “Possible Actions” section you should include embryo selection.

Yes, that’s a valid action. I was thinking about actions that actually result push the field forward. You could have benefits for the child from such embryo selection but I wouldn’t expect it to lead to knowledge generation.

• Can you link the earlier studies showing the significance of BHLHE41 in humans? When I Google it all I find are other candidate gene studies with small sample sizes.

• https://​​www.science.org/​​doi/​​abs/​​10.1126/​​science.1174443 seems to be the first paper. It does have a small sample size and thus is only able to produce candidates, but it results in the later paper not randomly searching over all possible gene mutations.

My main point is that there are papers that made independent observations and thus arguments that a single paper doesn’t demonstrate the effect doesn’t hold. I didn’t copy the exact minute numbers that the EA cause report had because I was unsure about the exactness of the data.

• I’m not sure if you have read the story of 5HTTLPR and all the independent studies which found it to have an effect, but if you haven’t you should.

• In the case of orexin, my argument doesn’t just rest on the DEC2 gene.

The experiments that improved performance in sleep-deprived rhesus monkeys happened before the discovery of the link between the DEC2 mutation and orexin.

The observations in Astyanax mexicanus seem independent from my perspective. Attempts to make Astyanax mexicanus a model organism aren’t driven by sleep researchers but because it’s interesting for studying evolution.

Orexin deficiency causing Narcolepsy type 1 is independent of any findings about the DEC2 gene as well.

As far as the linked post of Scott goes, it says nothing about experiments on animals other than humans. Gene knockout studies in mice and Drosophila seem to me like a pretty good way to measure the influence of a gene.

• As always, not investment advice. There are signs that a volatility spike is imminent, which often coincides with a market drop. I have reversed my usual short vol position and bought tail insurance (e.g. OTM puts). Remember vol can fall just as quickly. How long a spike lasts depends on how high it goes.

• I have thoughts, but they are contingent on better understanding what you mean by “types” of hidden information. For example, you used “operating a cocaine dealership” as a “type” of information hidden by the question about health information. Operating a cocaine dealership is not a health matter, except perhaps indirectly if you get high on your own supply.

To further illustrate this ambiguity, a person might be having gay sex in a country where gay sex is criminalized, morally condemned, and viewed as a shocking aberration from normal human behavior. It does not seem to me to be out of integrity for a gay person to refrain from telling other people that they have gay sex in this (or any other) context.

Where this becomes problematic is when the two people have different expectations about what constitutes reasonable expectations and moral behavior. If we give free moral license to choose what to keep private, then it seems to me there is little difference between this onion model and an algorithm for “defining away dishonesty.” One can always justify oneself by saying “other people were foolish for having had such unreasonable expectations as to have been mislead or upset by my nondisclosure.” If “outsiders” are expected to be sufficiently cynical, then the onion model would even justify outright lying and withholding outrageous misbehaviors as within the bounds of “integrity,” as long as other people expected such infractions to be occurring.

It short, it seems that this standard of integrity reduces to “they should have known what they were getting into.”

As such, the onion model seems to rely on a pre-existing consensus on both what is epistemically normal and what is morally right. It is useful as a recipe for producing a summary in this context, but not for dealing with disagreement over what is behaviorally normal or right.

• As such, the onion model seems to rely on a pre-existing consensus on both what is epistemically normal and what is morally right. It is useful as a recipe for producing a summary in this context, but not for dealing with disagreement over what is behaviorally normal or right.

I agree. For me it’s more of a characterization of honesty, not integrity (even though I consider honesty an aspect of integrity). Perhaps we should change the name.

• I think part of my answer here is “The more a person is a longterm trade partner, the more I invest in them knowing about my inner layers. If it seems like they have different expectations than I do, I’m proactive in sharing information about that.”

• And see also also Anna Salamon’s How to learn soft skills, which could possibly be helpful here

• See also Raemon’s very similarly named post, which is also good and covers pretty different ground!

• 27 Sep 2022 17:31 UTC
6 points
0 ∶ 0

Observation: Humans seem to actually care about each other.

You need to observe more (or better). Most humans care about some aspects of a few other humans’ experience. And care quite a bit less about the general welfare (with no precision in what “caring” or “welfare” means) of a large subset of humans. And care almost none or even negatively about some subset (whose size varies) of other humans.

Sure, most wouldn’t go out of their way to kill all, or even most, or even a large quantity of unknown humans. We don’t know how big a benefit it would take to change this, of course—it’s not an experiment one can (or should) run. We have a lot of evidence that many many people (I expect “most”, but I don’t know how to quantify the numerator nor denominator) can be convinced to harm or kill other humans in circumstances where those others are framed as enemies, even when they’re not any immediate threat.

There are very common human values (and even then, not universal—psycopaths exist), but they’re things like “working on behalf of ingroup”, “suspicious or murderous toward outgroup”, and “pursuing relative status games among the group(s) one isn’t trying to kill”.

• Yeah, I think the “working on behalf of in-group” one might be rather powerful, and I was aware that this is probably a case where I just interact mostly with people who consider “humans” as the ingroup. I don’t think the share of the population who shares this view is actually as important as the fact that a sizeable number of people hold this position at all. Maybe I should have called it: I do care about everyone to some extent. Is that what we want to achieve when we talk about alignment?

• Aaand I forgot to come back and check the site before the day was over. Sigh

• Well that’s nice. Does anyone have a general sense of where we are with detection and launch windows? I’m somewhat hoping that we already have enough monitoring capability to detect potential existential risk asteroids in time, but very unsure and don’t have time to research it.

• [ ]
[deleted]
• I’ve got a very good intuition about foo-vectors from Let’s remove Quaternions from every 3D Engine post.

• Actually, I think that post is probably what triggered me to write this originally, and I forgot that by the time I wrote it (or I would have added a link.) Thanks for the reminder!

• I haven’t read that deeply into this yet, but my first reaction is that I don’t see what this gains you compared to a perspective in which the functions mapping the inputs of the network to the activations of the layers are regarded as the network’s elementary units.

Unless I’m misunderstanding something, when you look at the entire network , where is the input, each polytope of f(x) with its affine transformation corresponds to one of the linear segments of . Same with looking at, say, the polytopes mapping layer to layer . You can just look at , where are the activations in layer , and each linear segment of that should correspond to a polytope.

However, I don’t really see how you’d easily extend the polytope formulation to activation functions that aren’t piecewise linear, like tanh or logits, while the functional analysis perspective can handle that pretty easily. Your functions just become smoother.

In the functional analysis view, a “feature” is a description of a set of inputs that makes a particular element in a given layer’s function space take activation values close to their maximum value. E.g., some linear combination of neurons in a layer is most activated by pictures of dog heads. But there’s a lot more to know about a function f than what is.

When you scale up a particular feature in a layer past its activation range in the training dataset, you are effectively querying the functions in subsequent layers outside the domains they’ve been trained to fit well. Instead of checking how many polytope boundaries you crossed, you can just check how much varied between your start and end points.

Scaling up some of the activations in a layer by a constant factor means you’re increasing the norm of the corresponding functions, changing the principal component basis of the layer’s function space. So it shouldn’t be surprising if subsequent layers get messed up by that.

However, I don’t really see how you’d easily extend the polytope formulation to activation functions that aren’t piecewise linear, like tanh or logits, while the functional analysis perspective can handle that pretty easily. Your functions just become smoother.

Extending the polytope lens to activation functions such as sigmoids, softmax, or GELU is the subject of a paper by Baleistriero & Baraniuk (2018) https://​​arxiv.org/​​abs/​​1810.09274

In the case of GELU and some similar activation functions, you’d need to replace the binary spine-code vectors with vectors whose elements take values in (0, 1).

There’s some further explanation in Appendix C!

In the functional analysis view, a “feature” is a description of a set of inputs that makes a particular element in a given layer’s function space take activation values close to their maximum value. E.g., some linear combination of neurons in a layer is most activated by pictures of dog heads.

This, indeed, is the assumption we wish to relax.

But there’s a lot more to know about a function f than what max({f(x) | x \in X}) is.

Agreed!

Scaling up some of the activations in a layer by a constant factor means you’re increasing the norm of the corresponding functions, changing the principal component basis of the layer’s function space. So it shouldn’t be surprising if subsequent layers get messed up by that.

There are many lenses that let us see how unsurprising this experiment was, and this is another one! We only use this experiment to show that it’s surprising when you view features as directions and don’t qualify that view by invoking a distribution of activation magnitude where semantics is still valid (called a ‘distribution of validity’ in this post).

• I’m a pretty bad chess player (~1500 ELO) and I can play bullet games while sleep deprived without much loss in skill. I think the system 1 pattern matching is relatively unimpaired while the system 2 calculating is very impaired. In bullet calculating doesn’t help much though.

A GM playing their instant move with no calculation can crush a master-level player, and I can crush a novice playing my instant move. It’s all about system 1 :)

• [ ]
[deleted]
• After this comment there was a long thread about AC efficiency.

Summarizing:

• I said: “In practice I think the actual efficiency loss relative to a 2-hose unit is more like 25-30%” (For cooling from 85 to 70.)

• John said that this was ridiculous.

• After the dust settled, our best estimate on paper is 40% rather than 25-30%.

The reason for the adjustments were roughly:

• [x2] I estimated exhaust temperature at 130 degrees, but it’s more like 100 degrees if the indoor air is 70.

• [x1/​2] I thought that all depressurization was compensated for by increased infiltration. But probably half of depressurization is offset by reduced exfiltration instead (see here)

• [x3/​2] I only considered sensible heat. But actually humidity is a huge deal, because the exhaust is heated but not humidified (see here)

John also attempted to measure the loss empirically, but I’d summarize as “too hard to measure”:

• With 1-hose the indoor temp was 68 vs 88 outside, while with 2-hose the indoor temp was 66 vs 88 outside (using the same amount of energy).

• We both agree that 10% is an underestimate for the efficiency loss (e.g. due to room insulation, other cooling in the building, and the improvised 2-hose setup).

• I don’t think we have a plausible way to extract a corrected estimate.

• [ ]
[deleted]
• My comment did not assume the physical. Entropy is not a physical concept but a statistical one. It can be used to measure the complexity of any system, physical or not.

Any real system, whether physical, spiritual, or anything else, will necessarily be computable. Even the mind of God would have to be technically computable, although you could argue that it would require more computational power than could possibly fit in the physical universe.

Again, there is no magic. Nothing is intrinsically mysterious. Mystery is a symptom of a finite, ignorant mind. Whatever realms might exist beyond this physical universe are necessarily logically coherent and explainable in principle. And that applies to any subsystem of any reality, whether consciousness, life, or anything else.

• 27 Sep 2022 14:45 UTC
8 points
0 ∶ 0

One thing I’ve found useful is to make sure I identify to the supplier what specifically I need about the product I’m ordering—sometimes they have something similar in stock which meets my requirements.

• This is a bit like how Scientology has tried to spread, but the E-hance is much better than the E-meter.

• This is a great collection of tips! I think it’s also worth explicitly noting that most of these strategies involve slowing down other people’s orders, and many involve more inconvenience/​stress for the sellers, so it’s important to weigh this tradeoff.

• This year Petrov day almost sneaked past me. This strikes me as weird on account of the biggest proxy war since the 80s being underway, putting us closer the same stakes in realspace.

• 27 Sep 2022 13:31 UTC
LW: 2 AF: 2
0 ∶ 0
AF

This is great! I really like your “prediction orthogonality thesis”, which gets to the heart of why I think there’s more hope in aligning LLM’s than many other models.

One point of confusion I had. You write:

Optimizing toward the simulation objective notably does not incentivize instrumentally convergent behaviors the way that reward functions which evaluate trajectories do. This is because predictive accuracy applies optimization pressure deontologically: judging actions directly, rather than their consequences. Instrumental convergence only comes into play when there are free variables in action space which are optimized with respect to their consequences.[25]Constraining free variables by limiting episode length is the rationale of myopia ; deontological incentives are ideally myopic. As demonstrated by GPT, which learns to predict goal-directed behavior, myopic incentives don’t mean the policy isn’t incentivized to account for the future, but that it should only do so in service of optimizing the present action (for predictive accuracy)[26].

I don’t think I agree with this conclusion (or maybe I don’t understand the claim). I agree that myopic incentives don’t mean myopic behavior, but they also don’t imply that actions are chosen myopically? For instance I think a language model could well end up sacrificing some loss on the current token if that made the following token easier to predict. I’m not aware of examples of this happening, but it seems consistent with the way these models are trained.

In the limit a model could sacrifice a lot of loss upfront if that allowed it to e.g. manipulate humans into giving it resources with which to better predict later tokens.

• 27 Sep 2022 19:23 UTC
LW: 2 AF: 2
1 ∶ 0
AFParent

Depends on what you mean by “sacrificing some loss on the current token if that made the following token easier to predict”.

The transformer architecture in particular is incentivized to do internal computations which help its future self predict future tokens when those activations are looked up by attention, as a joint objective to myopic next token prediction. This might entail sacrificing next token prediction accuracy as a consequence of not optimizing purely for that. (this is why I said in footnote 26 that transformers aren’t perfectly myopic in a sense)

But there aren’t training incentives for the model to prefer certain predictions because of the consequences if the sampled token were to be inserted into the stream of text, e.g. making subsequent text easier to predict if the rest of the text were to continue as expected given that token is in the sequence, because its predictions has no influence on the ground truth it has to predict during training. (For the same reason there’s no direct incentive for GPT to fix behaviors that chain into bad multi step predictions when it generates text that’s fed back into itself, like looping)

Training incentives are just training incentives though, not strict constraints on the model’s computation, and our current level of insight gives us no guarantee that models like GPT actually don’t/​won’t care about the causal impact of its decoded predictions to any end, including affecting easiness of future predictions. Maybe there are arguments why we should expect it to develop this kind of mesaobjective over another, but I’m not aware of any convincing ones.

• Got it, thanks for explaining! So the point is that during training the model has no power over the next token, so there’s no incentive for it to try to influence the world. It could generalize in a way where it tries to e.g. make self-fulfilling prophecies, but that’s not specifically selected for by the training process.