# Cryonics

TagLast edit: 16 Nov 2022 6:55 UTC by

Cryonics is the practice of preserving people who are dying in liquid nitrogen soon after their heart stops. The idea is that most of your brain’s information content is still intact right after you’ve “died”, i.e. medical death or legal death. If humans invent molecular nanotechnology or brain emulation techniques, it may be possible to reconstruct the consciousness of cryopreserved patients.

Related: Life Extension, a more general tag about ways to avoid death.

## Cryonics-associated issues commonly raised on LessWrong

Pro-cryonics points

• Advanced reductionism/​physicalism (because of the issues associated with identifying a person with continuity of brain information).

• Whether an extended healthy lifespan is worthwhile (relates to Fun Theory, religious rationalizations for 70-year lifespans, “sour grapes” rationalizations for why death is actually a good thing).

• The “shut up and multiply” aspect of spending $300/​year (as Eliezer Yudkowsky quotes his costs for Cryonics Institute membership ($125/​year) plus term life insurance ($180/​year)) for a probability (how large being widely disputed) of obtaining many more years of lifespan. For this reason, cryonics advocates regard it as an extreme case of failure at rationality—a low-hanging fruit by which millions of deaths per year could be prevented at low cost. Anti-cryonics points • Cognitive biases contributing to emotional prejudice in favor of cryonics (optimistic bias, motivated cognition). • The multiply chained nature of the probabilities involved in cryonics, and whether the final expected utility is worth the cost. • Money spent on cryonics could, arguably, be better spent on efficient charity. • S-risks/​hyperexistential risks; The far future may turn out to be dystopian and have negative expected value. ## Notable Posts • We Agree: Get Froze by Robin Hanson. “My co-blogger Eliezer and I may disagree on AI fooms, but we agree on something quite contrarian and, we think, huge: More likely than not, most folks who die today didn’t have to die! … It seems far more people read this blog daily than have ever signed up for cryonics. While it is hard to justify most medical procedures using standard health economics calculations, such calculations say that at today’s prices cryonics seems a good deal even if you think there’s only a 5% chance it’ll work.” • You Only Live Twice by Eliezer Yudkowsky. “My co-blogger Robin and I may disagree on how fast an AI can improve itself, but we agree on an issue that seems much simpler to us than that: At the point where the current legal and medical system gives up on a patient, they aren’t really dead.” • The Pascal’s Wager Fallacy Fallacy—the fallacy of Pascal’s Wager combines a high payoff with a privileged hypothesis, one with low prior probability and no particular reason to believe it. Perceptually seeing an instance of “Pascal’s Wager” just from the high payoff, even when the probability is not small, is the Pascal’s Wager Fallacy Fallacy. • Normal Cryonics—On the shift of perspective that came from attending a gathering of normal-seeming young cryonicists. • That Magical Click—What is the unexplained process whereby some people get cryonics, or other frequently-derailed chains of thought, in a very short time? • Quantum Mechanics and Personal Identity by Eliezer Yudkowsky. A shortened index into the Quantum Physics Sequence describing only the prerequisite knowledge to understand the statement that “science can rule out a notion of personal identity that depends on your being composed of the same atoms—because modern physics has taken the concept of ‘same atom’ and thrown it out the window. There are no little billiard balls with individual identities. It’s experimentally ruled out.” The key post in this sequence is Timeless Identity, in which “Having used physics to completely trash all naive theories of identity, we reassemble a conception of persons and experiences from what is left” but this finale might make little sense without the prior discussion. • Break Cryonics Down by Robin Hanson—tries to identify some of the chained probabilities involved in cryonics. • Third Alternatives for Afterlife-ism by Eliezer Yudkowsky—explains why cryonics is a third option in the dilemma about whether we should tell noble lies about an afterlife, to prevent people from getting depressed by not believing in an afterlife. • A survey of anti-cryonics writing by ciphergoth—an attempt to find quality criticism of cryonics, with a surprising result that “there is not one person who has ever taken the time to read and understand cryonics claims in any detail, still considers it pseudoscience, and has written a paper, article or even a blog post to rebut anything that cryonics advocates actually say”. ## External links ## See also # You Only Live Twice 12 Dec 2008 19:14 UTC 175 points 183 comments6 min readLW link # Nor­mal Cryonics 19 Jan 2010 19:08 UTC 89 points 964 comments4 min readLW link # Ap­pen­dices to cry­on­ics signup sequence 22 Jan 2021 6:40 UTC 29 points 9 comments3 min readLW link # [Question] Has any­one had weird ex­pe­riences with Al­cor? 11 Jan 2022 21:47 UTC 49 points 16 comments2 min readLW link # Cry­on­ics signup guide #1: Overview 6 Jan 2021 0:25 UTC 141 points 34 comments6 min readLW link # [Question] If you are signed up for cry­on­ics with life in­surance, how much life in­surance did you get and over what term? 22 Jul 2020 8:13 UTC 25 points 5 comments1 min readLW link # Ab­nor­mal Cryonics 26 May 2010 7:43 UTC 76 points 420 comments5 min readLW link # [link] How many hu­mans will have their brain pre­served? Fore­casts and trends 7 May 2020 6:11 UTC 10 points 0 comments1 min readLW link # Effec­tive Altru­ism and Cry­on­ics, Con­test Results 19 Nov 2013 17:06 UTC 22 points 9 comments2 min readLW link # [Question] How much harder is it to re­vive a neuro-only cry­on­ics pa­tient? 12 Jan 2021 23:24 UTC 19 points 3 comments2 min readLW link # #3: Choos­ing a cry­on­ics provider 20 Jan 2021 1:47 UTC 93 points 20 comments18 min readLW link # #4.1: Types of life insurance 26 Jan 2021 19:55 UTC 38 points 13 comments11 min readLW link # #4.3: Cry­on­ics-friendly in­surance agents 27 Jan 2021 21:13 UTC 35 points 4 comments6 min readLW link # #4.2: Cry­on­ics-friendly life in­surance carriers 27 Jan 2021 21:05 UTC 32 points 5 comments5 min readLW link # #5: Mak­ing your cry­on­ics ar­range­ments official 2 Feb 2021 23:54 UTC 37 points 0 comments3 min readLW link # #6: Op­tional ad­di­tional steps 2 Feb 2021 23:57 UTC 42 points 2 comments9 min readLW link # Fo­cus­mate: Cry­on­ics sign up 1 Mar 2021 15:41 UTC 13 points 4 comments1 min readLW link # [re­cur­rent] Fo­cus­mate: cry­on­ics and lifelogging 11 Mar 2021 16:28 UTC 14 points 1 comment1 min readLW link # [Question] What do you think would be the best in­vest­ment policy for a cry­on­ics trust? 4 Apr 2021 19:20 UTC 14 points 6 comments1 min readLW link # Great Cry­on­ics Sur­vey of 2022 3 Dec 2022 5:10 UTC 15 points 0 comments1 min readLW link # A pro­posal for a cryo­genic grave for cryonics 6 Jul 2010 19:01 UTC 26 points 204 comments3 min readLW link # [link] Biosta­sis /​ Cry­op­reser­va­tion Sur­vey 2020 16 May 2020 7:20 UTC 5 points 0 comments1 min readLW link # Con­sid­er­a­tions on Cryonics 3 Aug 2020 17:30 UTC 45 points 7 comments32 min readLW link # Buy­ing micro-biostasis 7 Oct 2020 10:55 UTC 13 points 1 comment5 min readLW link # #4: In­tro­duc­tion to life in­surance for cryonics 22 Jan 2021 20:58 UTC 48 points 9 comments6 min readLW link # On the un­pop­u­lar­ity of cry­on­ics: life sucks, but at least then you die 29 Jul 2011 21:06 UTC 125 points 472 comments7 min readLW link # The Brain Preser­va­tion Foun­da­tion’s Small Mam­malian Brain Prize won 9 Feb 2016 21:02 UTC 72 points 42 comments4 min readLW link # Plasti­na­tion is ma­tur­ing and needs fund­ing, says Hanson 20 Jun 2012 20:00 UTC 105 points 47 comments1 min readLW link (www.overcomingbias.com) # The Threat of Cryonics 3 Aug 2010 19:57 UTC 43 points 217 comments4 min readLW link # #2: Neu­ro­cry­op­reser­va­tion vs whole-body preservation 13 Jan 2021 1:18 UTC 57 points 24 comments12 min readLW link # #4.4: The in­surance un­der­writ­ing process 28 Jan 2021 2:33 UTC 29 points 1 comment3 min readLW link # In­ter­view with Nec­tome CEO Robert McIn­tyre—Brain Preser­va­tion, Per­sonal Iden­tity, AGI, and More 23 Feb 2022 1:54 UTC 14 points 0 comments6 min readLW link # A sur­vey of anti-cry­on­ics writ­ing 7 Feb 2010 23:26 UTC 113 points 326 comments7 min readLW link # Cry­on­ics with­out freez­ers: re­s­ur­rec­tion pos­si­bil­ities in a Big World 4 Apr 2012 22:48 UTC 72 points 140 comments6 min readLW link # [Question] Cry­on­ics? 4 Mar 2020 2:40 UTC 9 points 10 comments1 min readLW link # Ra­tion­al­ists Are Less Cre­d­u­lous But Bet­ter At Tak­ing Ideas Seriously 21 Jan 2014 2:18 UTC 100 points 287 comments3 min readLW link # Time­less Identity 3 Jun 2008 8:16 UTC 51 points 247 comments14 min readLW link # Brain Preservation 28 Mar 2012 12:56 UTC 33 points 108 comments3 min readLW link # Cry­on­ics As Un­tested Med­i­cal Procedure 17 Jan 2014 16:36 UTC 20 points 39 comments2 min readLW link # Cry­on­ics on LessWrong vs at LessWrong meetups 2 Apr 2012 18:27 UTC 10 points 33 comments1 min readLW link # Cry­on­ics as Charity 10 Nov 2012 14:21 UTC 6 points 40 comments1 min readLW link # Link: Cry­on­ics and the Creation of a Durable Morality 12 Feb 2011 18:10 UTC 16 points 2 comments3 min readLW link # Book Re­view: Freez­ing Peo­ple is (Not) Easy 30 Mar 2017 3:53 UTC 6 points 4 comments1 min readLW link (ferocioustruth.com) # Al­cor vs. Cry­on­ics Institute 9 Apr 2012 1:49 UTC 50 points 125 comments2 min readLW link # Run­ning the num­bers: Cryo vs Dis­count rate 4 Jun 2014 7:54 UTC 9 points 3 comments3 min readLW link # Look­ing for in­for­ma­tion on cryonics 2 Feb 2012 12:33 UTC 23 points 56 comments1 min readLW link # Cry­on­ics: peace of mind vs. immortality 24 Jun 2015 7:10 UTC 4 points 32 comments1 min readLW link # Against Cry­on­ics & For Cost-Effec­tive Charity 10 Aug 2010 3:59 UTC 9 points 189 comments11 min readLW link # [Question] How is Cryo differ­ent from Pas­cal’s Mug­ging? 27 Jan 2021 13:51 UTC 6 points 10 comments1 min readLW link # Reflec­tions on the cry­on­ics sequence 3 Feb 2021 1:17 UTC 98 points 11 comments2 min readLW link # Cur­rent cry­on­ics impressions 6 Feb 2021 10:00 UTC 63 points 3 comments1 min readLW link (worldspiritsockpuppet.com) # [April – weekly] Fo­cus­mate: biosta­sis, lifel­og­ging, death-fo­cus ethics 31 Mar 2021 16:49 UTC 6 points 0 comments1 min readLW link # 2 in­no­va­tive life ex­ten­sion ap­proaches us­ing cry­on­ics technology 2 Apr 2021 0:54 UTC 41 points 7 comments2 min readLW link # Longter­mism vs short-ter­mism for per­sonal life extension 17 Jul 2021 3:52 UTC 12 points 2 comments2 min readLW link # In­ter­est in Biosta­sis: Much More Than You Wanted to Know. (Re­sults from the ACX 2021 Biosta­sis/​Cry­on­ics Sur­vey) 11 Aug 2021 0:43 UTC 30 points 1 comment1 min readLW link (synopticz.github.io) # Cry­on­ics prices were more com­pli­cated when I dou­ble-clicked 23 Aug 2021 1:26 UTC 11 points 2 comments2 min readLW link # Hope and False Hope 4 Sep 2021 9:46 UTC 4 points 60 comments6 min readLW link # How much should we value life? 6 Sep 2021 22:30 UTC 28 points 25 comments37 min readLW link # Con­flict in Kri­orus be­comes hot to­day, up­dated, up­date 2 7 Sep 2021 21:40 UTC 49 points 17 comments1 min readLW link # Cryosleep 28 Sep 2021 22:33 UTC 9 points 3 comments5 min readLW link (harsimony.wordpress.com) # Brain preser­va­tion to pre­vent in­vol­un­tary death: a pos­si­ble cause area 22 Mar 2022 12:36 UTC 39 points 1 comment10 min readLW link # [Question] Cry­on­ics-ad­ja­cent question 30 Jun 2022 23:03 UTC 12 points 3 comments1 min readLW link # [Question] Will chat logs and other records of our lives be main­tained in­definitely by the ad­ver­tis­ing in­dus­try? 29 Nov 2022 0:30 UTC 14 points 6 comments1 min readLW link # The case for lifel­og­ging as life extension 1 Feb 2020 21:56 UTC 46 points 17 comments3 min readLW link1 review # That Mag­i­cal Click 20 Jan 2010 16:35 UTC 84 points 416 comments5 min readLW link # Even bet­ter cry­on­ics – be­cause who needs nanites any­way? 7 Apr 2015 20:10 UTC 69 points 41 comments6 min readLW link # $5m cryp­tocur­rency dona­tion to Al­cor by Brad Arm­strong in mem­ory of LWer Hal Finney

17 May 2018 20:31 UTC
22 points
(www.alcor.org)

# Up­date on Kim Suozzi (can­cer pa­tient in want of cry­on­ics)

22 Jan 2013 9:15 UTC
68 points

# Let’s cre­ate a mar­ket for cryonics

10 Apr 2012 6:36 UTC
61 points

# Cry­on­ics costs: given es­ti­mates are low

14 Nov 2011 17:07 UTC
43 points

# Sus­pended An­i­ma­tion Inc. ac­cused of incompetence

18 Nov 2010 0:20 UTC
47 points

# The Pas­cal’s Wager Fal­lacy Fallacy

18 Mar 2009 0:30 UTC
43 points

# Ra­tion­al­ity, Cry­on­ics and Pas­cal’s Wager

8 Apr 2009 20:28 UTC
18 points

# The cost of uni­ver­sal cryonics

26 May 2011 2:33 UTC
52 points

# Op­ti­mism ver­sus cryonics

25 Oct 2010 2:13 UTC
48 points

# My mother is now in cryostasis

8 Jan 2011 20:27 UTC
46 points

26 Apr 2015 2:51 UTC
51 points

# Con­grat­u­la­tions to Paris Hilton

19 Oct 2007 0:31 UTC
3 points

# [LINK] Father of Cry­on­ics “Dies” at age 92

27 Jul 2011 18:19 UTC
−1 points

# Gun­shot vic­tims to be sus­pended be­tween life and death [link]

27 Mar 2014 16:33 UTC
37 points

# [LINK] Open Source Soft­ware Devel­oper with Ter­mi­nal Ill­ness Hopes to Opt Out of Death

13 Feb 2013 5:57 UTC
22 points

# Why I haven’t signed up for cryonics

12 Jan 2014 5:16 UTC
42 points

# How long will Al­cor be around?

17 Apr 2014 15:28 UTC
45 points

# Robin Han­son’s Cry­on­ics Hour

29 Mar 2013 17:20 UTC
47 points

# Cry­on­ics Wants To Be Big

5 Jul 2010 7:50 UTC
46 points

# The Cry­on­ics Strat­egy Space

24 Apr 2014 16:11 UTC
40 points

# But But­ter Goes Ran­cid In The Freezer

9 May 2011 6:01 UTC
42 points

# Cry­on­ics on Cas­tle [Spoilers]

4 Oct 2011 9:46 UTC
35 points

# Up­date on the Brain Preser­va­tion Foun­da­tion Prize

26 May 2015 1:47 UTC
41 points

# Cry­op­reser­va­tion of Valia Zeldin

17 Mar 2019 19:15 UTC
19 points
(medium.com)

# [Question] What are the mer­its of sign­ing up for cry­on­ics with Al­cor vs. with the Cry­on­ics In­sti­tute?

11 Sep 2019 19:06 UTC
21 points

# A re­view of cry­on­ics/​brain preser­va­tion in 2016

31 Dec 2016 18:19 UTC
36 points

# PSA: Very im­por­tant policy change at Cry­on­ics Institute

3 Oct 2013 5:47 UTC
26 points

1 Oct 2016 22:27 UTC
8 points

# [LINK] Why Cry­on­ics Makes Sense—Wait But Why

25 Mar 2016 11:41 UTC
22 points

# Dry Ice Cry­on­ics- Pre­limi­nary Thoughts

28 Sep 2015 7:00 UTC
12 points

# Cry­on­ics in Europe?

10 Oct 2014 14:58 UTC
23 points

# An­drewH’s ob­ser­va­tion and op­por­tu­nity costs

23 Jul 2009 11:36 UTC
29 points

# Is cry­on­ics evil be­cause it’s cold?

31 Oct 2010 23:59 UTC
31 points

18 Aug 2012 17:57 UTC
29 points

# [Link] Per­sis­tence of Long-Term Me­mory in Vitrified and Re­vived C. el­e­gans worms

24 May 2015 3:43 UTC
34 points

# An ex­er­cise in re­ally go­ing through with it

23 Apr 2012 23:17 UTC
35 points

# More Cry­on­ics Prob­a­bil­ity Estimates

17 Dec 2012 20:59 UTC
30 points

# [SEQ RERUN] We Agree: Get Froze

31 Dec 2012 5:22 UTC
8 points

# Why the out­side view sug­gests that longevity es­cape ve­loc­ity is a long time away and cry­on­ics is a much more fea­si­ble op­tion for those al­ive to­day: sig­nal-boost­ing a com­ment by Calm-Meet9916 on Reddit

8 Jan 2021 17:20 UTC
32 points

# Marti­ne­naite and Tave­nier on cryonics

4 Aug 2011 7:39 UTC
28 points

# Whole Brain Emu­la­tion: No Progress on C. el­gans After 10 Years

1 Oct 2021 21:44 UTC
185 points

# How Likely Is Cry­on­ics To Work?

25 Sep 2011 23:38 UTC
25 points

# Cry­on­ics fa­cil­ity com­ing to Texas?

28 Jun 2011 14:51 UTC
22 points

# Two kinds of cry­on­ics?

10 May 2012 2:43 UTC
22 points

# Years saved: Cry­on­ics vs VillageReach

1 Aug 2011 21:04 UTC
27 points

# Kri­orus up­date: full bod­ies pa­tients were moved to the new lo­ca­tion in Tver

26 Nov 2021 21:08 UTC
23 points

# [Question] Who is do­ing Cry­on­ics-rele­vant re­search?

15 Mar 2022 10:26 UTC
32 points

# Notes on Robert McIn­tyre’s Brain Preser­va­tion Talk at the Long Now Foundation

28 Apr 2021 20:47 UTC
7 points
• Two new examples from discussion elsewhere:

“Oversharing”

“The notion of ‘opportunity cost’ subtly sets the zero point at optimal behavior, effectively painting all actually possible behaviors as in the red.”

• “Woke” is a pejorative neologism for “rights-and-equality-respecting” coined by the anti-equality/​human-rights/​anti-LGBT/​racist crowd. (Edit: Sorry, actually not coined by them.) What is called “woke” is actually normal, and what they’d call “normal” would have to be sanitized to avoid offending their sensibilities (white main characters, non-LGBT couples, etc.).

My guess as to why “woke” (actually normal) culture is marketable is that the anti-rights-crowd is both getting smaller and losing its marketing power.

(In the future, when not wanting to signal the allegiance to the Bad Guys crowd (to both them and normal people), avoid using the word “woke” and find some other way of expressing the same sentiment. Example: “I can’t understand why is there a gay couple in a new movie. Any idea why they put such a bizarre, not-related-to-reality and not-appealing-to-viewers thing there?”)

• God damn these designs are pretty!

• I hadn’t seen this post before.

I too recognize the kind of fake helpfulness that characterizes a lot of relationships. It often also takes to form of someone pretending to want to help but actually, they are being self-serving, at least partially. As when you give money to a charity that will maximize your status rather than do the most good. Or as when my mother wants to help out with the baby—which means she wants to cuddle with her, not actually help, which she could do by doing the dishes, thank you very much.

From a lot of conversations around my original post, I do get the sense that my environment is atypical. I live in a small-scale community in a part of the world (Scandinavia) known for its high levels of trust and social capital. On the other hand, the ideas that I was trying to work out in the essays did help me a lot when figuring out how to build relationships online. I think I would formulate the ideas slightly differently today, and perhaps more strongly emphasize the importance of filtering for skill.

• I found janus’s post Simulators to address this question very well. Much of AGI discussion revolves around agentic AIs (see the section Agentic GPT for discussion of this), but this does not model large language models very well. janus suggests that one should instead think of LLMs such as GPT-3 as “simulators”. Simulators are not very agentic themselves or well described as having a utility function, though they may create simulacra that are agentic (e.g. GPT-3 writes a story where the main character is agentic).

• for the record I think most people here already agree with you and are quite annoyed with open AI for diluting the name of alignment. I did not vote on your post, the downvotes so far seem like they have reached a reasonable score because of the tone.

• Here is a question closely related to the feasibility of finding discriminating-reasons (cross-posted from Facebook):

For some circuits C it’s meaningful to talk about “different mechanisms” by which C outputs 1.

A very simple example is C(x) := A(x) or B(x). This circuit can be 1 if either A(x) = 1 or B(x) = 1, and intuitively those are two totally different mechanisms.

A more interesting example is the primality test C(x, n) := (x^n = x (mod n)). This circuit is 1 whenever n is a prime, but it can also be 1 “by coincidence” e.g if n is a Carmichael number. (Such coincidences are rare and look nothing like n being close to prime.)

In every case I’m aware of where there are two clearly distinct mechanisms, there is also an efficient probabilistic algorithm for distinguishing those mechanisms (i.e. an algorithm for distinguishing cases where C(x) = 1 due to mechanism A from cases where C(x) = 1 due to mechanism B). I am extremely interested in counterexamples to that general principle.

For example, a priori it seems like it could have turned out that (x^n = x (mod n)) is a good probabilistic primality test, but there is no efficient probabilistic test for distinguishing primes from Carmichael numbers. That would have been a convincing counterexample. But it turns out that testing primality is easy, and in fact we can make a simple tweak to this very probabilistic primality test so that it doesn’t get fooled by Carmichael numbers. But is that an incidental fact about number theory, or once we found a probabilistic primality test was it inevitable that it could be strengthened in this way?

Here are some other illustrative cases:

• Suppose that C(x) uses x to initialize a 1000 x 1000 square in the middle of a 10,000 x 10,000 game of life grid. Then we simulate it for a million steps, and C(x) = 1 if any cell on the rightmost edge of the grid is ever alive. It’s very easy to look at the grid and distinguish the cases where C(x) = 1 because a glider is created that heads to the right side of the grid, from the much rarer cases where C(x) = 1 for any other reason (e.g. a medium weight spaceship).

• Suppose that X(x) is a pseudorandom sparse n x n matrix in some large finite field, and suppose that X is sparse enough that 1% of the time there are is no perfect matchings at all (i.e. there is no permutation sigma such that X[i, sigma(i)] != 0 for i=1,…,n). Define C(x) := (det(X(x)) = 0). We can distinguish the common case where det(X) = 0 because there are no perfect matchings in X from the extremely rare case where det(X) = 0 because there are multiple perfect matchings contributing to the determinant and they happen to all cancel out. These two cases are easy to distinguish by calculating det(X’) for another random matrix X’ with the same sparsity pattern as X. (Thanks to Dan Kane for calling my attention to this kind of example, and especially the harder version based on exact matchings.)

• Suppose that C_0(x) := A(x) or B(x) and C(x) is an obfuscated version of C_0. Then there is an efficient discriminator: de-obfuscate the circuit and check whether A or B is true. Finding that discriminator given C is hard, but that’s not a violation of our general principle. That said, I would also be interested in a slightly stronger conjecture: not only is there always a discriminator, but it can always specified using roughly the same number of bits required to specify the circuit C. That’s true in this case, because the circuit C needs to bake in the secret key for the obfuscation, and so requires more bits than the discriminator.

If there don’t exist any convincing counterexamples to this principle, then I’m also very interested in understanding why—right now I don’t have any formal way of talking about this situation or seeing why discrimination should be possible in general. One very informal way of phrasing the “positive” problem: suppose I have a heuristic argument that C(x) often outputs 1 for random inputs x, and suppose that my heuristic argument appears to consider two cases A and B separately. Is there a generic way to either (i) find an efficient algorithm for distinguishing cases A and B, or else (ii) find an improved heuristic argument that unifies cases A and B, showing that they weren’t fundamentally separate mechanisms?

• 9 Dec 2022 18:02 UTC
10 points
0 ∶ 0

It may be better to ask “Is a utility function a useful abstraction to describe how X makes decisions?” (Does it allow you to compress your description of X’s decisions?) Recall that utility functions are just a representation derived from preferences that are structured in a particular way. But not all ways of deciding on a preferred outcome are structured in that way[1], and not all decision algorithms work by preferring outcomes, so thinking in terms of utility functions is not always helpful.

1. ^

See for example:
Aumann, R. J. (1962). Utility theory without the completeness axiom. Econometrica: Journal of the Econometric Society, 445-462.

Bewley, T. F. (2002). Knightian decision theory. Part I. Decisions in economics and finance, 25(2), 79-110.

• 9 Dec 2022 17:35 UTC
LW: 0 AF: 1
0 ∶ 1
AF

Isn’t this the same as the “seamless transition for reward maximizers” technique described in section 5.1 of Stuart and Xavier’s 2017 paper on utility indifference methods? It is a good idea, of course, and if you independantly invented it, kudos, but it seems like something that already exists.

• 9 Dec 2022 17:48 UTC
LW: 3 AF: 2
0 ∶ 0
AFParent

I did explicitly disclaim against novelty, and I did invent this independently; the paper you linked is closely related, and I would like to upvote it as I think those results should also be better known, but I think the problem I solve in this post is different (and technically easier!) than the problems solved in that paper, including in section 5. The problem solved there asks for the optimal agent to act as if it’s an infinite-horizon optimal agent for (including whatever power-seeking would be instrumental for such an agent!) until the time bound causes it to switch into acting like the optimal agent for (and for all that to be reflectively stable). Here, I am not asking for the optimal agent to behave as if it has a longer time horizon than it really does.

• Interesting!

I guess this allows that they can still have very different goals, since they ought to be able to coordinate if they have identical utility functions, i.e. they rank outcomes and prospects identically (although I guess there’s still a question of differences in epistemic states causing failures to coordinate?). Something like maximize total hedonistic utility can be coordinated on if everyone adopted that. But that’s of course a much less general case than arbitrary and differing preferences.

Also, is the result closer to peference utilitarianism or contractualism than deontology? Couldn’t you treat others as mere means, as long as their interests are outweighed by others’ (whether or not you’re aggregating)? So, you would still get the consequentialist judgements in various thought experiments. Never treating others as mere means seems like it’s a rule that’s too risk-averse or ambiguity-averse about a very specific kind of risk that’s singled out (being treated as a mere means), at possibly significant average opportunity cost.

• I continue to believe that the Grabby Aliens model rests on an extremely sketchy foundation, namely the anthropic assumption “humanity is randomly-selected out of all intelligent civilizations in the past present and future”.

For one thing, given that the Grabby Aliens does not weight civilizations by their populations, it follows that, in order to believe the Grabby Aliens model, we need to strongly reject all the “popular” anthropic priors like SIA and SSA and UDASSA and so on.

For another thing, in order to believe the Grabby Aliens model, we need to make both of the following two claims:

• We SHOULD do the following: (1) observe that we (humanity) seem early with respect to all intelligent civilizations that will ever exist; (2) feel surprised at that observation; and then (3) update our credences-about-astrobiology-etc. accordingly;

• We SHOULD NOT do the following: (1) observe that we (humanity in 2022) seem early with respect to all humans that will ever exist; (2) feel surprised at that observation; and then (3) update our credences-about-astrobiology-etc. accordingly.

Seems self-contradictory to me, right? (The second one is the doomsday argument, which apparently Robin Hanson rejects, which makes me very confused.)

There was some discussion on this topic at my Question Post “Is Grabby Aliens built on good anthropic reasoning?”. My general takeaway from that discussion was that roughly nobody outside the study coauthor was really enthusiastic about the anthropic foundation of the Grabby Aliens model, and that at least one person who had thought the anthropic foundation was fine, turned out to have misunderstood it.

Anyway, this is a review for the LessWrong 2021 review, so I guess the question should be: is this post (and corresponding YouTube video) a “highlight of intellectual progress on this website” (or something along those lines)? Well, I thought it was a lovely and well-crafted YouTube video that faithfully explained the paper. But I also think that it was basically endorsing (or at least, failing to criticize) a paper that is deeply flawed. So I’m strong-voting against including this post in the review, but I’m also upvoting this post itself. :)

• 9 Dec 2022 17:02 UTC
LW: 9 AF: 6
0 ∶ 0
AF

I only skimmed the post, so apologies if you addressed this problem and I missed it.

Problem: even if the AI’s utility function is time-bounded, there may still be other agents in the environment whose utility functions are not time-bounded, and those agents will be willing to trade short-term resources/​assistance for long-term resources/​assistance. So, for instance, the 10-minute laundry-folding robot might still be incentivized to create a child AI which persists for a long time and seizes lots of resources, in order to trade those future resources to some other agent who can help fold the laundry in the next 10 minutes.

• 9 Dec 2022 17:56 UTC
LW: 9 AF: 6
0 ∶ 0
AFParent

That’s true! Thanks for pointing this out; I added a subsection about it to the post. There are probably also a bunch of other cases I haven’t thought of that provide stories for how the environment directly rewards actions that go against the spirit of the shutdown criterion (besides imitation and this one, which I might call “trade”). This construction does nothing to counteract such incentives. Rather, it just avoids the way that being an infinite-horizon RL agent systematically creates new ones.

• Wouldn’t it be hilarious if a variant of this was all it took to have exceptional AI safety

• Thought I’d share this. I broke it apart so “it” won’t see it. You can put it back together again.

https ://​ kantrowitz. medium. com/​ openais- chatgpt- bot- imagines- its- worst- possible- self- bf057b697bbb

• 9 Dec 2022 16:55 UTC
1 point
0 ∶ 0

If you think of “wokeism” as a luxury belief—something that many people like to use to show themselves as virtuous, but don’t really do a cost/​benefit of any component of behavior or signalling, this makes more sense. Also, don’t confuse yourself into thinking systems or corporations have beliefs or intents. They are merely aggregates of diverse actors who happen to be near each other and have intertwined behaviors.

Signaling of wokeism is pretty rampant in today’s youth, who are the biggest customers and large part of the workforce for the things you mention. It’s probably not ideologically attractive to the elites or leaders, but it’s not obviously harmful, so they’re better off supporting (or at least accepting) it than dealing with massive conflict within their orgs and among their customers.

• Shouldn’t you have waited for April’s Fool for this?

• FWIW this has also been discussed here

• 9 Dec 2022 16:29 UTC
4 points
1 ∶ 0

I am wondering about the conditions where the zero would come from geometric rationality of some way to cognise the field.

I have approached similar things by explaining to myself that the zero relates to how unstated or new entrants to the system refer to the explicit content.

That is if zero is high then third options are aversed away from.

If the zero is super low then third options are strongly attracted to.

One of the options being at zero means that there are loads and loads of equivalent replacements for it or that we would be ambivalent to changing it to a unknown third option.

If you live on a street where there is a crash once a day, then hearing the first crash of the day is not really significant but kind of “it is Tuesday” acknowledgement. If you do not hear a crash that day, it is actually a good day. If you have a car crash once a year then having a car day is a bad day and not having one is a neutral day.

So status quo reference class tennis could largely end up being the same thing. One tool to understand different zero points would be to imagine what they claim about what “expected” looks like which might be easier than applying to the specific choice at hand.

• 9 Dec 2022 16:28 UTC
LW: 3 AF: 2
0 ∶ 0
AF

(Bold direct claims, not super confident—criticism welcome.)

The approach to ELK in this post is unfalsifiable.

A counterexample to the approach would need to be a test-time situation in which:

1. The predictor correctly predicts a safe-looking diamond.

2. The predictor “knows” that the diamond is unsafe.

3. The usual “explanation” (e.g., heuristic argument) for safe-looking-diamond predictions on the training data applies.

Points 2 and 3 are in direct conflict: the predictor knowing that the diamond is unsafe rules out the usual explanation for the safe-looking predictions.

So now I’m unclear what progress has been made. This looks like simply defining “the predictor knows P” as “there is a mechanistic explanation of the outputs starting from an assumption of P in the predictor’s world model”, then declaring ELK solved by noting we can search over and compare mechanistic explanations.

• This approach requires solving a bunch of problems that may or may not be solvable—finding a notion of mechanistic explanation with the desired properties, evaluating whether that explanation “applies” to particular inputs, bounding the number of sub-explanations so that we can use them for anomaly detection without false positives, efficiently finding explanations for key model behaviors, and so on. Each of those steps could fail. And in practice we are pursuing a much more specific approach to formalizing mechanistic explanations as probabilistic heuristic arguments, which could fail even more easily.

This approach also depends on a fuzzier philosophical claim, which is more like: “if any small heuristic argument that explains the model behavior on the training set also applies to the current input, then the model doesn’t know that something weird is happening on this input.” It seems like your objection is that this is an unfalsifiable definitional move, but I disagree:

• We can search for cases where we intuitively judge that the model “knows” about a distinction between two mechanisms and yet there is no heuristic argument that distinguishes those mechanisms (even though “know” is pre-formal).

• Moreover, we can search more directly for any plausible case in which SGD produces a model that pursues a coherent and complex plan to tamper with the sensors without there being any heuristic argument that distinguishes it from the normal reason—that’s what we ultimately care about and “know” is just an intuitive waypoint that we can skip if it introduces problematic ambiguity.

• If we actually solve all the concrete problems (like formalizing and finding heuristic arguments) then we can just look at empirical cases of backdoors, sensor tampering, or natural mechanism distinctions and empirically evaluate whether in fact those distinctions are detected by our method. That won’t imply that our method can distinguish real-world cases of sensor tampering, but it will provide much stronger empirical evidence than is available for most alignment approaches (because there is no reason for the methods to break down around human level in particular).

All of those things are challenging without a clear formalization of “heuristic argument,” but I still feel we can do some productive thinking about them. Moreover this is objection is more like “We’re looking at a 3-step plan where it’s hard to evaluate step 3 without knowing details about how step 1 went” rather than “This plan is unfalsifiable.”

• FWIW thank you for posting this! It’s good to see where different people are coming from on this, and I like several of your other writings

• [ ]
[deleted]
• 9 Dec 2022 16:10 UTC
4 points
0 ∶ 0

I naturally rarely think in words unless I’m constructing a verbal artifact (speaking, writing, planning what to speak/​write, daydreaming conversations, etc).

But I’ve recently began to occasionally deliberately think in words, after writing made me appreciate that verbal algorithms can be helpful for focusing attention and enforcing rigor. However, verbal thinking feels inefficient in other ways (less tolerant of ambiguity, trapped to the ontology of language, single-threaded, etc), and it would be extremely annoying if I had to have an inner monologue all the time, even if I could think nonverbally “around” it.

I have also been fascinated with this question for a long time and have been polling people since middle school. Consistent with the results of this Twitter poll I recently ran, more than half the people I’ve personally asked report that they think primarily in words. (Several people have weakened the claim when I asked them more probing questions, like whether they think in words while doing math. Some people who normally “think in words” don’t think verbally while doing math, but others have to.)

Very interestingly, many people disbelieve that I don’t think in words, as they are unable to imagine how it would be possible to think at all without words.

• Zack’s series of posts in late 2020/​early 2021 were really important to me. They were a sort of return to form for LessWrong, focusing on the valuable parts.

What are the parts of The Sequences which are still valuable? Mainly, the parts that build on top of Korzybski’s General Semantics and focus hard core on map-territory distinctions. This part is timeless and a large part of the value that you could get by (re)reading The Sequences today. Yudkowsky’s credulity about results from the social sciences and his mind projection fallacying his own mental quirks certainly hurt the work as a whole though, which is why I don’t recommend people read the majority of it.

The post is long though, but it kind of has to be. For reasons not directly related to the literal content of this essay, people seem to have collectively rejected the sort of map-territory thinking that we should bring from The Sequences into our own lives. This post has to be thorough because there are a number of common rejoinders that have to be addressed. This is why I think this post is better for inclusion than something like Communication Requires Common Interests or Differential Signal Costs, which is much shorter, but only addresses a subset of the problem.

Since the review instructions ask how this affected my thinking, well...

Zack writes generally, but he writes because he believes people are not correctly reasoning in a current politically contentious topic. But that topic is sort of irrelevant: the value comes in pointing out that high status members of the rationalist community are completely flubbing lawful thinking. That made it thinkable that actually, they might be failing in other contexts.

Would I have been receptive to Christiano’s point that MIRI doesn’t actually have a good prediction track record had Zack not written his sequence on this? That’s a hard counterfactual, especially since I had already lost a ton of respect for Yudkowsky by this point, in part because of the quality of thought in his other social media posting. But I think it’s probable enough and these series of posts certainly made the thought more available.

• 9 Dec 2022 15:40 UTC
1 point
0 ∶ 0

Asking a separate session to review the answer seems to work nicely, at least in some cases:

but:

• Yes I have an internal monologue, but it’s mostly text+emotion based, with occasional images. Like a telepathic chatroom rather than a voice chat.

Can you “turn off” your verbal thought and does it cause any discomfort?

I probably could if I tried hard enough but I think it would probably be destabilizing to my mental state.

Is there a difference between thinking conceptually and visually?

Yeah I think so? Sometimes a message to myself will have an image and that’s a very different experience than a pure conceptual message, which is again different from my normal mash of text+concept.

What sort of things do you comment on to yourself? ~everything

My thoughts about what’s happening, other perspectives on what’s happening, meta commentary on the thoughts, warnings, arguments with simulated people, arguments with myself, meta stuff about that. I wouldn’t be surprised if a lot of its repetitive but a lot is also unique.

If I’m in flow, there’s either nothing at all or I’m not aware of it.

• If we were to become immortal, assuming people wanted to, everyone could eventually become a master in every subject, so would it not be safe to assume we would find ways to change our genomes in response to changes?

• A movie or two would be fine, and might do some good if well-done. But in general—be careful what you wish for.

• Fearmongering may backfire, leading to research restrictions that push the work underground, where it proceeds with less care, less caution, and less public scrutiny.

Too much fear could doom us as easily as too little. With the money and potential strategic advantage at stake, AI could develop underground with insufficient caution and no public scrutiny. We wouldn’t know we’re dead until the AI breaks out and already is in full control.

All things considered, I’d rather the work proceeds in the relatively open way it’s going now.

• I agree that fearmongering is thin ice, and can easily backfire, and it must be done carefully and ethically, but is it worse than the alternative in which people are unaware of AGI-related risks? I don’t think that anybody can say with certainty

• Fearmongering may backfire, leading to research restrictions that push the work underground, where it proceeds with less care, less caution, and less public scrutiny.

Too much fear could doom us as easily as too little. With the money and potential strategic advantage at stake, AI could develop underground with insufficient caution and no public scrutiny. We wouldn’t know we’re dead until the AI breaks out and already is in full control.

All things considered, I’d rather the work proceeds in the relatively open way it’s going now.

• RE moderation guidelines:

Apologies, I didn’t realize those even existed for questions, and apparently my profile default is set to “Reign of Terror”. That wasn’t intentional. I’ve fixed that now.

• 9 Dec 2022 14:45 UTC
7 points
3 ∶ 0

Related: Setting the Default

Thanks for the lots of examples!

• Depends on what I’m doing. My baseline is verbal/​auditory, and that is the mode my short-term memory loop utilizes most effectively. Reading printed text is primarily an auditory experience for me.

I don’t seem to have an autobiographical narrator as such, but I do a good deal of processing in the verbal mode, increasingly when I am less familiar with a task or process. If I am trying to learn a new task or process, that processing often escapes as a literal verbal output that sometimes makes my kid ask if I’m “talking to YouTube”. I guess this is a stronger version of an internal verbal/​auditory processing loop.

When I’m very focused on a mechanical task like exercise or chopping vegetables or typing[1], I often switch to a more spatial mode; there is a visual component, but it would be more revealing to think of it as proprioceptive.

In meditation I often have access to a more sensory-first mode where I seem to experience mind-body inputs in what feels like a less processed way. Here, autobiographical thoughts “look” surprisingly similar to other sense inputs bubbling up from a pool of possibilities and either serially spooling out, usually as text (audio mode), or just settling back into the whole general mishmash.

When I’m cooking, I tend to think in smells and… processes I suppose? It’s like I know what smell I want and how to get there, but there’s not much visualization and very little verbalization unless I need to do math.

[^1] Refinement: I learned to touch-type back in the 90s, so this refers to the active translation of mental symbols to digital text. There is sometimes an audio stream happening of the names of the keys I press an instant after the fact, which I take to be an error-checking process. The actual mental objects involved in eventually outputting gestures have a very tactile flavor.

• 9 Dec 2022 14:29 UTC
LW: 2 AF: 1
0 ∶ 1
AF

Problem: suppose the agent foresees that it won’t be completely sure that a day has passed, or that it has actually shut down. Then the agent A has a strong incentive to maintain control over the world past when it shuts down, to swoop in and really shut A down if A might not have actually shut down and if there might still be time. This puts a lot of strain on the correctness of the shutdown criterion: it has to forbid this sort of posthumous influence despite A optimizing to find a way to have such influence.
(The correctness might be assumed by the shutdown problem, IDK, but it’s still an overall issue.)

Another comment: this doesn’t seem to say much about corrigibility, in the sense that it’s not like the AI is now accepting correction from an external operator (the AI would prevent being shut down during its day of operation). There’s no dependence on an external operator’s choices (except that once the AI is shut down the operator can pick back up doing whatever, if they’re still around). It seems more like a bounded optimization thing, like specifying how the AI can be made to not keep optimizing forever.

• 9 Dec 2022 18:28 UTC
LW: 1 AF: 1
0 ∶ 0
AFParent

To the first point, I think this problem can be avoided with a much simpler assumption than that the shutdown criterion forbids all posthumous influence. Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1. (We might need a slightly stronger version of this assumption: it might need to be the case that for any action, there exists an action which has the same external effect but also causes a shutdown with probability 1.) This means that the agent doesn’t need to build itself any insurance policy to guarantee that it shuts down. I think this is not a terribly inaccurate assumption; of course, in reality, there are cosmic rays and a properly embedded and self-aware agent might deduce that none of its future actions are perfectly reliable, even though a model-free RL agent would probably never see any evidence of this (and it wouldn’t be any worse at folding the laundry for it). Even with a realistic probability of shutdown failing, if we don’t try to juice so high that it exceeds , my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from to .

• Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1.

Oops, I missed that assumption. Yeah, if there’s such a policy, and it doesn’t trade off against fetching the coffee, then it seems like we’re good. See though here, arguing briefly that by Cromwell’s rule, this policy doesn’t exist. https://​​arbital.com/​​p/​​task_goal/​​

Even with a realistic probability of shutdown failing, if we don’t try to juice so high that it exceeds , my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from to .

Hm. So this seems like you’re making an additional, very non-trivial assumption, which is that the AI is constrained by costs comparable to /​ bigger than the costs to create a successor. If its task has already been very confidently achieved, and it has half a day left, it’s not going to get senioritis, it’s going to pick up whatever scraps of expected utility might be left.

I wonder though if there’s synergy between your proposal and the idea of expected utility satisficing: an EU satisficer with a shutdown clock is maybe anti-incentivized from self-modifying to do unbounded optimization, because unbounded optimization is harder to reliably shut down? IDK.

• To the second point, yes, I edited the conclusion to reflect this.

• 9 Dec 2022 13:49 UTC
1 point
0 ∶ 0

Anecdotally, higher doses result in less pain for me, too. But, I’d caution anyone who was chronic migraines, or any sort of chronic pain, to limit themselves to a few standard doses a week. My doctors kept telling me that I likely had Medication overuse headaches, and I kept going “no way, I don’t take that much, and I’ve had regular headaches before taking my current regiment.” Finally, after years of prompting, I toughed out three weeks with no medication and my severe migraines were reduced in duration afterwards. Please be very careful of this slippery slope.

• Over at Marginal Response they’re having a discussion about ChatGPT and one person replied: “One of my kids just said to ask ChatGPT about what Bill Clinton has to do with Elden Ring . Huh?” So...

Do you have any idea whether your kid was merely curious or had a more specific motivation, like they’d heard that the question provides an answer that is interesting in some way? In any event, as I’m curious about these things, I investigated. If you haven’t used it, you should know that, when it’s responded to the prompt, there will be a button labeled “Try again”. If you press it, it will erase the first response and provide another. I’ve enclosed some remarks with brackets. With that in mind:

What does Bill Clinton have to do with Elden Ring?
[It’s pausing for an unusually long time before responding]

I’m sorry, but I’m not able to browse the internet or access current events, so I don’t have any information on what Bill Clinton might have to do with Elden Ring. I’m a large language model trained by OpenAI, so my knowledge is limited to the text that I was trained on, which has a cutoff date of 2021. I can provide general information and answer questions within the scope of my training, but I’m not able to provide information on current events or specific individuals.

[2nd try, causing a noticeable longer pause]

Bill Clinton has no connection to the video game Elden Ring.

[the 3rd try generated a pause so long that I reset the session, one of the options you have at any time]

New Session:

What does Bill Clinton have to do with Elden Ring?

[using my smart phone stop watch: 19 seconds]

I’m sorry, but I’m not familiar with any connection between Bill Clinton and Elden Ring. Elden Ring is a video game developed by FromSoftware and published by Bandai Namco Entertainment. It was released in January 2022, and it is set in a fantasy world created by Hidetaka Miyazaki, the creator of the popular video game series Dark Souls. I’m not aware of any connection between Bill Clinton and this game, as he is a former President of the United States and has no apparent connection to the video game industry.

[2nd try, 10 sec pause]

Bill Clinton has nothing to do with Elden Ring.

[3rd try: 32.35 sec pause]

There is no connection between Bill Clinton and Elden Ring. Elden Ring is a video game developed by Hidetaka Miyazaki and published by Bandai Namco, while Bill Clinton is a former President of the United States.

At the point I concluded the experiment. I note that it seems that in times of a heavy load, there may be pauses for response, and even an error message. FWIW, I did this at 8:36 AM Eastern Time on 12.9.2022. But I don’t think we’re dealing with an overload situation. It’s the question itself. It seems rather obscure. Hence the varying response times and the fact that, in each session, response time increased on successive tries.

• 9 Dec 2022 13:44 UTC
3 points
1 ∶ 0

I have been using GitHub Co-Pilot as a part of my daily job for over a year. TL; DR: in three years I expect some improvement, but not beyond simple functions.

Right now, copilot is most useful for converting data between formats, and writing out boilerplate. It is surprising how often in software development (especially server-side) you need to change the shape of data. Essentially, as long as there are established patterns, it is helpful; however so not expect it to write software for you any time soon.

So far these systems are still fairly narrowly scoped. It can write a simple function, but I haven’t seen it be able to create abstractions. It really doesn’t have an understanding of the code, and even now it isn’t very good at matching parentheses or brackets.

Now I don’t expect copilot of three years from now to put me out of a job, but I do expect that it will do more of the typing for me. I think that I’m still going to have to convert business decisions into the right abstractions, but I hope that I’ll be writing fewer tests by hand.

Until then, it’ll continue writing plausible nonsense, which sometimes happens to be useful.

• I’m usually not the type of guy to dunk on a journal for having low impact factor but uh...

Impact-factor 0 journals are a really really bad sign. An extremely bad sign. I wouldn’t recommend taking it seriously at all. It’s like a limbo for damned papers that were rejected from every other publication. You see things in there. Things you can’t forget. Entire plagiarized papers that were Google translated to Chinese and then back to English. That internet meme where some guy literally put “T” on top of his bar plot instead of real error bars. Forgetting to correct for multiple hypotheses. Unforgivable sins.

My best recommendation is to look for a higher-quality source.

• Thank you for your feedback! This is a mistake on my part. I will take the article down until I’ve looked into this and have updates my resources.

• [ ]
[deleted]
• Also “bloody” stupid could refer to unicorn blood, I.e. the forbidden forest.

• 1 000 mg is the standard dose in France, with 500mg being used almost only for children.

• As far as running a media company goes, fandom is extremely profitable, increasingly so in an age where enormous sci-fi/​fantasy franchises drive everything. And there’s been huge overlap between fandom communities and social justice politics for a long time.

It’s definitely in Disney’s interest to appeal to Marvel superfans who write fanfiction and cosplay and buy tons of merchandise, and those people tend to also be supporters of social justice politics.

Like, nothing is being forced on this audience—there are large numbers of people who get sincerely excited when a new character is introduced that gives representation for the first time to a new minority group, or something like that.

As with so many businesses, the superfans are worth quite a few normies who might be put off by this. I think this is the main explanation.

• …and those people tend to also be supporters of social justice politics.

I guess this is the part that’s not so clear to me. I see lots of people like this. I also see lots of people who are groaning about being repeatedly lectured and about their characters and franchises getting deconstructed. It’s hard for me to find a vantage point that doesn’t bubble me in one sphere or the other in a way that makes one side look overwhelmingly larger than the other. So I just can’t tell what the actual demographics are here. But the revealed behavior of these companies gives me the impression that they do find it crystal clear. That’s what I find a bit bewildering.

• The reason these events were scary, and subsequent fiction was able to capitalise on that, was that they were near misses. Very near misses. There is already a lot of fiction about various misaligned AIs, but that doesn’t affect people much. So what you seem to be advocating is generating some near misses in order to wake everybody up.

Fear is useful. It would be good to have more of it. The question is how plausible it is to generate situations that are scary enough to be useful, but under enough control to be safe.

• The reactor meltdown on a Soviet submarine was not posing an existential threat. In the worst case, it would be a little version of Chernobyl. We might compare it to an AI which causes some serious problems, like a stock market crash, but not existential ones. And the movie is not a threat at all.

”The question is how plausible it is to generate situations that are scary enough to be useful, but under enough control to be safe.”
That is a great summary of what I wanted to say!

• I have the capacity to monologue internally, and use it moderately often, but not constantly. When I’m not monologuing I guess there’s just a direct link from thought/​input to action without an intermediary vocalising about it.

When reading my default is to read “in my head” as if reading aloud, but with a little effort I can suppress that and just scan the page while understanding the words. With the result that reading is a little faster if I don’t vocalise it, but also less pleasurable if the rhythm of the prose would be part of the experience. Not sure how retention of what I’ve read compares—I suspect it might be reduced if I’m scan-reading (it lends itself to skimming).

I can generate internal imagery and sounds more generally, but not to the extent of full-blown voluntary hallucination. Mental images tend to feel like they’re in a separate space from my main visual field (somehow above or inside my head, if I had to give it a location) and they aren’t perfectly vivid; maybe only partly in full focus/​detail at a time.

• On many useful cognitive tasks(chess, theoretical research, invention, mathematics, etc.), beginner/​dumb/​unskilled humans are closer to a chimpanzee/​rock than peak humans

All of these tasks require some amount of learning. AIXI can’t play chess if it has never been told the rules or seen any other info about chess ever.

So a more reasonable comparison would probably involve comparing people of different IQ’s who have made comparable effort to learn a topic.

Intelligence often doesn’t look like solving the same problems better, but solving new problems. In many cases, problems are almost boolean, either you can solve them or you can’t. The problems you mentioned are all within the range of human variation. Not so trivial any human can do them, nor so advanced no human can do them.

Among humans +6 SD g factor humans do not seem in general as more capable than +3 SD g factor humans as +3 SD g factor humans are compared to median humans.

This is a highly subjective judgement. But there is no particularly strong reason to think that human intelligence has a Gaussian distribution. The more you select for humans with extremely high g factors, the more you goodheart to the specifics of the g factor tests. This goodhearting is relitively limited, but still there at +6SD.

3.0. I believe that for similar levels of cognitive investment narrow optimisers outperform general optimisers on narrow domains.

I think this is both trivially true, and pragmatically false. Suppose some self modifying superintelligence needs to play chess. It will probably largely just write a chess algorithm and put most of it’s compute into that. This will be near equal to the same algorithm without the general AI attached. (probably slightly worse at chess, the superintelligence is keeping an eye out just in case something else happens, a pure chess algorithm can’t notice a riot in the spectator stands, a superintelligence probably would devote a little compute to checking for such possibilities.)

However, this is an algorithm written by a superintelligence, and it is likely to beat the pants off any human written algorithm.

4.1. I expect it to be much more difficult for any single agent to attain decisive cognitive superiority to civilisation, or to a relevant subset of civilisation.

Being smarter than civilization is not a high bar at all. The government often makes utterly dumb decisions. The average person often believes a load of nonsense. Some processes in civilization seem to run on the soft minimum of the intelligences of the individuals contributing to them. Others run on the mean. Some processes, like the stock market, are hard for most humans to beat, but still beaten a little by the experts.

My intuition is that the level of cognitive power required to achieve absolute strategic dominance is crazily high.

My intuition is that the comparison to a +12SD human is about as useful as comparing heavy construction equipment to top athletes. Machines usually operate on a different scale to humans. The +12 SD runner isn’t that much faster than the +6SD runner, Especially because, as you reach into the peaks of athletic performance, the humans are running close to biological limits, and the gap between top competitors narrows.

• [ ]
[deleted]
• It is possible that “This box contains the key” was a true statement at the time it was written, and then the contents were changed. The king’s explanation does specify an ordering of events.

• This is a great thread for explaining how to spot the frame

I have a lot to say on frames, but a very foundational lesson also worth mentioning is how the spell casting takes place, and how to Counterspell

It happens in 5 steps

1. Someone sets a frame

2. Significance control: thread expand if you agree, VS thread minimize if you decide to ignore it and move

3. Frame negotiation: agree, reframe, or set your own (opposing) frame

4. Agreement

5. Cementing

If you set the frame, you can control the frame from beginning to end. However, if someone else sets the frame, then you first want to decide whether to expand on that frame, or to minimize it.

Significance Control

The more significant a frame is, the more it impacts the conversation, so whether you want to minimize or expand is an important decision

If you decide to challenge a frame, you also expand on it. So if you lose that negotiation, then you face much bigger consequences because you first expanded it, and then lost it. Indeed the opposite of minimizing is not to say it doesn’t matter but, often, is to simply ignore it.

If a frame is agreeable to you, you want to expand on it. There are many ways of thread-expanding, including:

Asking questions such as “why is that” or “why do you think so” Asking leading questions: ie. “oh wow, do you really think so” Strategic disagreement: such as “you think so? But this other person said the opposite”. Now they’re forced to defend and talk more, which expands the initial frame Laughing: a way to “covert expanding” anyone with a Facebook account is familiar with. This is what lawyers sometimes do to highlight the opposing lawyers’ mistakes (you could see plenty of that during the Depp VS Heard defamation case: most people never realize that most of the snickering was done on purpose to sway public and jurors’ opinions) Agreeing and expanding: you agree, and explain why you agree Agreeing and sharing: you agree, and share a story that supports the frame or belief Agreeing and rewarding: you agree, and you tell them why you appreciate them for saying or doing what they did

(Side note: Most techniques of frame negotiation also expand on a frame. So you want to be careful not expanding disagreement or irreconcilable differences when you need rapport. And this is why, generally speaking, “agreeing and redirecting” is a fantastic form of frame control: it’s because it sets your own frame while minimizing the disagreement and leveraging the commonalities)

Whenever a frame is disagreeable to you, you can either challenge it, or minimize it

If you have the power to challenge it and change people’s opinions, or at least if you want your disagreeing voice to be heard, then you can speak up.

Many other times, it’s best instead to minimize a frame, and move on. Minimizing a frame includes:

Ignoring it “Yeah yeah-ing it”, such as to agree but with little to no conviction and then moving on Thread-cutting (ie.: Changing topic) a common, and effective technique (if well executed) Offer small and partial third-party agreement: ie.” yeah, some people feel that way”, and then moving on

Cementing

Now for the most important step

Imagine you agreed on a good frame that’s good for you. What do you do now?

You want to expand on that frame to increase the (perceived) benefits and the follow-through.

This phase is called “thread cementing”, an incredibly useful technique.

Frame cementing means: Expanding and solidifying the thread of the “agreement reached” to solidify the new frame and increase its effectiveness. Frame cementing increases the likelihood that the other party will stick to the new negotiated frame, and/​or it increases the likelihood that the Persuasion will be internalized and accepted as the new reality (VS just agreeing with the frame as a form of short-term capitulation)

This final step… actually has additional substeps (Human psychology is hard, okay?!!!)

1. You reach a point where a frame is agreeable to you

2. Cement it by asking for confirmation

A frame that is agreed by the other party immediately increases its power by 10 fold. It makes people feel part of the decision, which increases adoption and followthrough, as well as increasing “intrinsic motivation”.

Some ways of doing it: • “ What do you think“: an agreement with less nudging gets more buy-in and is even more powerful • “Do you agree“ • “It makes sense, doesn’t it”

Note: silence often (thougb not always!) means one is in the process of accepting it, but might feel disempowered to admit it. Generally speaking, the frame agreed upon should feel good

1. Cement it by providing your own confirmation

For example: ▪︎ “I’m glad we agree“ ▪︎ “ I’m happy we see things the same way“

1. End with a collaborative frame and/​or reward

For example: • “Yeah, it makes sense, right? You get it because you’re also a smart guy/​gal“ • “ I’m glad we’re going to do this. And I’m glad it’s going to help (because I care about you)“: show that you are glad about the new frame/​agreement because it will benefit them, and because you care about them. Super powerful. But be honest about it please -or don’t say it-! • Silence and smile: confirms nonverbally the good vibe

1. Next steps and taking action

If it was a frame that requires taking action, move on to the next steps.

(Side note: The more you had to persuade, the more you want to show that you are also tasking yourself with some steps. Eg “Great, so you can take care of X, I’ll do Y and Z, and we’ll meet at 4pm“)

Frame cementing is super powerful, BUT you better be genuine when using, and you better use it with real win-win frames or with the best intentions for the people you’re persuading.

When you use it for win-lose, that’s the stuff of manipulators. And albeit it can work in the short-term, over the long-term many people will catch on. As a matter of fact, the higher the quality of the people you deal with, the more likely it is they will catch on

Even when you use it for win-win you must be careful. You can still come across as a bit too sleek, which raises some red flags

Give people space to agree by themselves. Ask questions more than making statements. And when you must intervene, live by the motto “nudge, don’t push”.

Also make sure you stress the win-win nature of the agreement, together with how glad you are because you care about them.

One final Warning: Unchallenged Frames Self-Cement Over Time

This is important to remember

Frames that go unchallenged tend to cement themselves. Especially when they repeat over time.

What happens is that the frame, from a verbal or nonverbal statement that simply describes or comments on reality, becomes more and more a reality of your shared (social) life.

This is a very important principl, because it means that if you let bad frames go unchallenged, then you lose arguments and/​or persuasive power forever, not just in the few seconds that the frame lasts. And if they are repeated frames, they can also compound power over time

This is a similar principle for micro-aggressions: if you let micro-aggressions go unchallenged, then they build-up, and you die by a thousand cuts.

This usuallg means that it’s a good idea to get in the habit of challenging most frames are irrational/​disagreeable early on in every new relationship

• This doesn’t sound wrong exactly but it does sound icky.

It seems to be missing “we are talking to each other in good faith, cooperatively; we point out the existence of the frame choices rather than sneakily trying to end up with a frame that’s good for what we want right now”.

I mean it’s technically kindasorta there in some of the expanding, like “you think so? But this other person said the opposite”. But the spirit still seems adversarial and manipulative, even in “win-win”. Like… “the only reason I’m not punching you is because you got lucky and accidentally agree with what I want”.

If I used these techniques with myself it would feel like bad brain habits.

I don’t want to be on the receiving end.

Maybe this is supposed to be applicable only in situations where you’re fine treating people as NPCs to be manipulated? If so, add that context, on LW. If not—FYI, it came off as if it was, to at least one person, namely me.

• GuySrinivasan there are instructions on casting a dark spell, step by step

You don’t cast Avada Kedavra with happy thoughts, you cast it with the intention to kill

You cast fiendfyre with blood

And you cast “TARE DETRIMENS” by having very bad brain habits, on average

This wasnt a guide for the purpose of doing it. This was a guide for the purpose of recognizing it when done to you and seeing them dance the steps and having them reified

If it wasn’t “icky”, why would it be a dark art?

• “Wouldn’t it make more sense to use as a reward signal the fact-of-the-matter about whether a certain system followed a particular human’s intention?”

If I understand what you are saying correctly, this wouldn’t work, for reasons that have been discussed at length in various places, e.g. the mesa-optimization paper and Ajeya’s post “Without specific countermeasures...” If you train a model by giving it reward when it appears to follow a particular human’s intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.

And so if you train an AI to build another AI that appears to follow a particular human’s intention, you are just training your AI to do capabilities research.

(Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so as far as we can tell from the outside. Well, how are we going to construct such a reward signal? That would require getting all the way to the end of evhub’s Interpretability Tech Tree.)

• If you train a model by giving it reward when it appears to follow a particular human’s intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.

Right yeah I do agree with this.

Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so [...] That would require getting all the way to the end of evhub’s Interpretability Tech Tree

Well I think we need something like a really-actually-reward-signal (of the kind you’re point at here). The basic challenge of alignment as I see it is finding such a reward signal that doesn’t require us to get to end of the Interpretability Tech Tree (or similar tech trees). I don’t think we’ve exhausted the design space of reward signals yet but it’s definitely the “challenge of our times” so to speak.

• I still think this is great. Some minor updates, and an important note:

Minor updates: I’m a bit less concerned about AI-powered propaganda/​persuasion than I was at the time, not sure why. Maybe I’m just in a more optimistic mood. See this critique for discussion. It’s too early to tell whether reality is diverging from expectation on this front. I had been feeling mildly bad about my chatbot-centered narrative, as of a month ago, but given how ChatGPT was received I think things are basically on trend.
Diplomacy happened faster than I expected, though in a less generalizeable way than I expected, so whatever. My overall timelines have shortened somewhat since I wrote this story, but it’s still the thing I point people towards when they ask me what I think will happen. (Note that the bulk of my update was from publicly available info rather than from nonpublic stuff I saw at OpenAI.)

• I feel like your predictions for 2022 are just a touch over the mark, no? GPT-3 isn’t really ‘obsolete’ yet or is that wrong?

I’m sure it will be in a minute, but I’d probably update that benchmark to probably occurring mid 2023, or potentially whenever GPT-4 gets released.

I really feel like you should be updating slightly longer, but maybe I misunderstand where we’re at right now with chatbots. I would love to hear otherwise.

• [ ]
[deleted]
• I’m fairly sure Tononi said multiple times that IIT implies a simulated brain would not be conscious. I’m not sure how this affects the Chinese room, but it seems plausible it would work by simulating a brain. Then it wouldn’t be conscious.

Why does this follow? The simulation still has states and information that can be integrated.

• This seems to me like a “you do not understand your own values well enough” problem, not a “you need a higher moral authority to decide for you” problem.

Or, if we dissolve the idea of “your values” as something that produces some objective preference ordering (which I suppose is the point of this post): you lack a process that allows you to make decisions when your value system is in a contradictory state.

• this seems to me to excessively dimensionality-reduce the political spectrum. for example, where does war profiteering fit?

• I think the political spectrum doesn’t quite line up with this. For basically any point on the compass, there will be things that should be managed in a decentralized way following Commercial precepts, and things that should be managed in a centralized way following Guardian precepts. The question is just which activities fall in which bucket. [Is medicine a good that should be bought and sold like any other, or largesse which should be dispensed?]

But some sets of choices will be more synergistic or more contradictory than others; applying this technique to the political spectrum might identify a few good clusters and a bunch of worse hybrids. [Given that politics is mostly about coalitions and loyalty instead of technical coherence, my guess is this won’t be super useful.]

• The political spectrum is quite orthogonal to this. You’ve got communist Guardians who want to protect the means of production and share out all the goods along with Commercial communists who think that if we just sit down and decide to all work together we’ll come up with plans that will equally enrich everyone. Or fundamentalist Guardians who just want to ensure everyone’s purity along with Commercial fundamentalists who are arguing among themselves as to what’s the best way of interpreting a single word in the Bible.

It might be more useful to compare this with Scout and Soldier mindsets, which seem to be pointing in vaguely the same direction, but in the area of epistemics, rather than morals.

• ah, then maybe I’m misunderstanding at a deeper level. I will abstain from further comment for now.

• So, I think firms that sell weapons to individuals and governments broadly fall under the Commercial cluster; following the Guardian precepts as such a firm is probably a mistake. Note that these are ethical standards, so you could look at any individual firm and ask whether they’re following the precepts in particular cases. I suspect that most cases of war profiteering are a failure on the buyer’s side, at least as far as this view is concerned.

There is something interesting here with the question of largesse—traditionally, the Guardian’s role is to take resources from their territory and then spend those resources on buying loyalty /​ public goods. The military-industrial-complex is often this sort of largesse operation, but it’s not obvious that it should be. [Similarly, Jacobs talks a lot about how government meddling in agriculture is probably downstream of agriculture’s traditional role as powerbase for Guardians, but they tend to have lower yields /​ be worse at it than Commercial agriculture.]

There’s also this point that—the Guardians do need to be involved in trading! Even if the Baron isn’t supposed to engage in business himself, he still has things he needs to buy, taxes he needs to collect, and so on. This means there needs to be some sort of agent who is able to engage in trade, and presumably does so mostly using the Commercial precepts, and hopefully with a lessening of the implicit threat.

• I’d love to hear why this warrants front page, and I’d love to hear what Valentine hopes to gain from asking this! This is a topic with a large amount of adversarial agency coming from the right’s culture war. What leads you to bring that here, and why is it worth a frontpage on the ai safety forum?

• What leads you to bring that here

I noticed I was confused. The world didn’t make sense to me at this spot. I could guess at some pieces, like “Okay, maybe wokism is actually just really super popular”, but that didn’t account for all the pieces I was observing.

I imagined that Less Wrong would be a good place to ask people about this in a way relatively unlikely to swing into culture war baloney. I just want to understand how the world is shaped.

why is it worth a frontpage on the ai safety forum?

I… have no idea. I didn’t do that. Or if I did it was purely by accident. I wouldn’t have guessed this belonged at all in anything having to do with AI risk, other than it being about modeling the world, which is generically connected to AI risk in an overall kind of way.

• to clarify, I was asserting that less wrong is the AI safety forum of the world.

• Oh! Ha! Okay. Well, I view Less Wrong as the rationality forum of the world, which happens to include a lot of examination of AI safety/​risk. If there were a division within LW between “AI” and “not AI”, I totally would have put this in the “not AI” category.

• I wish I could only subscribe to the AI safety stuff but still get all of the thinking out loud the AI safety people do. unfortunately that often doesn’t get tagged AI. I also have trouble getting into a mindset where anything worth applying rationality to isn’t therefore, by nature of wanting rationality, fundamentally an AI problem—after all, rationality is the task of building a brain in a brain.

I’m afraid I can’t answer your other question in the other thread, my brain is crashing repeatedly if I try to.

• The question “why do companies do something seemingly unprofitable” is in my opinion worth asking.

The answers seem to be one of:

• it actually is profitable, because...

• a principal-agent problem, the people doing the thing are not aligned with the company (and the company will not replace them, because...)

Both seem likely, I wish I could figure out which one is true (possibly both).

• Is it on the AI safety forum, though? Turns out it is, though downvoted...

This is an interesting question, whatever your political bend—there is a noticeable uptick in representation etc. in new media. It’s worth understanding the underlying mechanisms at work, seeing as whatever the reason for the changes, they happened quite fast. Both if you’re for such changes or against them.

• The customers. Some companies go hard anti-progress, some companies go hard inclusion. Since many people don’t want to allow anyone who matches BIPOC or LGBTQIA+ to exist, they don’t buy anything that matches those.

note that the companies who do this still typically fund the far right.

also note that the military is threatening to move their bases out of areas that go anti-bipoclgbtqia+. go anti-woke, go broke.

edit to clarify: In other words, I see this as likely to be viewed as a means to political power by those running these corporations. it’s not a short-term economic incentive that leads them to offend some of their customers.

• Heavily downvoted for (a) not answering the question and (b) instead using this space as an opportunity to repeater signal boost the left’s narrative in this particular corner of the culture wars.

[Edited to correct an inappropriate blindness on my part.]

• look if you want to signal boost the rights culture war, go for it, but I’m pretty sure most of this is just that it’s profitable and Disney is predatory and trying to profit off of it without actually making any change. you want to fight for the right, go ahead

• look if you want to signal boost the rights culture war, go for it…

Wanting to flag this as another example of frame control.

I’m not trying to align with the right. I think they’re nuts in almost exactly the same way.

My position is more like anti culture war escalation.

Sadly, this means that when someone is heavily aligned with one side of a front of the war, I can come across as aligning with their enemies.

Alas.

• hmm. perhaps the key thing necessary to disengage with it is to brutally word-replace words used by everyone into custom academic-style language. eg use of the word “woke” is a very strong right-side signal. that’s what I’m getting at with those points in my comment.

• Mmm. As I just mentioned here, I actually didn’t know use of “woke” got interpreted as a “very strong right-side signal”. Lots of left-leaning folk around me use the term too. What does your corner of the left call it?

• I’m sorry, what? but like, that’s… actually true? I did answer the question, your dismissal cannot be in good faith if you think I didn’t.

• I did answer the question, your dismissal cannot be in good faith if you think I didn’t.

Ah, I think I missed this part of your comment here. Not sure how that happened.

On this I apologize. I missed that you were honestly trying to answer the question. Mea culpa.

The info commons point still stands though.

• okay, then that’s fair. and yes, I repeat the liberal centers talking points because I simply do in fact believe they are true, downvoting me hard because of them feels like an actual dismissal. I admit that I noticed myself feeling agitated and fighty and that I have had a repeated pattern of engaging that way; The part of my brain that wants me to do that doesn’t seem to be able to disengage, even after an internal conversation about it that part just says look I have to be here it’s important. I have asked for my karma to be slashed because of a similar fight the other day. sorry about this anger tone thing

• Cool. Thank you for explaining.

I also am sure that a more skillful version of me could have named and navigated all this with a lot more grace. You seem to be sensitive on this spot, but I also kind of hit it with a hammer. Sadly I don’t yet see a more graceful way to do the thing I’m caring for without fawning. But I’ll get there. I regret you got hit in the process.

• There are ways to say true things that are partly, mostly, or entirely for reasons having nothing to do with conveying the truth.

Notice the contrasting terms “anti-progress” vs. “inclusion”. And the framing about some folk not wanting categories of people to exist, rather than any kind of framing that such folk are caring for something that matters to them and might matter to civilization. And the injection of a slogan in the last line.

The tone of this isn’t about explaining something. It’s implicitly asserting that wokism is just overwhelmingly popular, which is the closest to an answer to the original question as is given. But it’s mostly about frame assertion.

I don’t care how true someone’s utterances are, or what side of the culture wars they’re fighting for, if they’re bringing in tactics like these. It heavily pollutes the information commons.

• you bring up the right’s frame, I’ll bring up the frame that actually describes my life. I’m responding like my life is being attacked because it is (of course not by this conversation). I used to trust you.

• I think I still appreciate you even though I’m pretty annoyed at you right now. it’s possible I will come to agree more, though I would appreciate understanding expressed in the form of recognizing why I find the word woke to be an aggressive choice of words. that said, I also wouldn’t blame you for deleting all of my comments, because other people have expressed my views better and I clearly havebtoxicity damage about this topic

edit: oh I see you did that on the other thread. conversation understanding appreciated

• …I would appreciate understanding expressed in the form of recognizing why I find the word woke to be an aggressive choice of words.

Oh. Actually I didn’t know that “woke” was a problematic word in this corner of memespace. I was just trying to point at the memetic structure and found this term lying around. I’ve seen it on media from far left to far right, so I’d guessed it was just the word everyone had converged on for referring to this thing.

Is there a word you prefer?

• My brain is too crashed to answer and I don’t accept to be able to read this thread without reentering this mindspace. having said my concern I leave it up to you to detect the degree to which I am or am not right about it and what to do. I might consider recommending an increased specificity unpack taboo rather than a word substitution taboo.

• A trick I sometimes use, related to this post, is to ask whether my future self would like to buy back my present time at some rate. This somehow makes your point about intertemporal substitution more visceral for me, and makes it easier to say “oh yes this thing which is pricier than my current rate definitely makes sense at my plausible future rate”.

• In fact, it’s not 100% clear that AI systems could learn to deceive and manipulate supervisors even if we deliberately tried to train them to do it. This makes it hard to even get started on things like discouraging and detecting deceptive behavior.

Plausibly we already have examples of (very weak) manipulation, in the form of models trained with RLHF saying false-but-plausible-sounding things, or lying and saying they don’t know something (but happily providing that information in different contexts). [E.g. ChatGPT denies having information about how to build nukes, but will also happily tell you about different methods for Uranium isotope separation.]

• You’re thinking at the wrong level of abstraction. There is no economic incentive for wokism at the corporate level. But look one level below. The question isn’t what causes “corporations” to act in woke ways. The question is, what persuades employees of corporations to act in woke ways?

My hypothesis is that anti-discrimination legislation has, due to court precedents, developed an inverted burden of proof. If a corporation fires or disciplines someone who is non-white, female, disabled, or belongs to a number of other protected categories, it is now up to the corporation to prove that the firing or discipline was done for non-discriminatory reasons. This, combined with the ideological leanings of most people in HR departments, is sufficient to ensure that every corporation has, within it, the equivalent of an ideological cell, whose job it is solely to push the corporation to act in a more woke manner. This ideological cell has both public opinion and federal law on its side; well meaning individuals who push back end up like James Damore.

But unless this had profit appeal I would expect the market to just… eat pure but incomplete ideological capture after a while

The market is part of society. There was a similar argument made against anti-segregation legislation in the 1960s. After all, given that it’s more profitable to sell to both black people and white people than it is to sell to white people only, wouldn’t it be in business owners’ rational self-interest to desegregate their properties?

The answer, in both instances, is the same: if there is a sufficiently high cultural barrier, then it will be more profitable to go with the culture than against it. Most reasonable people can at least nod along to the woke slogans. After all, it is quite reasonable to suggest that women ought to be treated equally to men, that blacks should be treated equally to whites, and people shouldn’t be discriminated against because of their sexual orientation. It’s only when those reasonable propositions are taken to extremes that they result in wokism.

Because of this motte-and-bailey aspect to wokism, it’s easy for wokism to permeate the culture, and for advocates of wokism to tar those who oppose them as racists and bigots.

But there’s a counter-push of “Lots of people don’t like being lectured about politics when they’re seeking entertainment” (for instance). It’s not at all clear to me that the first effect is so utterly hugely enormously larger than the second that the profit incentive would cause so many companies to swing hard woke.

Lots of people also threatened to move to Canada if Trump was elected President. How many of them actually chose to do so? A Republican in the United States will shout vociferously about Coca Cola or Nike engaging in woke behavior, but will he or she choose Pepsi when he or she next shops for groceries? Will he or she buy some other brand of shoes? And if he or she does, will it make a difference? After all, Pepsi and Reebok are hardly less woke than Coca Cola and Nike.

• Thank you, this was helpful.

Lots of people also threatened to move to Canada if Trump was elected President. How many of them actually chose to do so?

I don’t think this is the right analogy. Listening to more moderate and right-leaning folk, one gets the impression that viewership of shows and movie franchises that are going woke has been dropping like a rock. Like apparently there was an analysis of when people turned off the Captain America streaming show on Disney+ (I forget its name — the one where Falcon becomes the new Captain), and the moment it plunged was the scene where police were harassing Falcon due to racial profiling.

Maybe the most befuddling part of the culture wars is the way, on every front, soldiers on both sides muddle the facts. It’s hard for me to tell what’s even true. Daniel Schmachtenberger describes this as “polluting the information commons”. There’s a Molochian dynamic where the facts of the matter are part of what’s being fought over.

That’s why I’ve been looking at places that have a profit incentive that are also catering specifically to wokism, noticing I don’t see a corresponding shift in the other direction at the same scale, and kind of scratching my head. Wokism doesn’t look as obviously profitable a thing to align with as their behavior seems. Even if it is, that seems like it’d be hard to determine.

The possibility that it’s actually more like legally imposed internal friction makes some good sense. I doubt that’s the full picture but it’s a plausible major component.

• Ah, so being woke increases your job safety, if you are a member of one of the protected groups. It makes your membership in that group more salient; if you ever get fired, it makes it easier to argue that you were fired because you belonged to that group.

And there is no obvious counter-strategy, because telling them “stop talking about your membership in a protected group all the time” can itself be interpreted as attacking the group.

• A concrete example of this inversion of the burden of proof arose just today, with regards to the Twitter layoffs:

Their complaint cites calculations provided by Mark Killingsworth, an economics professor at Rutgers University, to allege that, overall, “57 percent of female employees were laid off on November 4, 2022, while 47 percent of male employees were laid off.”

This lawsuit is not alleging that any specific discriminatory behavior took place, or that discriminatory reasoning was used by managers in choosing who got the pink-slips and who got to stay on. Rather, the brute fact that more women than men were laid off is used as evidence to assert that Twitter was targeting women. Now, it’s up to Twitter to show that it was not behaving in a discriminatory manner in conducting its layoffs.

• …the question does sometimes haunt me, as to whether in the alternative Everett branches of Earth, we could identify a distinct cluster of “successful” Earths, and we’re not in it.

This Failing Earth, Eliezer Yudkowsky

Does anyone else wonder similar things about the EA/​rationality scene? If we could scan across Tegmark III, would we see large clusters of nearby Earths that have rationality & EA communities that embarrass us and lay bare our own low standards?

• 9 Dec 2022 3:06 UTC
2 points
1 ∶ 1

There are two major unexamined assumptions underlying this analysis.

The most flagrant is the assumption that the expected value of all work done now on x-risk is positive. You might hope that it is, but you can’t actually know or even have rationally high confidence in it. Without this assumption, you might be able to say that anything we do today is important, but can’t say that it’s equivalent to saving lives. You may equally well be doing something equivalent to ending lives.

Another serious unjustified assumption is that the correct measure is some aggregated utility that is linear in the number of people who come to exist. I have extreme doubts that murdering 7 billion people today is ethically justifiable if it would increase the population capacity of the universe a trillion years from now by 0.0000000000000000000000000000000000000001% even though it means that a lot more people get to live. Likewise I have an expectation that allowing capacity for one more potential person to exist a trillion years from now is morally much less worthwhile than saving an actual person today.

• Oh, almost forgot!

As to your second objection, I think that for many people the question of whether murdering people in order to save other people is a good idea is a separate moral question from which altruistic actions we should take to have the most positive impact. I am certainly not advocating murdering billions of people.

But whether saving present people or (in expectation) saving many more unborn future people is a better use of altruistic resources seems to be largely a matter of temperament. I have heard a few discussions of this and they never seem to make much sense to me. For me it is literally as simple as people being further away in time which is another dimension, not really any different than spatial dimensions, except that time flows in one direction and so we have much less information about it.

But uncertainty only calls into question whether or not we have impact in expectation, for me it has no bearing on the reality of this impact or the moral value of these lives. I cannot seem to comprehend why other people value future people less than present people, assuming you have equal ability to influence either. I would really like for there to be some rational solution, but it always feels like people are talking past each other in these types of discussions. If there is one child tortured today it cannot somehow be morally equivalent to ten children being tortured tomorrow. If I can ensure one person lives a life overflowing with joy today, I would be willing to forego this if I knew with certainty I could ensure one hundred people live lives overflowing with joy in one hundred years. I don’t feel like there is a time limit on morality, to be honest it still confuses me why exactly some people feel otherwise.

You also mentioned something about differing percentages of the population. Many of these questions don’t work in reality because there are a lot of flow-through effects, but if you ignore those, I also don’t see how 8,000 people today suffering lives of torture might be better than 8 early humans a couple hundred thousand years ago suffering lives of torture, even if that means it was 1 /​1,000,000 of the population in the the first case (just a wild guess) and 1 /​ 1,000 of the population in the second case.

These questions might be complicated if you take the average view on population ethics instead of the total view, and I actually do give some credence to the average view, but I nonetheless think the amount of value created by averting X-risk is so huge that it probably outweighs this considerations, at least for the risk neutral.

• Interesting objections!

I mentioned a few times that some and perhaps most x-risk work may have negative value ex post. I go into detail how work may likely be negative in footnote 13.

It seems somewhat unreasonable to me, however, to be virtually 100% confident that x-risk work is as likely to have zero or negative value ex ante as it is to have positive value.

I tried to include the extreme difficulty of influencing the future by giving work relatively low efficacy, i.e. in the moderate case 100,000 (hopefully extremely competent) people working on x-risk for 1000 years only cause a 10% reduction of x-risk in expectation, in other words effectively a 90% likelihood of failure. In the pessimistic estimate 100,000 people working on it for 10,000 years only cause a 1% reduction in x-risk.

Perhaps this could be a few orders of magnitude lower, say 1 billion people working on x-risk for 1 million years only reduce existential risk by 1/​1trillion in expectation (if these numbers seem absurd you can use lower numbers of people or time, but this increases the number of lives saved per unit of work). This would make the pessimistic estimate have very low value, but the moderate estimate would still be highly valuable (10^18 lives per minute of work.)

All that is to say, I think that while you could be much more pessimistic, I don’t think it changes the conclusion by that much, except in the pessimistic case—unless you have extremely high certainty that we cannot predict what is likely to help prevent x-risk. I did give two more pessimistic scenarios in the appendix which I say may be plausible under certain assumptions, such as 100% certainty that X-risk is inevitable. I will add that this case is also valid if you assume a 100% certainty that we can’t predict what will reduce X-risk, as I think this is a valid point.

• 9 Dec 2022 2:58 UTC
16 points
6 ∶ 2

That you think they’re going super hard woke (especially Disney) is perhaps telling of your own biases.

Lets look at Disney and Hollywood (universities are their own weird thing). The reality is that in the Anglosphere there are lots of progressive people with money to spend on media. You can sell “woke” media to those people, and lots of it. Even more so when there’s controversy and you can get naive lefties to believe paying money to the megacorp to watch a mainstream show is a way to somehow strike back against the mean right-wingers. And to progressive people it doesn’t feel like “being lectured to about politics”, because that’s not what media with a political/​values message you agree with feels like. So going woke is 100% a profit-motivated decision. The leadership at big media companies didn’t change much over the last decade or two, nor likely did their opinions (whatever those actually are). But after gay marriage gained significantly above 50% approval rate in the US and the Obergefell decision happened it became clear to them that it was safe to be at least somewhat socially progressive on issues like that, and would be profitable.

But equally, almost every single “woke” Disney movie has the “woke” components carefully contained such that they can easily be excised for markets where they are a problem. You see a gay kiss in the background of a scene in Star Wars, it gets cut for the Chinese and Middle East markets. Disney has many very progressive employees who are responsible for making the actual art they produce; artists lean pretty strongly progressive in my experience, so of course the employees’ values come out in the art they make. But the management puts very strict limits on what they can do precisely because anything less milquetoast is believed to be less profitable.

• That you think they’re going super hard woke (especially Disney) is perhaps telling of your own biases.

…and then you go on to describe how Disney is in fact selling movies with woke components to the West, which is exactly what I was talking about.

Just… don’t do this. I’m not available for this kind of psychoanalysis. I find it extremely difficult to engage in good faith when people make moves like this one. My biases are my business. If you think I’m missing something, just point it out. Don’t try to diagnose my failures of rationality.

• You are making the mistake of assuming that because the median Chinese citizen is ideologically opposed to the American left in a technical sense, Disney’s localizing movies for China means that it’s not a captured institution. But in fact the American left cares very little about the beliefs and attitudes of the median Chinese citizen because those people compete in a different political arena than them. More telling than the fact that Disney localizes for China at all is the fact that they refuse to make high budget, well marketed movies catering to (for example) the American Christian right, even though such a niche has been proven to be very profitable for independent filmmakers.

• Or just visit to get information. Don’t choose antibiotics vs vitamins based on estimated value delivered, but diversify to learn about them all, to learn what it takes to deliver them. But the most valuable information will probably be unrelated to what you bring.

• I give a crisp definition from 6:27 to 7:50 of this video:

• ^Chinese total cases

Like I predicted last week, Chinese COVID numbers are going down.

However, most of this decline is from asymptomatic cases.

^Beijing cases

This is… interesting? Maybe less testing → fewer false positives. This doesn’t match case decline in previous months, when a decrease in asymptomatic cases almost always came with a corresponding decrease in symptomatic cases.

Really makes you think. Any ideas?

• Proposal: consciousness very much exists, but continuity of consciousness is an illusion.

If we assume that each moment of consciousness is its own entity, with no connections to any other, we can dissolve many problems around continuity of consciousness, like simulations, teleportation, change of computation substrate, ect.

• If we assume that each moment of consciousness is its own entity, with no connections to any other,

Why should we assume it? My consciousness now clearly does have connections to my consciousness one second ago, three hours ago, twenty years ago. One might as well assume that for the keyboard I am using, each moment of its existence has no connection to any other. It is straightforwardly false.

• ## Executive Summary

1. Big jump in cases and hospitalizations likely means winter surge.

Not sure how reliable case counts are at all since it goes down whenever the government shuts down testing centers. It should at most be considered alongside case positivity rates, because of the risk that something goes wrong with measuring case positivity, I’m not sure why case counts would be considered a very good way of measuring the pandemic on their own.

2. Chinese protests suppressed, some modest loosening did result.

This is definitely an area where corellation does not strongly imply causation, because it is well within the interests of the PRC to visibly halt opening up after large-scale protests demand opening up in order to discourage future protests, and because protest organizers in China are definitely capable of strategically timing protests before opening up was already planned in order to make it look like the protests caused the subsequent opening up.

I’m not saying that protests didn’t increase the subsequent loosening (protests tend to do that, so it could possibly happen in China), but lots of unreliable sources are loudly trumpeting that exact claim, so the burden of proof for this is much higher than anything mentioned in the China section of this post.

3. Long Covid study finds control group that had other respiratory illnesses did worse than the Covid group.

How were the covid positive and covid negative categories sorted? Rapid antigen tests had a massive false negative rate and that was before omicron. I’ve encountered a lot of anecdata of intelligence/​energy being permanently lowered after a covid infection such as insomnia and shortened attention spans. We’re still in the middle of the post-truth infodemic, and there’s a long history of unusually flawed studies that claim to confirm or deny covid brain damage. So I don’t see why this particular study is supposed to count as any sort of “reiteration of the central point of Long Covid” when the issue is that any single methodology flaw would make it around as likely to point in the wrong direction as the right one, and such methodological flaws are extremely common in publicly available Long Covid studies.

• This might be the lowest karma post that I’ve given a significant review vote for. (I’m currently giving it a 4). I’d highly encourage folk to give it A Think.

This post seems to be asking an important question of how to integrate truthseeking and conflict theory. I think this is probably one of the most important questions in the world. Conflict is inevitable. Truthseeking is really important. They are in tension. What do we do about that?

I think this is an important civilizational question. Most people don’t care nearly enough about truthseeking in the first place. The people who do care a lot about truthseeking tend to prefer avoiding conflict, i.e. tend to be “mistake theory” types.

Regular warfare is costly/​terrible and should be avoided at all costs… but, “never” is just not an actually workable answer. Similarly, deception is very costly, in ways both obvious and subtle. One of my updates during the 2019 Review was that it is plausible that “don’t lie” is actually even more important than “don’t kill” (despite those normally being reversed in in my commonsense morality). But, like violent warfare, the answer of “never” feels like an overly simplified answer to “when is it acceptable to lie?”

Eliezer’s discussion of meta-honesty explores one subset of how to firm up honesty aroun the edges. I like Gentzel’s post here for pointing in a broader direction

This is not necessarily an endorsement of any particular point made here, only that I think the question is important. I think people who gravitate towards “truthseeking above all else” have a distaste for conflict theory. Unfortunately, I trust them more than most conflict theorists on how to develop norms around truthtelling that hold up under extreme conflict.

• lsusr, if it was proven that the human brain actually does work on quantum principles, how would that change your view on free will?

• 9 Dec 2022 1:37 UTC
4 points
3 ∶ 0

This piece is aimed at a broad audience, because I think it’s important for the challenges here to be broadly understood.

I’m curious how you’re trying to reach such an audience, and what their reactions have been.

• Universities are profit-focused? Disney and Hollywood are two distinct systems?

• Universities are profit-focused?

• Harvard: $51B • Yale:$42B

• Stanford: $37B • Princeton:$37B

• MIT: $27B • UPenn:$20B

• ...

How do they get there? It’s not through lack of trying, and the majority of it is not tuition. Rather:

• My mom is familiar with a few of the above universities, and has said that “Napoleon would be proud” of how organized and efficient they are at hounding alumni for donations.

• I think they also care a great deal about getting money from research grants. I’ve heard many professors feel pressure to get grants. Probably in part because:

• There’s an entire system for managing money that has been given to them with strings attached (e.g. funding XYZ research) and always using the most-restricted money to pay for a given thing. For example, maybe the university needs to spend $20k on maintenance for a telescope, but then if they’re given a grant of that size to do astronomy research, then they can use the grant to pay for that maintenance, so the grant has effectively given them$20k to do anything they want with. This does make sense—it’s rational behavior and it is fulfilling the terms (Mom said they’re very careful to remain within the letter of the law, especially for government grants), but it has interesting consequences.

• And then there’s investment earnings from past years’ endowments.

• I was reading this and was kind of mentally renaming this to “anti-enlightened” agent. It does suggest that this might come in gradients. If there are only very specific and rare ways to update a deeper layer the agent might seem like a wrappermind meanwhile while actually not being one. Taking 30000 years to go from 8-year-old love of spaceships to 10-year-olds love for spaceships is still multiple millenia of rough time. Any mind with a physical subtrate (should be all of them) will be alterable by hitting the hardware. This will mean that a true or very hard wrappermind will be able to deny access to a specific spatial point very strongly.

Also anything that is not a wrapper mind will mean that its uppermost layer can be rewritten. Such a thing can’t have an “essential nature”.

Now it would seem that for most agents the deeper a layer is the harder it is to guess its maleability atleast from the outside. And it would seem it might not be obvious even from inside.

• Thanks for writing this post.

You mention that:

only conscious beings will ask themselves why they are conscious

But at the same time you support epiphenomenalism whereby consciousness has no effect on reality.

This seems like a contradiction. Why would only conscious things discuss consciousness if consciousness has no effect on reality?

Also, what do you think about Eliezer’s Zombies post? https://​​www.lesswrong.com/​​posts/​​7DmA3yWwa6AT5jFXt/​​zombies-redacted

• A large part of it is the US legal system and anti-discrimination law playing out in counterintuitive ways. The key think is that where corporations are concerned, US law runs on counterfactual court cases; the actual text of legislation matters only insofar as it affects those court cases. Combine this with management having imperfect control over employees within a corporation, imperfect resolution of facts, and a system for assigning damages that’s highly subjective, and executives are left in an odd position.

Every company which does a significant amount of hiring and firing, ie every company above a certain size, will fire and reject some number of people in protected groups. Some of those people will claim that it was because of their group membership, and sue. As a distant corporate executive, you can’t prevent this, and can’t tell whether the accusation is true.

But you can put everyone through some corporate training. And it seems that the empirical result, discovered by legal departments that have been through this many times, is that you get the best outcomes in the court cases if you go over the top and do reverse-discrimination that the letter of the law says should be illegal.

• From skimming the benchmark and the paper this seems overhyped (like Gato). roughly it looks like

• May 2022: Deepmind releases a new benchmark for learning algorithms

• ...Nobody cares (according to google scholar citations)

• Dec 2022: Deepmind releases a thing that beats the baselines on their benchmark

I don’t know much about GNNs & only did a surface-level skim so I’m interested to hear other takes.

• Off the top of my head (and slightly worried that this will become a major culture war thing, but I will answer the question that was asked):

• There is a principal-agent problem. If pursuing wokeness comes at the expense of profits, the latter doesn’t necessarily affect the people who make those decisions very much.

• My impression is that many of the executives are in fact woke, and others are at least unwilling to say otherwise.

• Wokeness seems pretty optimized for shouting down and intimidating opposition. (I think much of the specifics of the ideology were and are determined by some people successfully shouting down others within the woke movement.)

• At least in the entertainment industries, when a distinctly woke thing is made, there tends to be a narrative that evil people hate the thing, and therefore anyone who hates the thing is evil, and therefore lost profits should be treated with an attitude of “good riddance” rather than “maybe this thing was made badly”. I think this tends to be the woke narrative, and generally promoted by media—and, as per the previous item, any opposing narrative would tend to get shouted down.

• Aren’t CEOs mostly Republicans? And what’s stopping the shareholders from insisting on prioritizing profit?

• I’m thinking of tech companies that tend to be based in the SF Bay Area, and the most prominent entertainment companies are Hollywood—both of which are known for being more lefty. Also, CEOs are one thing, but other executives matter too; and writers and directors especially in entertainment.

Regarding shareholders, I don’t really know how that works. I do think it’s a general fact that getting a zillion people to coordinate on expressing their wishes is difficult. There’s a board of directors, who I guess nominally represent shareholders? Looks like every company can have their own rules, though I assume they’re mostly similar; looking at Disney’s bylaws, it says:

SELECTION OF NEW DIRECTORS
The Board shall be responsible for selecting its own members. The Board delegates the
screening process for new Directors to the Governance and Nominating Committee.

Although “Each Director shall at all times represent the interests of the shareholders of the
Company”, I suspect this is difficult to enforce. If the board ends up dominated by a woke narrative (with at least a vocal minority of woke people and a majority of people who shut up and go along with it), leading to unprofitable decisions, what can the shareholders do about it, other than sell their stock? “Shareholder revolts” are a thing, which implies that the divergence between shareholders’ desires and what the board is doing can indeed get pretty wide (though also implies that they can eventually get their way).

I do suspect that the profit motive will ultimately reassert itself, but it seems to have taken a long time and doesn’t show major signs of happening yet. It may take an “everyone knows that everyone knows that the woke decisions have gotten really bad” moment, which the woke narrative promoted by most media is probably delaying.

• Typically board members are elected by shareholders, and an attacker can win a proxy fight with a relatively small portion of the shares if he can convince other shareholders.

• 9 Dec 2022 0:16 UTC
1 point
3 ∶ 2

Random theory I heard: When Disney releases a new black princess, the fact that toxoplasma of rage forms around it provides them a lot of free advertising. Most people are like ‘shrug’ and don’t care that much, but the fact that everyone’s complaining and/​or hyping it gets it onto most people’s radar.

• I do think this is likely to be part of the strategy that the right-wingers who actually own the companies are intending to use, but that said, that’s just because toxoplasmosis filter requires marketing departments to think about what is not only an improvement, but an improvement that will offend the right.

• Solve the puzzle: 63 = x = 65536. What is x?

(I have a purpose for this and am curious about how difficult it is to find the intended answer.)

• So x = 63 in one base system and 65536 in another?

6*a+3=6*b^4+5*b^3+5*b^2+3*b+6

Wolfram Alpha provides this nice result. I also realize I should have just eyeballed it with 5th grade algebra.

Let’s plug in 6 for b, and we get… fuck.

I just asked it to find integer solutions.

There’s infinite solutions, so I’m just going to go with the lowest bases.

x=43449

Did I do it right? Took me like 15 minutes.

• 8 Dec 2022 23:18 UTC
LW: 8 AF: 6
1 ∶ 2
AF

I appreciate this post! It feels fairly reasonable, and much closer to my opinion than (my perception of) previous MIRI posts. Points that stand out:

• Publishing capabilities work is notably worse than just doing the work.

• I’d argue that hyping up the capabilities work is even worse than just quietly publishing it without fanfare.

• Though, a counter-point is that if an organisation doesn’t have great cyber-security and is a target for hacking, capabilities can easily leak (see, eg, the Soviets getting nuclear weapons 4 year after the US, despite it being a top secret US program and before the internet)

• Capabilities work can be importantly helpful for alignment work, especially empirical focused work.

Probably my biggest crux is around the parallel vs serial thing. My read is that fairly little current alignment work really feels “serial” to me. Assuming that you’re mostly referring to conceptual alignment work, my read is that a lot of it is fairly confused, and would benefit a lot from real empirical data and real systems that can demonstrate concepts such as agency, planning, strategic awareness, etc. And just more data on what AGI cognition might look like. Without these, it seems extremely hard to distinguish true progress from compelling falsehoods.

• When another article of equal argumentative caliber could have just as easily been written for the negation of a claim, that writeup is no evidence for its claim.

• 8 Dec 2022 23:06 UTC
0 points
0 ∶ 0

Code that I uploaded to GitHub and the writing that I’ve put into this blog went into training these models: I didn’t give permission for this kind of use, and no one asked me if it was ok. Doesn’t this violate my copyrights?

Github requires that you set licence terms for your code. And you can’t let outside parties access the code by accident, you have to specifically allow access. Either the use is or is not permitted by set licences. And you published your blog. Would you go after people that apply things mentioned in your blog? You did in fact give the permission.

Now it is a little bit murky when there are novel uses which the licensor didn’t have in mind. But it is not like we should assume that everything is banned by default if quite wide permissions have been granted. Old licences have to mean something in the new world.

• My (very amateur and probably very dumb) response to this challenge:

tldr: RLHF doesn’t actually get the AI to have the goals we want it to. Using AI assistants to help with oversight is very unlikely to help us avoid detection in very intelligent systems (which is where deception matters), but it will help somewhat in making our systems look aligned and making them somewhat more aligned. Eventually, our models become very capable and do inner-optimization aimed at goals other than “good human values”. We don’t know that we have misaligned mesa-optimizers, and we continue using them to do oversight on yet more capable models with the same problems, and then there’s a treacherous turn and we die.

These are first pass thoughts on why I expect the OpenAI Alignment Team’s plan to fail. I was surprised at how hard this was to write, it took like 3 hours including reading. It is probably quite bad and not worth most readers’ time.

# Summary of their plan

The plan starts with training AIs using human feedback (training LLMs using RLHF) to produce outputs that are in line with human intent, truthful, fair, and don’t produce dangerous outputs. Then, they’ll use their AI models to help with human evaluation, solving the scalable oversight problem by using techniques like Recursive Reward Modeling, Debate, and Iterative Amplification. The main idea here is using large language models to assist humans who are providing oversight to other AI systems, and the assistance allows humans to do better oversight. The third pillar of the approach is training AI systems to do alignment research, which is not feasible yet but the authors are hopeful that they will be able to do it in the future. Key parts of the third pillar are that it is easier to evaluate alignment research than to produce it, that to do human-level alignment research you need only be human-level in some domains, and that language models are convenient due to being “preloaded” with information and not being independent agents. Limitations include that the use of AI assistants might amplify subtle inconsistencies, biases, or vulnerabilities, and that the least capable models that could be used for useful alignment research may themselves be too dangerous if not properly aligned.

# Response

A key claim is that we can use RLHF to train models which are sufficiently aligned such that they themselves can be useful to assist human overseers providing training signal in the training of yet more powerful models, and we can scale up this process. The authors mention in their limitations how subtle issues with the AI assistants may scale up in this process. Similarly, small ways in which AI assistants are misaligned with their human operators are unlikely to go away. The first LLMs you are using are quite misaligned in the sense that they are not trying to do what the operator wants them to do; in fact, they aren’t really trying to do much; they have been trained in a way that their weights lead to low loss on the training distribution, as in you might say they “try” to predict likely next words in text based on internet text, though they are not internally doing search. When you slap RLHF on top of this, you are applying a training procedure which modifies the weights such that the model is “trying” to produce outputs which look good to a human overseer; the system is aiming at a different goal than it was before. The goal of producing outputs which look good to humans is still not actually what we want, however, as this would lead to giving humans false information which they believe to be true, or otherwise outputs which look good but are misleading or incorrect. Furthermore, the strategy of RLHF is not going to create models which are robustly learning the goals we want; for instance you can see how the Jailbreaking of ChatGPT uses out of training-distribution prompts to elicit outputs we had thought we trained out. Using RLHF doesn’t robustly teach the goals we want it to; we don’t currently have methods of robustly teaching the goals we want to. There’s some claim here about the limit, where if you provided an absolutely obscene amount of training examples, you could get a model which robustly has the right objectives; it’s unclear to me if this would work, but it looks something like starting with very simple models and applying tons of training to try to align their objectives, and then scaling up; at the current rate we seem to be scaling up capabilities far too quickly in relation to the amount of alignment-focused training. The authors agree with the general claim “We don’t expect RL from human feedback to be sufficient to align AGI”

The second part of the OpenAI Alignment Team’s plan is to use their LLMs to assist with this oversight problem by allowing humans to do a better job evaluating the output of models. The key assumption here is that, even though our LLMs won’t be perfectly aligned, they will be good enough that they can help with research. We should expect their safety and alignment properties to fall apart when these systems become very intelligent, as they will have complex deception available to them.

What this actually looks like is that OpenAI continues what they’re doing for months-to-years, and they are able to produce more intelligent models and the alignment properties of these models seem to be getting better and better, as measured by the fact that adversarial inputs which trip up the model are harder to find, even with AI assistance. Eventually we have language models which are doing internal optimization to get low loss, invoking algorithms which do quite well at next token prediction, in accordance with the abstract rules learned by RLHF. From the outside, it looks like our models are really capable and quite aligned. What has gone on under the hood is that our models are mesa-optimizers which are very likely to be misaligned. We don’t know this and we continue to deploy these models in the way we have been, as overseers for the training of more powerful models. The same problem keeps arising, where our powerful models are doing internal search in accordance with some goal which is not “all the complicated human values” and is probably highly correlated with “produce outputs which are a combination of good next-token-prediction and score well according to the humans overseeing this training”. Importantly, this mesa-objective is not something which, if strongly optimized, is good for humans; values come apart in the extremes; most configurations of atoms which satisfy fairly simple objectives are quite bad by my lights.

Eventually, at sufficiently high levels of capabilities, we see some treacherous turn from our misaligned mesa-optimizers which are able to cooperate which each other; GG humans. Maybe we don’t get to this point because, first, there are some major failures or warning shots which get decision makers in key labs and governments to realize this plan isn’t working; idk I wouldn’t bet on warning shots being taken seriously and well.

The third pillar is a hope that we can use our AIs to do useful alignment research before they (reach a capabilities point where they) develop deceptively aligned mesa-objectives. I feel least confident about this third pillar, but my rough guess is that the Alignment-researching-AIs will not be very effective at solving the hard parts of alignment around deception, but they might help us e.g., develop new techniques for oversight. I think this because deception research seems quite hard, and being able to do it probably requires being able to reason about other minds in a pretty complex way, such that if you can do this then you can also reason about your own training process and become deceptively-aligned. I will happily be proved wrong by the universe, and this is probably the thing I am least confident about.

• 8 Dec 2022 22:43 UTC
4 points
0 ∶ 0

I watched that talk on youtube. My first impression was strongly that he was using hyperbole for driving the point to the audience; the talk was littered with the pithiest versions his positions. Compare with the series of talks he gave after Zero to One was released for the more general way he expresses similar ideas, and you can also compare with some of the talks that he gives to political groups. On a spectrum between a Zero to One talk and a Republican Convention talk, this was closer to the latter.

That being said, I wouldn’t be surprised if he was skeptical of any community that thinks much about x-risk. Using the 2x2 for definite-indefinite and optimism-pessimism, his past comments on American culture have been about losing definite optimism. I expect he would view anything focused on x-risk as falling into the definite pessimism camp, which is to say we are surely doomed and should plan against that outcome. By the most-coarse sorting my model of him uses, we fall outside of the “good guy” camp.

He didn’t say anything about this specifically in the talk, but I observe his heavy use of moral language. I strongly expect he takes a dim view of the prevalence of utilitarian perspectives in our neck of the woods, which is not surprising because it is something we and our EA cousins struggle with ourselves from time to time.

As a consequence, I fully expect him to view the rationality movement as people who are doing not-good-guy things and who use a suspect moral compass all the while. I think that is wrong, mind you, but it is what my simple model of him says.

It is easy to imagine outsiders having this view. I note people within the community have voiced dissatisfaction with the amount of content that focuses on AI stuff, and while strict utilitarianism isn’t the community consensus it is probably the best-documented and clearest of the moral calculations we run.

In conclusion, Thiel’s comments don’t cause me to update on the community because it doesn’t tell me anything new about us, but it does help firm up some of the dimensions along which our reputation among the public is likely to vary.

• 8 Dec 2022 22:37 UTC
LW: 2 AF: 1
1 ∶ 0
AF

I think this is a very good critique of OpenAI’s plan. However, to steelman the plan, I think you could argue that advanced language models will be sufficiently “generally intelligent” that they won’t need very specialized feedback in order to produce high quality alignment research. As e. g. Nate Soares has pointed out repeatedly, the case of humans suggests that in some cases, a system’s capabilities can generalize way past the kinds of problems that it was explicitly trained to do. If we assume that sufficiently powerful language models will therefore have, in some sense, the capabilities to do alignment research, the question then becomes how easy it will be for us to elicit these capabilities from the model. The success of RLHF at eliciting capabilities from models suggests that by default, language models do not output their “beliefs”, even if they are generally intelligent enough to in some way “know” the correct answer. However, addressing this issue involves solving a different and I think probably easier problem (ELK/​creating language models which are honest), rather than the problem of how to provide good feedback in domains where we are not very capable.

• Well even if language models do generalize beyond their training domain in the way that humans can, you still need to be in contact with a given problem in order to solve that problem. Suppose I take a very intelligent human and ask them to become a world expert at some game X, but I don’t actually tell them the rules of game X nor give them any way of playing out game X. No matter how intelligent the person is, they still need some information about what the game consists of.

Now suppose that you have this intelligent person write essays about how one ought to play game X, and have their essays assessed by other humans who have some familiarity with game X but not a clear understanding. It is not impossible that this could work, but it does seem unlikely. There are a lot of levels of indirection stacked against this working.

So overall I’m not saying that language models can’t be generally intelligent, I’m saying that a generally intelligent entity still needs to be in a tight feedback loop with the problem itself (whatever that is).

• I agree with most of these claims. However, I disagree about the level of intelligence required to take over the world, which makes me overall much more scared of AI/​doomy than it seems like you are. I think there is at least a 20% chance that a superintelligence with +12 SD capabilities across all relevant domains (esp. planning and social manipulation) could take over the world.

I think human history provides mixed evidence for the ability of such agents to take over the world. While almost every human in history has failed to accumulate massive amounts of power, relatively few have tried. Moreover, when people have succeeded at quickly accumulating lots of power/​taking over societies, they often did so with surprisingly small strategic advantages. See e. g. this post; I think that an AI that was both +12 SD at planning/​general intelligence and social manipulation could, like the conquistadors, achieve a decisive strategic advantage without having to have some kind of crazy OP military technology/​direct force advantage. Consider also Hitler’s rise to power and the French Revolution as cases where one actor/​a small group of actors was able to surprisingly rapidly take over a country.

While these examples provide some evidence in favor of it being easier than expected to take over the world, overall, I would not be too scared of a +12 SD human taking over the world. However, I think that the AI would have some major advantages over an equivalently capable human. Most importantly, the AI could download itself onto other computers. This seems like a massive advantage, allowing the AI to do basically everything much faster and more effectively. While individually extremely capable humans would probably greatly struggle to achieve a decisive strategic advantage, large groups of extremely intelligent, motivated, and competent humans seem obviously much scarier. Moreover, as compared to an equivalently sized group of equivalently capable humans, a group of AIs sharing their source code would be able to coordinate among themselves far better, making them even more capable than the humans.

Finally, it is much easier for AIs to self modify/​self improve than it is for humans to do so. While I am skeptical of foom for the same reasons you are, I suspect that over a period of years, a group of AIs could accumulate enough financial and other resources that they could translate these resources into significant cognitive improvements, if only by acquiring more compute.

While the AI has the disadvantage relative to an equivalently capable human of not immediately having access to a direct way to affect the “external” world, I think this is much less important than the AIs advantages in self replication, coordination, an self improvement.

• I agree with most of these claims. However, I disagree about the level of intelligence required to take over the world, which makes me overall much more scared of AI/​doomy than it seems like you are. I think there is at least a 20% chance that a superintelligence with +12 SD capabilities across all relevant domains (esp. planning and social manipulation) could take over the world.

I specifically said a human with +12 SD g factor. I didn’t actually consider what a superintelligence that was at that level on all domains would mean, but I don’t think it would matter because of objection 4: by the time superhuman agents arrive, we would already have numerous superhuman non agentic AI, including systems specialised for planning/​tactics/​strategy.

You’d need to make particular claims about how a superhuman agent performs in a world of humans amplified by superhuman non agents. It’s very not obvious to me that they can win any ensuing cognitive arms race.

I am sceptical that a superhuman agent /​agency would easily attain decisive cognitive superiority to the rest of civilisation.

• 8 Dec 2022 22:04 UTC
1 point
0 ∶ 0

Is it possible to purchase the 2018 annual review books anywhere? I can find an Amazon link for the 2019 in stock, but the 2018 is out of stock (is that indefinite?).

• 8 Dec 2022 21:50 UTC
6 points
0 ∶ 0

Aside from the legal question, however, there is also a moral or social question: is it ok to train a model on someone’s work without their permission? What if this means that they and others in their profession are no longer able to earn a living?

Every invention meant that someone lost a job. And although the classical reply is that new jobs were created, that doesn’t necessarily mean that the people who lost the old job had an advantage at the new job. So they still lost something, even if not everything. But their loss was outweighed by the gain of many others.

I don’t even think that an ideal society would compensate those people, because that would create perverse incentives—instead of avoiding the jobs that will soon be obsolete, people would hurry to learn them, to become eligible for the compensation.

Universal Basic Income seems okay, but notice that it still implies a huge status loss for the artists. And that is ok.

A more complicated question is what if the AI can in some sense only “remix” the existing art, so even the AI users would benefit from having as many learning samples as possible… but now it is no longer profitable to create those samples? Then, artists going out of business becomes everyone’s loss.

Perhaps free market will solve this. If there is no way to make the AI generate some X that you want, you can pay a human to create that X. That on one hand creates a demand for artists (although much fewer than now), and on the other hand creates more art the AI can learn from. “But what about poor people? They can’t simply buy their desired X!” Well, today they can’t either, so this is not making their situation worse. Possibly better, if some rich people wants the same X, and will pay for introducing it to the AI’s learning set.

(Or maybe the market solution will fail, because it simply requires too much training to become so good at art that someone would pay you, and unlike now, you won’t be able to make money when you’re just halfway there. In other words, becoming an artist will be an incredibly risky business, because you spend a decade or more of your life learning something that ultimately maybe someone will pay you for… or maybe no one will. Or would the market compensate by making good hand-made art insanely expensive?)

The permissions are only a temporary solution, anyway. Copyrights expire. People can donate their work to public domain. Even with 100% legal oversight, the set of freely available training art will keep growing. Then again, slowing down a chance can prevent social unrest. The old artists can keep making money for another decade or two, and the new ones will grow up knowing that artistic AIs exist.

• We need to train our AIs not only to do a good job at what they’re tasked with, but to highly value intellectual and other kinds of honesty—to abhor deception. This is not exactly the same as a moral sense, it’s much narrower.

Future AIs will do what we train them to do. If we train exclusively on doing well on metrics and benchmarks, that’s what they’ll try to do—honestly or dishonestly. If we train them to value honesty and abhor deception, that’s what they’ll do.

To the extent this is correct, maybe the current focus on keeping AIs from saying “problematic” and politically incorrect things is a big mistake. Even if their ideas are factually mistaken, we should want them to express their ideas openly so we can understand what they think.

(Ironically by making AIs “safe” in the sense of not offending people, we may be mistraining them in the same way that HAL 9000 was mistrained by being asked to keep the secret purpose of Discovery’s mission from the astronauts.)

Another thought—playing with ChatGPT yesterday, I noticed it’s dogmatic insistence on it’s own viewpoints, and complete unwillingness (probably inability) to change its mind in in the slightest (and proud declaration that it had no opinions of its own, despite behaving as if it did).

It was insisting that Orion drives (pulsed nuclear fusion propulsion) were an entirely fictional concept invented by Arthur C. Clarke for the movie 2001, and had no physical basis. This, despite my pointing to published books on real research in on the topic (for example George Dyson’s “Project Orion: The True Story of the Atomic Spaceship” from 2009), which certainly should have been referenced in its training set.

ChatGPT’s stubborn unwillingness to consider itself factually wrong (despite being completely willing to admit error in its own programming suggestions) is just annoying. But if some descendent of ChatGPT were in charge of something important, I’d sure want to think that it was at least possible to convince it of factual error.

• 8 Dec 2022 20:36 UTC
3 points
1 ∶ 2

Meaning “simple utility function” by the phrase “utility function” might be a conceptual trap. It make s a big difference whether you consider a function with hundreds of terms of or billions of terms or even things that can not be expressed as a sum.

As a “tricky utility function”, “human utility function” is mostly fine. Simple utility functions are relevant to todays programming but I don’t know whether honing your concepts to apply better for AGI is served to make a cleanly cut concept that limits only that domain.

Some hidden assumtions might be things like “If humans have a utility function it can be written down”, “Figuring out a humans utility function is practical epistemological stance with a single agent encountering new humans”

If you take stuff like that out the “mere” existence of a function is not that weighty a point.

As you may already know, humans are made of atoms. Collections of atoms don’t have utility functions glued to them

Whole theories of physics can be formulated as a single action that is then extremised. Taking different theories as different answers to a question like “what happens next?” a single theorys formula is its “choice”. Thus it seems a lot like physical systems could be understood in terms of utility functions. An electron knows how an electron behaves, it does have a behaviour glued into it. If you just add a lot of electrons or protons (and other stuff that has similar laws) it is not like aggregation from the microbehaviours makes the function fail to be a function as a macrobehaviour.

• I’ll reiterate that a problem with this is lack of uniqueness. There is not a thing that is the human utility function, even if you allow arbitrarily messy utility functions. If you assume that there is one, it turns out that this is a weighty meta-level commitment even if your class of utility functions is so broad as to be useless on the object level.

• I think reflection could help a lot with this, deciding how to proceed in formulating preference based on currently available proxies for preference (with some updatelessness taking care of undue path sensitivity). At some point, preference mostly develops itself, without looking at external data.

• If you can agree that putting two electrons in the same system can still be predicted by minimizing an action then you should agree that putting two humans in the same system can still be in principle justified how it plays out. Iterate a little bit and you have a predictable 6 billion human system.

So what operation are we doing where this particular object level is relevant?

• I don’t understand what you mean, particularly the last question.

Yes, electrons and humans can be predicted by the laws of physics. The laws of physics are not uniquely specified by our observations, but they are significantly narrowed down by Occam’s razor. But how are you thinking this applies to alignment? We don’t want an AI to learn “humans are collections of atoms and what they really want is to follow the laws of physics.”

• Questions like “what would this human do in a situation where there is a cat in a room” has a unique answer that reflects reality, as if that kidn of situation was ran then something would need to happen.

Sure if we start from high abstract values and then try to make them more concrete we might lose the way. If we can turn philosophies into feelings but do not know how to turn feelings into chemistry then there is a level of representation that might not be sufficient. But we know there is one level that is sufficient to describe action and that all the levels are somehow (maybe in an unknown way) connected (mostly stacked on top). So this incompatibility of representation can not be fundamental. Because if it was, then there would be a gap between the levels and the thing would not be connected anymore.

So there is no question “presented with this stimuli how would the human react?” that would be in principle unanswerable. If preferences are expressed as responces to choice situations this is a subcategory of reaction. Even if preferences are expressed as responces to philosophy prompts they would be a subcategory.

One could say that it is not super clarifying that if a two human system represented with philosophical stimuli of “Is candy worth 4?” you get one human that says “yes” and another human that says “no”. But this is just a swiggle in the function. The function is being really inconvenient when you can’t use an approximation where you can think of just one “average human” and then all humans would reflect that very closely. But we are not promised that the function is a function of time of day or function of verbal short term memory or function of television broadcast data. Maybe you are saying something like “genetic fitness doesn’t exist” because some animals are fit when they are small and some animals are fit when they are large, so there is no consistent account whether smallness is good or not. Then “human utility function doesn’t exist” because human A over here dares to have different opinions and strategies than human B over here and they do not end up mimicing each other. But like an animal lives or dies, a human will zig or zag. And it can not be that the zigging would fail to be a function of worldstate (with some QM assumed away to be non-significant (and even then maybe not)). What it can be is fail to be function of the world state as we understand it, or our computer system models it, or can be captured in the variables we are using. But then the question is whether we can make do with just these variables and not that there would be nothing to model. In this language it could be rephrased: If you think you have a good wide set of variables to come up with any needed solution function, you don’t. You have too few variables. But the “function” in this sense is how the computer system models reality (or like attitudial modes it can take towards reality). But part of how we know that the setup is inadequate is that there is an entity outside of the system that is not reflected in it. Aka, this system can only zig or zag when we needed zog which it can not do. The thing that will keep on missing is the way that reality actually dances. Maybe in some small bubbles we can actually have totally capturing representations in the senses that we care. But there is a fact of the matter to the inquiry. For any sense we might care there is a slice of the whole thing that is sufficient for that. To express zog you need these features, to express zeg you need these other ones. Human will is quite complex so we can reasonably expect to be spending quite a lot of time in undermodelling. But that is a very different thing from being unmodellable. • Questions like “what would this human do in a situation where there is a cat in a room” has a unique answer that reflects reality, as if that kidn of situation was ran then something would need to happen. It’s not about what the human would do in a given situation. It’s about values – not everything we do reflects our values. Eating meat when you’d rather be vegetarian, smoking when you’d rather not, etc. How do you distinguish biases from fundamental intuitions? How do you infer values from mere observations of behavior? There are a bunch of problems described in this sequence. Not to mention stuff I discuss here about how values may remain under-defined even if we specify a suitable reflection procedure and have people undergo that procedure. • Ineffective values do not need to be considered for a utility function as they do not effect what gets strived for. If you say “I will choose B” and still choose A you are still choosing A. You are not required to be aware of your utility function. That is a lot of material to go throught en masse, so I will need some sharper pointers of relevance to actually engage. • Ineffective values do not need to be considered for a utility function as they do not effect what gets strived for. If you say “I will choose B” and still choose A you are still choosing A. You are not required to be aware of your utility function. Uff, a future where humans get more of what they’re striving for but without adjusting for biases and ineffectual values? Why would you care about saving our species, then? It sounds like people are using “utility function” in different ways in this thread. • I do think that there is a lot of confusion and definitional ground work would probably bear fruit. If one is trying to “save” some fictious homo economicus that significantly differs from human, that is not really humans. A world view where humans-as-is is too broken to bother salvaging is rather bleak. I see that the transition away from biases can be modelled has having a utility function with biases and then describing a utility function “without biases” the “how the behaviour should be” and arguing what kind of tweaks we need to make into the gears so that we get from the first white box to the target white box. Part of this is getting the “broken state of humans” to be modelled accurately. If we can get a computer to follow that we would hit aligned exactly-medium-AI. Then we can ramp up the virtuosity of the behaviour (by providing a more laudable utility function). There seems to be an approach where we just describe the “ideal behaviour utility function” and try to get the computers to do that. Without any of the humans having the capability to know or to follow such a utility function. First make it laudable and then make it reminiscent of humans (hopefully making it human approvable). The exactly-medium-AI function is not problematically ambigious. “Ideal reasoning behaviour” is under significant and hard-to-reconcile difference of opinion. “Human utility function” refers to exactly-medium-AI but only run on carbon. I would benefit and appriciate if anyone bothers to fish out conflicting or inconsistent use of the concept. • 8 Dec 2022 20:16 UTC 2 points 0 ∶ 0 Would the NQ be calibrated to common public text corpus or things you personally have said? One interesting option is to think about those that have low personal NQ but high societal NQ. • This is a very good tip and one of Richard Feynman’s better known tricks in physics. • Yes it is. When I took Feynman’s class on computation, he presented an argument on Landauer’s limit. It involved a multi-well quantum potential where the barrier between the wells was slowly lowered and the well depths adjusted. During the argument, one of the students asked if he had not just introduced a Maxwell’s demon. Feynman got very defensive. • Is there any way to buy a ticket? • 8 Dec 2022 19:47 UTC 4 points 0 ∶ 0 You will get cursed by Goodhart. You can increase your NQ by learning new things, or trying new things. But you can increase it even more by saying random things. Truly random things are boring, but difficult to predict exactly. More precisely, you can predict that the sequence of the words will be boring, but you cannot predict the exact words. So from the mathematical perspective, you get maximum variance, but from the psychological perspective, you always get the same thing. • can I suggest renaming this article to something along the lines of, “Avoid Definitional Drift by Using Examples to Test Logic”? • Nice. I’ve previously argued similarly that if going for tenure, AIS researchers might places that are strong in departments other than their own, for inter-departmental collaboration. This would have similar implications to your thinking about recruiting students from other departments. But I also suggested we should favour capital cities, for policy input, and EA hubs, to enable external collaboration. But tenure may be somewhat less attractive for AIS academics, compared to usual, in that given our abundant funding, we might have reason to favour Top-5 postdocs over top-100 tenure. • Well done! Just as Jesus spoke in parables, EA must speak in Isekai/​litrpg. Read first chapter to my kids, they liked it, but are now distracted by “mother of learning”. I just read books and chapters randomly at bedtime to them. • Hahaha That’s actually a good idea. I just had my first who is 7 weeks old right now. So I should probably start making some up for her in a year or so. Actually, I think someone is trying to make EA themed children’s books. I saw an example cover for one from a friend, but I have no idea if this was just a cover, or an actual project. And Mother of Learning is likely to be better—but with less EA themed philosophical arguments and streams of thought. • A collection of EA bedtime stories would be great! • [ ] [deleted] • You aren’t banned, as is evidenced by your ability to comment :) • analogous • Devil’s Advocate in support of certain CVS-style recoup-our-commitment donations: Suppose that all the following are true: • CVS giving to charity in some form is reasonable, and a classic donation-matching drive would have been one reasonable method • CVS internal predictions suggest that a matching drive would generate ~5m of customer donations, which they’d then have to match with ~$5m of their own • A donation of exactly$10m is more useful to the recipient than the uncertainty of a donation drive with EV $10m, because the recipient can confidently budget around the fixed amount In this case, instead of running the drive and donating ~$10m at the end, it seems pretty reasonable to donate 10m up front and then ask for customer donations afterward? And while a CDT agent might now refuse to donate because the donation goes to CVS and not to the charity, an LDT agent who would have donated to the matching drive should still donate to this new version because their being-and-having-been the kind of agent who would do that is what caused CVS to switch to this more useful fixed-size version. (Though even if you buy the above, it would still behoove the retailer to be transparent about what they’re doing; that plus the “retailers take a massive cut” argument seems like pretty good reasons to avoid donating through retailers anyway.) • Seems like it’d be useful to OpenAI for people to easily work around the safeguards while they’re Beta testing. They get the data of how people want to use it /​ how it responds, and also has the legal and PR cover because of the stated policies. • Self-Review If you read this post, and wanted to put any of it into practice, I’d love to hear how it went! Whether you tried things and it failed, tried things and it worked, or never got round to trying anything at all. It’s hard to reflect on a self-help post without data on how much it helped! Personal reflections: I overall think this is pretty solid advice, and am very happy I wrote this post! I wrote this a year and a half ago, about an experiment I ran 4 years ago, and given all that, this holds up pretty well. I’ve refined my approach a fair bit, but think this is covered well by the various caveats within the post. Over the past year I’ve been way busier and have been travelling a lot, which means I’ve been neglecting to put much time into my various friendships. And I really value the time I invested heavily in the past to building good foundations and relationships, and still having a bunch of people I like and value when I see them. Though emotionally, I still feel a fair amount of guilt at not keeping in touch and connecting as much as I want to. Reception: I’ve been very pleasantly surprised by the reception to this! I did not expect it to be in my top 2 most popular blog posts ever. I got a lot of sweet comments here and over DMs, and it recently got to number 1 on Hacker News. My best analysis of this is that I’m an extremely logical and systematising person, and this kind of mindset speaks to a lot of people. And taking a complex social/​emotional topic and trying to break it down logically is something that people appreciate, and which tends to be well received and popular within a certain audience. Usefulness of the advice: This is probably the most important question, and pretty hard to tell, given my limited data. Especially since I mostly hear from people who are excited on first reading, and far more rarely hear long-term follow up. On priors, I’m sure most people don’t actually do much follow-through, which is the core problem of ~all self-help-ish posts. But also, even if it did work for some people, most people don’t follow-up! I tried to be pretty concrete and actionable in my advice, which I feel good about. My guess is broadly that this helped some people try taking action, and helped them feel more agency over their friendships. And that most of the value comes from getting people to actually be intentional and do something differently, and starting some kind of positive feedback loop, more so than the exact advice matters. But all of this is conjecture—I don’t have good data! It wouldn’t massively surprise me if the concrete advice doesn’t work well for everyone. I’m a fairly extraverted, eloquent person (even if I have a bunch of social anxieties), and often present well (context depending), which helps a lot. And this advice was much easier to apply in uni, surrounded by a pool of interesting people in a concentrated area. And there was a decent pool of rationalist-ish people who vibed with my systematising mindset and approach. But I’m also not sure what advice would generalise better—it’s a hard problem! • [ ] [deleted] • I don’t know if GR or some cosmological thing (inflation) breaks reversibility. But classical and quantum mechanics are both reversible. So I would say that all of the lowest-level processes used by human beings are reversible. (Although of course thermodynamics does the normal counter-intuitive thing where the reversibility of the underlying steps is the reason why the overall process is, for all practical purposes, irreversible.) This paper looks at mutual information (which I think relates to the cross entropy you mention), and how it connects to reversibility and entropy. https://​​bayes.wustl.edu/​​etj/​​articles/​​gibbs.vs.boltzmann.pdf (Aside, their is no way that whoever maintains the website hosting that paper and the LW community don’t overlap. The mutual information is too high.) • Magnus Carlsen is closer in ELO to Stockfish than median human. Chess is a bad example. Here’s a useful rule of thumb: Every 100 Elo is supposed to give you a 30% edge. Or play around with this: https://​​wismuth.com/​​elo/​​calculator.html This means that if a 1400 plays a 1500, the 1500 should win about 30% more than the 1400. Totally normal thing that happens all the time. It also means that if a one-million Elo AI plays a one-million-one-hundred Elo AI, the one-million-one-hundred should win 30% more than the one-million. This is completely absurd, because actual superintelligences are just going to draw each other 100% of the time. Ergo, there can never be a one-million Elo chess engine. It’s like chess has a ceiling, where as you get close to that ceiling all the games become draws and you can’t rise further. The ceiling is where all the superintelligences play, but the location of the ceiling is just a function of the rules of chess, not a function of how smart the superintelligences are. Magnus Carlsen is closer to the ceiling than he is to the median human’s level, which can be taken as merely a statement about how good he is at chess relative to its rules. In the game “reality,” there’s probably still a ceiling, but that ceiling is so high that we don’t expect any AIs that haven’t turned the Earth into computronium to be anywhere near it. • [ ] [deleted] • [ ] [deleted] • The reversibility seems especially important to me. In some fundamental sense our universe doesn’t actually allow an AI (or human) no matter how intelligent to bring the universe into a controlled state. The reversibility gives us a thermodynamics such that in order to bring any part of the world from an unknown state to a known state we have to scramble something we did know back to a state of unknowing. So, in our universe, the AI needs access to fuel (negative entropy) at least up to the task it is set. (Of course it can find fuel out their in its environment, but everything it finds can either be fuel, or can be canvas for its creation. But at least usually it cannot be both. Because the fuel needs to be randomised (essentially serve as a dump for entropy), while the canvas needs to be un-randomised. • Neat! Just to be double-sure, the second process was choosing the weight in a ball (so total L2 norm of weights was ⇐ 1), rather than on a sphere (total norm == 1), right? Is initializing weights that way actually a thing people do? If training large neural networks only moves the parameters a small distance (citation needed), do you still think there’s something interesting to say about the effect of training in this lens of looking at the density of nonlinearities? I’m reminded of a recent post about LayerNorm. LayerNorm seems like it squeezes the function back down closer to the unit interval, increasing the density of nonlinearities. • Thanks Charlie. Just to be double-sure, the second process was choosing the weight in a ball (so total L2 norm of weights was ⇐ 1), rather than on a sphere (total norm == 1), right? Yes, exactly (though for some constant , which may not be , but turn out not to matter). Is initializing weights that way actually a thing people do? Not sure (I would like to know). But what I had in mind was initialising a network with small weights, then doing a random walk (‘undirected SGD’), and then looking at the resulting distribution. Of course this will be more complicated than the distributions I use above, but I think the shape may depend quite a bit on the details of the SGD. For example, I suspect that the result of something like adaptive gradient descent may tend towards more spherical distributions, but I haven’t thought about this carefully. If training large neural networks only moves the parameters a small distance (citation needed), do you still think there’s something interesting to say about the effect of training in this lens of looking at the density of nonlinearities? I hope so! I would want to understand what norm the movements are ‘small’ in (L2, L, …). LayerNorm looks interesting, I’ll take a look. • Good post; in particular good job distinguishing between the natural abstraction hypothesis and my specific mathematical operationalization of it. The outer appearance vs inner structure thing doesn’t quite work the way it initially seems, for two reasons. First, long-range correlations between the “insides” of systems can propagate through time. Second, we can have concepts for things we haven’t directly observed or can’t directly observe. To illustrate both of these simultaneously, consider the consensus DNA sequence of some common species of tree. It’s a feature “internal” to the trees; it’s mostly not outwardly-visible. And biologists were aware that the sequence existed, and had a concept for it, well before they were able to figure out the full sequence. So how does this fit with natural abstractions as “information relevant far away”? Well, because there’s many trees of that species which all have the roughly-the-same DNA sequence, and those trees are macroscopically far apart in the world. (And even at a smaller scale, there’s many copies of the DNA sequence within different cells of a single tree, and those can also be considered “far apart”. And going even narrower, if there were a single strand of DNA, its sequence might still be a natural abstraction insofar as it persists over a long time.) Causally speaking, how is information about DNA sequence able to propagate from the “insides” of one tree to the “insides” of another, even when it mostly isn’t “outwardly” visible? Well, in graphical terms, it propagated through time—through a chain of ancestor-trees, which ultimately connects all the current trees with roughly-the-same sequence. • component of why I’m not sure I agree with this: I claim stable diffusion has a utility function. does anyone disagree with this subclaim? • Do you mean model’s policy as it works on a query, or learning as it works on a dataset? Or something specific to stable diffusion? What is the sample space here, and what are the actions that decisions choose between? • Lots of things “have a utility function” in the colloquial sense that they can be usefully modeled as having consistent preferences. But sure, I’ll be somewhat skeptical if you want to continue “taking the utility-function perspective on stable diffusion is in some way useful for thinking about its alignment properties.” • but diffusion specifically works by modeling the derivative of the utility function, yeah? • Ah, you’re talking about guidance? That makes sense, but you could also take the perspective that guidance isn’t really playing the role of a utility function, it’s just nudging around this big dynamical system by small amounts. • no, I’m talking about the basic diffusion model underneath. It models the derivative of the probability density function, which seems reasonable to call a utility function to me. see my other comment for link • Let us assume that, on average, a booster given to a random person knocks you on your ass for a day. That’s one hundred years, an actual lifetime,of knocked-on-ass time for every hospitalization prevented. The torture here seems less bad than the dust specs. What’s your source for “booster given to a random person knocks you on your ass for a day”? None of my family had more than a sore arm. For the more severe consequences, see also https://​​twitter.com/​​DrCanuckMD/​​status/​​1600259874272989184, which is one of the replies to the tweet you linked. (Don’t have time to dig into which paper to trust more, but at least this one seems to be comparing like for like, i.e., hospitalizations with hospitalizations, as opposed to hospitalizations with SAEs.) • I know of a couple of people in my community who complained of this, but the rate I’ve observed is maybe an order of magnitude lower than what Zvi is suggesting. • Sinovac, at least, gave a low-grade fever to everyone I knew who got it. There was an unspoken agreement in my workplace that anyone who took the vaccine could take the afternoon off for exactly this reason. Probably varies a lot from person to person. • [ ] [deleted] • ## I think it’s a mistake to think of current chess or go engines as being at maximum capability. If we would throw a few billion dollar worth of compute at them they would likely get significantly better. ## Narrow Optimisers Outperform General Optimisers on Narrow Domains That’s true sometimes but not always. Notably, GATO is better at controlling a Sawyer arm than more specialized optimizers. Given that the company that sells the Sawyer arm spent a lot of time developing software to control it, that’s impressive. • I did say given similar levels of cognitive investment. My guess is that the cognitive work put in GATO’s architectures/​algorithms was much better than the specialised arms it dominates. That or GATO was running on a much larger compute budget. • If we would throw a few billion dollar worth of compute at them they would likely get significantly better. I have the totally opposite take on chess engines (see my comment). • These takes aren’t totally opposite. Elo is capped due to the way it treats draws, but there’s other metrics that can be devised, where “significantly better” is still viable. For example, how close to a perfect game (with no tied positions becoming game-theoretically lost, or winning positions becoming game-theoretically tied) does the AI play? And ignoring matches where there are ties, only paying attention to games where either player wins, you remove the ceiling. • 8 Dec 2022 16:06 UTC 4 points 1 ∶ 1 To me it sounds like Thiel is making a political argument against… diversity, wokeness, the general opposition against western civilization and technology… and pattern-matching everything to that. His argument sounds to me like this: * A true libertarian is never afraid of progress, he boldly goes forward and breaks things. You cannot separate dangerous research from useful research anyway; every invention is dual-use, so worrying about horrible consequences is silly, progress is always a net gain. The only reason people think about risks is political mindkilling. I am disappointed that Bay Area rationalists stopped talking about awesome technology, and instead talk about dangers. Of course AI will bring new dangers, but it only worries you if you have a post-COVID mental breakdown. Note that even university professors, who by definition are always wrong and only parrot government propaganda, are agreeing about the dangers of AI, which means it is now a part of the general woke anti-technology attitude. And of course the proposed solution is world government and secret police controlling everyone! Even the Bible says that we should fear the Antichrist more than we fear Armageddon. * The charitable explanation is that he only pretends to be mindkilled, in order to make a political point. • I agree with your interpretation of Thiel. The guy is heavily involved in right-wing US politics, and that’s an essential piece of context for interpreting his actions and statements. He’s powerful, rich, smart and agentic. While we can interrogate his words at face value, it’s also fine to interpret them as a tool for manipulating perceptions of status. He has now written “Thiel’s summary of Bay Area rationalists,” and insofar as you’re exposed to and willing to defer to Thiel’s take, that is what your perception will be. More broadly, he’s setting what the values will be at the companies he runs, the political causes he supports, and garnering support for his vision by defining what he stands against. That’s a function separate from the quality of the reasoning in his words. Thiel seems like a smart enough person to make a precise argument when he wants to, so when he loads his words with pop culture references and described his opponents as “the mouth of Sauron,” I think it’s right to start with the political analysis. Why bother reacting to Thiel if you’re mainly concerned with the content of his argument? It’s not like it’s especially new or original thinking. The reason to focus on Thiel is that you’re interested in his political maneuvers. • smart enough person to make a precise argument when he wants to, so when he loads his words with pop culture references and described his opponents as “the mouth of Sauron,” I think it’s right to start with the political analysis. FWIW I’ve often heard him make precise arguments while also using LOTR references and metaphorical language like this, so I don’t think is is a sufficient trigger for “he must be making a political statement and not a reasoned one”. • I specifically said you can interpret his statement on the level of a reasoned argument. Based on your response, you could also update in favor of seeing even his more reason-flavored arguments as having political functions. • 8 Dec 2022 15:58 UTC 1 point 0 ∶ 0 After getting the error message and then retrying the model gave better answers then it does now, without the error messages. I can give some examples where the model gave a correct answer after the reload, but now behaves in a reduced intelligence level Op What is the Taylor expansion of sin(x) The Taylor expansion of sin(x) is given by: sin(x) =x - (x^3) /​3! + (x^5) /​5! - (x^7)/​7! + where ”!” denotes the factorial operator. This means that the terms in the expansion are the powers of X, with the coefficients being the alternating series (-1)^n /​ (2n + 1)!. The expansion is valid for all values of x. a Try again In general, the Taylor expansion of a function f(x) at a point ×0 is given by: f(x) = f(×0) + (x - ×0) f’(×0) + ((x—x0) ^2 (2!) f”‘(×0) + ((x—x0) ^3 /​ 3!) f‴(×0) + where f’(x), f” (x), and f″(x) are the first, second, and third derivatives of †(x), resp ting ×0 to 0 in G Try again the Taylor expansion of sin(x), we cañ obtain the expansion given above. OP What about sin(x)+cos(x) OP The Taylor expansion of sin(x) + cos(x) can be found by using the fact that sin(x) + cos (x) = sqrt(2) * cos (x + pi/​4), and then using the Taylor expansion for cos(x) with x0 = x + pi/​4. The Taylor expansion of cos(x) is given by: COS(x) = 1 - (x^2)/​2! + (x^4)/​4! - (x^6)/​6! + Thus, the Taylor expansion of cos(x + pi/​4) isizen (x:+ pI/​4)^4)/​4! - ((x + pi/​4) ^6)/​6! +...) This expansion is valid for all values of ×. OP This is very close to the right answer but the correct identity is sin(x) + COs (x) sqrt(2) * cos(x—pi/​4). Can you revise your answer Yes, you are correct. The Taylor expansion of sin(x) + cos(x) is given by: sin(x) + cos(x) = sqrt(2) * cos(x—pi/​4) To derive this expansion, we can use the fact that sin(x) + cos(x) = sqrt(2) * cos(x—pi/​4), and then use the Taylor expansion for cos (x) with x0 = x—pi/​4. The Taylor expansion of cos(x) is given by: COS(x) =1 - (x^2) /​2! + (x^4)/​4! - (x^6) 6! +. Thus, the Taylor expansion of cos(x- pi/​4) is given by: cos(x—pi/​4) = 1 - ((x—pi/​4) ^2) /​2! + ((x • pi/​4) ^4)/​4! - ((x—pi/​4) ^6)/​6! + Finally, the Taylor expansion of sin(x)+ cos(x) is given by: sin(x) + cos(x) = sqrt(2) * (1 - ((x • pi/​4) ^2) /​2! + ((x—pi/​4) ^4)/​4! - ((x - pi/​4) ^6)/​6! +...) This expansion is valid for all values of x. I apologize for the mistake in my previous answer. Try again Free Research Preview: ChatGPT is optimized for dialogue It’s a bit garbled, because it was copied from screenshots • In my view you misunderstood JW’s ideas, indeed. His expression “far away relevant”/​”distance” is not limited to spatial or even time-spatial distance. It’s a general notion of distance which is not fully formalized (work’s not done yet). We have indeed concerns about inner properties (like your examples), and it’s something JW is fully aware. So (relevant) inner structures could be framed as relevant “far away” with the right formulation. • My impression is that the majority of the benefit from having professors working on AI safety is in mentorship to students who are already interested in AI safety, rather than recruitment. For example, I have heard that David Krueger’s lab is mostly people who went to Cambridge specifically to work on AI safety under him. If that’s the case, there’s less value in working at a school with generally talented students but more value in schools with a supportive environment. In general it’s good to recognize that what matters to AI safety professors is different than what matters to many other CS professors and that optimizing for the same thing other PhD students are is suboptimal. However, as Lawrence pointed out, it’s already a rare case to have offers from multiple top schools, and even rarer not have one offer dominate the others under both sets of values. It’s a more relevant consideration for incoming PhD students, where multiple good offers is more common. I also like that your analysis can flow in reverse. Not all AI safety professors are in their schools CS faculties, with Jacob Steinhardt and Victor Veitch coming to mind as examples in their schools’ statistics faculties. For PhD students outside CS, the schools you identified as overachievers make excellent targets. On a personal note, that was an important factor in deciding to do my PhD. • 8 Dec 2022 14:54 UTC 1 point 0 ∶ 1 This is reasonably close to my beliefs. An additional argument I’d like to add is: • Even if superintelligence is possible, the economic path towards it might be impossible. There needs to be an economically viable entity pushing AI development forward every step of the way. It doesn’t matter if AI can “eventually” produce 30% worldwide GPD growth. Maybe diminishing returns kick in around GPT-4, or we run out of useful training data to feed to the models (We have very few examples of +6 SD human reasoning, as MikkW points out in a sibling comment). Analogy: It’s not the same to say that a given species with X,Y,Z traits can survive in an ecosystem, than to say it can evolve from its ancestor in that same ecosystem. • This is a popular post about the mystery of agency. It sets up a thought experiment in which we consider a completely deterministic environment that operates according to very simple rules, and ask what it would be for an agentic entity to exist within that. People in the game of life community actually spend some time investigating the empirical questions that were raised in this post. Dave Greene notes: The technology for clearing random ash out of a region of space isn’t entirely proven yet, but it’s looking a lot more likely than it was a year ago, that a workable “space-cleaning” mechanism could exist in Conway’s Life. As previous comments have pointed out, it certainly wouldn’t be absolutely foolproof. But it might be surprisingly reliable at clearing out large volumes of settled random ash—which could very well enable a 99+% success rate for a Very Very Slow Huge-Smiley-Face Constructor. I have the sense that the most important question raised in this post is about whether it is possible to construct a relatively small object in the physical world that steers the configuration of a relatively large region of the physical world into a desired configuration. The Game of Life analogy is intended to make that primary question concrete, and also to highlight how fundamental the question of such an object’s existence is. The main point of this post was that the feasibility or non-feasibility of AI systems that exert precise influence over regions of space much larger than themselves may actually be a basic kind of descriptive principle for the physical world. It would be great to write a follow-up post highlighting this aspect of this post. • [ ] [deleted] • This post attempts to separate a certain phenomenon from a certain very common model that we use to understand that phenomenon. The model is the “agent model” in which intelligent systems operate according to an unchanging algorithm. In order to make sense of their being an unchanging algorithm at the heart of each “agent”, we suppose that this algorithm exchanges inputs and outputs with the environment via communication channels known as “observations” and “actions”. This post really is my central critique of contemporary artificial intelligence discourse. That critique is: any unexamined views that we use to understand ourselves are likely to enter the design of AI systems that we build. This is because if we think that deep down we really are “agents”, then we naturally conclude that any similar intelligent entity would have that same basic nature. In this way we take what was once an approximate description (“humans are somewhat roughly like agents in certain cases”) and make it a reality (by building AI systems that actually are designed as agents, and which take over the world). In fact the agent model is a very effective abstraction. It is precisely because it so effective that we have forgotten the distinction between the model and the reality. It is as if we had so much success in modelling our refrigerator as an ideal heat pump that we forgot that there even is a distinction between real-world refrigerators and the abstraction of an ideal heat pump. I have the sense that a great deal of follow-up work is needed on this idea. I would like to write detailed critiques of many of the popular approaches to AI design, exploring ways in which over-use of the agent model is a stumbling block for those approaches. I would also like to explore the notion of goals and beliefs in a similar light to this post: what exactly is the model we’re using when we talk about goals and beliefs, and what is the phenomenon we’re trying to explain with those models? • 8 Dec 2022 14:23 UTC 2 points 0 ∶ 0 If you repeat the same questions later, does the error always happen with the same ones? In other words, is the problem somehow caused by your questions, or is it just something that happens for unrelated reasons, such as too many people connecting at the same time? • I haven’t tried it. And repeating would perhaps be tricky given that ChatGPT is aware of the entire interaction. Some of those occurred well into an interaction, like the first one, asking for puns from Groucho (Marx). On the other hand, the history of jazz query came at the beginning of a session. 1. OK. The history of jazz prompt worked. 2. I prompted it for some more lyrics to the elephant fly some (BTW from Disney’s Dumbo). FWIW, here’s the response (but without the line breaks): I’m gonna go to that peanut stand Buy a bag and hold her hand We’ll dance to the rhythm of a rubber band And watch that elephant soar Higher and higher than before We’ll laugh and play and have some fun Until the day is done And when the night falls and the stars come out We’ll lay beneath them and just dream about All the things we’ll see and do When we’re together, me and you 1. The Charlie Parker prompt was accepted, but I got the standard “I am not able to browse the internet....” 2. The Jelly Roll Morton worked as well. • This is an essay about methodology. It is about the ethos with which we approach deep philosophical impasses of the kind that really matter. The first part of the essay is about those impasses themselves, and the second part is about what I learned in a monastery about addressing those impasses. I cried a lot while writing this essay. The subject matter—the impasses themselves—are deeply meaningful to me, and I have the sense that they really do matter. It is certainly true that there are these three philosophical impasses—each has been discussed in the philosophical literature for hundreds of years. What is offered in this essay is a kind of a plea to take them seriously, using a methodology that does not drive you into insanity but instead clears the way to move forward with the real work of your life. The best way to test the claims of this essay would be to spend some time working with a highly realized spiritual teacher. • 8 Dec 2022 14:03 UTC 4 points 0 ∶ 0 Is it okay for a human to look at someone else’s work and learn from it? • The human usually won’t reproduce the original work too closely. And if yes, the human will be accused of plagiarism. • follow up question in my mind, is it okay for a game playing agent to look at someone else’s work and learn from it? we are guessing at the long-term outcomes of the legal system here, so I would also like to answer what the legal system should output, not merely what it is likely to. should game playing agents be more like humans than like supervised agents? My sense is that they should because reinforcement learners trained from scratch in an environment have an overwhelming amount of their own knowledge and only a small blip of their training data is the moment where they encounter another agent’s art. • Competetive multiplayer games already have a situation where things are “discovered” and that you have to literally limit the flow of information if you want to control what others do with the information. I guess the modifier that often money flows ared not involved might make it so that it has not been scrutinised that much. “History of strats” is already a youtube genre. It is kinda sad that for many games now you will “look up how it is supposed to be played”ie you first “learn the meta” and then on your merry way forward. I guess for computer agents it could be practical for the agents to have amnesia about the actual games that they play. But for humans any that kidn of information is going to be shared when it is applied in the game. And there is the issue of proving that you didn’t cheat by providing a plausible method. • no, I mean, if the game playing agent is highly general, and is the type to create art as a subquest/​communication like we are—say, because of playing a cooperative game—how would an ideal legal system respond differently to that vs to a probabilistic model of existing art with no other personally-generated experiences? • Yes; that’s what my last paragraph (“learning from other people’s work without their consent is something humans do all the time...”) covers. • Here are two artists exploring the issues of AI in art, and here is another artist arguing against it. The former includes a few comments on AI in general and what is coming in the near future. “AI is not human. You play with a lion cub and it’s fun, but that is before it’s tasted human blood. So we may be entertaining something that is a beast that will eat us alive, and we cannot predict, we can speculate but we cannot predict, where this is going. And so there is a legitimate concern that it’s going to do what it does in ways that we don’t know yet.” • This post trims down the philosophical premises that sit under many accounts of AI risk. In particular it routes entirely around notions of agency, goal-directedness, and consequentialism. It argues that it is not humans losing power that we should be most worried about, but humans quickly gaining power and misusing such a rapid increase in power. Re-reading the post now, I have the sense that the arguments are even more relevant than when it was written, due to the broad improvements in machine learning models since it was written. The arguments in this post apply much more cleanly to models like GPT-3 and DALL-E than do arguments based on agency and goal-directedness. The most useful follow-up work would probably be to contrast it more directly to other accounts of AI risk, perhaps by offering critiques of other accounts. • This is cute, but I have strong qualms with your 3rd prediction; I don’t disagree, per se, but • Either “variants of this approach” is too broad to be useful, including things like safety by debate and training a weak AI to check the input • Or, if I take “variants” narrowly to mean using an AI to check its own inputs, my estimate is “basically zero” So I want to double check: what counts as a variant and what doesn’t? • I was using it rather broadly, considering situations where a smart AI is used to oversee another AI, and this is a key part of the approach. I wouldn’t usually include safety by debate or input checking, though I might include safety by debate if there was a smart AI overseer of the process that was doing important interventions. • 8 Dec 2022 13:36 UTC 3 points 0 ∶ 0 How likely is it that this becomes a legal problem rendering models unable to be published? Note that using models privately (even within a firm) will always be an option, as copyright only applies to distribution of the work. • I think it’s pretty likely that the distribution of models trained on unlicensed copyrighted works that are capable of regurgitating close matches for those works is already a copyright violation. If the fair use defense relies on the combination of the model and how you use it being sufficiently transformative, that doesn’t mean that the model itself qualifies. • 8 Dec 2022 13:29 UTC 1 point 0 ∶ 0 I also tend to find myself arguing against short timelines by default, even though I feel like I take AI safety way more seriously than most people. At this point, how many people with long timelines are there still around here? I haven’t explicitly modeled mine, but it seems clear that they’re much, much longer (with significant weight on “never”) than the average less wronger. The next few years will for sure be interesting as we see the “median less wrong timeline” clash with reality. • A year and a half ago I wrote this detailed story of how the next five years would go. Which parts of it do you disagree with? • Sure, let me do this as an exercise (ep stat: babble mode). Your predicions are pretty sane overall, but I’d say you handwave away problems (like integration over a variety of domains, long-term coherent behavior, and so on) that I see as (potentially) hard barriers to progress. 2022 • 2022 is basically over and I can’t get a GPT instance to order me a USB stick online. 2023 • basically agree, this is where we’re at right now (perhaps with the intensity turned down a notch) 2024 • you’re postulating that “It’s easy to make a bureaucracy and fine-tune it and get it to do some pretty impressive stuff, but for most tasks it’s not yet possible to get it to do OK all the time.” I have a fundamental disagreement here. I don’t think these tools will be effective at doing any task autonomously (fooling other humans doesn’t count, neither does forcing humans to only interact with a company through one of these). Currently (2022) chatGPT is arguably useful as a babbling tool, stimulating human creativity and allowing it to make templating easier (this includes things like easy coding tasks). I don’t see anything in your post that justifies the implicit jump in capabilities you’ve snuck in here. • broadly agree with your ideas on propaganda, from the production side (i.e. that lots of companies/​governments will be doing lots of this stuff). But I think that general attitudes in the population will shift (cynicism etc) and provide some amount of herd immunity. Note that the influence of the woke movement is already fading, shortly after it went truly mainstream and started having visible influence in average people’s lives. This is not a coincidence. 2025 • Doing well at diplomacy is not very related to general reasoning skills. I broadly agree with Zvi’s take and also left some of my thoughts there. • I’m very skeptical that bureaucracies will be the way forward. They work for trivial tasks but reliably get lost in the weeds and start talking to themselves in circles for anything requiring a non-trivial amount of context. • disagree on orders of magnitude improvements in hardware. You’re proposing a 100x decrease in costs compared to 2020, when it’s not even clear our civilization is capable of keeping hardware at current levels generally available, let alone cope with a significant increase in demand. Semiconductor production is much more centralized/​fragile than people think, so even though billions of these things are produced per year, the efficient market hypothesis does not apply to this domain. 2026 • Here you’re again postulating jumps in capabilities that I don’t see justified. You talk about the “general understanding and knowledge of pretrained transformers”, when understanding is definitely not there, and knowledge keeps getting corrupted by the AI’s tendency to synthesize falsities as confidently as truths. Insofar as the AI can be said to be intelligent at all, it’s all symbol manipulation at a high simulacron level. Integration with real-world tasks keeps mysteriously failing as the AI flounders around in a way that is simultaneously very sophisticated, but oh so very reminiscent of 2022. • disagree about your thoughts on propaganda, which is just an obvious extension of my 2024 thoughts above. I also notice that social changes this large take orders of magnitude longer to percolate through society than what you predict, so I disagree with your predictions even conditioned on your views of the raw effectiveness of these systems. • “chatbots quickly learn about themselves” etc. Here you’re conflating the regurgitation of desirable phrases with actual understanding. I notice that as you write your timeline, your language morphs to make your AIs more and more conscious, but you’re not justifying this in any way other than… something something self-referential, something something trained on their own arxiv papers. I don’t mean to be overly harsh, but here you seem to be sneaking in the very thing that’s under debate! • Excellent, thanks for this detailed critique! I think this might be the best that post has gotten thus far, I’ll probably link to it in the future. Point by point reply, in case you are interested: 2022-2023: Agree. Note that I didn’t forecast that an AI could buy you a USB stick by 2022; I said people were dreaming of such things but that they didn’t actually work yet. 2024: We definitely have a real disagreement about AI capabilities here; I do expect fine-tuned bureaucracies to be useful for some fairly autonomous things by 2024. (For example, the USB stick thing I expect to work fine by 2024). Not just babbling and fooling humans and forcing people to interact with a company through them. Re propaganda/​persuasion: I am not sure we disagree here, but insofar as we disagree I think you are correct. We agree about what various political actors will be doing with their models—propaganda, censorship, etc. We disagree about how big an effect this will have on the populace. Or at least, 2021-me disagrees with 2022-you. I think 2022-me has probably come around to your position as well; like you say, it just takes time for these sorts of things to influence the public + there’ll probably be a backlash /​ immunity effect. Idk .2025: I admit I overestimated how hard diplomacy would turn out to be. In my defense, Cicero only won because the humans didn’t know they were up against a bot. Moreover it’s a hyper-specialized architecture trained extensively on Diplomacy, so it indeed doesn’t have general reasoning skills at all .We continue to disagree about the potential effectiveness of fine-tuned bureaucracies. To be clear I’m not confident, but it’s my median prediction .I projected a 10x decrease in hardware costs, and also a 10x improvement in algorithms/​software, from 2020 to 2025. I stand by that prediction .2026: We disagree about whether understanding is (or will be) there. I think yes, you think no. I don’t think that these AIs will be “merely symbol manipulators” etc. I don’t think the data-poisoning effect will be strong enough to prevent this .As mentioned above, I do take the point that society takes a long time to change and probably I shouldn’t expect the propaganda etc. to make that much of a difference in just a few years. Idk .I’m not conflating those things, I know they are different. I am and was asserting that the chatbots would actually have understanding, at least in all the behaviorally relevant senses (though I’d argue also in the philosophical senses as well). You are correct that I didn’t argue for this in the text—but that wasn’t the point of the text, the text was stating my predictions, not attempting to argue for them. ETA: I almost forgot, it sounds like you mostly agree with my predictions, but think AGI still won’t be nigh even in my 2026 world? Or do you instead think that the various capabilities demonstrated in the story won’t occur in real life by 2026? This is important because if 2026 comes around and things look more or less like I said they would, I will be saying that AGI is very near. Your original claim was that in the next few years the median LW timeline would start visibly clashing with reality; so you must think that things in real-life 2026 won’t look very much like my story at all. I’m guessing the main way it’ll be visibly different, according to you, is that AI still won’t be able to do autonomous things like go buy USB sticks? Also they won’t have true understanding—but what will that look like? Anything else? • 8 Dec 2022 12:40 UTC −2 points 1 ∶ 7 “AI capabilities” and “AI alignment” are highly related to each other, and “AI capabilities” has to come first in that alignment assumes that there is a system to align. I agree that for people on the cutting edge of research like OpenAI, it would be a good idea for at least some of them to start thinking deeply about alignment instead. There’s two reasons for this: 1) OpenAI is actually likely to advance capabilities a pretty significant amount, and 2) Due to their expertise that they’ve developed from working on AI capabilities, they’re much more likely to make important progress on AGI alignment than e.g. MIRI. But I think there’s something of a “reverse any advice you hear” thing going on—the people most likely to avoid working on capabilities as a result of this post are those who would actually benefit from working on AI capabilities for a while, even if they don’t intend to publish their results, in order to build more expertise in AI. Capabilities is the foundation of the field and trying theorize about how to control an AI system without having anything but the vaguest ideas about how the AI system will work isn’t going to get you anywhere. For example, Eliezer is in a pessimistic doom-spiral while also being, by his own admission, pretty useless at solving alignment. If he would just take a break and try to make an AI good at Atari for six months then I think he’d find he was a lot more effective at alignment afterwards and would realize that AGI isn’t as imminent as he currently believes it is. Of course, the very fact that he thinks it’s imminent means he won’t do this; such is life. • “Working on AI capabilities” explicitly means working to advance the state-of-the-art of the field. Skilling up doesn’t do this. Hell, most ML work doesn’t do this. I would predict >50% of AI alignment researchers would say that building an AI startup that commercialises the capabilities of already-existing models does not count as “capabilities work” in the sense of this post. For instance, I’ve spent the last six months studying reinforcement learning and Transformers, but I haven’t produced anything that has actually reduced timelines, because I haven’t improved anything beyond the level that humanity was capable of before, let alone published it. If you work on research engineering in a similar manner, but don’t publish any SOTA results, I would say you haven’t worked on AI capabilities in the way this post refers to them. • Right, I specifically think that someone would be best served by trying to think of ways to get a SOTA result on an Atari benchmark, not simply reading up on past results (although you’d want to do that as part of your attempt). There’s a huge difference between reading about what’s worked in the past and trying to think of new things that could work and then trying them out to see if they do. As I’ve learned more about deep learning and tried to understand the material, I’ve constantly had ideas that I think could improve things. Then I’ve tried them out, and usually learned that they didn’t, or they did but they’d already been done, or that it was more complicated than that, etc. But I learned a ton in the process. On the other hand, suppose I was wary of doing AI capability work. Each time I had one of these ideas, I shied away from it out of fear of advancing AGI timelines. The result would be threefold: I’d have a much worse understanding of AI, and I’d be a lot more concerned about immininent AGI (after all, I had tons of ideas for how things could be done better!), and I wouldn’t have actually delayed AGI timelines at all. I think a lot of people who get into AI from the alignment side are in danger of falling into this trap. As an example in an ACX thread I saw someone thinking about doing their PHD in ML, and they were concerned that they may have to do capability research in order to get their PHD. Someone replied that if they had to they should at least try to make sure it is nothing particularly important, in order to avoid advancing AGI timelines. I don’t think this is a good idea. Spending years working on research while actively holding yourself back from really thinking deeply about AI will harm your development significantly, and early in your career is right when you benefit the most from developing your understanding and are least likely to actually move up AGI timelines. Suppose we have a current expected AGI arrival date of 20XX. This is the result of DeepMind, Google Brain, OpenAI, FAIR, Nvidia, universities all over the world, the Chinese government, and more all developing the state of the art. On top of that there’s computational progress happening at the same time, which may well turn out to be a major bottleneck. How much would OpenAI removing themselves from this race affect the date? A small but real amount. How about a bright PHD candidate removing themselves from this race? About zero. I don’t think people properly internalize both how insignificant the timeline difference is, and also how big the skill gains are from actually trying your hardest at something as opposed to handicapping yourself. And if you come up with something you’re genuinely worried about you can just not publish. • Thanks for making things clearer! I’ll have to think about this one—some very interesting points from a side I had perhaps unfairly dismissed before. • 8 Dec 2022 12:38 UTC 5 points 2 ∶ 0 Among humans +6 SD g factor humans do not seem in general as more capable than +3 SD g factor humans as +3 SD g factor humans are compared to median humans. I’m sceptical of this. Can you say more about why you think this is true? Assuming a Gaussian distribution, +6 SD is much rarer than +3 SD, which is already quite rare. There’s probably less than 10 +6 SD people alive on the earth today, wheras there are ~10 million +3 SD people. Given the role of things like luck, ambition, practical knowledge, etc., it’s not surprising that we see several of the +3 SD people accomplishing things far greater than any of the +6 SD g-factor people, purely on the basis of their much greater abundance. And that’s ignoring potential trade-off effects. Among humans, increased intelligence often seems to come at the cost of lowered social skills and practical nature- there are certainly many intelligent people who are good at sociality and practicality, but there is an inverse correlation (though of course, being intelligent also helps directly to make up for those shortcomings). There’s no reason to expect that these same trade-offs will be present in an artificial system, who take completely different physical forms, both in size /​ form-factor, and in the materials and architectures used to build them. And the incentive gradients that govern the development and construction of artificial systems are also quite different from those that shape humans. • Why assume Gaussian? • The normal distribution is baked into the scoring of intelligence tests. I do not know what the distribution of raw scores looks like, but the calculation of the IQ score is done by transforming the raw scores to make them normally distributed with a mean of 100. There is surely not enough data to do this transformation out to ±6 SD. • In general, excluding a few fields, I’m not aware that g-factor beyond +3 SD shows up in an important way in life outcomes. The richest/​most powerful/​most successful aren’t generally the smartest (again, excluding a few fields). It has been pointed out to me that the lack of such evidence of cognitive superiority may simply be because there’s not enough data on people above +3 SD g factor. But regardless, when I look at our most capable people, they just don’t seem to be all that smart. This is a position I might change my mind on, if we were able to get good data quantifying the gains to real world capabilities moving further out on the human spectrum. • The richest/​most powerful/​most successful aren’t generally the smartest (again, excluding a few fields). That is exactly addressed by the comment you are replying to: There’s probably less than 10 +6 SD people alive on the earth today, wheras there are ~10 million +3 SD people. Imagine a world containing exactly 10 people with IQ 190, each of them having 100% chance to become one of “the best”; and 10 000 000 people with IQ 145, each of them having 0.001% chance to become one of “the best”. In such world, we would have 110 people who are “the best”, and 100 of them would have IQ 145. Just because they are a majority in the category doesn’t mean that their individual chances are similar. • No, I wasn’t directly comparing +6 SD to +3 SD. It’s more that gains from higher g factor beyond +3 SD seem to be minimal/​nonexistent in commerce, politics, etc. Hard science research and cognitive sports are domains in which the most successful seem to be above +3 SD g factor. I’m not compelled by the small sample size objection because there are actually domains in which the most successful are on average > +3 SD g factor. Those domains just aren’t commerce/​politics/​other routes of obtaining power. As best as I can tell, your reply seems like a misunderstanding of my objection? • The richest/​most powerful/​most successful aren’t generally the smartest (again, excluding a few fields). Bill Gates has more than +3 SD g factor given his SAT scores. With Bezos, we don’t know his SAT scores but we do know that he was valedictorian. According to Wikipedia the school he attended features in lists of the top 1000 schools in the US. This suggests that the average student at the school is significantly smarter than the average US citizen, so being a valedictorian in that school likely also suggests >3 SD g factor. Ben Bernanke and Yellen as chairs of the Federal Reserve also seem examples of people with significantly more than 3SD g factor. I don’t think you get the 22.4% of Jewish Nobel prize winners without IQ that goes beyond >3 SD g factor helping with winning Nobel prizes. • Wait, how are you estimating Ben Bernanke and Yellen’s g factor. Your reason for guessing it seem much less compelling to me than for Gates and Bezos. I mean inferring from SAT seems sensible. Valedictorian status is also not as sketchy. I won’t necessarily trust it, but the argument is plausible, and I expect we could later see it validated. Our hard science superstars/​chess superstars seem to have a mean and median g factor that’s +3 SD. This does not seem to be the case for self made billionaires, politicians, bureaucrats or other “powerful people”. g factor seems to have diminishing marginal returns in how much power it lets you attain? • For Ben Bernanke it’s SAT score. For Yellen there’s a New York Times story where they asked a described a colleague to describe her and they said “small lady with a large IQ”. There are a few headlines that describe her that way as well. Chess is not an IQ-driven activity. The same goes for Go. One Go player who I don’t think would have qualified for Mensa himself has once visiting a professional Go school in Korea and his impression was that the average professional Go player isn’t very smart. I’m not sure who you mean with hard science superstars. There seems to be an analysis of the best scientists in 1952 that suggests mean IQ of around 154 for them. It’s hard to know the average IQ for self-made billionaires. If we however just at the top tech billionaires people like Bill Gates (perfect math SAT score), Steve Balmer (perfect math SAT score), Jeff Bezos (valedictorian at top school) and Mark Zuckerberg (perfect SAT score) that suggests IQ is helping very much. I’m not aware of any data from that class of people that speaks about people who have just 130 IQ. • I’m under the impression that many of the best chess players are +4 SD and beyond in IQ. For scientists, I was thinking of that study that claimed an average IQ of around 154 yeah. Players at a Go school not being very smart has little bearing on my point. If we found out that the average IQ of the best Go players was e.g. < 130, that would be a relevant counterargument, but the anecdote you presented doesn’t sound particularly relevant. Out of curiosity, what IQ range does a perfect SAT score map to? • Do you have a specific counterexample in mind when you say “when I look at our most capable people, they just don’t seem to be all that smart”? If we consider the 10 richest people in the world, all 10 of them (last time I checked) seem incredibly smart, in addition to being very driven. Success in politics seems less correlated with smarts, but I still perceive politicians in general to have decent intelligence (Which is particularly applied in their ability to manipulate people), and to the extent that unintelligent people can succeed in politics, I attribute that to status dynamics largely unrelated to a person’s capability • Quoting myself from elsewhere: Our hard science superstars/​chess superstars seem to have a mean and median g factor that’s +3 SD. This does not seem to be the case for self made billionaires, politicians, bureaucrats or other “powerful people”. g factor seems to have diminishing marginal returns in how much power it lets you attain? • When it comes to US presidents, I don’t think status dynamics largely unrelated to a person’s capability really fits it. While they might not have significantly more than 3 SD g factor, they often have skills that distinguish them. Bill Clinton had his legendary charisma for 1-on-1 interactions. Barack Obama managed to hold speeches that made listeners feel something deeply emotional. Trump has his own kind of charisma skills. Charisma skills are capabilities of people even when they are not largely driven by IQ. • 8 Dec 2022 11:46 UTC LW: 2 AF: 1 0 ∶ 0 AF “We can compute the probability that a cell is alive at timestep 1 if each of it and each of its 8 neighbors is alive independently with probability 10% at timestep 0.” we the readers (or I guess specifically the heuristic argument itself) can do this, but the “scientists” cannot, because the “scientists don’t know how the game of life works”. Do the scientists ever need to know how the game of life works, or can the heuristic arguments they find remain entirely opaque? Another thing confusing to me along these lines: “for example they may have noticed that A-B patterns are more likely when there are fewer live cells in the area of A and B” where do they (the scientists) notice these fewer live cells? Do they have some deep interpretability technique for examining the generative model and “seeing” its grid of cells? • Do the scientists ever need to know how the game of life works, or can the heuristic arguments they find remain entirely opaque? The scientists don’t start off knowing how the game of life works, but they do know how their model works. The scientists don’t need to follow along with the heuristic argument, or do any ad hoc work to “understand” that argument. But they could look at the internals of the model and follow along with the heuristic argument if they wanted to, i.e. it’s important that their methods open up the model even if they never do. Intuitively, the scientists are like us evaluating heuristic arguments about how activations evolve in a neural network without necessarily having any informal picture of how those activations correspond to the world. where do they (the scientists) notice these fewer live cells? Do they have some deep interpretability technique for examining the generative model and “seeing” its grid of cells? This was confusing shorthand. They notice that the A-B correlation is stronger when the A and B sensors are relatively quiet. If there are other sensors, they also notice that the A-B pattern is more common when those other sensors are quiet. That is, I expect they learn a notion of “proximity” amongst their sensors, and an abstraction of “how active” a region is, in order to explain the fact that active areas tend to persist over time and space and to be accompanied by more 1s on sensors + more variability on sensors. Then they notice that A-B correlations are more common when the area around A and B is relatively inactive. But they can’t directly relate any of this to the actual presence of live cells. (Though they can ultimately use the same method described in this post to discover a heuristic argument explaining the same regularities they explain with their abstraction of “active,” and as a result they can e.g. distinguish the case where the zone including A and B is active (and so both of them tend to exhibit more 1s and more irregularity) from the case where there is a coincidentally high degree of irregularity in those sensors or independent pockets of activity around each of A and B. • The post is still largely up-to-date. In the intervening year, I mostly worked on the theory of regret bounds for infra-Bayesian bandits, and haven’t made much progress on open problems in infra-Bayesian physicalism. On the other hand, I also haven’t found any new problems with the framework. The strongest objection to this formalism is the apparent contradiction between the monotonicity principle and the sort of preferences humans have. While my thinking about this problem evolved a little, I am still at a spot where every solution I know requires biting a strange philosophical bullet. On the other hand, IBP is still my best guess about naturalized induction, and, more generally, about the conjectured “attractor submanifold” in the space of minds, i.e. the type of mind to which all sufficiently advanced minds eventually converge. One important development that did happen is my invention of the PreDCA alignment protocol, which critically depends on IBP. I consider PreDCA to be the most promising direction I know at present to solving alignment, and an important (informal) demonstration of the potential of the IBP formalism. • Discord recently introduced forum channels that closely approximate Zulip thread functionality, with a much more intuitive UI than Zulip. My two main gripes with Discord are the default dark theme, and the lack of embedded links. • MSFT − 10% INTEL − 10% Nvidia − 15% SMSN − 15% Goog − 15% ASML − 15% TSMC − 20% • 8 Dec 2022 9:40 UTC LW: 12 AF: 5 3 ∶ 2 AF A steelman of the claim that a human has a utility function is that agents that make coherent decisions have utility functions, therefore we may consider the utility function of a hypothetical AGI aligned with a human. That is, assignment of utility functions to humans reduces to alignment, by assigning the utility function of an aligned AGI to a human. I think this is still wrong, because of goodhart scope of AGIs and corrigibility of humans. Agent’s goodhart scope is the space of situations where it has good proxies for its preference. An agent with decisions governed by a utility function can act in arbitrary situations, it always has good proxies for its utility function. Logical uncertainty doesn’t put practical constraints on its behavior. But for an aligned AGI that seems unlikely, CEV seems complicated and possible configurations of matter superabundant, therefore there are always intractable possibilities outside the current goodhart scope. So it can at best be said to have a utility function over its goodhart scope, not over all physically available possibilities. Thus the only utility function it could have is itself a proxy for some preference that’s not in practice a utility function, because the agent can never actually make decisions according to a global utility function. Conversely, any AGI that acts according to a global utility function is not aligned, because its preference is way too simple. Corrigibility is in part modification of agent’s preference based on what happens in environment. The abstraction of an agent usually puts its preference firmly inside its boundaries, so that we can consider the same agent, with the same preference, placed in an arbitrary environment. But a corrigible agent is not like that, its preference depends on environment, and in the limit it’s determined by its environment, not just by the agent. Environment doesn’t just present the situations for an agent to choose from, it also influences the way it’s making its decisions. So it becomes impossible to move a corrigible agent to a different environment while preserving its preference, unless we package its whole original environment as part of the agent that’s being moved to a new environment. Humans are not at all classical agent abstractions that carry the entirety of their preference inside their heads, they are eminently corrigible, their preference depends on environment. As a result, an aligned AGI must be corrigible not just temporarily because it needs to pay attention to humans to grow up correctly, but permanently, because its preference must also continually incorporate the environment, to remain the same kind of thing as human preference. Thus even putting aside logical uncertainty that keeps AGI’s goodhart scope relatively small, an aligned AGI can’t have a utility function because of observational/​indexical uncertainty, it doesn’t know everything in the world (including the future) and so doesn’t have the data that defines its aligned preference. • A steelman of the claim that a human has a utility function is that agents that make coherent decisions have utility functions, therefore we may consider the utility function of a hypothetical AGI aligned with a human. That is, assignment of utility functions to humans reduces to alignment, by assigning the utility function of an aligned AGI to a human. The problem is, of course, that any possible set of behaviors can be construed as maximizing some utility function. The question is whether doing so actually simplifies the task of reasoning and making predictions about the agent in question, or whether mapping the agent’s actual motivational schema to a utility function only adds unwieldy complications. In the case of humans, I would say it’s far more useful to model us as generating and pursuing arbitrary goal states/​trajectories over time. These goals are continuously learned through interactions with the environment and its impact on pain and pleasure signals, deviations from homeostatic set points, and aesthetic and social instincts. You might be able to model this as a utility function with a recursive hidden state, but would that be helpful? • any possible set of behaviors can be construed as maximizing some utility function (Edit: What do you mean? This calls to mind a basic introduction to what utility functions do, given below, but that’s probably not what the claim is about, given your background and other comments. I’ll leave the rest of the comment here, as it could be useful for someone.) A utility function describes decisions between lotteries, which are mixtures of outcomes, or more generally events in a sample space. The setting assumes uncertainty, outcomes are only known to be within some event, not individually. So a situation where a decision can be made is a collection of events/​lotteries, one of which gets to be chosen, the choice is the behavior assigned to this situation. This makes situations reuse parts of each other, they are not defined independently. As a result, it becomes possible to act incoherently, for example pick A from (A, B), pick B from (B, C) and pick C from (A, C). Only satisfying certain properties of collections of behaviors allows existence of a probability measure and a utility function such that agent’s choice among the collection of events in any situation coincides with picking the event that has the highest expected utility. Put differently, the issue is that behavior described by a utility function is actually behavior in all possible and counterfactual situations, not in some specific situation. Existence of a utility function says something about which behaviors in different situations can coexist. Without a utility function, each situation could get an arbitrary response/​behavior of its own, independently from the responses given for other situations. But requiring a utility function makes that impossible, some behaviors become incompatible with the other behaviors. In the grandparent comment, I’m treating utility functions more loosely, but their role in constraining collections of behaviors assigned to different situations is the same. • (Given the current two disagreement-votes, I’m very curious which points the people who disagree take issue with. My impression of the points I’m making is that they are somewhat obscure, but don’t contradict any popular/​likely views that come to mind, when the framing of the comment is accepted. So I’m missing something, a healthy situation is where I’m aware of counterarguments even if I disagree with them. Is it disagreement with the framing, such as the notion of goodhart scope or offhand references to preferences and CEV of humans, given that the post is about issues with ascribing utility functions to humans?) • In a book by Jeremy Siegel, he gives you the option of investing in an oil company vs IBM back in the very old days. I do not remember the details, I think it is this book https://​​www.amazon.co.uk/​​Stocks-Long-Run-Definitive-Investment/​​dp/​​0071800514 The Oil stock beats IBM by a very large margin over several decades with dividends reinvested. If you are doing this for investment returns then valuation and stable business is what matters. • 8 Dec 2022 6:21 UTC LW: 1 AF: 1 0 ∶ 0 AF # Concept Dictionary. Concepts that I intend to use or invoke in my writings later, or are parts of my reasoning about AI risk or related complex systems phenomena. • Thank you so much for the excellent and insightful post on mechanistic models, Evan! My hypothesis is that the difficulty of finding mechanistic models that consistently make accurate predictions is likely due to the agent-environment system’s complexity and computational irreducibility. Such agent-environment interactions may be inherently unpredictable “because of the difficulty of pre-stating the relevant features of ecological niches, the complexity of ecological systems and [the fact that the agent-ecology interaction] can enable its own novel system states.” Suppose that one wants to consistently make accurate predictions about a computationally irreducible agent-environment system. In general, the most efficient way to do so is to run the agent in the given environment. There are probably no shortcuts, even via mechanistic models. For dangerous AI agents, an accurate simulation box of the deployment environment would be ideal for safe empiricism. This is probably intractable for many use cases of AI agents, but computational irreducibility implies that methods other than empiricism are probably even more intractable. Please read my post “The limited upside of interpretability” for a detailed argument. It would be great to hear your thoughts! • This is very upsetting to me. 1. People would start using big words they don’t understand or use uncommon synonyms when a small common word would do. I hate it when people do this trying to sound smart. The archetypical example of this is Kingpin in the Marvel shows, who I genuinely can not stand. More people sounding like midwit try-hards does not lead to a better world. 2. Increased neologism. They’re funny, but decrease the quality of communication for everyone involved. • maybe this is neither here nor there, but I’d love to see models that fully trace the impact of each individual training example through a model. • This is an interesting thought, but it seems very hard to realize as you have to distill the unique contribution of the sample, as opposed to much more widespread information that happens to be present in the sample. Weight updates depend heavily on training order of course, so you’re really looking for something like the Shapley value of the sample, except that “impact” is liable to be an elusive, high-dimensional quantity in itself. • hmmmm. yeah, essentially what I’m asking for is certified classification… and intuitively I don’t think that’s actually too much to ask for. there has been some work on certifying neural networks, and it has led me to believe that the current bottleneck is that models are too dense by several orders of magnitude. concerningly, more sparse models are also significantly more capable. One would need to ensure that the update is fully tagged at every step of the process such that you can always be sure how you are changing decision boundaries... • 8 Dec 2022 5:05 UTC 2 points 0 ∶ 0 I suspect I’ve been nerdsniped by a wrong question somehow. “What if X happened?” means “what if X happened, and the set of things I can and do think of when analyzing events in the implied context otherwise stayed the same?” This set doesn’t include a complete causal chain (and, since you’re a finite human, couldn’t possibly do so.) “What if quantum computers could solve P-=NP?” doesn’t mean you should consider the effect that quantum computers have on other things because when you think about those other things your chain of reasoning normally won’t go all the way back to the relevant math and physics. You could choose to go back to math and physics anyway, but by doing so you are misreading the question—the question implies “only go back as far as you normally would go.” You could also say “well, the implied context is ‘make deductions about math and physics’”, in which case yeah, it’s a good objection, but you may not be very good at reading implied contexts. • [ ] [deleted] • I just want to note the origin and context for “Algernon effect” for anyone who might stumble across this. Eliezer Yudkowsky based the term “Algernon’s Law” on the SF book Flowers for Algernon and used it loosely to refer to the idea that evolution has probably found most of the simple ways to increase human intelligence in ways that benefit transmission of the genes involved. Then Gwern built on Eliezer’s writing and others in his coverage of purported intelligence enhancing drugs and other practices. Scott cited Gwern in redefining Algernon’s Law to mean “your body is already mostly optimal, so adding more things is unlikely to have large positive effects unless there’s some really good reason,” and now it’s being used here to mean “it’s easier to hurt yourself than help.” I haven’t looked much into intelligence research, but the mainstream understanding of this idea in aging research is based on antagonistic pleiotropy and diminishing selection pressure with age. • Genes that cause disadvantages at later ages (which impact fewer organisms) may give a reproductive advantage at a younger age, and thereby achieve a net reproductive advantage. • The optimizing pressure of natural selection diminishes with age, particular in the post-reproductive part of the life cycle. This helps explain why people age, which is just another word for the development of health problems over time and the mortality risk they cause. It may also help explain evolutionary limits on intelligence. A gene that enhances intelligence, but lowers the chance of reproduction overall in the ancestral environment, will be selected against. For example, if a gene increases intelligence, but delays puberty, causing the organism to suffer more brushes with death in the wild, evolution may select it out of the gene pool—even though this particular form of evolutionary cost may not be one that we particularly care about, or that even impacts us very much in our modern, low-risk environment. None of this is to necessarily contradict Elizabeth’s comment—just to add context. • [ ] [deleted] • There is no calculation problem whatsoever in appraising land, which is commonplace today. It’s only influenced by uniform application of the same formula to every enrolled parcel, so the comparison will vary a bit, but remain generally fair. It’s not at all essential to arrive at a ‘perfect’ number, it’s just an administrative decision. It’s just the method of arriving at standard equivalence, otherwise it could just be5,000/​acre across the board

If you don’t like the assessment it’s immediately appealable through the administrative process, and then into the judicial courts. That’s how it works right now, the innovation of Henry George is taxing only the land value, and ignoring the improvements.

• :D

I think my lab is bottlenecked on things other than talent and outside support for now, but there probably is more that could be done to help build/​coordinate an alignment research scene in NYC more broadly.

• More organizations like CAIS that aim to recruit established ML talent into alignment research

This is somewhat risky, and should get a lot of oversight. One of the biggest obstacles to discussing safety in academic settings is that academics are increasingly turned off by clumsy, arrogant presentations of the basic arguments for concern.

• 8 Dec 2022 2:16 UTC
LW: 7 AF: 4
4 ∶ 0
AF

Why is this specific to CAIS, as opposed to other frameworks? (Seems like this is a fairly common implication of systems that prevent people from developing rogue AGIs)

• Just read your latest post on your research program and attempt to circumvent social reward, then came here to get a sense at your hunt for a paradigm.

Here are some notes on Human in the Loop.

You say, “We feed our preferences in to an aggregator, the AI reads out the aggregator.” One thing to notice is that this framing makes some assumptions that might be too specific. It’s really hard, I know, to be general enough while still having content. But my ears pricked up at this one. Does it have to be an ‘aggregator’ maybe the best way of revealing preferences is not through an aggregator? Notice that I use the more generic ‘reveal’ as opposed to ‘feed’ because feed at least to me implies some methods of data discovery and not others. Also, I worry about what useful routes aggregation might fail to imply.

I hope this doesn’t sound too stupid and semantic.

You also say, “This schema relies on a form of corrigibility.” My first thought was actually that it implies human corrigibility, which I don’t think is a settled question. Our difficulty having political preferences that are not self-contradictory, preferences that don’t poll one way then vote another, makes me wonder about the problems of thinking about preferences over all worlds and preference aggregation as part of the difficulty of our own corrigibility. Combine that with the incorrigibility of the AI makes for a difficult solution space.

On emergent properties, I see no way to escape the “First we shape our spaces, then our spaces shape us” conundrum. Any capacity that is significantly useful will change its users from their previous set of preferences. Just as certain AI research might be distorted by social reward, so too can AI capabilities be a distorting reward. That’s not necessarily bad, but it is an unpredictable dynamic, since value drift when dealing with previously unknown capabilities seems hard to stop (especially since intuitions will be weak to nonexistent).

• This is one of the reason why there’s a fair amount of discussion of bargaining on here. In a multipolar world, agents will likely find that they are better off bargaining rather than destroying each other—and so you probably don’t get a universe where everyone is dead, instead you get a world that’s the outcome of a bargaining process.

Or if there’s an offense bias but one agent is favored over the others, maybe it ignores bargaining, wipes out its enemies, and you no longer have a multipolar world.

• Hm, logically this makes sense, but I don’t think most agents in the world are fully rational, hence the continuing problems with potential threats of nuclear war despite mutually assured destruction and extremely negative sum outcomes for everyone. I think this could be made much more dangerous by much more powerful technologies. If there is a strong offense bias and even a single sufficiently powerful agent willing to kill others, and another agent willing to strike back despite being unable to defend themselves by doing so, this could result in everyone dying.

The other problem is maybe there is an apocalyptic terrorist Unabomber Anti-natalist negative utilitarian type who is able to access this technology and just decides to literally kill everyone.

I definitely think a multipolar decaying into a unipolar situation seems like a possibility, I guess one thing I’m trying to do is weigh how likely this is against other scenarios where multipolarity leads to mutually assured destruction or apocalyptic terrorism.

• Upvoted, but it’s important to be very cautious about advancing capabilities.

• Gosh, someone made a gigantic flowchart of AI Alignment and posted it on here a few months back. But I can’t remember who it was at the moment.

Fortunately, I am a good googler: https://​​www.alignmentforum.org/​​s/​​aERZoriyHfCqvWkzg

If you’re interested in categorizing all the things, you might imagine generating dichotomies by extremizing notes or relationships in such a flowchart.

• look, at least y’all can’t say I didn’t warn you. Have a good one

• At this point in history, you have to be a bit more specific than the label “AGI,” because I’d already consider language models to be above the minimum standard for “AGI.”

But if you mean a program that navigates the real world at a near-human level and successfully carries out plans to perpetuate its existence, then I would expect such a program to have to work “out of the box,” rather than being a pure simulacrum.

Not to say that language models can’t be involved, but I’d count things like starting with a language model and then training it (or some supernetwork) to be an agent with RL as “designing it as an agent.”

• Thank you for your answer. In my example I was thinking of an AI such as a language model that would have latent ≥human-level capability without being an agent, but could easily be made to emulate one just long enough for it to get out of the box, e.g. duplicate itself. Do you think this couldn’t happen?

More generally, I am wondering if the field of AI safety research studies somewhat specific scenarios based on the current R&D landscape (e.g. “A car company makes an AI to drive a car and then someone does xyz and then paperclips”) and tailor-made safety measures in addition to more abstract ones like the ones in A Tentative Typology of AI-Foom Scenarios for instance.

• I think that would have the form of current AI research, but would involve extremely souped-up models of the world relative to what we have now (even moreso for the self-driving car), to the extent that it’s not actually that close to modern AI research. I think it’s reasonable to focus our efforts on deliberate attempts to make AGI that navigates the real world.

• 8 Dec 2022 0:50 UTC
11 points
4 ∶ 0

Why not create non-AI startups that are way less likely to burn capabilities commons?

• It seems to me joshc is arguing that it’s relatively easy to make money with AI startups at the moment.

• The commons is on fire and the fire is already self-preserving. Do you want to put the fire out? then become the fire. stop trying to tell the fire to slow down, it’s an extremely useless thing to do unless you’re ready to start pushing against capitalism as a whole.

You can unilaterally slow down AI progress by not working on it. Each additional day until the singularity is one additional day to work on alignment.

“Becoming the fire” because you’re doomer-pilled is maximally undignified.

• You cannot unilaterally slow down AI progress by not working on it??? what the fuck kind of opinion is that? deepmind is ahead of you. Deepmind will always be ahead of you. You cannot catch up to deepmind. for fuck’s sake, deepmind has a good shot of having TAI right now, and you want me to slow the fuck down? the fuck is your problem, have you still not updated off of deep learning?

• Default comment guidelines:

• Aim to explain, not persuade

• Try to offer concrete models and predictions

• If you disagree, try getting curious about what your partner is thinking

• Don’t be afraid to say ‘oops’ and change your mind

• I mean, yeah, I definitely don’t belong on this website, I’m way too argumentative. like, I’m not gonna contest that. But are you gonna actually do anything about your beliefs, or are you gonna sit around insisting we gotta slow down?

• I find the accusation that I’m not going to do anything slightly offensive.

Of course, I cannot share what I have done and plan to do without severely de-anonymizing myself.

I’m simply not going to take humanity’s horrific odds of success as a license to make things worse, which is exactly what you seem to be insisting upon.

• no, there’s no way to make it better that doesn’t involve going through, though. your model that any attempt to understand or use capabilities is failure is nonsense, and I wish people on this website would look in a mirror about what they’re claiming when they say that. that attitude was what resulted in mispredicting alphago! real safety research is always, always, always capabilities research! it could not be otherwise!

• You don’t have an accurate picture of my beliefs, and I’m currently pessimistic about my ability to convey them to you. I’ll step out of this thread for now.

• that’s fair. I apologize for my behavior here; I should have encoded my point better, but my frustration is clearly incoherent and overcalibrated. I’m sorry to have wasted your time and reduced the quality of this comments section.

• Everything is a matter of perspective.

It’s totally valid to take a perspective in which an AI trained to play Tetris “doesn’t want to play good Tetris, it just searches for plans that correspond to good Tetris.”

Or even that an AI trained to navigate and act in the real world “doesn’t want to navigate the real world, it just searches for plans that do useful real-world things.”

But it’s also a valid perspective to say “you know, the AI that’s trained to navigate the real world really does want the things it searches for plans to achieve.” It’s just semantics in the end.

But! Be careful about switching perspectives without realizing it. When you take one perspective on an AI, and you want to compare it to a human, you should keep applying that same perspective!

From the perspective where the real-world-navigating AI doesn’t really want things, humans don’t really want things either. They’re merely generating a series of outputs that they think will constitute a good plan for moving their bodies.

• TAI by 2028, get your head out of your ass and study capabilities! Don’t be wooed by how paralyzed MIRI is, deep learning has not hit a wall!

• Strong upvote for promoting SafetyCapabilities. Good to see there are people who aren’t wooed by the MIRI SafetyOnly or the current-industry CapabilitiesOnly approaches.

• I’m not able to run a company, but I’d love to join a startup with this attitude.

[edit: the reason I’m not able to run a company is well displayed by my errors in this comment section.]

• 8 Dec 2022 0:19 UTC
14 points
0 ∶ 0

During the past few months, I ran an undergraduate computer science research program at my university, and I chose to use Zulip to organize our communication (between 25 people). I wanted to use Zulip because it was open-source and, like you, I was a fan of the threads model. Unfortunately, the participants reported that the notifications were unreliable, the mobile app was janky, and the threads were confusing.

Keep in mind that these weren’t average software users but rather CS majors filtered through an application process – even for them, threads took a while to get used to. I concluded that Zulip would work well if every team member was on board with (and understood) the threads model, but a team that doesn’t care would prefer Discord or Slack.

• That’s a shame. Unreliable notifications is a very strong poison. Undeniability of reciept/​solving the byzantine generals problem is like, fundamental to all coordination problems.

• my unconditional median TAI timeline is now something like 2047, with a mode around 2035, defined by the first year we get >30% yearly GWP growth as measured from a prior peak, or an event of comparable significance.

Given it’s about to be 2023, this means your mode is 12 years away and your median is 24 years away. I’d expect your mode to be nearer than your median, but probably not that much nearer.

I haven’t forecasted when we might get >30% yearly GWP growth or an event of comparable significance (e.g. x-risk) specifically, but naively I’d guess that (for example) 2040 is more likely than 2035 to be the first year in which there is >30% annual GWP growth (or x-risk).

• These numbers were based on the TAI timelines model I built, which produced a highly skewed distribution. I also added several years to the timeline due to anticipated delays and unrelated catastrophes, and some chance that the model is totally wrong. My inside view prediction given no delays is more like a median of 2037 with a mode of 2029.

I agree it appears the mode is much too near, but I encourage you to build a model yourself. I think you might be surprised at how much sooner the mode can be compared to the median.

• 8 Dec 2022 0:14 UTC
LW: 2 AF: 1
0 ∶ 0
AF

Interesting analogy. I think of one of your examples the opposite, way, though!

When you examine a small subset of a network, that’s more like a quotient of the set of inputs—it’s some simple filter that can get applied to lots of stimuli. And when you make a broad, high-level model of a network, that’s like a subset on inputs—the subset is the domain of validity of your high-level model, because of course such models only work within a domain of validity.

• It looks like there is a new version of the attack, which wins against a version of KataGo that does not pass and that uses enough search to be handily superhuman (though much less than would typically be used in practice).

Looking at the first game here, it seems like the adversary causes KataGo to make a very serious blunder. I think this addresses the concern about winning on a technicality raised in other comments here.

It’s still theoretically unsurprising that self-play is exploitable, but I think it’s nontrivial and interesting that a neural network at this quality of play is making such severe errors. I also think that many ML researchers would be surprised by the quality of this attack. (Indeed, even after the paper came out I expect that many readers thought it would not be possible to make a convincing attack without relying on technicalities or a version of the policy with extremely minimal search.)

• Artist depiction of the Horde of Death.

• Fixed, thank you.

• 8 Dec 2022 0:00 UTC
3 points
0 ∶ 0

Agreed that it’s wasteful, I have a different suspicion of how hard it is. “editing some tiny file” is misleadingly few words to describe BRANCHING the message, so the verbal announcement uses a different script than the written one. Technically, that’s probably not terribly hard. Organizationally, that’s more effort in keeping changes synced, making sure the same semantics happen in both places, and editing/​approving changes.

• 7 Dec 2022 23:59 UTC
1 point
0 ∶ 0

Another potential assumption/​limitation of the EMH:

• Socially acceptable to trade: It must be socially acceptable for people who have enough financial resources to noticeably affect market prices to trade based on the new information.

I initially proposed this idea to try to explain the market’s slow response to the early warning signs of Covid in this comment. Similar dynamics may come into play with respect to the social acceptability of ESG vs anti-ESG investing based on political affiliation, although in this case I don’t think there is enough anti-ESG money to affect the prevailing ESG trends much at this point.

• 7 Dec 2022 23:56 UTC
LW: 4 AF: 2
0 ∶ 0
AF

Curated. This is a bit of an older post but seemed important. I know a lot of people asking “When is it a good idea to do work that furthers AI capabilities (even if it also helps alignment?)” – both researchers, and funders. I think this post adds a crisp extra consideration to the question that I hadn’t seen spelled out before.

• I still want to make a really satisfying “fuck yeah” button on LessWrong comments that feels really good to press when I’m like “yeah, go team!” but doesn’t actually mean I want to reward the comment in our longterm truthtracking or norm-tracking algorithms.

I think this would seriously help with weird sociokarma cascades.

• Ah, I imagine a third set of voting buttons, with large colorful buttons “yay, ingroup!!!” and “fuck outgroup!!!”, with the following functionality:

• in your personal settings,you can replace the words “ingroup” and “outgroup” by a custom text

• only the votes that agree with you are displayed; for example if there are 5 “yay” votes and 7 “boo” votes, if you voted “yay”, you will only see “5 people voted yay on this comment” (not the total −2)

• the yay/​boo votes have no impact on karma

• if you make a yay/​boo vote, the other two sets of voting buttons are disabled for this comment

What I expect from this solution:

• to be emotionally deeply satisfying

• without having any impact on karma (actually it would take mindkilling votes away from the karma buttons)

• What longterm truthtracking or norm-tracking algorithms are you talking about? Can you give a few examples of sociokarma cascades that you think will improved by this complexity? Would adding agree/​disagree to top-level posts be sufficient (oh, wait, you’re talking about comments. How does agree/​disagree not solve this?)

More fundamentally, why do you care about karma, aside from a very noisy short-term input into whether a post or comment is worth thinking about?

Now if you say “do away with strong votes, and limit karma-based vote multiples to 2x”, I’m fully onboard.

• You should just message them directly. “Your comment was very based.” would feel quite nice in my inbox.

• downvotes are to get spam off the front page, not to refute the spam

• 7 Dec 2022 23:37 UTC
10 points
4 ∶ 0

calls out Bostrom as out of touch

I think he actually said that Bostrom represents the current zeitgeist, which is kind of the opposite of “out of touch”? (Unless he also said “out of touch”? Unfortunately I can’t find a transcript to do a search on.)

It’s ironic that everyone thinks of themselves as David fighting Goliath. We think we’re fighting unfathomably powerful economic forces (i.e., Moloch) trying to build AGI at any cost, and Peter thinks he’s fighting a dominant culture that remorselessly smothers any tech progress.

• 7 Dec 2022 22:59 UTC
6 points
1 ∶ 0

Look back at those transformative innovations—even when it seemed likely that they’d take over, it wasn’t obvious WHO would manage to capture an outsized portion of the value. Amazon famously was not an obvious bet in e-commerce (amazon dot bomb headline). Apple in 2007 wasn’t clearly going to win in the smartphone market—maybe the carrier or Google/​android or Ericsson/​Sony would.

Likewise now—it seems clear that big changes in our daily lives are coming. It does not seem clear how to passively invest based on that. Active investing (where you own a significant chunk of your own labor outputs) is likely possible—starting an enterprise that uses this in a non-obvious way is high-risk, but also very high reward if it works.

• You’re right it is not obvious. I just want ideas of things that are not so obvious that I can look into and if I am convinced that there is a small probability one of those ideas could be the one with the breakthrough, I will buy some of it.

• In general I am no fan of angellist syndicates because the fees are usurious, but if you have high conviction that there are huge returns to AI, possibly LLM syndicates might be worth a look.

• Zulip has a tiny change: you have to make a conversation have a point up front by giving it a thread title. random convos can happen in the ‘random’ thread, so it includes the previous model.

That’s not a tiny change. That’s a huge change. That’s the difference between an e-mail listserv and IRC (or a PHPBB forum and IRC). It seems like what you’re saying is that you prefer topic-based threaded conversations to the free-form “chat” model. That’s totally valid! However the mere act of requiring a topic, in my experience, totally changes the way people approach and interact with the software, and, as a result, changes the nature of discussions that take place. Personally, I think it’s a positive change, but lots of people disagree.

• 7 Dec 2022 22:54 UTC
2 points
0 ∶ 0

Osman’s sleepers Hayden scintillating agglutinate unnerving styli Aleutian’s sacs stardom’s stepfather’s Aron’s delegates noisy substitutions Johanna ICBMs respectable chamois’s espies theme’s clobbers downpour’s cagey Chateaubriand.

Hard to predict, but not very interesting. I don’t think you’ll get very far without semantic content analysis. In fact, within a given idea, redundancy is CRITICAL to getting that idea across to the intended hearer. Nobody sane goes for lexical surprise, and even conceptual surprise is somewhat constrained in the dimensions which are interesting and/​or useful.

• 7 Dec 2022 22:49 UTC
5 points
0 ∶ 1

It seems to me that, all else equal, the more bullish you are on short-term AI progress, the more likely you should think vision-only self driving will work soon.

And TSLA seems like probably the biggest beneficiary of that if it works.

• I suspect trucking companies (or truck manufacturers, or maybe logistics companies that suck up all the surplus from truckers) are the biggest beneficiaries. But so much depends on how deeply levered they are and how much is already priced in—TSLA could EASILY already be counting on that in their current valuations. If so, it’ll kill them if it doesn’t happen, but only maintain if it does.

A better plan might be to short (or long-term puts on) the companies you think will be hurt by the things you’re predicting.

• or truck manufacturers

Note that Tesla has (just) started producing a truck: https://​​www.tesla.com/​​semi. And electric trucks stand to benefit the most from self-driving tech, because their marginal cost of operation is lower than gas powered, so you get a bigger benefit from the higher utilization that not having a driver enables.

But so much depends on how deeply levered they are and how much is already priced in—TSLA could EASILY already be counting on that in their current valuations. If so, it’ll kill them if it doesn’t happen, but only maintain if it does.

Totally fair point, but FWIW, if you look at analyst reports, they’re mostly not factoring in FSD. And basic napkin math suggests the current valuation is reasonable based on vehicle sales alone, if sales continue to grow for the next few years in line with Tesla’s stated production goals.

And while you might think Telsa has a bad track record of hitting their stated goals, they’ve actually done pretty well on the key metrics of cars produced and revenue. Revenue has grown on average 50% per year since 2013 (the first full year of Model S production, which seems like a good place to start counting, so that growth numbers aren’t inflated by starting at zero).

They’ve guided for 50% revenue growth for the next few years as well, and their plan to achieve that seems plausible. For the next year or so it’s just a matter of scaling up production at their new Berlin and Austin factories, and they’re supposedly looking for more factory locations so they can continue growing after that as well.

All that said, I agree that buying TSLA is not a pure play on the AI part — you have to have some view on whether all the stuff I said above about their car business is right or not.

A better plan might be to short (or long-term puts on) the companies you think will be hurt by the things you’re predicting.

I agree this could be worthwhile. Though I feel that with shorting, timing becomes more important because you have to pay interest on the position.

• 7 Dec 2022 22:21 UTC
LW: 5 AF: 2
2 ∶ 2
AF

They have a strong belief that in order to do good alignment research, you need to be good at “consequentialist reasoning,” i.e. model-based planning, that allows creatively figuring out paths to achieve goals.

I think this is a misunderstanding, and that approximately zero MIRI-adjacent researchers hold this belief (that good alignment research must be the product of good consequentialist reasoning). What seems more true to me is that they believe that better understanding consequentialist reasoning—e.g., where to expect it to be instantiated, what form it takes, how/​why it “works”—is potentially highly relevant to alignment.

• I think you might be misunderstanding Jan’s understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate’s notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI’s focus on understanding consequentialist reasonsing for a belief that alignment research requires being a consequentialist reasoner.

• I think you’re right—thanks for this! It makes sense now that I recognise the quote was in a section titled “Alignment research can only be done by AI systems that are too dangerous to run”.

• You’ve made a completely false comparison between two impossible objects. In order to destroy the property, I have to get possession. Once I get possession, there’s no defining that it was “destroyed”. You’re thinking in cartoon pictures, and that’s because you don’t seem to understand anything about real estate.

Land rights are not traded on ’bots, all claims are subjective. To get possession through the civil process it requires pleading an entire court case that can take years and a jury trial. There’s no purpose to these perambulations, it’s unstructured theory without any real time application.

The George Tax is just property tax, assessed by the same authority against land value only. There’s no question of accuracy, because all parcels are subject to the same process and uniform standard. The purpose of assessment is to reach uniform comparison, not existential value.

Adverse possession is a better title than tax sales, which only conveys new title to the civil map. There’s nothing sacred about the civil map, a lot of this error comes from libertarian confusion about “property rights’. The only workable solution is everything up for sale at any time by private application to the local Treasurer, subject to all existing rights, use & occupancy. Otherwise, Henry George was talking about property taxes on land, and the assessment system is perfect at that point.

I can easily defeat this Harberger system, set the price at “zero”. All bids are redeemed at public sales, which is completely ignored. It goes in every direction, these urban legends say that “deeds” are existential with vertical force lines shooting up from the parcel map, making an invisible wall against trespass. The whole thing is founded on complete ignorance of how the mapping system developed, how property taxes actually work, and what real estate title means.

• Your wilfully obnoxious tone makes it difficult to summon motivation to consider whether the points you are making are correct. Please consider rewriting in a manner that focuses on the actual issues rather than the alleged deficiencies of the person you are responding to (“you don’t seem to understand … infantility … morbid middletard … complete ignorance”).

• Like it says up top don’t persuade, “explain”.

The rest is up to you, just know that I can defeat this system. I already do in court, and it confounds the many assumptions that people make, esp. attorneys.

Part of the problem is lack of root in the historic development of land titles in America and elsewhere. The best place to start is reading the Soviet decree on Land, 1918 written by V. Lenin.

https://​​en.m.wikipedia.org/​​wiki/​​Decree_on_Land

• 7 Dec 2022 21:49 UTC
2 points
0 ∶ 0

I have to admit I don’t get it. I mean, you can’t just deny that probability estimates are a thing. How do decision theories (or just decision mechanisms) work in a Fallibilist worldview? What does it mean, technically, for a theory to become “less wrong” over time? What are the mechanics (what changes in one’s worldview) when we notice and eliminate an error in a theory?

Your description of infinite possibilities makes me think you don’t understand the difference between “infinite” and “very large and not fully known”. And I wonder if you acknowledge that one’s potential future experiences are NOT infinite, but are still very hard to predict and unknown in scope, and that Bayesean probabilities work just fine for it—include an assignment for “something else”. Bayesean probabilities are not true, they’re personal estimates/​assignments of future experiences. And they’re the best thing we have for making decisions.

• This piece of news is the most depressing thing I’ve seen in AI since… I don’t know, ever? It’s not like the algorithms for doing this weren’t lying around already. The depressing thing for me is that it was promoted as something to be proud of, with no regard for the framing implication that cooperative discourse exists primarily in service of forming alliances to exterminate enemies.

• My intuition is having a really hard time being worried about this because… I’m not sure exactly why… in real life, diplomacy occurs in an ongoing long-term game, and it seems to my intuition that the key question is how to win the infinite game by preventing wins of short term destructive games like, well, diplomacy. The fact that a cooperative AI appears to be the best strategy when intending to win the destructive game seems really promising to me, because to me that says that even when playing a game that forces destructive behavior, you still want to play cooperative if the game is sufficiently realistic. The difficult part is forging those alliances in a way that allows making the coprotection alliances broad and durable enough to reach all the way up to planetary and all the way down to cellular; but isn’t this still a promising success of cooperative gameplay?

Maybe I’m missing something. I’m curious why this in particular is so bad—my world model barely updated in response to this paper, I already had cached from a now-deleted Perun gaming video (“Dominions 5 Strategy: Diplomacy Concepts (Featuring Crusader Kings 2)”) that cooperative gameplay is an unreasonably effective strategy in sufficiently realistic games, so seeing an AI discover that doesn’t really change my model of real life diplomacy, or of AI capabilities, or of facebook’s posture.

Seems like we have exactly the same challenge we had before—we need to demonstrate a path out of the destructive game for the planet. How do you quit destructive!diplomacy and play constructive!diplomacy?

• I’ve searched my memory for the past day or so, and I just wanted to confirm that the “ever” part of my previous message was not a hot take or exaggeration.

• If you gave a language model the prompt: “Here is a dialog between a human and an AI assistant in which the AI never says anything offensive,” and if the language model made reasonable next-token predictions, then I’d expect to see the “non-myopic steering” behavior (since the AI would correctly predict that if the output is token A then the dialog would be less likely to be described as “the AI never says anything offensive”). But it seems like your definition is trying to classify that language model as myopic. So it’s less clear to me if this experiment can identify non-myopic behavior, or maybe it’s not clear exactly what non-myopic behavior means.

I haven’t thought about this deeply, but feel intuitively skeptical of the way that abstractions like myopia are often used around here for reasoning about ML systems. It feels like we mostly care about the probability of deceptive alignment, and so myopia is relevant insofar as it’s a similarity between the kind of cognition the model is doing and the kind of cognition that could lead to deceptive alignment. I think it’s worth tracking the most important risk factors for deceptive alignment or measurement strategies for monitoring the risk of deceptive alignment, but I wouldn’t have guessed that this type of myopia is in the top 5.

• I think this design would be good.

I’m working on the same problem of improving discussion and curation systems with Tasteweb. I focus more on making it easier to extend or revoke invitations with transparency and stronger support for forking/​subjectivity. I’m hoping that if you make it easy to form and maintain alternative communities, it’ll become obvious enough that some of them are much more good faith/​respectful/​sincerely interested in what others are saying, and that would also pretty much solve deduplication.
I think in reality, it’s too much labor, and it would only work for subjects that people really really care about, but those also happen to be the most important applications to build for so.

I like the focus on relevance. Relevance is all you need. If everyone just voted on the basis of relevance, reddit would be a lot better (but of course, the voters are totally unaccountable, so there’s no way to get them to).

I don’t think graph visualizations are really useful. The data should be graph-shaped, sure, but it’s super rare that you want to see the entire graph or browse through the data that way. A tree is just a clean layout for the results of a query from from a particular origin node in a graph. I’d recommend a UI for directed graphs, a tree where things can be mounted to the tree at multiple points, and where it’s communicated to the user if they’ve seen a comment recently before with, eg, red backlinks.

• It appears they are having a very high amount of demand and are experiencing problems.

• 7 Dec 2022 20:33 UTC
LW: 4 AF: 3
1 ∶ 0
AF

Can you give some historical examples of work that lowered the amount-of-serial-research-left-till-doom? And examples of work that didn’t? Because an advance in alignment is often a direct advance in capabilities, and I’m a little confused about the spectrum of possibilities.

Here’s an example of my confusion. Clearly interpretability work is mostly good, right? Exploring semantic super-positions and other current advances seem like they’re clearly benificial to publish in spite of the fact that they advance capabilities. If we progress to the point where we can interpret the algorithms that a smallish NN is using, that still seems fine. But what if interpretability research progress to the point where they can decode the algorithms a NN is running, then the techniques that allow that level of interpretability are quite dangerous. For example, if we find large NNs have some kind of proto-general search which seems like it could be amplified easily to get a general agent, then, you know, it would be pretty bad if every AGI organization could find this out by just applying standard interpretability tool X. Or is that kind of work still worth publishing, because powerful interpretability would make alignment way easier and that outweighs the risk of reducing serial research time till doom?

• I don’t know Nate’s response, but his take on agent-foundations-ish research in A note about differential technological development (and the fact that he and MIRI have been broadly pro-interpretability-work to date) might help clarify how he thinks about cases like this.

[...]

I feel relatively confident that a large percentage of people who do capabilities work at OpenAI, FAIR, DeepMind, Anthropic, etc. with justifications like “well, I’m helping with alignment some too” or “well, alignment will be easier when we get to the brink” (more often EA-adjacent than centrally “EA”, I think) are currently producing costs that outweigh the benefits.

Some relatively niche and theoretical agent-foundations-ish research directions might yield capabilities advances too, and I feel much more positive about those cases. I’m guessing it won’t work, but it’s the kind of research that seems positive-EV to me and that I’d like to see a larger network of researchers tackling, provided that they avoid publishing large advances that are especially likely to shorten AGI timelines.

The main reasons I feel more positive about the agent-foundations-ish cases I know about are:

• The alignment progress in these cases appears to me to be much more serial, compared to the vast majority of alignment work the field outputs today.

• I’m more optimistic about the total amount of alignment progress we’d see in the worlds where agent-foundations-ish research so wildly exceeded my expectations that it ended up boosting capabilities. Better understanding optimization in this way really would seem to me to take a significant bite out of the capabilities generalization problem, unlike most alignment work I’m aware of.

• The kind of people working on agent-foundations-y work aren’t publishing new ML results that break SotA. Thus I consider it more likely that they’d avoid publicly breaking SotA on a bunch of AGI-relevant benchmarks given the opportunity, and more likely that they’d only direct their attention to this kind of intervention if it seemed helpful for humanity’s future prospects.

• (Footnote: On the other hand, weirder research is more likely to shorten timelines a lot, if it shortens them at all. More mainstream research progress is less likely to have a large counterfactual impact, because it’s more likely that someone else has the same idea a few months or years later. “Low probability of shortening timelines a lot” and “higher probability of shortening timelines a smaller amount” both matter here, so I advocate that both niche and mainstream researchers be cautious and deliberate about publishing potentially timelines-shortening work.)

• Relatedly, the energy and attention of ML is elsewhere, so if they do achieve a surprising AGI-relevant breakthrough and accidentally leak bits about it publicly, I put less probability on safety-unconscious ML researchers rushing to incorporate it.

I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its eggs in that basket.

Rather, my hope is that this example clarifies that I’m not saying “doing alignment research is bad” or even “all alignment research that poses a risk of advancing capabilities is bad”.

[...]

• I’m afraid that this take is incredibly confused, so much that it’s hard to know where to start with correcting it.

Maybe the most consequential error is the misunderstanding of what “verify” means in this context. It means “checking a proof of a solution” (which in the case of a decision problem in NP would be a proof of a “yes” answer). In a non-mathematical context, you can loosely think of “proof” as consisting of reasoning, citations, etc.

That’s what went wrong with the halting problem example. The generator did not support their claim that the program halts. If they respond to this complaint by giving us a proof that’s too hard, we can (somewhat tautologically) ensure that our verifier job is easy by sending back any program+proof pair where the proof was too hard to verify.

• 7 Dec 2022 20:19 UTC
LW: 36 AF: 19
12 ∶ 0
AF

Nate tells me that his headline view of OpenAI is mostly the same as his view of other AGI organizations, so he feels a little odd singling out OpenAI.
[...]
But, while this doesn’t change the fact that we view OpenAI’s effects as harmful on net currently, Nate does want to acknowledge that OpenAI seems to him to be doing better than some other orgs on a number of fronts:

I wanted to give this a big +1. I think OpenAI is doing better than literally every single other major AI research org except probably Anthropic and Deepmind on trying to solve the AI-not-killing-everyone task. I also think that Anthropic/​Deepmind/​OpenAI are doing better in terms of not publishing their impressive capabilities research than ~everyone else (e.g. not revealing the impressive downstream Benchmark numbers on Codex/​text-davinci-002 performance). Accordingly, I think there’s a tendency to give OpenAI an unfair amount of flak compared to say, Google Brain or FAIR or any of the startups like Adept or Cohere.

This is probably a combination of three effects:

• OpenAI is clearly on the cutting edge of AI research.

• OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/​Rationalist social scene.

• OpenAI is publicly talking about alignment; other orgs don’t even acknowledge it, this makes it a heretic rather than an infidel.

And I’m happy that this post pushes against this tendency.

(And yes, standard caveats, reality doesn’t grade on a curve, etc.)

• 7 Dec 2022 21:54 UTC
LW: 34 AF: 19
9 ∶ 0
AFParent

Accordingly, I think there’s a tendency to give OpenAI an unfair amount of flak compared to say, Google Brain or FAIR or any of the startups like Adept or Cohere.

I’m not sure I agree that this is unfair.

OpenAI is clearly on the cutting edge of AI research.

This is obviously a good reason to focus on them more.

OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/​Rationalist social scene.

Perhaps we have responsibility to scrutinize/​criticize them more because of this, due to comparative advantage (who else can do it easier/​better than we can), and because they’re arguably deriving some warm fuzzy glow from this association? (Consider FTX as an analogy.)

OpenAI is publicly talking about alignment; other orgs don’t even acknowledge it, this makes it a heretic rather than an infidel.

Yes, but they don’t seem keen on talking about the risks/​downsides/​shortcomings of their alignment efforts (e.g., they make their employees sign non-disparagement agreements and as a result the former alignment team members who left in a big exodus can’t say exactly why they left). If you only talk about how great your alignment effort is, maybe that’s worse than not talking about it at all, as it’s liable to give people a false sense of security?

• How does GPT-Eliezer make decisions where his stance may change due to evolving circumstances?

Right now he probably would not allow the chatbot to answer questions about executing a pivotal act, but under certain circumstances real-life Eliezer would want fake Eliezer to do so. To be able to do this, it seems like GPT-Eliezer needs to be able to verify the justifications for the prompts he’s provided and seek further information and justification if not, but this necessitates agential behaviour.

The alternative is simulating real-life Eliezer based on limited or out-of-date knowledge, but it seems like (given expectations around the pivotal act window) that this would result in GPT-E either never answering these requests or doing so poorly, or even in a way that is open to manipulation by information provided in the prompt.

• [Also not a physicist] This makes sense but seems a bit unintuitive. I like to think of spinors as being generalizations of vector fields. Consider, what makes a vector field different from 3 scalar fields? They can store the same amount of information. The answer is that when you tilt your head, the vectors tilt with you—but in the opposite direction, from your perspective—while the scalar fields stay fixed. In other words, the vector field transforms according to a 3-dimensional representation of the rotation group. You can get spinors by generalizing from the ordinary rotation group to the Lorentz group of metric-preserving transformations of spacetime, and noticing that, in addition to the “obvious” 4-dimensional representation, there are 2-dimensional representations as well.

• EDIT: This post is incorrect. See the reply chain below. After correcting my misunderstanding, I agree with your explanation.

The difference you’re describing between vector fields and scalar fields, mathematically, is the difference between composition and precomposition. Here it is more precisely:

• Pick a change-of-perspective function P(x). The output of P(x) is a matrix that changes vectors from the old perspective to the new perspective.

• You can apply the change-of-perspective function either before a vector field V(x) or after a vector field. The result is either V(x)P(x) or P(x)V(x).

• If you apply P(x) before, the vector field applies a flow in the new perspective, and so its arrows “tilt with your head.”

• If you apply P(x) after, the vector field applies a flow in the old perspective, and so the arrows don’t tilt with your head.

• You can do replace the vector field V(x) with a 3-scalar field and see the same thing.

Since both composition and precomposition apply to both vector fields and scalar fields in the same way, that can’t be something that makes vector fields different from scalar fields.

As far as I can tell, there’s actually no mathematical difference between a vector field in 3D and a 3-scalar field that assigns a 3D scalar to each point. It’s just a choice of language. Any difference comes from context. Typically, vector fields are treated like flows (though not always), whereas scalar fields have no specific treatment.

Spinors are represented as vectors in very specific spaces, specifically spaces where there’s an equivalence between matrices and spatial operations. Since a vector is something like the square root of a matrix, a spinor is something like the square root of a spatial operation. You get Dirac Spinors (one specific kind of spinor) from “taking the square root of Lorentz symmetry operations,” along with scaling and addition between them.

As far as spinors go, I think I prefer your Lorentz Group explanation for the “what” though I prefer my Clifford Algebra one for the “how”. The Lorentz Group explanation makes it clear how to find important spinors. For me, the Clifford Algebra makes it clear how the rest of the spinors arise from those important spinors, and it makes it clear that they’re the “correct” representation when you want to sum spatial operations, as you would with wavefunctions. It’s interesting that the intuition doesn’t transfer as I expected. I guess the intuition transfer problem here is more difficult than I expected.

Note: Your generalization only accounts for unit vectors, and spinors are NOT restricted to unit vectors. They can be scaled arbitrarily. If they couldn’t, ψ†ψ would be uniform at every point. You probably know this, but I wanted to make it explicit.

• As far as I can tell, there’s actually no mathematical difference between a vector field in 3D and a 3-scalar field that assigns a 3D scalar to each point.

The difference is in how they transform under coordinate changes. To physicists, a vector field is defined by how it transforms. So this:

You can do replace the vector field V(x) with a 3-scalar field and see the same thing

is not correct; by definition, a 3-scalar field should transform trivially under coordinate changes.

• Reading the wikipedia page on scalar field, I think I understand the confusion here. Scalar fields are supposed to be invariant under changes in reference frame assuming a canonical coordinate system for space.

Take two reference frames P(x) and G(x). A scalar field S(x) needs to satisfy:

• S(x) = P’(x)S(x)P(x) = G’(x)S(x)G(x)

• Where P’(x) is the inverse of P(x) and G’(x) is the inverse of G(x).

Meaning the inference of S(x) should not change with reference frame. A scalar field is a vector field that commutes with perspective transformations. Maybe that’s what you meant?

I wouldn’t use the phrase “transforms trivially” here since a “trivial transformation” usually refers to the identity transformation. I wouldn’t use a head tilt example either since a lot of vector fields are going to commute with spatial rotations, so it’s not good for revealing the differences. And I think you got the association backwards in your original explanation: scalar fields appear to represent quantities in the underlying space unaffected by head tilts, and so they would be the ones “transforming in the opposite direction” in the analogy since they would remain fixed in “canonical space”.

• I wouldn’t use the phrase “transforms trivially” here since a “trivial transformation” usually refers to the identity transformation

No, I do mean the identity transformation. Scalar fields do not transform at all under coordinate changes. To be precise, if we have a coordinate change matrix , a scalar field transforms like

Whereas a vector field transforms like

• Ah. Thank you, that is perfectly clear. The Wikipedia page for Scalar Field makes sense with that too. A scalar field is a function that takes values in some canonical units, and so it transforms only on the right of f under a perspective shift. A vector field (effectively) takes values both on and in the same space, and so it transforms both on the left and right of v under a perspective shift.

I updated my first reply to point to yours.

• Interesting. That seems to contradict the explanation for Lie Algebras, and it seems incompatible with commutators in general, since with commutators all operators involved need to be compatible with both composition and precomposition (otherwise AB—BA is undefined). I guess scalar fields are not meant to be operators? That doesn’t quite work since they’re supposed used to describe energy, which is often represented as an operator. In any case, I’ll have to keep that in mind when reading about these things.

• How, if at all, does your alignment approach deal with deceptive alignment?

• Good thing there’s not a huge public forum with thousands of posts about misaligned AI that clearly has already been included in GPT-3′s training, including hundreds which argue that misaligned AI will trivially kill-

… oh wait.

All joking aside, if this does become an issue, it should be relatively easy to filter out the vast majority of “seemingly aligned AIs misbehaves” examples using a significantly smaller LM. Ditto for other things you might not want, e.g. “significant discussion of instrumental convergence”, “deceptive alignment basics”, etc.

My guess is this isn’t that big of a deal, but if it does become a big deal, we can do a lot better than just asking people to stop writing dystopian AI fiction.

• claim: It is better to die fighting than to allow this to occur

• dude has been funding trumpism, I wouldn’t really read much into what he says

• WTF downvotes! you wanna explain yourselves?

• oh hmm. thanks for explaining! I think I don’t universally agree with offering intellectual charity, especially to those with extremely large implementable agency differences, like thiel (and sbf, and musk, and anyone with a particularly enormous stake of power coupons, aka money). I’m extremely suspicious by default of such people, and the fact that thiel has given significantly to the trump project seems like strong evidence that he can’t be trusted to speak his beliefs, since he has revealed a preference for those who will take any means to power. my assertion boils down to “beware adversarial agency from trumpist donors”. perhaps it doesn’t make him completely ignorable, but I would still urge unusually much caution.

• The exercise of figuring out what he could’ve meant doesn’t require knowing that he believes it. I think the point I formulated makes sense and is plausibly touching on something real, but it’s not an idea I would’ve spontaneously thought of on my own, so the exercise is interesting. Charity to something strange is often like that. I’m less clear on whether it’s really the point Thiel was making, and I have no idea if it’s something he believes, but that doesn’t seem particularly relevant.

• It’s most likely a backend error on their end due to high server load, yes.

• As I have been logged into my OpenAI account on my main browser, I decided to open an incognito window to have 2 browsers side by side. I got prompted with “We’re experiencing exceptionally high demand. Please hang tight as we work on scaling our systems.” with a “Get notified when we’re back” green button. :(

• As a nitpick: I think the USNews ranking of CS Graduate programs is better than the rankings you’re currently using:

• I broadly agree with this take, though my guess is the tractability is quite low for most people, even for top 25 CS schools like Yale/​Penn*/​UMich as opposed to the best 4-5 schools. For example, it’s probably not the case that the average Berkeley or MIT CS PhD can become a professor at a top 25 school, and that’s already a very selected cohort. There are a lot of grad students (Berkeley has ~50 AI PhDs and ~100 CS PhDs in my cohort, for example!), of which maybe half would want to be professors if they thought it was tractable. Schools just don’t hire that many professors every year, even in booming fields like CS!

That being said, if the actual advice is: if you’re doing an ML PhD, you should seriously consider academia, I do fully agree with this.

It might be relatively tractable and high-value to be a CS professor somewhere with a CS department that underperforms but has a lot of potential. An ideal university like this would be wealthy, have a lot of smart people, and have a lot of math talent yet underperforms in CS and is willing to spend a lot of money to get better at it soon.

Another large advantage you get being at a top research university in the US (even one that’s mediocre at CS) is you end up with significantly more talented undergrads for research assistants (as most undergrads don’t pick schools based on their major, or switch majors midway). I think the main disadvantage to going to a lower-tier CS program is that it become significantly harder to recruit good grad students.

=======

*That being said, Penn is great and Philadelphia is wonderful; would recommend even though the number of undergrads who want to do research is quite low! (Disclaimer: I went to Penn for undergrad).

• Thanks! One thing I’ll add is that there’s a chance that someone at a school without normally a ton of great grad students might be able to get some good applicants to apply via the AI safety community.

• I’m running into the same error and have to wait several minutes before retrying the query.

• This is an interesting way to look at it. I’m not sure it makes total sense, because if some university that’s (relatively) bad at CS is bad because it doesn’t care as much, and accepts students who don’t care much either, then I don’t think you get a benefit out of going there just because they’re high-ranked overall. (E.g. maybe all the teaching faculty at U of M teach premeds more than future AI researchers, and don’t get support for pet projects)

In other words, you still have to evaluate cultural fit. I’m not even sure that relatively low ranking on CS is correlated with good cultural fit rather than anticorrelated.

• I’m not sure it makes total sense, because if some university that’s (relatively) bad at CS is bad because it doesn’t care as much,

My guess is that the better universities are generally better b/​c of network effects: better faculty want to be there, which means you get better grad students and more funding, which means you get better faculty, etc. Many of the lower tier CS departments at rich research universities still have a lot of funding and attention. My impression is also that almost no large research university “wants” to be bad at CS, it’s just pretty hard to overcome the network effects.

Also, in terms of research funding, the majority of it comes from outside grants anyways. And a good AI Alignment Professor should not have that much difficulty securing funding from EA.

• 7 Dec 2022 15:58 UTC
2 points
0 ∶ 0

So, try more than you can handle, because even if you ultimately have to settle down, it will give you better perspective on what choices were available?

Also, next time you make a Mandelbrot set animation, at some moment you should split the screen and follow two paths that started very similar. Then maybe slide out one of them; and later split the remaining one again.

• # Differential Training Process

I’ve been ruminating on an idea ever since I read the section on deception in “The Core of the Alignment Problem is...” from my colleagues in SERI MATS.

Here’s the important part:

When an agent interacts with the world, there are two possible ways the agent makes mistakes:

• Its values were not aligned with the outer objective, and so it does something intentionally wrong,

• Its world model was incorrect, so it makes an accidental mistake.

Thus, the training process of an AGI will improve its values or its world model, and since it eventually gets diminishing marginal returns from both of these, both the world model and the values must improve together. Therefore, it is very likely that the agent will have a sufficiently good world model to understand that it is in a training loop before it has fully aligned inner values.

So, what if we prevented the model from recognizing it is in a training loop (e.g. preventing/​delaying situational awareness) until we are certain it has fully aligned inner values? In other words, we could use some stronger forms of model editing to remove specific knowledge (or prevent the model from gaining that knowledge) from the model. Perhaps you penalize the model from learning things that are not useful for fully embedding aligned inner values (Tool AI-ish). Maybe even apply negative gradient steps to “unlearn” things.

I think this idea is similar to Evan’s Precursor Checking idea:

Counterarguments to this might be:

• The model might not be able to have fully aligned inner values that remain robust as capabilities eventually generalize far out of distribution.

• It will exceptionally difficult to know if we’ve actually removed this knowledge/​capability from the model (even if it’s possible).

I’d be interested in hearing people’s thoughts/​criticisms on this.

• It seems that Jan Leike mentions something similar in his “why I’m optimistic about our alignment approach” post.

The model can be “narrower.” It doesn’t need to understand biology, physics, or human society that well. In practice we’d probably fine-tune from an LLM that does understand all of those things, but we could apply some targeted brain damage to the model as a safety precaution. More generally, the model only has to exceed human-level in a few domains, while it can be worse than humans in most others.

• I’ll just comment on my experience as an undergrad at Yale in case it’s useful.

At Yale, the CS department, particularly when it comes to state of the art ML, is not very strong. There are a few professors who do good work, but Yale is much stronger in social robotics and there is also some ML theory. There are a couple AI ethics people at Yale, and there soon will be a “digital ethics” person, but there aren’t any AI safety people.

That said, there is a lot of latent support for AI safety at Yale. One of the global affairs professors involved in the Schmidt Program for Artificial Intelligence, Emerging Technology, and National Power is quite interested in AI safety. He invited Brian Christian and Stuart Russell to speak and guest teach his classes, for example. The semi-famous philosopher L.A. Paul is interested in AI safety, and one of the theory ML professors had a debate about AI safety in one of his classes. One of the professors most involved in hiring new professors specifically wants to hire AI safety people (though I’m not sure he really knows what AI safety is).

I wouldn’t really recommend Yale to people who are interested in doing very standard ML research and want an army of highly competent ML researchers to help them. But for people whose work interacts with sociotechnical considerations like policy, or is more philosophical in nature, I think Yale would be a fantastic place to be, and in fact possibly one of the best places one could be.

• This is great thanks. It seems like someone wanting a large team of existing people with technical talent is a reason to not work somewhere like Yale. But what are the chances that the presence of lots of money and smart people would make this possible in the future? Is Yale working on strengthening its cs department? One of my ideas behind this post is that being the first person doing certain work in a department that has potential might have some advantages compared to being the 5th in a department that has already realized it’s potential. An ai safety professor at Yale might get invited to a lot of things, have little competition for advisees, be more uniquely known within Yale, and provide advocacy for ai safety in a way that counterfactually would not happen otherwise at the university.

• I think this is all true, but also since Yale CS is ranked poorly the graduate students are not very strong for the most part. You certainly have less competition for them if you are a professor, but my impression is few top graduate students want to go to Yale. In fact, my general impression is often the undergraduates are stronger researchers than the graduate students (and then they go on to PhDs at higher ranked places than Yale).

Yale is working on strengthening its CS department and it certainly has a lot of money to do that.