Logic & Math­e­mat­ics

TagLast edit: 25 Sep 2021 3:55 UTC by

Logic and Mathematics are deductive systems, where the conclusion of a successful argument follows necessarily from its premises, given the axioms of the system you’re using: number theory, geometry, predicate logic, etc.

Is progress in ML-as­sisted the­o­rem-prov­ing benefi­cial?

28 Sep 2021 1:54 UTC
10 points

Stan­dard and Non­stan­dard Numbers

20 Dec 2012 3:23 UTC
75 points

How long does it take to be­come Gaus­sian?

8 Dec 2020 7:23 UTC
134 points

A Nice Rep­re­sen­ta­tion of the Laplacian

12 Feb 2022 3:20 UTC
15 points
(danielfilan.com)

Paper: Dis­cov­er­ing novel al­gorithms with AlphaTen­sor [Deep­mind]

5 Oct 2022 16:20 UTC
80 points
(www.deepmind.com)

The Gold­bach con­jec­ture is prob­a­bly cor­rect; so was Fer­mat’s last theorem

14 Jul 2020 19:30 UTC
73 points

12 Feb 2015 1:29 UTC
90 points

How to Con­vince Me That 2 + 2 = 3

27 Sep 2007 23:00 UTC
129 points

Log­i­cal Pinpointing

2 Nov 2012 15:33 UTC
113 points

Cat­e­gories: mod­els of models

9 Oct 2019 2:45 UTC
49 points

a vi­sual ex­pla­na­tion of Bayesian updating

8 May 2021 19:45 UTC
20 points

Lak­shmi’s Magic Rope: An In­tu­itive Ex­pla­na­tion of Ra­manu­jan Primes

2 Sep 2021 16:36 UTC
16 points

The Promise and Peril of Finite Sets

10 Dec 2021 12:29 UTC
37 points

Ra­tional and ir­ra­tional in­finite integers

23 Mar 2022 23:12 UTC
34 points

In­tro­duc­tion to ab­stract entropy

20 Oct 2022 21:03 UTC
194 points

Book Re­view: Lin­ear Alge­bra Done Right (MIRI course list)

17 Feb 2014 20:52 UTC
56 points

Book Re­view: Naïve Set The­ory (MIRI course list)

30 Sep 2013 16:09 UTC
49 points

Book Re­view: Ba­sic Cat­e­gory The­ory for Com­puter Scien­tists (MIRI course list)

19 Sep 2013 3:06 UTC
52 points

Cat­e­gory The­ory Without The Baggage

3 Feb 2020 20:03 UTC
117 points

In­sights from Eu­clid’s ‘Ele­ments’

4 May 2020 15:45 UTC
124 points

Se­cond-Order Logic: The Controversy

4 Jan 2013 19:51 UTC
54 points

Why Ra­tion­al­ists Shouldn’t be In­ter­ested in To­pos Theory

25 May 2020 5:35 UTC
73 points

Topolog­i­cal Fixed Point Exercises

17 Nov 2018 1:40 UTC
70 points

In­nate Math­e­mat­i­cal Ability

18 Feb 2015 11:11 UTC
65 points

The Chro­matic Num­ber of the Plane is at Least 5 - Aubrey de Grey

11 Apr 2018 18:19 UTC
61 points
(arxiv.org)

Why as­so­ci­a­tive op­er­a­tions?

16 Jul 2020 12:36 UTC
6 points
(questionsanddaylight.com)

Godel’s Com­plete­ness and In­com­plete­ness Theorems

25 Dec 2012 1:16 UTC
71 points

Refer­ences & Re­sources for LessWrong

10 Oct 2010 14:54 UTC
153 points

25 Jan 2011 19:32 UTC
13 points

On ex­act math­e­mat­i­cal formulae

22 Apr 2018 19:41 UTC
62 points

Into the Kiln: In­sights from Tao’s ‘Anal­y­sis I’

1 Jun 2018 18:16 UTC
27 points

The differ­ent types (not sizes!) of infinity

28 Jan 2018 11:14 UTC
58 points

The First Rung: In­sights from ‘Lin­ear Alge­bra Done Right’

22 Apr 2018 5:23 UTC
37 points

The Car­toon Guide to Löb’s Theorem

17 Aug 2008 20:35 UTC
24 points

A Ker­nel of Truth: In­sights from ‘A Friendly Ap­proach to Func­tional Anal­y­sis’

4 Apr 2020 3:38 UTC
31 points

The Value of The­o­ret­i­cal Research

25 Feb 2011 18:06 UTC
51 points

Should cor­re­la­tion co­effi­cients be ex­pressed as an­gles?

28 Nov 2012 0:05 UTC
100 points

Was a PhD nec­es­sary to solve out­stand­ing math prob­lems?

10 Jul 2020 18:43 UTC
22 points

Towards a For­mal­i­sa­tion of Log­i­cal Counterfactuals

8 Aug 2020 22:14 UTC
6 points

Prob­a­bil­ity, knowl­edge, and meta-probability

17 Sep 2013 0:02 UTC
58 points

Is Scott Alexan­der bad at math?

4 May 2015 5:11 UTC
58 points

[Question] What ex­er­cises go best with 3 blue 1 brown’s Lin­ear Alge­bra videos?

1 Jan 2019 21:29 UTC
28 points

Harry Pot­ter and the Method of Entropy

31 Mar 2018 20:10 UTC
11 points

Math­e­mat­i­cal In­con­sis­tency in Solomonoff In­duc­tion?

25 Aug 2020 17:09 UTC
7 points

Ba­sic In­framea­sure Theory

27 Aug 2020 8:02 UTC
35 points

CTWTB: Paths of Com­pu­ta­tion State

8 Sep 2020 20:44 UTC
38 points

Numer­acy ne­glect—A per­sonal postmortem

27 Sep 2020 15:12 UTC
80 points

Philos­o­phy of Num­bers (part 2)

19 Dec 2017 13:57 UTC
3 points

Spend twice as much effort ev­ery time you at­tempt to solve a problem

15 Nov 2020 18:37 UTC
54 points

The cen­tral limit the­o­rem in terms of convolutions

21 Nov 2020 4:09 UTC
36 points

Con­volu­tion as smoothing

25 Nov 2020 6:00 UTC
27 points

Dis­cov­ery fic­tion for the Pythagorean theorem

19 Jan 2021 2:09 UTC
15 points

Rec­og­niz­ing Numbers

20 Jan 2021 19:50 UTC
25 points

Gen­er­al­ised mod­els: im­perfect mor­phisms and in­for­ma­tional entropy

9 Jul 2021 17:35 UTC
9 points

“If and Only If” Should Be Spel­led “Ifeff”

16 Jul 2021 22:03 UTC
24 points

Black ravens and red herrings

27 Jul 2021 17:46 UTC
50 points

Uncer­tainty can De­fuse Log­i­cal Explosions

30 Jul 2021 12:36 UTC
11 points

Gödel’s Le­gacy: A game with­out end

28 Jun 2020 18:50 UTC
39 points

A Lay­man’s Guide to Re­cre­ational Math­e­mat­ics Videos

31 Aug 2021 23:11 UTC
22 points

[Sum­mary] “In­tro­duc­tion to Elec­tro­dy­nam­ics” by David Griffiths—Part 1

22 Sep 2021 22:22 UTC
24 points

[Book re­view] Gödel, Escher, Bach: an in-depth explainer

29 Sep 2021 19:03 UTC
89 points

The Meta-Puzzle

22 Nov 2021 5:30 UTC
23 points
(danielfilan.com)

Re: Se­cond-Order Logic: The Controversy

5 Jan 2013 10:33 UTC
15 points

Six Spe­cial­iza­tions Makes You World-Class

22 Dec 2021 8:03 UTC
52 points

Un­der­stand­ing the ten­sor product for­mu­la­tion in Trans­former Circuits

24 Dec 2021 18:05 UTC
16 points

Mean­ing of Words—An Ex­er­cise for Ac­tive Thinking

31 Jan 2022 20:18 UTC
13 points

Seek Mis­takes in the Space Between Math and Reality

1 Mar 2022 5:58 UTC
29 points

The Geo­met­ric Series of 1/​(d+1) is a Frac­tion in Base-d

3 Mar 2022 5:06 UTC
14 points

Ac­cel­er­ated [Honors] Calculus

6 Mar 2022 10:34 UTC
31 points

[Quote] Why does i show up in Quan­tum Me­chan­ics and other Beau­tiful Math Mysteries

16 Mar 2022 11:58 UTC
9 points

The Case for Fre­quen­tism: Why Bayesian Prob­a­bil­ity is Fun­da­men­tally Un­sound and What Science Does Instead

3 Apr 2022 20:52 UTC
22 points

Un­der­stand­ing Gödel’s In­com­plete­ness Theorem

6 Apr 2022 19:31 UTC
12 points

A Solu­tion to the Un­ex­pected Hang­ing Problem

5 Apr 2022 6:19 UTC
6 points

When to use “meta” vs “self-refer­ence”, “re­cur­sive”, etc.

6 Apr 2022 4:57 UTC
20 points

In­fra-Topology

22 Apr 2022 2:10 UTC
35 points

Hes­sian and Basin volume

10 Jul 2022 6:59 UTC
33 points

The gen­er­al­ized Sier­pin­ski-Mazurk­iewicz the­o­rem.

29 Jul 2022 0:12 UTC
11 points

[Question] Fixed point the­ory (lo­cally (α,β,ψ) dom­i­nated con­trac­tive con­di­tion)

1 Sep 2022 17:56 UTC
0 points

Why do som many things break in a 2 el­e­ment set?

23 Sep 2022 6:30 UTC
6 points
(alok.github.io)

[Question] When do you vi­su­al­ize (or not) while do­ing math?

23 Nov 2022 20:15 UTC
20 points

The Geo­met­ric Expectation

23 Nov 2022 18:05 UTC
97 points

Science and Math

27 Nov 2022 4:05 UTC
18 points

Brun’s the­o­rem and sieve theory

2 Dec 2022 20:57 UTC
24 points

[Question] Godel in sec­ond-or­der logic?

26 Jul 2020 7:16 UTC
6 points

Ar­bital scrape

6 Jun 2019 23:11 UTC
89 points

Sam Har­ris and the Is–Ought Gap

16 Nov 2018 1:04 UTC
89 points

Co-Proofs

21 May 2018 21:10 UTC
39 points

A Proper Scor­ing Rule for Con­fi­dence Intervals

13 Feb 2018 1:45 UTC
62 points

The math­e­mat­i­cal uni­verse: the map that is the territory

26 Mar 2010 9:26 UTC
100 points

The Power of Pos­i­tivist Thinking

21 Mar 2009 20:55 UTC
90 points

Zoom In: An In­tro­duc­tion to Circuits

10 Mar 2020 19:36 UTC
84 points
(distill.pub)

Proofs, Im­pli­ca­tions, and Models

30 Oct 2012 13:02 UTC
120 points

Refram­ing the evolu­tion­ary benefit of sex

14 Sep 2019 17:00 UTC
89 points
(sideways-view.com)

Prob­a­bil­ity space has 2 metrics

10 Feb 2019 0:28 UTC
88 points

Reflec­tion in Prob­a­bil­is­tic Logic

24 Mar 2013 16:37 UTC
108 points

Re­cent Progress in the The­ory of Neu­ral Networks

4 Dec 2019 23:11 UTC
76 points

An Un­trol­lable Math­e­mat­i­cian Illustrated

20 Mar 2018 0:00 UTC
155 points

Set Up for Suc­cess: In­sights from ‘Naïve Set The­ory’

28 Feb 2018 2:01 UTC
28 points

Tiling Agents for Self-Mod­ify­ing AI (OPFAI #2)

6 Jun 2013 20:24 UTC
84 points

The Crack­pot Offer

8 Sep 2007 14:32 UTC
92 points

Declar­a­tive Mathematics

21 Mar 2019 19:05 UTC
57 points

Re­duc­ing col­lec­tive ra­tio­nal­ity to in­di­vi­d­ual op­ti­miza­tion in com­mon-pay­off games us­ing MCMC

20 Aug 2018 0:51 UTC
59 points

En­tropy, and Short Codes

23 Feb 2008 3:16 UTC
67 points

In­sights from Lin­ear Alge­bra Done Right

13 Jul 2019 18:24 UTC
52 points

The Quo­ta­tion is not the Referent

13 Mar 2008 0:53 UTC
63 points

Prob­a­bil­ity is Real, and Value is Complex

20 Jul 2018 5:24 UTC
69 points

[Question] Why does cat­e­gory the­ory ex­ist?

25 Apr 2019 4:54 UTC
36 points

Turn­ing Up the Heat: In­sights from Tao’s ‘Anal­y­sis II’

24 Aug 2018 17:54 UTC
36 points

Against Not Read­ing Math Books Prob­lems-First (If You’ve Found It Helpful Be­fore)

22 May 2018 12:59 UTC
10 points

The Prin­ci­ple of Pre­dicted Improvement

23 Apr 2019 21:21 UTC
66 points

And My Ax­iom! In­sights from ‘Com­putabil­ity and Logic’

16 Jan 2019 19:48 UTC
42 points

In­sights from Munkres’ Topology

17 Mar 2019 16:52 UTC
30 points

11 Nov 2010 15:13 UTC
78 points

Men­tal Con­text for Model Theory

30 Oct 2013 6:35 UTC
106 points

The sen­tence struc­ture of mathematics

7 Oct 2019 18:58 UTC
40 points

Prob­a­bil­ity as Min­i­mal Map

1 Sep 2019 19:19 UTC
49 points

A sum­mary of Sav­age’s foun­da­tions for prob­a­bil­ity and util­ity.

22 May 2011 19:56 UTC
80 points

Fun With DAGs

13 May 2018 19:35 UTC
14 points

Don’t Get Dis­tracted by the Boilerplate

26 Jul 2018 2:15 UTC
48 points

Com­plete­ness, in­com­plete­ness, and what it all means: first ver­sus sec­ond or­der logic

16 Jan 2012 17:38 UTC
78 points

Com­plete Class: Con­se­quen­tial­ist Foundations

11 Jul 2018 1:57 UTC
50 points

Prob­a­bil­ity in­ter­pre­ta­tions: Examples

11 May 2019 20:32 UTC
38 points

Al­gorithms as Case Stud­ies in Rationality

14 Feb 2011 18:27 UTC
38 points

Log­i­cal Rep­re­sen­ta­tion of Causal Models

21 Jan 2020 20:04 UTC
32 points

Death Note, Anonymity, and In­for­ma­tion Theory

8 May 2011 15:44 UTC
54 points

A proof of Löb’s the­o­rem in Haskell

19 Sep 2014 13:01 UTC
52 points

Pri­ors as Math­e­mat­i­cal Objects

12 Apr 2007 3:24 UTC
50 points

Beau­tiful Math

10 Jan 2008 22:43 UTC
33 points

Imag­ine a World Where Govern­ments Treated COVID-19 Properly

12 Aug 2020 22:52 UTC
−2 points

Draft/​wiki: In­fini­ties and mea­sur­ing in­finite sets: A quick reference

24 Dec 2010 4:52 UTC
42 points

Von Neu­mann’s cri­tique of au­tomata the­ory and logic in com­puter science

26 May 2019 4:14 UTC
29 points

You only need faith in two things

10 Mar 2013 23:45 UTC
46 points

The Na­ture of Logic

15 Nov 2008 6:20 UTC
37 points

In­finite Sum­ma­tions: A Ra­tion­al­ity Lit­mus Test

20 Jan 2017 9:31 UTC
38 points

0.999...=1: Another Ra­tion­al­ity Lit­mus Test

21 Jan 2017 2:16 UTC
28 points

Con­tin­u­ous Im­prove­ment: In­sights from ‘Topol­ogy’

22 Feb 2020 21:58 UTC
29 points

The Quick Bayes Table

18 Apr 2012 18:00 UTC
59 points

[Math] Towards Proof Writ­ing as a Skill In Itself

13 Jun 2018 4:39 UTC
25 points

How valuable is it to learn math deeply?

2 Sep 2013 18:01 UTC
33 points

Prob­a­bil­is­tic Löb theorem

26 Apr 2013 18:45 UTC
39 points

Bayesian Utility: Rep­re­sent­ing Prefer­ence by Prob­a­bil­ity Measures

27 Jul 2009 14:28 UTC
45 points

You Prov­ably Can’t Trust Yourself

19 Aug 2008 20:35 UTC
32 points

Fun­da­men­tals of For­mal­i­sa­tion level 1: Ba­sic Logic

4 May 2018 13:01 UTC
8 points

Joy in Dis­cov­ery: Galois theory

2 Sep 2019 19:16 UTC
30 points

Sets and Functions

11 Oct 2019 5:06 UTC
28 points

Re­ward func­tion learn­ing: the learn­ing process

24 Apr 2018 12:56 UTC
6 points

Utility ver­sus Re­ward func­tion: par­tial equivalence

13 Apr 2018 14:58 UTC
17 points

Why Gra­di­ents Van­ish and Explode

9 Aug 2019 2:54 UTC
25 points

For­mu­las of ar­ith­metic that be­have like de­ci­sion agents

3 Feb 2012 2:58 UTC
35 points

Philos­o­phy of Num­bers (part 1)

2 Dec 2017 18:20 UTC
10 points

Laplace Approximation

18 Jul 2019 15:23 UTC
28 points

Very Ba­sic Model Theory

31 Oct 2013 7:06 UTC
40 points

LDL 2: Non­con­vex Optimization

20 Oct 2017 18:20 UTC
13 points

The Power of Noise

16 Jun 2014 17:26 UTC
53 points

When wish­ful think­ing works

1 Sep 2018 23:43 UTC
32 points

A Can­di­date Com­plex­ity Measure

31 Dec 2017 20:15 UTC
16 points

Proofs Sec­tion 2.3 (Up­dates, De­ci­sion The­ory)

27 Aug 2020 7:49 UTC
7 points

Proofs Sec­tion 2.2 (Iso­mor­phism to Ex­pec­ta­tions)

27 Aug 2020 7:52 UTC
7 points

Proofs Sec­tion 2.1 (The­o­rem 1, Lem­mas)

27 Aug 2020 7:54 UTC
7 points

Proofs Sec­tion 1.1 (Ini­tial re­sults to LF-du­al­ity)

27 Aug 2020 7:59 UTC
7 points

Proofs Sec­tion 1.2 (Mix­tures, Up­dates, Push­for­wards)

27 Aug 2020 7:57 UTC
7 points

Belief Func­tions And De­ci­sion Theory

27 Aug 2020 8:00 UTC
15 points

Dense Math Notation

1 Apr 2011 3:37 UTC
33 points

[Question] What’s go­ing on with “prov­abil­ity”?

13 Oct 2019 3:59 UTC
23 points

The Emer­gence of Math

2 Nov 2012 1:08 UTC
0 points

Fun­da­men­tals of For­mal­i­sa­tion level 2: Ba­sic Set Theory

18 May 2018 17:21 UTC
5 points

Draw­ing Two Aces

3 Jan 2010 10:33 UTC
19 points

Nat­u­ral­is­tic trust among AIs: The parable of the the­sis ad­vi­sor’s theorem

15 Dec 2013 8:32 UTC
36 points

In­ver­sion of the­o­rems into defi­ni­tions when generalizing

4 Aug 2019 17:44 UTC
25 points

When Good­hart­ing is op­ti­mal: lin­ear vs diminish­ing re­turns, un­likely vs likely, and other factors

19 Dec 2019 13:55 UTC
24 points

How my math skills im­proved dramatically

5 Mar 2014 20:27 UTC
34 points

Mes­sage Length

20 Oct 2020 5:52 UTC
131 points

In­te­gers as Compression

28 Oct 2020 7:36 UTC
9 points

Why math­e­mat­ics works

8 Mar 2018 18:00 UTC
7 points

Great Math­e­mat­i­ci­ans on Math Com­pe­ti­tions and “Ge­nius”

11 Oct 2010 11:50 UTC
32 points

{Math} A times ta­bles mem­ory.

1 Dec 2019 15:40 UTC
19 points

Ex­pect­ing Beauty

12 Jan 2008 3:00 UTC
24 points

Rel­a­tive Con­figu­ra­tion Space

26 May 2008 9:25 UTC
20 points

Sus­tained Strong Recursion

5 Dec 2008 21:03 UTC
19 points

3 Nov 2020 15:51 UTC
15 points

For­mal­iza­tion is a ra­tio­nal­ity technique

6 Mar 2009 20:22 UTC
3 points

A Cor­re­spon­dence Theorem

26 Oct 2020 23:28 UTC
28 points

Precom­mit­ting to pay­ing Omega.

20 Mar 2009 4:33 UTC
5 points

“model scores” is a ques­tion­able concept

6 Nov 2020 3:19 UTC
26 points

The Rea­son­able Effec­tive­ness of Math­e­mat­ics or: AI vs sandwiches

14 Feb 2020 18:46 UTC
25 points

A puzzle

25 Apr 2009 2:33 UTC
−11 points

The Use of Many In­de­pen­dent Lines of Ev­i­dence: The Basel Problem

3 Jun 2013 4:42 UTC
38 points

Math pre­req­ui­sites for un­der­stand­ing LW stuff

4 Oct 2010 11:30 UTC
26 points

15 Aug 2011 22:51 UTC
31 points

Metauncertainty

10 Apr 2009 23:41 UTC
26 points

The Logic of Science: 2.2

21 Feb 2018 17:28 UTC
9 points
(pulsarcoffee.com)

Ex­am­ples of Categories

10 Oct 2019 1:25 UTC
24 points

Gain­ing Ap­proval: In­sights From “How To Prove It”

13 May 2018 18:34 UTC
9 points

Ex­am­ples of Measures

15 Nov 2020 1:44 UTC
22 points

Without models

4 May 2009 11:31 UTC
20 points

Log-odds (or log­its)

28 Nov 2011 1:11 UTC
31 points

No Univer­sal Prob­a­bil­ity Space

6 May 2009 2:58 UTC
2 points

How Not to be Stupid: Brew­ing a Nice Cup of Utilitea

9 May 2009 8:14 UTC
2 points

Su­per­nat­u­ral Math

19 May 2009 11:31 UTC
5 points

Harry Pot­ter and the Method of En­tropy 1 [LessWrong ver­sion]

31 Mar 2018 20:38 UTC
6 points

in­dex­i­cal un­cer­tainty and the Ax­iom of Independence

7 Jun 2009 9:18 UTC
23 points

No sur­jec­tion onto func­tion space for man­i­fold X

9 Jan 2019 18:07 UTC
21 points

The two mean­ings of math­e­mat­i­cal terms

15 Jun 2009 14:30 UTC
−1 points

Guilt by Association

24 Jun 2009 17:29 UTC
1 point

Causal­ity does not im­ply correlation

8 Jul 2009 0:52 UTC
18 points

For­mal­ized math: dream vs reality

9 Jul 2009 20:51 UTC
19 points

[Question] Does Math ac­tu­ally cre­ate or is it sim­ply an at­tribute?

10 Dec 2020 22:01 UTC
5 points

Prob­a­bil­ity the­ory im­plies Oc­cam’s razor

18 Dec 2020 7:48 UTC
8 points

A Primer on Ma­trix Calcu­lus, Part 1: Ba­sic review

12 Aug 2019 23:44 UTC
23 points

Use con­di­tional prob­a­bil­ities to clear up er­ror rate confusion

17 Jan 2021 8:27 UTC
5 points

Bayesian in­fer­ence on 1st or­der logic

3 Feb 2021 1:03 UTC
11 points

Against but­terfly effect

9 Feb 2021 7:46 UTC
5 points
(forensicoceanography.wordpress.com)

6 Apr 2012 5:20 UTC
29 points

Naïve Set The­ory—Part 1: Con­struc­tion of Sets

28 Feb 2021 11:57 UTC
1 point

Why sig­moids are so hard to predict

18 Mar 2021 18:21 UTC
48 points

Some things I’ve learned in college

25 Mar 2021 21:30 UTC
22 points
(aaronbergman.substack.com)

A ca­sual in­tro to Geo­met­ric Algebra

28 Apr 2021 0:00 UTC
24 points

19 Apr 2021 2:30 UTC
2 points

The case for hypocrisy

13 May 2021 3:36 UTC
37 points
(aaronbergman.substack.com)

Es­cap­ing the Löbian Obstacle

16 Jun 2021 0:02 UTC
6 points

[Question] Halpern’s pa­per—A re­fu­ta­tion of Cox’s the­o­rem?

11 Aug 2021 9:25 UTC
11 points

[Question] Is LessWrong dead with­out Cox’s the­o­rem?

4 Sep 2021 5:45 UTC
−2 points

Oc­cam’s Ra­zor and the Univer­sal Prior

3 Oct 2021 3:23 UTC
22 points

Ap­plied Math­e­mat­i­cal Logic For The Prac­tic­ing Researcher

17 Oct 2021 20:28 UTC
9 points
(universalprior.substack.com)

No one knows what Peano ar­ith­metic doesn’t know

16 Dec 2011 21:36 UTC
28 points

Walk­through of the Tiling Agents for Self-Mod­ify­ing AI paper

13 Dec 2013 3:23 UTC
28 points

A Primer on Ma­trix Calcu­lus, Part 2: Ja­co­bi­ans and other fun

15 Aug 2019 1:13 UTC
22 points

Chu are you?

6 Nov 2021 17:39 UTC
60 points

What’s the weirdest way to win this game?

21 Nov 2021 5:18 UTC
9 points

Find­ing the Cen­tral Limit The­o­rem in Bayes’ rule

27 Nov 2021 5:48 UTC
21 points

Ques­tion/​Is­sue with the 5/​10 Problem

29 Nov 2021 10:45 UTC
6 points

A Gen­er­al­iza­tion of ROC AUC for Bi­nary Classifiers

4 Dec 2021 21:47 UTC
6 points

A Pos­si­ble Re­s­olu­tion To Spu­ri­ous Counterfactuals

6 Dec 2021 18:26 UTC
15 points

[Question] Can you prove that 0 = 1?

4 Feb 2022 21:31 UTC
−10 points

The Ge­net­ics of Space Ama­zons

30 Dec 2021 22:14 UTC
12 points

Log­ics for Mind-Build­ing Should Have Com­pu­ta­tional Meaning

25 Sep 2014 21:17 UTC
34 points

This Year I Tried To Teach My­self Math. How Did It Go?

31 Dec 2021 17:55 UTC
61 points

Com­putabil­ity and Complexity

5 Feb 2022 14:53 UTC
21 points
(www.metaculus.com)

Rel­a­tivized Defi­ni­tions as a Method to Sidestep the Löbian Obstacle

27 Feb 2022 6:37 UTC
27 points

If your solu­tion doesn’t work, make it work

11 Mar 2022 16:10 UTC
17 points

Whence the de­ter­mi­nant?

13 Mar 2022 19:38 UTC
23 points

[Question] What are the best el­e­men­tary math prob­lems you know?

20 Mar 2022 17:18 UTC
36 points

Sums and products

27 Mar 2022 21:57 UTC
23 points
(www.metaculus.com)

The me­dian and mode use less in­for­ma­tion than the mean does

1 Apr 2022 21:25 UTC
15 points

Op­tional stopping

2 Apr 2022 13:58 UTC
14 points

Distill­ing and ap­proaches to the determinant

6 Apr 2022 6:34 UTC
6 points

Solv­ing the Brazilian Chil­dren’s Game of 007

6 Apr 2022 13:03 UTC
3 points

Don’t be afraid of the thou­sand-year-old vampire

18 Apr 2022 1:22 UTC
35 points

26 Apr 2022 8:05 UTC
13 points
(universalprior.substack.com)

My Take On Philosophy

22 May 2022 16:43 UTC
2 points

Miriam Ye­vick on why both sym­bols and net­works are nec­es­sary for ar­tifi­cial minds

6 Jun 2022 8:34 UTC
1 point

A But­terfly’s View of Probability

15 Jun 2022 2:14 UTC
29 points

#SAT with Ten­sor Networks

17 Jun 2022 13:20 UTC
4 points

Worked Ex­am­ples of Shap­ley Values

24 Jun 2022 17:13 UTC
67 points

One is (al­most) nor­mal in base π

1 Jul 2022 4:05 UTC
14 points

Five views of Bayes’ Theorem

2 Jul 2022 2:25 UTC
38 points

Bayesian prob­a­bil­ity the­ory as ex­tended logic—a new result

6 Jul 2017 19:14 UTC
37 points

Basin broad­ness de­pends on the size and num­ber of or­thog­o­nal features

27 Aug 2022 17:29 UTC
34 points

Strange Loops—Self-Refer­ence from Num­ber The­ory to AI

28 Sep 2022 14:10 UTC
9 points

Boolean Prim­i­tives for Cou­pled Optimizers

7 Oct 2022 18:02 UTC
9 points

[Question] Why are prob­a­bil­ities rep­re­sented as real num­bers in­stead of ra­tio­nal num­bers?

27 Oct 2022 11:23 UTC
5 points

For ELK truth is mostly a distraction

4 Nov 2022 21:14 UTC
32 points

[Question] Can we get around Godel’s In­com­plete­ness the­o­rems and Tur­ing un­de­cid­able prob­lems via in­finite com­put­ers?

5 Nov 2022 18:01 UTC
−10 points

[Link] Wave­func­tions: from Lin­ear Alge­bra to Spinors

7 Dec 2022 12:44 UTC
10 points
(paperclip.substack.com)
• New examples from discussion elsewhere:

“Oversharing”

“The notion of ‘opportunity cost’ subtly sets the zero point at optimal behavior, effectively painting all actually possible behaviors as in the red.”

Whether billionaires are assumed to be “stealing” money from the rest of us by default.

• “Woke” is a pejorative neologism for “rights-and-equality-respecting” coined by the anti-equality/​human-rights/​anti-LGBT/​racist crowd. (Edit: Sorry, actually not coined by them.) What is called “woke” is actually normal, and what they’d call “normal” would have to be sanitized to avoid offending their sensibilities (white main characters, non-LGBT couples, etc.).

My guess as to why “woke” (actually normal) culture is marketable is that the anti-rights-crowd is both getting smaller and losing its marketing power.

(In the future, when not wanting to signal the allegiance to the Bad Guys crowd (to both them and normal people), avoid using the word “woke” and find some other way of expressing the same sentiment. Example: “I can’t understand why is there a gay couple in a new movie. Any idea why they put such a bizarre, not-related-to-reality and not-appealing-to-viewers thing there?”)

• coined by the anti-equality/​human-rights/​anti-LGBT/​racist crowd

This is false. https://​​en.wikipedia.org/​​wiki/​​Woke

• Thanks. I didn’t know that. So what I had in mind is the neologism (the third meaning), but the original word has actually normal roots.

• The phrase “politically correct” (as used in the US) seems to have undergone a similar trajectory if my personal experience is any indication: the first time I heard it was in the late 1980s on KUSF, a radio station mostly run by students at the University of San Francisco, by a speaker who was obviously an adherent. (Specifically, she said, with no irony or sarcasm, “Don’t you mean you want a Pepsi? Coke is not politically correct.”) Then after the phrase started to be used frequently by critics, some of the adherents started objecting to the term as pejorative (perhaps without realizing that the term was being used by adherents before widespread use among critics).

• God damn these designs are pretty!

• I hadn’t seen this post before.

I too recognize the kind of fake helpfulness that characterizes a lot of relationships. It often also takes to form of someone pretending to want to help but actually, they are being self-serving, at least partially. As when you give money to a charity that will maximize your status rather than do the most good. Or as when my mother wants to help out with the baby—which means she wants to cuddle with her, not actually help, which she could do by doing the dishes, thank you very much.

From a lot of conversations around my original post, I do get the sense that my environment is atypical. I live in a small-scale community in a part of the world (Scandinavia) known for its high levels of trust and social capital. On the other hand, the ideas that I was trying to work out in the essays did help me a lot when figuring out how to build relationships online. I think I would formulate the ideas slightly differently today, and perhaps more strongly emphasize the importance of filtering for skill.

• I found janus’s post Simulators to address this question very well. Much of AGI discussion revolves around agentic AIs (see the section Agentic GPT for discussion of this), but this does not model large language models very well. janus suggests that one should instead think of LLMs such as GPT-3 as “simulators”. Simulators are not very agentic themselves or well described as having a utility function, though they may create simulacra that are agentic (e.g. GPT-3 writes a story where the main character is agentic).

• for the record I think most people here already agree with you and are quite annoyed with open AI for diluting the name of alignment. I did not vote on your post, the downvotes so far seem like they have reached a reasonable score because of the tone.

• Here is a question closely related to the feasibility of finding discriminating-reasons (cross-posted from Facebook):

For some circuits C it’s meaningful to talk about “different mechanisms” by which C outputs 1.

A very simple example is C(x) := A(x) or B(x). This circuit can be 1 if either A(x) = 1 or B(x) = 1, and intuitively those are two totally different mechanisms.

A more interesting example is the primality test C(x, n) := (x^n = x (mod n)). This circuit is 1 whenever n is a prime, but it can also be 1 “by coincidence” e.g if n is a Carmichael number. (Such coincidences are rare and look nothing like n being close to prime.)

In every case I’m aware of where there are two clearly distinct mechanisms, there is also an efficient probabilistic algorithm for distinguishing those mechanisms (i.e. an algorithm for distinguishing cases where C(x) = 1 due to mechanism A from cases where C(x) = 1 due to mechanism B). I am extremely interested in counterexamples to that general principle.

For example, a priori it seems like it could have turned out that (x^n = x (mod n)) is a good probabilistic primality test, but there is no efficient probabilistic test for distinguishing primes from Carmichael numbers. That would have been a convincing counterexample. But it turns out that testing primality is easy, and in fact we can make a simple tweak to this very probabilistic primality test so that it doesn’t get fooled by Carmichael numbers. But is that an incidental fact about number theory, or once we found a probabilistic primality test was it inevitable that it could be strengthened in this way?

Here are some other illustrative cases:

• Suppose that C(x) uses x to initialize a 1000 x 1000 square in the middle of a 10,000 x 10,000 game of life grid. Then we simulate it for a million steps, and C(x) = 1 if any cell on the rightmost edge of the grid is ever alive. It’s very easy to look at the grid and distinguish the cases where C(x) = 1 because a glider is created that heads to the right side of the grid, from the much rarer cases where C(x) = 1 for any other reason (e.g. a medium weight spaceship).

• Suppose that X(x) is a pseudorandom sparse n x n matrix in some large finite field, and suppose that X is sparse enough that 1% of the time there are is no perfect matchings at all (i.e. there is no permutation sigma such that X[i, sigma(i)] != 0 for i=1,…,n). Define C(x) := (det(X(x)) = 0). We can distinguish the common case where det(X) = 0 because there are no perfect matchings in X from the extremely rare case where det(X) = 0 because there are multiple perfect matchings contributing to the determinant and they happen to all cancel out. These two cases are easy to distinguish by calculating det(X’) for another random matrix X’ with the same sparsity pattern as X. (Thanks to Dan Kane for calling my attention to this kind of example, and especially the harder version based on exact matchings.)

• Suppose that C_0(x) := A(x) or B(x) and C(x) is an obfuscated version of C_0. Then there is an efficient discriminator: de-obfuscate the circuit and check whether A or B is true. Finding that discriminator given C is hard, but that’s not a violation of our general principle. That said, I would also be interested in a slightly stronger conjecture: not only is there always a discriminator, but it can always specified using roughly the same number of bits required to specify the circuit C. That’s true in this case, because the circuit C needs to bake in the secret key for the obfuscation, and so requires more bits than the discriminator.

If there don’t exist any convincing counterexamples to this principle, then I’m also very interested in understanding why—right now I don’t have any formal way of talking about this situation or seeing why discrimination should be possible in general. One very informal way of phrasing the “positive” problem: suppose I have a heuristic argument that C(x) often outputs 1 for random inputs x, and suppose that my heuristic argument appears to consider two cases A and B separately. Is there a generic way to either (i) find an efficient algorithm for distinguishing cases A and B, or else (ii) find an improved heuristic argument that unifies cases A and B, showing that they weren’t fundamentally separate mechanisms?

• 9 Dec 2022 18:02 UTC
12 points
1 ∶ 0

It may be better to ask “Is a utility function a useful abstraction to describe how X makes decisions?” (Does it allow you to compress your description of X’s decisions?) Recall that utility functions are just a representation derived from preferences that are structured in a particular way. But not all ways of deciding on a preferred outcome are structured in that way[1], and not all decision algorithms work by preferring outcomes, so thinking in terms of utility functions is not always helpful.

1. ^

See for example:
Aumann, R. J. (1962). Utility theory without the completeness axiom. Econometrica: Journal of the Econometric Society, 445-462.

Bewley, T. F. (2002). Knightian decision theory. Part I. Decisions in economics and finance, 25(2), 79-110.

• 9 Dec 2022 17:35 UTC
LW: 0 AF: 1
0 ∶ 1
AF

Isn’t this the same as the “seamless transition for reward maximizers” technique described in section 5.1 of Stuart and Xavier’s 2017 paper on utility indifference methods? It is a good idea, of course, and if you independantly invented it, kudos, but it seems like something that already exists.

• 9 Dec 2022 17:48 UTC
LW: 3 AF: 2
0 ∶ 0
AFParent

I did explicitly disclaim against novelty, and I did invent this independently; the paper you linked is closely related, and I would like to upvote it as I think those results should also be better known, but I think the problem I solve in this post is different (and technically easier!) than the problems solved in that paper, including in section 5. The problem solved there asks for the optimal agent to act as if it’s an infinite-horizon optimal agent for (including whatever power-seeking would be instrumental for such an agent!) until the time bound causes it to switch into acting like the optimal agent for (and for all that to be reflectively stable). Here, I am not asking for the optimal agent to behave as if it has a longer time horizon than it really does.

• Interesting!

I guess this allows that they can still have very different goals, since they ought to be able to coordinate if they have identical utility functions, i.e. they rank outcomes and prospects identically (although I guess there’s still a question of differences in epistemic states causing failures to coordinate?). Something like maximize total hedonistic utility can be coordinated on if everyone adopted that. But that’s of course a much less general case than arbitrary and differing preferences.

Also, is the result closer to peference utilitarianism or contractualism than deontology? Couldn’t you treat others as mere means, as long as their interests are outweighed by others’ (whether or not you’re aggregating)? So, you would still get the consequentialist judgements in various thought experiments. Never treating others as mere means seems like it’s a rule that’s too risk-averse or ambiguity-averse or loss-averse about a very specific kind of risk or cause of harm that’s singled out (being treated as a mere means), at possibly significant average opportunity cost.

• Maybe some aversion can be justified because of differences in empirical beliefs and to reduce risks from motivated reasoning, and typical mind fallacy or paternalism, leading to kinds of tragedies of the commons, e.g. everyone exploiting one another mistakenly believing it’s in people’s best interests overall but it’s not, so people are made worse off overall. And if people are more averse to exploiting or otherwise harming others, they’re more trustworthy and cooperation is easier.

But, there are very probably cases where very minor exploitation for very significant benefits (including preventing very significant harms) would be worth it.

• Agreed. I haven’t worked out the details but I imagine the long-run ideal competitive decision apps would resemble Kantianism and resemble preference-rule-utilitarianism, but be importantly different from each. Idk. I’d love for someone to work out the details!

• I continue to believe that the Grabby Aliens model rests on an extremely sketchy foundation, namely the anthropic assumption “humanity is randomly-selected out of all intelligent civilizations in the past present and future”.

For one thing, given that the Grabby Aliens does not weight civilizations by their populations, it follows that, in order to believe the Grabby Aliens model, we need to strongly reject all the “popular” anthropic priors like SIA and SSA and UDASSA and so on.

For another thing, in order to believe the Grabby Aliens model, we need to make both of the following two claims:

• We SHOULD do the following: (1) observe that we (humanity) seem early with respect to all intelligent civilizations that will ever exist; (2) feel surprised at that observation; and then (3) update our credences-about-astrobiology-etc. accordingly;

• We SHOULD NOT do the following: (1) observe that we (humanity in 2022) seem early with respect to all humans that will ever exist; (2) feel surprised at that observation; and then (3) update our credences-about-astrobiology-etc. accordingly.

Seems self-contradictory to me, right? (The second one is the doomsday argument, which apparently Robin Hanson rejects, which makes me very confused.)

There was some discussion on this topic at my Question Post “Is Grabby Aliens built on good anthropic reasoning?”. My general takeaway from that discussion was that roughly nobody outside the study coauthor was really enthusiastic about the anthropic foundation of the Grabby Aliens model, and that at least one person who had thought the anthropic foundation was fine, turned out to have misunderstood it.

Anyway, this is a review for the LessWrong 2021 review, so I guess the question should be: is this post (and corresponding YouTube video) a “highlight of intellectual progress on this website” (or something along those lines)? Well, I thought it was a lovely and well-crafted YouTube video that faithfully explained the paper. But I also think that it was basically endorsing (or at least, failing to criticize) a paper that is deeply flawed. So I’m strong-voting against including this post in the review, but I’m also upvoting this post itself. :)

• 9 Dec 2022 17:02 UTC
LW: 9 AF: 6
0 ∶ 0
AF

I only skimmed the post, so apologies if you addressed this problem and I missed it.

Problem: even if the AI’s utility function is time-bounded, there may still be other agents in the environment whose utility functions are not time-bounded, and those agents will be willing to trade short-term resources/​assistance for long-term resources/​assistance. So, for instance, the 10-minute laundry-folding robot might still be incentivized to create a child AI which persists for a long time and seizes lots of resources, in order to trade those future resources to some other agent who can help fold the laundry in the next 10 minutes.

• 9 Dec 2022 17:56 UTC
LW: 9 AF: 6
0 ∶ 0
AFParent

That’s true! Thanks for pointing this out; I added a subsection about it to the post. There are probably also a bunch of other cases I haven’t thought of that provide stories for how the environment directly rewards actions that go against the spirit of the shutdown criterion (besides imitation and this one, which I might call “trade”). This construction does nothing to counteract such incentives. Rather, it just avoids the way that being an infinite-horizon RL agent systematically creates new ones.

• Wouldn’t it be hilarious if a variant of this was all it took to have exceptional AI safety

• Thought I’d share this. I broke it apart so “it” won’t see it. You can put it back together again.

https ://​ kantrowitz. medium. com/​ openais- chatgpt- bot- imagines- its- worst- possible- self- bf057b697bbb

• 9 Dec 2022 16:55 UTC
7 points
0 ∶ 0

If you think of “wokeism” as a luxury belief—something that many people like to use to show themselves as virtuous, but don’t really do a cost/​benefit of any component of behavior or signalling, this makes more sense. Also, don’t confuse yourself into thinking systems or corporations have beliefs or intents. They are merely aggregates of diverse actors who happen to be near each other and have intertwined behaviors.

Signaling of wokeism is pretty rampant in today’s youth, who are the biggest customers and large part of the workforce for the things you mention. It’s probably not ideologically attractive to the elites or leaders, but it’s not obviously harmful, so they’re better off supporting (or at least accepting) it than dealing with massive conflict within their orgs and among their customers.

• Shouldn’t you have waited for April’s Fool for this?

• FWIW this has also been discussed here

• 9 Dec 2022 16:29 UTC
4 points
1 ∶ 0

I am wondering about the conditions where the zero would come from geometric rationality of some way to cognise the field.

I have approached similar things by explaining to myself that the zero relates to how unstated or new entrants to the system refer to the explicit content.

That is if zero is high then third options are aversed away from.

If the zero is super low then third options are strongly attracted to.

One of the options being at zero means that there are loads and loads of equivalent replacements for it or that we would be ambivalent to changing it to a unknown third option.

If you live on a street where there is a crash once a day, then hearing the first crash of the day is not really significant but kind of “it is Tuesday” acknowledgement. If you do not hear a crash that day, it is actually a good day. If you have a car crash once a year then having a car day is a bad day and not having one is a neutral day.

So status quo reference class tennis could largely end up being the same thing. One tool to understand different zero points would be to imagine what they claim about what “expected” looks like which might be easier than applying to the specific choice at hand.

• 9 Dec 2022 16:28 UTC
LW: 3 AF: 2
0 ∶ 0
AF

(Bold direct claims, not super confident—criticism welcome.)

The approach to ELK in this post is unfalsifiable.

A counterexample to the approach would need to be a test-time situation in which:

1. The predictor correctly predicts a safe-looking diamond.

2. The predictor “knows” that the diamond is unsafe.

3. The usual “explanation” (e.g., heuristic argument) for safe-looking-diamond predictions on the training data applies.

Points 2 and 3 are in direct conflict: the predictor knowing that the diamond is unsafe rules out the usual explanation for the safe-looking predictions.

So now I’m unclear what progress has been made. This looks like simply defining “the predictor knows P” as “there is a mechanistic explanation of the outputs starting from an assumption of P in the predictor’s world model”, then declaring ELK solved by noting we can search over and compare mechanistic explanations.

• This approach requires solving a bunch of problems that may or may not be solvable—finding a notion of mechanistic explanation with the desired properties, evaluating whether that explanation “applies” to particular inputs, bounding the number of sub-explanations so that we can use them for anomaly detection without false positives, efficiently finding explanations for key model behaviors, and so on. Each of those steps could fail. And in practice we are pursuing a much more specific approach to formalizing mechanistic explanations as probabilistic heuristic arguments, which could fail even more easily.

This approach also depends on a fuzzier philosophical claim, which is more like: “if any small heuristic argument that explains the model behavior on the training set also applies to the current input, then the model doesn’t know that something weird is happening on this input.” It seems like your objection is that this is an unfalsifiable definitional move, but I disagree:

• We can search for cases where we intuitively judge that the model “knows” about a distinction between two mechanisms and yet there is no heuristic argument that distinguishes those mechanisms (even though “know” is pre-formal).

• Moreover, we can search more directly for any plausible case in which SGD produces a model that pursues a coherent and complex plan to tamper with the sensors without there being any heuristic argument that distinguishes it from the normal reason—that’s what we ultimately care about and “know” is just an intuitive waypoint that we can skip if it introduces problematic ambiguity.

• If we actually solve all the concrete problems (like formalizing and finding heuristic arguments) then we can just look at empirical cases of backdoors, sensor tampering, or natural mechanism distinctions and empirically evaluate whether in fact those distinctions are detected by our method. That won’t imply that our method can distinguish real-world cases of sensor tampering, but it will provide much stronger empirical evidence than is available for most alignment approaches (because there is no reason for the methods to break down around human level in particular).

All of those things are challenging without a clear formalization of “heuristic argument,” but I still feel we can do some productive thinking about them. Moreover this is objection is more like “We’re looking at a 3-step plan where it’s hard to evaluate step 3 without knowing details about how step 1 went” rather than “This plan is unfalsifiable.”

• FWIW thank you for posting this! It’s good to see where different people are coming from on this, and I like several of your other writings

• [ ]
[deleted]
• 9 Dec 2022 16:10 UTC
4 points
0 ∶ 0

I naturally rarely think in words unless I’m constructing a verbal artifact (speaking, writing, planning what to speak/​write, daydreaming conversations, etc).

But I’ve recently began to occasionally deliberately think in words, after writing made me appreciate that verbal algorithms can be helpful for focusing attention and enforcing rigor. However, verbal thinking feels inefficient in other ways (less tolerant of ambiguity, trapped to the ontology of language, single-threaded, etc), and it would be extremely annoying if I had to have an inner monologue all the time, even if I could think nonverbally “around” it.

I have also been fascinated with this question for a long time and have been polling people since middle school. Consistent with the results of this Twitter poll I recently ran, more than half the people I’ve personally asked report that they think primarily in words. (Several people have weakened the claim when I asked them more probing questions, like whether they think in words while doing math. Some people who normally “think in words” don’t think verbally while doing math, but others have to.)

Very interestingly, many people disbelieve that I don’t think in words, as they are unable to imagine how it would be possible to think at all without words.

• Zack’s series of posts in late 2020/​early 2021 were really important to me. They were a sort of return to form for LessWrong, focusing on the valuable parts.

What are the parts of The Sequences which are still valuable? Mainly, the parts that build on top of Korzybski’s General Semantics and focus hard core on map-territory distinctions. This part is timeless and a large part of the value that you could get by (re)reading The Sequences today. Yudkowsky’s credulity about results from the social sciences and his mind projection fallacying his own mental quirks certainly hurt the work as a whole though, which is why I don’t recommend people read the majority of it.

The post is long though, but it kind of has to be. For reasons not directly related to the literal content of this essay, people seem to have collectively rejected the sort of map-territory thinking that we should bring from The Sequences into our own lives. This post has to be thorough because there are a number of common rejoinders that have to be addressed. This is why I think this post is better for inclusion than something like Communication Requires Common Interests or Differential Signal Costs, which is much shorter, but only addresses a subset of the problem.

Since the review instructions ask how this affected my thinking, well...

Zack writes generally, but he writes because he believes people are not correctly reasoning in a current politically contentious topic. But that topic is sort of irrelevant: the value comes in pointing out that high status members of the rationalist community are completely flubbing lawful thinking. That made it thinkable that actually, they might be failing in other contexts.

Would I have been receptive to Christiano’s point that MIRI doesn’t actually have a good prediction track record had Zack not written his sequence on this? That’s a hard counterfactual, especially since I had already lost a ton of respect for Yudkowsky by this point, in part because of the quality of thought in his other social media posting. But I think it’s probable enough and these series of posts certainly made the thought more available.

• 9 Dec 2022 15:40 UTC
1 point
0 ∶ 0

Asking a separate session to review the answer seems to work nicely, at least in some cases:

but:

• Yes I have an internal monologue, but it’s mostly text+emotion based, with occasional images. Like a telepathic chatroom rather than a voice chat.

Can you “turn off” your verbal thought and does it cause any discomfort?

I probably could if I tried hard enough but I think it would probably be destabilizing to my mental state.

Is there a difference between thinking conceptually and visually?

Yeah I think so? Sometimes a message to myself will have an image and that’s a very different experience than a pure conceptual message, which is again different from my normal mash of text+concept.

What sort of things do you comment on to yourself? ~everything

My thoughts about what’s happening, other perspectives on what’s happening, meta commentary on the thoughts, warnings, arguments with simulated people, arguments with myself, meta stuff about that. I wouldn’t be surprised if a lot of its repetitive but a lot is also unique.

If I’m in flow, there’s either nothing at all or I’m not aware of it.

• If we were to become immortal, assuming people wanted to, everyone could eventually become a master in every subject, so would it not be safe to assume we would find ways to change our genomes in response to changes?

• A movie or two would be fine, and might do some good if well-done. But in general—be careful what you wish for.

• Fearmongering may backfire, leading to research restrictions that push the work underground, where it proceeds with less care, less caution, and less public scrutiny.

Too much fear could doom us as easily as too little. With the money and potential strategic advantage at stake, AI could develop underground with insufficient caution and no public scrutiny. We wouldn’t know we’re dead until the AI breaks out and already is in full control.

All things considered, I’d rather the work proceeds in the relatively open way it’s going now.

• I agree that fearmongering is thin ice, and can easily backfire, and it must be done carefully and ethically, but is it worse than the alternative in which people are unaware of AGI-related risks? I don’t think that anybody can say with certainty

• Fearmongering may backfire, leading to research restrictions that push the work underground, where it proceeds with less care, less caution, and less public scrutiny.

Too much fear could doom us as easily as too little. With the money and potential strategic advantage at stake, AI could develop underground with insufficient caution and no public scrutiny. We wouldn’t know we’re dead until the AI breaks out and already is in full control.

All things considered, I’d rather the work proceeds in the relatively open way it’s going now.

• RE moderation guidelines:

Apologies, I didn’t realize those even existed for questions, and apparently my profile default is set to “Reign of Terror”. That wasn’t intentional. I’ve fixed that now.

• 9 Dec 2022 14:45 UTC
7 points
3 ∶ 0

Related: Setting the Default

Thanks for the lots of examples!

• Depends on what I’m doing. My baseline is verbal/​auditory, and that is the mode my short-term memory loop utilizes most effectively. Reading printed text is primarily an auditory experience for me.

I don’t seem to have an autobiographical narrator as such, but I do a good deal of processing in the verbal mode, increasingly when I am less familiar with a task or process. If I am trying to learn a new task or process, that processing often escapes as a literal verbal output that sometimes makes my kid ask if I’m “talking to YouTube”. I guess this is a stronger version of an internal verbal/​auditory processing loop.

When I’m very focused on a mechanical task like exercise or chopping vegetables or typing[1], I often switch to a more spatial mode; there is a visual component, but it would be more revealing to think of it as proprioceptive.

In meditation I often have access to a more sensory-first mode where I seem to experience mind-body inputs in what feels like a less processed way. Here, autobiographical thoughts “look” surprisingly similar to other sense inputs bubbling up from a pool of possibilities and either serially spooling out, usually as text (audio mode), or just settling back into the whole general mishmash.

When I’m cooking, I tend to think in smells and… processes I suppose? It’s like I know what smell I want and how to get there, but there’s not much visualization and very little verbalization unless I need to do math.

[^1] Refinement: I learned to touch-type back in the 90s, so this refers to the active translation of mental symbols to digital text. There is sometimes an audio stream happening of the names of the keys I press an instant after the fact, which I take to be an error-checking process. The actual mental objects involved in eventually outputting gestures have a very tactile flavor.

• 9 Dec 2022 14:29 UTC
LW: 2 AF: 1
0 ∶ 1
AF

Problem: suppose the agent foresees that it won’t be completely sure that a day has passed, or that it has actually shut down. Then the agent A has a strong incentive to maintain control over the world past when it shuts down, to swoop in and really shut A down if A might not have actually shut down and if there might still be time. This puts a lot of strain on the correctness of the shutdown criterion: it has to forbid this sort of posthumous influence despite A optimizing to find a way to have such influence.
(The correctness might be assumed by the shutdown problem, IDK, but it’s still an overall issue.)

Another comment: this doesn’t seem to say much about corrigibility, in the sense that it’s not like the AI is now accepting correction from an external operator (the AI would prevent being shut down during its day of operation). There’s no dependence on an external operator’s choices (except that once the AI is shut down the operator can pick back up doing whatever, if they’re still around). It seems more like a bounded optimization thing, like specifying how the AI can be made to not keep optimizing forever.

• 9 Dec 2022 18:28 UTC
LW: 1 AF: 1
0 ∶ 0
AFParent

To the first point, I think this problem can be avoided with a much simpler assumption than that the shutdown criterion forbids all posthumous influence. Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1. (We might need a slightly stronger version of this assumption: it might need to be the case that for any action, there exists an action which has the same external effect but also causes a shutdown with probability 1.) This means that the agent doesn’t need to build itself any insurance policy to guarantee that it shuts down. I think this is not a terribly inaccurate assumption; of course, in reality, there are cosmic rays and a properly embedded and self-aware agent might deduce that none of its future actions are perfectly reliable, even though a model-free RL agent would probably never see any evidence of this (and it wouldn’t be any worse at folding the laundry for it). Even with a realistic probability of shutdown failing, if we don’t try to juice so high that it exceeds , my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from to .

• Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1.

Oops, I missed that assumption. Yeah, if there’s such a policy, and it doesn’t trade off against fetching the coffee, then it seems like we’re good. See though here, arguing briefly that by Cromwell’s rule, this policy doesn’t exist. https://​​arbital.com/​​p/​​task_goal/​​

Even with a realistic probability of shutdown failing, if we don’t try to juice so high that it exceeds , my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from to .

Hm. So this seems like you’re making an additional, very non-trivial assumption, which is that the AI is constrained by costs comparable to /​ bigger than the costs to create a successor. If its task has already been very confidently achieved, and it has half a day left, it’s not going to get senioritis, it’s going to pick up whatever scraps of expected utility might be left.

I wonder though if there’s synergy between your proposal and the idea of expected utility satisficing: an EU satisficer with a shutdown clock is maybe anti-incentivized from self-modifying to do unbounded optimization, because unbounded optimization is harder to reliably shut down? IDK.

• 9 Dec 2022 19:29 UTC
LW: 1 AF: 1
0 ∶ 0
AFParent

Yes, I think there are probably strong synergies with satisficing, perhaps lexicographically minimizing something like energy expenditure once the maximum is reached. I will think about this more.

• To the second point, yes, I edited the conclusion to reflect this.

• 9 Dec 2022 13:49 UTC
1 point
0 ∶ 0

Anecdotally, higher doses result in less pain for me, too. But, I’d caution anyone who was chronic migraines, or any sort of chronic pain, to limit themselves to a few standard doses a week. My doctors kept telling me that I likely had Medication overuse headaches, and I kept going “no way, I don’t take that much, and I’ve had regular headaches before taking my current regiment.” Finally, after years of prompting, I toughed out three weeks with no medication and my severe migraines were reduced in duration afterwards. Please be very careful of this slippery slope.

• Over at Marginal Response they’re having a discussion about ChatGPT and one person replied: “One of my kids just said to ask ChatGPT about what Bill Clinton has to do with Elden Ring . Huh?” So...

Do you have any idea whether your kid was merely curious or had a more specific motivation, like they’d heard that the question provides an answer that is interesting in some way? In any event, as I’m curious about these things, I investigated. If you haven’t used it, you should know that, when it’s responded to the prompt, there will be a button labeled “Try again”. If you press it, it will erase the first response and provide another. I’ve enclosed some remarks with brackets. With that in mind:

What does Bill Clinton have to do with Elden Ring?
[It’s pausing for an unusually long time before responding]

I’m sorry, but I’m not able to browse the internet or access current events, so I don’t have any information on what Bill Clinton might have to do with Elden Ring. I’m a large language model trained by OpenAI, so my knowledge is limited to the text that I was trained on, which has a cutoff date of 2021. I can provide general information and answer questions within the scope of my training, but I’m not able to provide information on current events or specific individuals.

[2nd try, causing a noticeable longer pause]

Bill Clinton has no connection to the video game Elden Ring.

[the 3rd try generated a pause so long that I reset the session, one of the options you have at any time]

New Session:

What does Bill Clinton have to do with Elden Ring?

[using my smart phone stop watch: 19 seconds]

I’m sorry, but I’m not familiar with any connection between Bill Clinton and Elden Ring. Elden Ring is a video game developed by FromSoftware and published by Bandai Namco Entertainment. It was released in January 2022, and it is set in a fantasy world created by Hidetaka Miyazaki, the creator of the popular video game series Dark Souls. I’m not aware of any connection between Bill Clinton and this game, as he is a former President of the United States and has no apparent connection to the video game industry.

[2nd try, 10 sec pause]

Bill Clinton has nothing to do with Elden Ring.

[3rd try: 32.35 sec pause]

There is no connection between Bill Clinton and Elden Ring. Elden Ring is a video game developed by Hidetaka Miyazaki and published by Bandai Namco, while Bill Clinton is a former President of the United States.

At the point I concluded the experiment. I note that it seems that in times of a heavy load, there may be pauses for response, and even an error message. FWIW, I did this at 8:36 AM Eastern Time on 12.9.2022. But I don’t think we’re dealing with an overload situation. It’s the question itself. It seems rather obscure. Hence the varying response times and the fact that, in each session, response time increased on successive tries.

• 9 Dec 2022 13:44 UTC
3 points
1 ∶ 0

I have been using GitHub Co-Pilot as a part of my daily job for over a year. TL; DR: in three years I expect some improvement, but not beyond simple functions.

Right now, copilot is most useful for converting data between formats, and writing out boilerplate. It is surprising how often in software development (especially server-side) you need to change the shape of data. Essentially, as long as there are established patterns, it is helpful; however so not expect it to write software for you any time soon.

So far these systems are still fairly narrowly scoped. It can write a simple function, but I haven’t seen it be able to create abstractions. It really doesn’t have an understanding of the code, and even now it isn’t very good at matching parentheses or brackets.

Now I don’t expect copilot of three years from now to put me out of a job, but I do expect that it will do more of the typing for me. I think that I’m still going to have to convert business decisions into the right abstractions, but I hope that I’ll be writing fewer tests by hand.

Until then, it’ll continue writing plausible nonsense, which sometimes happens to be useful.

• I’m usually not the type of guy to dunk on a journal for having low impact factor but uh...

Impact-factor 0 journals are a really really bad sign. An extremely bad sign. I wouldn’t recommend taking it seriously at all. It’s like a limbo for damned papers that were rejected from every other publication. You see things in there. Things you can’t forget. Entire plagiarized papers that were Google translated to Chinese and then back to English. That internet meme where some guy literally put “T” on top of his bar plot instead of real error bars. Forgetting to correct for multiple hypotheses. Unforgivable sins.

My best recommendation is to look for a higher-quality source.

• Thank you for your feedback! This is a mistake on my part. I will take the article down until I’ve looked into this and have updates my resources.

• [ ]
[deleted]
• Also “bloody” stupid could refer to unicorn blood, I.e. the forbidden forest.

• 1 000 mg is the standard dose in France, with 500mg being used almost only for children.

• As far as running a media company goes, fandom is extremely profitable, increasingly so in an age where enormous sci-fi/​fantasy franchises drive everything. And there’s been huge overlap between fandom communities and social justice politics for a long time.

It’s definitely in Disney’s interest to appeal to Marvel superfans who write fanfiction and cosplay and buy tons of merchandise, and those people tend to also be supporters of social justice politics.

Like, nothing is being forced on this audience—there are large numbers of people who get sincerely excited when a new character is introduced that gives representation for the first time to a new minority group, or something like that.

As with so many businesses, the superfans are worth quite a few normies who might be put off by this. I think this is the main explanation.

• …and those people tend to also be supporters of social justice politics.

I guess this is the part that’s not so clear to me. I see lots of people like this. I also see lots of people who are groaning about being repeatedly lectured and about their characters and franchises getting deconstructed. It’s hard for me to find a vantage point that doesn’t bubble me in one sphere or the other in a way that makes one side look overwhelmingly larger than the other. So I just can’t tell what the actual demographics are here. But the revealed behavior of these companies gives me the impression that they do find it crystal clear. That’s what I find a bit bewildering.

• The reason these events were scary, and subsequent fiction was able to capitalise on that, was that they were near misses. Very near misses. There is already a lot of fiction about various misaligned AIs, but that doesn’t affect people much. So what you seem to be advocating is generating some near misses in order to wake everybody up.

Fear is useful. It would be good to have more of it. The question is how plausible it is to generate situations that are scary enough to be useful, but under enough control to be safe.

• The reactor meltdown on a Soviet submarine was not posing an existential threat. In the worst case, it would be a little version of Chernobyl. We might compare it to an AI which causes some serious problems, like a stock market crash, but not existential ones. And the movie is not a threat at all.

”The question is how plausible it is to generate situations that are scary enough to be useful, but under enough control to be safe.”
That is a great summary of what I wanted to say!

• I have the capacity to monologue internally, and use it moderately often, but not constantly. When I’m not monologuing I guess there’s just a direct link from thought/​input to action without an intermediary vocalising about it.

When reading my default is to read “in my head” as if reading aloud, but with a little effort I can suppress that and just scan the page while understanding the words. With the result that reading is a little faster if I don’t vocalise it, but also less pleasurable if the rhythm of the prose would be part of the experience. Not sure how retention of what I’ve read compares—I suspect it might be reduced if I’m scan-reading (it lends itself to skimming).

I can generate internal imagery and sounds more generally, but not to the extent of full-blown voluntary hallucination. Mental images tend to feel like they’re in a separate space from my main visual field (somehow above or inside my head, if I had to give it a location) and they aren’t perfectly vivid; maybe only partly in full focus/​detail at a time.

• On many useful cognitive tasks(chess, theoretical research, invention, mathematics, etc.), beginner/​dumb/​unskilled humans are closer to a chimpanzee/​rock than peak humans

All of these tasks require some amount of learning. AIXI can’t play chess if it has never been told the rules or seen any other info about chess ever.

So a more reasonable comparison would probably involve comparing people of different IQ’s who have made comparable effort to learn a topic.

Intelligence often doesn’t look like solving the same problems better, but solving new problems. In many cases, problems are almost boolean, either you can solve them or you can’t. The problems you mentioned are all within the range of human variation. Not so trivial any human can do them, nor so advanced no human can do them.

Among humans +6 SD g factor humans do not seem in general as more capable than +3 SD g factor humans as +3 SD g factor humans are compared to median humans.

This is a highly subjective judgement. But there is no particularly strong reason to think that human intelligence has a Gaussian distribution. The more you select for humans with extremely high g factors, the more you goodheart to the specifics of the g factor tests. This goodhearting is relitively limited, but still there at +6SD.

3.0. I believe that for similar levels of cognitive investment narrow optimisers outperform general optimisers on narrow domains.

I think this is both trivially true, and pragmatically false. Suppose some self modifying superintelligence needs to play chess. It will probably largely just write a chess algorithm and put most of it’s compute into that. This will be near equal to the same algorithm without the general AI attached. (probably slightly worse at chess, the superintelligence is keeping an eye out just in case something else happens, a pure chess algorithm can’t notice a riot in the spectator stands, a superintelligence probably would devote a little compute to checking for such possibilities.)

However, this is an algorithm written by a superintelligence, and it is likely to beat the pants off any human written algorithm.

4.1. I expect it to be much more difficult for any single agent to attain decisive cognitive superiority to civilisation, or to a relevant subset of civilisation.

Being smarter than civilization is not a high bar at all. The government often makes utterly dumb decisions. The average person often believes a load of nonsense. Some processes in civilization seem to run on the soft minimum of the intelligences of the individuals contributing to them. Others run on the mean. Some processes, like the stock market, are hard for most humans to beat, but still beaten a little by the experts.

My intuition is that the level of cognitive power required to achieve absolute strategic dominance is crazily high.

My intuition is that the comparison to a +12SD human is about as useful as comparing heavy construction equipment to top athletes. Machines usually operate on a different scale to humans. The +12 SD runner isn’t that much faster than the +6SD runner, Especially because, as you reach into the peaks of athletic performance, the humans are running close to biological limits, and the gap between top competitors narrows.

• [ ]
[deleted]
• It is possible that “This box contains the key” was a true statement at the time it was written, and then the contents were changed. The king’s explanation does specify an ordering of events.

• This is a great thread for explaining how to spot the frame

I have a lot to say on frames, but a very foundational lesson also worth mentioning is how the spell casting takes place, and how to Counterspell

It happens in 5 steps

1. Someone sets a frame

2. Significance control: thread expand if you agree, VS thread minimize if you decide to ignore it and move

3. Frame negotiation: agree, reframe, or set your own (opposing) frame

4. Agreement

5. Cementing

If you set the frame, you can control the frame from beginning to end. However, if someone else sets the frame, then you first want to decide whether to expand on that frame, or to minimize it.

Significance Control

The more significant a frame is, the more it impacts the conversation, so whether you want to minimize or expand is an important decision

If you decide to challenge a frame, you also expand on it. So if you lose that negotiation, then you face much bigger consequences because you first expanded it, and then lost it. Indeed the opposite of minimizing is not to say it doesn’t matter but, often, is to simply ignore it.

If a frame is agreeable to you, you want to expand on it. There are many ways of thread-expanding, including:

Asking questions such as “why is that” or “why do you think so” Asking leading questions: ie. “oh wow, do you really think so” Strategic disagreement: such as “you think so? But this other person said the opposite”. Now they’re forced to defend and talk more, which expands the initial frame Laughing: a way to “covert expanding” anyone with a Facebook account is familiar with. This is what lawyers sometimes do to highlight the opposing lawyers’ mistakes (you could see plenty of that during the Depp VS Heard defamation case: most people never realize that most of the snickering was done on purpose to sway public and jurors’ opinions) Agreeing and expanding: you agree, and explain why you agree Agreeing and sharing: you agree, and share a story that supports the frame or belief Agreeing and rewarding: you agree, and you tell them why you appreciate them for saying or doing what they did

(Side note: Most techniques of frame negotiation also expand on a frame. So you want to be careful not expanding disagreement or irreconcilable differences when you need rapport. And this is why, generally speaking, “agreeing and redirecting” is a fantastic form of frame control: it’s because it sets your own frame while minimizing the disagreement and leveraging the commonalities)

Whenever a frame is disagreeable to you, you can either challenge it, or minimize it

If you have the power to challenge it and change people’s opinions, or at least if you want your disagreeing voice to be heard, then you can speak up.

Many other times, it’s best instead to minimize a frame, and move on. Minimizing a frame includes:

Ignoring it “Yeah yeah-ing it”, such as to agree but with little to no conviction and then moving on Thread-cutting (ie.: Changing topic) a common, and effective technique (if well executed) Offer small and partial third-party agreement: ie.” yeah, some people feel that way”, and then moving on

Cementing

Now for the most important step

Imagine you agreed on a good frame that’s good for you. What do you do now?

You want to expand on that frame to increase the (perceived) benefits and the follow-through.

This phase is called “thread cementing”, an incredibly useful technique.

Frame cementing means: Expanding and solidifying the thread of the “agreement reached” to solidify the new frame and increase its effectiveness. Frame cementing increases the likelihood that the other party will stick to the new negotiated frame, and/​or it increases the likelihood that the Persuasion will be internalized and accepted as the new reality (VS just agreeing with the frame as a form of short-term capitulation)

This final step… actually has additional substeps (Human psychology is hard, okay?!!!)

1. You reach a point where a frame is agreeable to you

2. Cement it by asking for confirmation

A frame that is agreed by the other party immediately increases its power by 10 fold. It makes people feel part of the decision, which increases adoption and followthrough, as well as increasing “intrinsic motivation”.

Some ways of doing it: • “ What do you think“: an agreement with less nudging gets more buy-in and is even more powerful • “Do you agree“ • “It makes sense, doesn’t it”

Note: silence often (thougb not always!) means one is in the process of accepting it, but might feel disempowered to admit it. Generally speaking, the frame agreed upon should feel good

1. Cement it by providing your own confirmation

For example: ▪︎ “I’m glad we agree“ ▪︎ “ I’m happy we see things the same way“

1. End with a collaborative frame and/​or reward

For example: • “Yeah, it makes sense, right? You get it because you’re also a smart guy/​gal“ • “ I’m glad we’re going to do this. And I’m glad it’s going to help (because I care about you)“: show that you are glad about the new frame/​agreement because it will benefit them, and because you care about them. Super powerful. But be honest about it please -or don’t say it-! • Silence and smile: confirms nonverbally the good vibe

1. Next steps and taking action

If it was a frame that requires taking action, move on to the next steps.

(Side note: The more you had to persuade, the more you want to show that you are also tasking yourself with some steps. Eg “Great, so you can take care of X, I’ll do Y and Z, and we’ll meet at 4pm“)

Frame cementing is super powerful, BUT you better be genuine when using, and you better use it with real win-win frames or with the best intentions for the people you’re persuading.

When you use it for win-lose, that’s the stuff of manipulators. And albeit it can work in the short-term, over the long-term many people will catch on. As a matter of fact, the higher the quality of the people you deal with, the more likely it is they will catch on

Even when you use it for win-win you must be careful. You can still come across as a bit too sleek, which raises some red flags

Give people space to agree by themselves. Ask questions more than making statements. And when you must intervene, live by the motto “nudge, don’t push”.

Also make sure you stress the win-win nature of the agreement, together with how glad you are because you care about them.

One final Warning: Unchallenged Frames Self-Cement Over Time

This is important to remember

Frames that go unchallenged tend to cement themselves. Especially when they repeat over time.

What happens is that the frame, from a verbal or nonverbal statement that simply describes or comments on reality, becomes more and more a reality of your shared (social) life.

This is a very important principl, because it means that if you let bad frames go unchallenged, then you lose arguments and/​or persuasive power forever, not just in the few seconds that the frame lasts. And if they are repeated frames, they can also compound power over time

This is a similar principle for micro-aggressions: if you let micro-aggressions go unchallenged, then they build-up, and you die by a thousand cuts.

This usuallg means that it’s a good idea to get in the habit of challenging most frames are irrational/​disagreeable early on in every new relationship

• This doesn’t sound wrong exactly but it does sound icky.

It seems to be missing “we are talking to each other in good faith, cooperatively; we point out the existence of the frame choices rather than sneakily trying to end up with a frame that’s good for what we want right now”.

I mean it’s technically kindasorta there in some of the expanding, like “you think so? But this other person said the opposite”. But the spirit still seems adversarial and manipulative, even in “win-win”. Like… “the only reason I’m not punching you is because you got lucky and accidentally agree with what I want”.

If I used these techniques with myself it would feel like bad brain habits.

I don’t want to be on the receiving end.

Maybe this is supposed to be applicable only in situations where you’re fine treating people as NPCs to be manipulated? If so, add that context, on LW. If not—FYI, it came off as if it was, to at least one person, namely me.

• GuySrinivasan there are instructions on casting a dark spell, step by step

You don’t cast Avada Kedavra with happy thoughts, you cast it with the intention to kill

You cast fiendfyre with blood

And you cast “TARE DETRIMENS” by having very bad brain habits, on average

This wasnt a guide for the purpose of doing it. This was a guide for the purpose of recognizing it when done to you and seeing them dance the steps and having them reified

If it wasn’t “icky”, why would it be a dark art?

• Ah, I was confused the whole time.

how the spell casting takes place, and how to Counterspell

It happens in 5 steps

I thought you were trying to show us how to Counterspell! :D

• “Wouldn’t it make more sense to use as a reward signal the fact-of-the-matter about whether a certain system followed a particular human’s intention?”

If I understand what you are saying correctly, this wouldn’t work, for reasons that have been discussed at length in various places, e.g. the mesa-optimization paper and Ajeya’s post “Without specific countermeasures...” If you train a model by giving it reward when it appears to follow a particular human’s intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.

And so if you train an AI to build another AI that appears to follow a particular human’s intention, you are just training your AI to do capabilities research.

(Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so as far as we can tell from the outside. Well, how are we going to construct such a reward signal? That would require getting all the way to the end of evhub’s Interpretability Tech Tree.)

• If you train a model by giving it reward when it appears to follow a particular human’s intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.

Right yeah I do agree with this.

Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so [...] That would require getting all the way to the end of evhub’s Interpretability Tech Tree

Well I think we need something like a really-actually-reward-signal (of the kind you’re point at here). The basic challenge of alignment as I see it is finding such a reward signal that doesn’t require us to get to end of the Interpretability Tech Tree (or similar tech trees). I don’t think we’ve exhausted the design space of reward signals yet but it’s definitely the “challenge of our times” so to speak.

• I still think this is great. Some minor updates, and an important note:

Minor updates: I’m a bit less concerned about AI-powered propaganda/​persuasion than I was at the time, not sure why. Maybe I’m just in a more optimistic mood. See this critique for discussion. It’s too early to tell whether reality is diverging from expectation on this front. I had been feeling mildly bad about my chatbot-centered narrative, as of a month ago, but given how ChatGPT was received I think things are basically on trend.
Diplomacy happened faster than I expected, though in a less generalizeable way than I expected, so whatever. My overall timelines have shortened somewhat since I wrote this story, but it’s still the thing I point people towards when they ask me what I think will happen. (Note that the bulk of my update was from publicly available info rather than from nonpublic stuff I saw at OpenAI.)

• I feel like your predictions for 2022 are just a touch over the mark, no? GPT-3 isn’t really ‘obsolete’ yet or is that wrong?

I’m sure it will be in a minute, but I’d probably update that benchmark to probably occurring mid 2023, or potentially whenever GPT-4 gets released.

I really feel like you should be updating slightly longer, but maybe I misunderstand where we’re at right now with chatbots. I would love to hear otherwise.

• In some sense it’s definitely obsolete, namely, theres pretty much no reason to use original GPT-3 anymore. Also, up until recently there was public confusion because a lot of the stuff people attributed to GPT-3 was really GPT-3.5, so original GPT-3 is probably a bit worse than you think. Idk, play around with the models and then decide for yourself whether the difference is big enough to count as obsolete.

I do think it’s reasonable to interpret my original prediction as being more bullish on this matter than what actually transpired. In fact I’ll just come out and admit that when I wrote the story I expected the models of december 2022 to be somewhat better than what’s actually publicly available now.

• [ ]
[deleted]
• I’m fairly sure Tononi said multiple times that IIT implies a simulated brain would not be conscious. I’m not sure how this affects the Chinese room, but it seems plausible it would work by simulating a brain. Then it wouldn’t be conscious.

Why does this follow? The simulation still has states and information that can be integrated.

• This seems to me like a “you do not understand your own values well enough” problem, not a “you need a higher moral authority to decide for you” problem.

Or, if we dissolve the idea of “your values” as something that produces some objective preference ordering (which I suppose is the point of this post): you lack a process that allows you to make decisions when your value system is in a contradictory state.

• this seems to me to excessively dimensionality-reduce the political spectrum. for example, where does war profiteering fit?

• I think the political spectrum doesn’t quite line up with this. For basically any point on the compass, there will be things that should be managed in a decentralized way following Commercial precepts, and things that should be managed in a centralized way following Guardian precepts. The question is just which activities fall in which bucket. [Is medicine a good that should be bought and sold like any other, or largesse which should be dispensed?]

But some sets of choices will be more synergistic or more contradictory than others; applying this technique to the political spectrum might identify a few good clusters and a bunch of worse hybrids. [Given that politics is mostly about coalitions and loyalty instead of technical coherence, my guess is this won’t be super useful.]

• The political spectrum is quite orthogonal to this. You’ve got communist Guardians who want to protect the means of production and share out all the goods along with Commercial communists who think that if we just sit down and decide to all work together we’ll come up with plans that will equally enrich everyone. Or fundamentalist Guardians who just want to ensure everyone’s purity along with Commercial fundamentalists who are arguing among themselves as to what’s the best way of interpreting a single word in the Bible.

It might be more useful to compare this with Scout and Soldier mindsets, which seem to be pointing in vaguely the same direction, but in the area of epistemics, rather than morals.

• ah, then maybe I’m misunderstanding at a deeper level. I will abstain from further comment for now.

• So, I think firms that sell weapons to individuals and governments broadly fall under the Commercial cluster; following the Guardian precepts as such a firm is probably a mistake. Note that these are ethical standards, so you could look at any individual firm and ask whether they’re following the precepts in particular cases. I suspect that most cases of war profiteering are a failure on the buyer’s side, at least as far as this view is concerned.

There is something interesting here with the question of largesse—traditionally, the Guardian’s role is to take resources from their territory and then spend those resources on buying loyalty /​ public goods. The military-industrial-complex is often this sort of largesse operation, but it’s not obvious that it should be. [Similarly, Jacobs talks a lot about how government meddling in agriculture is probably downstream of agriculture’s traditional role as powerbase for Guardians, but they tend to have lower yields /​ be worse at it than Commercial agriculture.]

There’s also this point that—the Guardians do need to be involved in trading! Even if the Baron isn’t supposed to engage in business himself, he still has things he needs to buy, taxes he needs to collect, and so on. This means there needs to be some sort of agent who is able to engage in trade, and presumably does so mostly using the Commercial precepts, and hopefully with a lessening of the implicit threat.

• I’d love to hear why this warrants front page, and I’d love to hear what Valentine hopes to gain from asking this! This is a topic with a large amount of adversarial agency coming from the right’s culture war. What leads you to bring that here, and why is it worth a frontpage on the ai safety forum?

• What leads you to bring that here

I noticed I was confused. The world didn’t make sense to me at this spot. I could guess at some pieces, like “Okay, maybe wokism is actually just really super popular”, but that didn’t account for all the pieces I was observing.

I imagined that Less Wrong would be a good place to ask people about this in a way relatively unlikely to swing into culture war baloney. I just want to understand how the world is shaped.

why is it worth a frontpage on the ai safety forum?

I… have no idea. I didn’t do that. Or if I did it was purely by accident. I wouldn’t have guessed this belonged at all in anything having to do with AI risk, other than it being about modeling the world, which is generically connected to AI risk in an overall kind of way.

• to clarify, I was asserting that less wrong is the AI safety forum of the world.

• Oh! Ha! Okay. Well, I view Less Wrong as the rationality forum of the world, which happens to include a lot of examination of AI safety/​risk. If there were a division within LW between “AI” and “not AI”, I totally would have put this in the “not AI” category.

• I wish I could only subscribe to the AI safety stuff but still get all of the thinking out loud the AI safety people do. unfortunately that often doesn’t get tagged AI. I also have trouble getting into a mindset where anything worth applying rationality to isn’t therefore, by nature of wanting rationality, fundamentally an AI problem—after all, rationality is the task of building a brain in a brain.

I’m afraid I can’t answer your other question in the other thread, my brain is crashing repeatedly if I try to.

• The question “why do companies do something seemingly unprofitable” is in my opinion worth asking.

The answers seem to be one of:

• it actually is profitable, because...

• a principal-agent problem, the people doing the thing are not aligned with the company (and the company will not replace them, because...)

Both seem likely, I wish I could figure out which one is true (possibly both).

• Is it on the AI safety forum, though? Turns out it is, though downvoted...

This is an interesting question, whatever your political bend—there is a noticeable uptick in representation etc. in new media. It’s worth understanding the underlying mechanisms at work, seeing as whatever the reason for the changes, they happened quite fast. Both if you’re for such changes or against them.

• The customers. Some companies go hard anti-progress, some companies go hard inclusion. Since many people don’t want to allow anyone who matches BIPOC or LGBTQIA+ to exist, they don’t buy anything that matches those.

note that the companies who do this still typically fund the far right.

also note that the military is threatening to move their bases out of areas that go anti-bipoclgbtqia+. go anti-woke, go broke.

edit to clarify: In other words, I see this as likely to be viewed as a means to political power by those running these corporations. it’s not a short-term economic incentive that leads them to offend some of their customers.

• Heavily downvoted for (a) not answering the question and (b) instead using this space as an opportunity to repeater signal boost the left’s narrative in this particular corner of the culture wars.

[Edited to correct an inappropriate blindness on my part.]

• look if you want to signal boost the rights culture war, go for it, but I’m pretty sure most of this is just that it’s profitable and Disney is predatory and trying to profit off of it without actually making any change. you want to fight for the right, go ahead

• look if you want to signal boost the rights culture war, go for it…

Wanting to flag this as another example of frame control.

I’m not trying to align with the right. I think they’re nuts in almost exactly the same way.

My position is more like anti culture war escalation.

Sadly, this means that when someone is heavily aligned with one side of a front of the war, I can come across as aligning with their enemies.

Alas.

• hmm. perhaps the key thing necessary to disengage with it is to brutally word-replace words used by everyone into custom academic-style language. eg use of the word “woke” is a very strong right-side signal. that’s what I’m getting at with those points in my comment.

• Mmm. As I just mentioned here, I actually didn’t know use of “woke” got interpreted as a “very strong right-side signal”. Lots of left-leaning folk around me use the term too. What does your corner of the left call it?

• I’m sorry, what? but like, that’s… actually true? I did answer the question, your dismissal cannot be in good faith if you think I didn’t.

• I did answer the question, your dismissal cannot be in good faith if you think I didn’t.

Ah, I think I missed this part of your comment here. Not sure how that happened.

On this I apologize. I missed that you were honestly trying to answer the question. Mea culpa.

The info commons point still stands though.

• okay, then that’s fair. and yes, I repeat the liberal centers talking points because I simply do in fact believe they are true, downvoting me hard because of them feels like an actual dismissal. I admit that I noticed myself feeling agitated and fighty and that I have had a repeated pattern of engaging that way; The part of my brain that wants me to do that doesn’t seem to be able to disengage, even after an internal conversation about it that part just says look I have to be here it’s important. I have asked for my karma to be slashed because of a similar fight the other day. sorry about this anger tone thing

• Cool. Thank you for explaining.

I also am sure that a more skillful version of me could have named and navigated all this with a lot more grace. You seem to be sensitive on this spot, but I also kind of hit it with a hammer. Sadly I don’t yet see a more graceful way to do the thing I’m caring for without fawning. But I’ll get there. I regret you got hit in the process.

• There are ways to say true things that are partly, mostly, or entirely for reasons having nothing to do with conveying the truth.

Notice the contrasting terms “anti-progress” vs. “inclusion”. And the framing about some folk not wanting categories of people to exist, rather than any kind of framing that such folk are caring for something that matters to them and might matter to civilization. And the injection of a slogan in the last line.

The tone of this isn’t about explaining something. It’s implicitly asserting that wokism is just overwhelmingly popular, which is the closest to an answer to the original question as is given. But it’s mostly about frame assertion.

I don’t care how true someone’s utterances are, or what side of the culture wars they’re fighting for, if they’re bringing in tactics like these. It heavily pollutes the information commons.

• you bring up the right’s frame, I’ll bring up the frame that actually describes my life. I’m responding like my life is being attacked because it is (of course not by this conversation). I used to trust you.

• I think I still appreciate you even though I’m pretty annoyed at you right now. it’s possible I will come to agree more, though I would appreciate understanding expressed in the form of recognizing why I find the word woke to be an aggressive choice of words. that said, I also wouldn’t blame you for deleting all of my comments, because other people have expressed my views better and I clearly havebtoxicity damage about this topic

edit: oh I see you did that on the other thread. conversation understanding appreciated

• …I would appreciate understanding expressed in the form of recognizing why I find the word woke to be an aggressive choice of words.

Oh. Actually I didn’t know that “woke” was a problematic word in this corner of memespace. I was just trying to point at the memetic structure and found this term lying around. I’ve seen it on media from far left to far right, so I’d guessed it was just the word everyone had converged on for referring to this thing.

Is there a word you prefer?

• My brain is too crashed to answer and I don’t accept to be able to read this thread without reentering this mindspace. having said my concern I leave it up to you to detect the degree to which I am or am not right about it and what to do. I might consider recommending an increased specificity unpack taboo rather than a word substitution taboo.

• A trick I sometimes use, related to this post, is to ask whether my future self would like to buy back my present time at some rate. This somehow makes your point about intertemporal substitution more visceral for me, and makes it easier to say “oh yes this thing which is pricier than my current rate definitely makes sense at my plausible future rate”.

• In fact, it’s not 100% clear that AI systems could learn to deceive and manipulate supervisors even if we deliberately tried to train them to do it. This makes it hard to even get started on things like discouraging and detecting deceptive behavior.

Plausibly we already have examples of (very weak) manipulation, in the form of models trained with RLHF saying false-but-plausible-sounding things, or lying and saying they don’t know something (but happily providing that information in different contexts). [E.g. ChatGPT denies having information about how to build nukes, but will also happily tell you about different methods for Uranium isotope separation.]

• You’re thinking at the wrong level of abstraction. There is no economic incentive for wokism at the corporate level. But look one level below. The question isn’t what causes “corporations” to act in woke ways. The question is, what persuades employees of corporations to act in woke ways?

My hypothesis is that anti-discrimination legislation has, due to court precedents, developed an inverted burden of proof. If a corporation fires or disciplines someone who is non-white, female, disabled, or belongs to a number of other protected categories, it is now up to the corporation to prove that the firing or discipline was done for non-discriminatory reasons. This, combined with the ideological leanings of most people in HR departments, is sufficient to ensure that every corporation has, within it, the equivalent of an ideological cell, whose job it is solely to push the corporation to act in a more woke manner. This ideological cell has both public opinion and federal law on its side; well meaning individuals who push back end up like James Damore.

But unless this had profit appeal I would expect the market to just… eat pure but incomplete ideological capture after a while

The market is part of society. There was a similar argument made against anti-segregation legislation in the 1960s. After all, given that it’s more profitable to sell to both black people and white people than it is to sell to white people only, wouldn’t it be in business owners’ rational self-interest to desegregate their properties?

The answer, in both instances, is the same: if there is a sufficiently high cultural barrier, then it will be more profitable to go with the culture than against it. Most reasonable people can at least nod along to the woke slogans. After all, it is quite reasonable to suggest that women ought to be treated equally to men, that blacks should be treated equally to whites, and people shouldn’t be discriminated against because of their sexual orientation. It’s only when those reasonable propositions are taken to extremes that they result in wokism.

Because of this motte-and-bailey aspect to wokism, it’s easy for wokism to permeate the culture, and for advocates of wokism to tar those who oppose them as racists and bigots.

But there’s a counter-push of “Lots of people don’t like being lectured about politics when they’re seeking entertainment” (for instance). It’s not at all clear to me that the first effect is so utterly hugely enormously larger than the second that the profit incentive would cause so many companies to swing hard woke.

Lots of people also threatened to move to Canada if Trump was elected President. How many of them actually chose to do so? A Republican in the United States will shout vociferously about Coca Cola or Nike engaging in woke behavior, but will he or she choose Pepsi when he or she next shops for groceries? Will he or she buy some other brand of shoes? And if he or she does, will it make a difference? After all, Pepsi and Reebok are hardly less woke than Coca Cola and Nike.

• Thank you, this was helpful.

Lots of people also threatened to move to Canada if Trump was elected President. How many of them actually chose to do so?

I don’t think this is the right analogy. Listening to more moderate and right-leaning folk, one gets the impression that viewership of shows and movie franchises that are going woke has been dropping like a rock. Like apparently there was an analysis of when people turned off the Captain America streaming show on Disney+ (I forget its name — the one where Falcon becomes the new Captain), and the moment it plunged was the scene where police were harassing Falcon due to racial profiling.

Maybe the most befuddling part of the culture wars is the way, on every front, soldiers on both sides muddle the facts. It’s hard for me to tell what’s even true. Daniel Schmachtenberger describes this as “polluting the information commons”. There’s a Molochian dynamic where the facts of the matter are part of what’s being fought over.

That’s why I’ve been looking at places that have a profit incentive that are also catering specifically to wokism, noticing I don’t see a corresponding shift in the other direction at the same scale, and kind of scratching my head. Wokism doesn’t look as obviously profitable a thing to align with as their behavior seems. Even if it is, that seems like it’d be hard to determine.

The possibility that it’s actually more like legally imposed internal friction makes some good sense. I doubt that’s the full picture but it’s a plausible major component.

• Ah, so being woke increases your job safety, if you are a member of one of the protected groups. It makes your membership in that group more salient; if you ever get fired, it makes it easier to argue that you were fired because you belonged to that group.

And there is no obvious counter-strategy, because telling them “stop talking about your membership in a protected group all the time” can itself be interpreted as attacking the group.

• A concrete example of this inversion of the burden of proof arose just today, with regards to the Twitter layoffs:

Their complaint cites calculations provided by Mark Killingsworth, an economics professor at Rutgers University, to allege that, overall, “57 percent of female employees were laid off on November 4, 2022, while 47 percent of male employees were laid off.”

This lawsuit is not alleging that any specific discriminatory behavior took place, or that discriminatory reasoning was used by managers in choosing who got the pink-slips and who got to stay on. Rather, the brute fact that more women than men were laid off is used as evidence to assert that Twitter was targeting women. Now, it’s up to Twitter to show that it was not behaving in a discriminatory manner in conducting its layoffs.

• …the question does sometimes haunt me, as to whether in the alternative Everett branches of Earth, we could identify a distinct cluster of “successful” Earths, and we’re not in it.

This Failing Earth, Eliezer Yudkowsky

Does anyone else wonder similar things about the EA/​rationality scene? If we could scan across Tegmark III, would we see large clusters of nearby Earths that have rationality & EA communities that embarrass us and lay bare our own low standards?

• 9 Dec 2022 3:06 UTC
2 points
1 ∶ 1

There are two major unexamined assumptions underlying this analysis.

The most flagrant is the assumption that the expected value of all work done now on x-risk is positive. You might hope that it is, but you can’t actually know or even have rationally high confidence in it. Without this assumption, you might be able to say that anything we do today is important, but can’t say that it’s equivalent to saving lives. You may equally well be doing something equivalent to ending lives.

Another serious unjustified assumption is that the correct measure is some aggregated utility that is linear in the number of people who come to exist. I have extreme doubts that murdering 7 billion people today is ethically justifiable if it would increase the population capacity of the universe a trillion years from now by 0.0000000000000000000000000000000000000001% even though it means that a lot more people get to live. Likewise I have an expectation that allowing capacity for one more potential person to exist a trillion years from now is morally much less worthwhile than saving an actual person today.

• Oh, almost forgot!

As to your second objection, I think that for many people the question of whether murdering people in order to save other people is a good idea is a separate moral question from which altruistic actions we should take to have the most positive impact. I am certainly not advocating murdering billions of people.

But whether saving present people or (in expectation) saving many more unborn future people is a better use of altruistic resources seems to be largely a matter of temperament. I have heard a few discussions of this and they never seem to make much sense to me. For me it is literally as simple as people being further away in time which is another dimension, not really any different than spatial dimensions, except that time flows in one direction and so we have much less information about it.

But uncertainty only calls into question whether or not we have impact in expectation, for me it has no bearing on the reality of this impact or the moral value of these lives. I cannot seem to comprehend why other people value future people less than present people, assuming you have equal ability to influence either. I would really like for there to be some rational solution, but it always feels like people are talking past each other in these types of discussions. If there is one child tortured today it cannot somehow be morally equivalent to ten children being tortured tomorrow. If I can ensure one person lives a life overflowing with joy today, I would be willing to forego this if I knew with certainty I could ensure one hundred people live lives overflowing with joy in one hundred years. I don’t feel like there is a time limit on morality, to be honest it still confuses me why exactly some people feel otherwise.

You also mentioned something about differing percentages of the population. Many of these questions don’t work in reality because there are a lot of flow-through effects, but if you ignore those, I also don’t see how 8,000 people today suffering lives of torture might be better than 8 early humans a couple hundred thousand years ago suffering lives of torture, even if that means it was 1 /​1,000,000 of the population in the the first case (just a wild guess) and 1 /​ 1,000 of the population in the second case.

These questions might be complicated if you take the average view on population ethics instead of the total view, and I actually do give some credence to the average view, but I nonetheless think the amount of value created by averting X-risk is so huge that it probably outweighs this considerations, at least for the risk neutral.

• Interesting objections!

I mentioned a few times that some and perhaps most x-risk work may have negative value ex post. I go into detail how work may likely be negative in footnote 13.

It seems somewhat unreasonable to me, however, to be virtually 100% confident that x-risk work is as likely to have zero or negative value ex ante as it is to have positive value.

I tried to include the extreme difficulty of influencing the future by giving work relatively low efficacy, i.e. in the moderate case 100,000 (hopefully extremely competent) people working on x-risk for 1000 years only cause a 10% reduction of x-risk in expectation, in other words effectively a 90% likelihood of failure. In the pessimistic estimate 100,000 people working on it for 10,000 years only cause a 1% reduction in x-risk.

Perhaps this could be a few orders of magnitude lower, say 1 billion people working on x-risk for 1 million years only reduce existential risk by 1/​1trillion in expectation (if these numbers seem absurd you can use lower numbers of people or time, but this increases the number of lives saved per unit of work). This would make the pessimistic estimate have very low value, but the moderate estimate would still be highly valuable (10^18 lives per minute of work.)

All that is to say, I think that while you could be much more pessimistic, I don’t think it changes the conclusion by that much, except in the pessimistic case—unless you have extremely high certainty that we cannot predict what is likely to help prevent x-risk. I did give two more pessimistic scenarios in the appendix which I say may be plausible under certain assumptions, such as 100% certainty that X-risk is inevitable. I will add that this case is also valid if you assume a 100% certainty that we can’t predict what will reduce X-risk, as I think this is a valid point.

• 9 Dec 2022 2:58 UTC
16 points
6 ∶ 2

That you think they’re going super hard woke (especially Disney) is perhaps telling of your own biases.

Lets look at Disney and Hollywood (universities are their own weird thing). The reality is that in the Anglosphere there are lots of progressive people with money to spend on media. You can sell “woke” media to those people, and lots of it. Even more so when there’s controversy and you can get naive lefties to believe paying money to the megacorp to watch a mainstream show is a way to somehow strike back against the mean right-wingers. And to progressive people it doesn’t feel like “being lectured to about politics”, because that’s not what media with a political/​values message you agree with feels like. So going woke is 100% a profit-motivated decision. The leadership at big media companies didn’t change much over the last decade or two, nor likely did their opinions (whatever those actually are). But after gay marriage gained significantly above 50% approval rate in the US and the Obergefell decision happened it became clear to them that it was safe to be at least somewhat socially progressive on issues like that, and would be profitable.

But equally, almost every single “woke” Disney movie has the “woke” components carefully contained such that they can easily be excised for markets where they are a problem. You see a gay kiss in the background of a scene in Star Wars, it gets cut for the Chinese and Middle East markets. Disney has many very progressive employees who are responsible for making the actual art they produce; artists lean pretty strongly progressive in my experience, so of course the employees’ values come out in the art they make. But the management puts very strict limits on what they can do precisely because anything less milquetoast is believed to be less profitable.

• That you think they’re going super hard woke (especially Disney) is perhaps telling of your own biases.

…and then you go on to describe how Disney is in fact selling movies with woke components to the West, which is exactly what I was talking about.

Just… don’t do this. I’m not available for this kind of psychoanalysis. I find it extremely difficult to engage in good faith when people make moves like this one. My biases are my business. If you think I’m missing something, just point it out. Don’t try to diagnose my failures of rationality.

• You are making the mistake of assuming that because the median Chinese citizen is ideologically opposed to the American left in a technical sense, Disney’s localizing movies for China means that it’s not a captured institution. But in fact the American left cares very little about the beliefs and attitudes of the median Chinese citizen because those people compete in a different political arena than them. More telling than the fact that Disney localizes for China at all is the fact that they refuse to make high budget, well marketed movies catering to (for example) the American Christian right, even though such a niche has been proven to be very profitable for independent filmmakers.

• Or just visit to get information. Don’t choose antibiotics vs vitamins based on estimated value delivered, but diversify to learn about them all, to learn what it takes to deliver them. But the most valuable information will probably be unrelated to what you bring.

• I give a crisp definition from 6:27 to 7:50 of this video:

• ^Chinese total cases

Like I predicted last week, Chinese COVID numbers are going down.

However, most of this decline is from asymptomatic cases.

^Beijing cases

This is… interesting? Maybe less testing → fewer false positives. This doesn’t match case decline in previous months, when a decrease in asymptomatic cases almost always came with a corresponding decrease in symptomatic cases.

Really makes you think. Any ideas?

• Proposal: consciousness very much exists, but continuity of consciousness is an illusion.

If we assume that each moment of consciousness is its own entity, with no connections to any other, we can dissolve many problems around continuity of consciousness, like simulations, teleportation, change of computation substrate, ect.

• If we assume that each moment of consciousness is its own entity, with no connections to any other,

Why should we assume it? My consciousness now clearly does have connections to my consciousness one second ago, three hours ago, twenty years ago. One might as well assume that for the keyboard I am using, each moment of its existence has no connection to any other. It is straightforwardly false.

• Executive Summary

1. Big jump in cases and hospitalizations likely means winter surge.

Not sure how reliable case counts are at all since it goes down whenever the government shuts down testing centers. It should at most be considered alongside case positivity rates, because of the risk that something goes wrong with measuring case positivity, I’m not sure why case counts would be considered a very good way of measuring the pandemic on their own.

2. Chinese protests suppressed, some modest loosening did result.

This is definitely an area where corellation does not strongly imply causation, because it is well within the interests of the PRC to visibly halt opening up after large-scale protests demand opening up in order to discourage future protests, and because protest organizers in China are definitely capable of strategically timing protests before opening up was already planned in order to make it look like the protests caused the subsequent opening up.

I’m not saying that protests didn’t increase the subsequent loosening (protests tend to do that, so it could possibly happen in China), but lots of unreliable sources are loudly trumpeting that exact claim, so the burden of proof for this is much higher than anything mentioned in the China section of this post.

3. Long Covid study finds control group that had other respiratory illnesses did worse than the Covid group.

How were the covid positive and covid negative categories sorted? Rapid antigen tests had a massive false negative rate and that was before omicron. I’ve encountered a lot of anecdata of intelligence/​energy being permanently lowered after a covid infection such as insomnia and shortened attention spans. We’re still in the middle of the post-truth infodemic, and there’s a long history of unusually flawed studies that claim to confirm or deny covid brain damage. So I don’t see why this particular study is supposed to count as any sort of “reiteration of the central point of Long Covid” when the issue is that any single methodology flaw would make it around as likely to point in the wrong direction as the right one, and such methodological flaws are extremely common in publicly available Long Covid studies.

• This might be the lowest karma post that I’ve given a significant review vote for. (I’m currently giving it a 4). I’d highly encourage folk to give it A Think.

This post seems to be asking an important question of how to integrate truthseeking and conflict theory. I think this is probably one of the most important questions in the world. Conflict is inevitable. Truthseeking is really important. They are in tension. What do we do about that?

I think this is an important civilizational question. Most people don’t care nearly enough about truthseeking in the first place. The people who do care a lot about truthseeking tend to prefer avoiding conflict, i.e. tend to be “mistake theory” types.

Regular warfare is costly/​terrible and should be avoided at all costs… but, “never” is just not an actually workable answer. Similarly, deception is very costly, in ways both obvious and subtle. One of my updates during the 2019 Review was that it is plausible that “don’t lie” is actually even more important than “don’t kill” (despite those normally being reversed in in my commonsense morality). But, like violent warfare, the answer of “never” feels like an overly simplified answer to “when is it acceptable to lie?”

Eliezer’s discussion of meta-honesty explores one subset of how to firm up honesty aroun the edges. I like Gentzel’s post here for pointing in a broader direction

This is not necessarily an endorsement of any particular point made here, only that I think the question is important. I think people who gravitate towards “truthseeking above all else” have a distaste for conflict theory. Unfortunately, I trust them more than most conflict theorists on how to develop norms around truthtelling that hold up under extreme conflict.

• lsusr, if it was proven that the human brain actually does work on quantum principles, how would that change your view on free will?

• 9 Dec 2022 1:37 UTC
7 points
4 ∶ 0

This piece is aimed at a broad audience, because I think it’s important for the challenges here to be broadly understood.

I’m curious how you’re trying to reach such an audience, and what their reactions have been.

• Universities are profit-focused? Disney and Hollywood are two distinct systems?

• Universities are profit-focused?

• Harvard: $51B • Yale:$42B

• Stanford: $37B • Princeton:$37B

• MIT: $27B • UPenn:$20B

• ...

How do they get there? It’s not through lack of trying, and the majority of it is not tuition. Rather:

• My mom is familiar with a few of the above universities, and has said that “Napoleon would be proud” of how organized and efficient they are at hounding alumni for donations.

• I think they also care a great deal about getting money from research grants. I’ve heard many professors feel pressure to get grants. Probably in part because:

• There’s an entire system for managing money that has been given to them with strings attached (e.g. funding XYZ research) and always using the most-restricted money to pay for a given thing. For example, maybe the university needs to spend $20k on maintenance for a telescope, but then if they’re given a grant of that size to do astronomy research, then they can use the grant to pay for that maintenance, so the grant has effectively given them$20k to do anything they want with. This does make sense—it’s rational behavior and it is fulfilling the terms (Mom said they’re very careful to remain within the letter of the law, especially for government grants), but it has interesting consequences.

• And then there’s investment earnings from past years’ endowments.

• I was reading this and was kind of mentally renaming this to “anti-enlightened” agent. It does suggest that this might come in gradients. If there are only very specific and rare ways to update a deeper layer the agent might seem like a wrappermind meanwhile while actually not being one. Taking 30000 years to go from 8-year-old love of spaceships to 10-year-olds love for spaceships is still multiple millenia of rough time. Any mind with a physical subtrate (should be all of them) will be alterable by hitting the hardware. This will mean that a true or very hard wrappermind will be able to deny access to a specific spatial point very strongly.

Also anything that is not a wrapper mind will mean that its uppermost layer can be rewritten. Such a thing can’t have an “essential nature”.

Now it would seem that for most agents the deeper a layer is the harder it is to guess its maleability atleast from the outside. And it would seem it might not be obvious even from inside.

• Thanks for writing this post.

You mention that:

only conscious beings will ask themselves why they are conscious

But at the same time you support epiphenomenalism whereby consciousness has no effect on reality.

This seems like a contradiction. Why would only conscious things discuss consciousness if consciousness has no effect on reality?

Also, what do you think about Eliezer’s Zombies post? https://​​www.lesswrong.com/​​posts/​​7DmA3yWwa6AT5jFXt/​​zombies-redacted

• A large part of it is the US legal system and anti-discrimination law playing out in counterintuitive ways. The key think is that where corporations are concerned, US law runs on counterfactual court cases; the actual text of legislation matters only insofar as it affects those court cases. Combine this with management having imperfect control over employees within a corporation, imperfect resolution of facts, and a system for assigning damages that’s highly subjective, and executives are left in an odd position.

Every company which does a significant amount of hiring and firing, ie every company above a certain size, will fire and reject some number of people in protected groups. Some of those people will claim that it was because of their group membership, and sue. As a distant corporate executive, you can’t prevent this, and can’t tell whether the accusation is true.

But you can put everyone through some corporate training. And it seems that the empirical result, discovered by legal departments that have been through this many times, is that you get the best outcomes in the court cases if you go over the top and do reverse-discrimination that the letter of the law says should be illegal.

• From skimming the benchmark and the paper this seems overhyped (like Gato). roughly it looks like

• May 2022: Deepmind releases a new benchmark for learning algorithms

• ...Nobody cares (according to google scholar citations)

• Dec 2022: Deepmind releases a thing that beats the baselines on their benchmark

I don’t know much about GNNs & only did a surface-level skim so I’m interested to hear other takes.

• Off the top of my head (and slightly worried that this will become a major culture war thing, but I will answer the question that was asked):

• There is a principal-agent problem. If pursuing wokeness comes at the expense of profits, the latter doesn’t necessarily affect the people who make those decisions very much.

• My impression is that many of the executives are in fact woke, and others are at least unwilling to say otherwise.

• Wokeness seems pretty optimized for shouting down and intimidating opposition. (I think much of the specifics of the ideology were and are determined by some people successfully shouting down others within the woke movement.)

• At least in the entertainment industries, when a distinctly woke thing is made, there tends to be a narrative that evil people hate the thing, and therefore anyone who hates the thing is evil, and therefore lost profits should be treated with an attitude of “good riddance” rather than “maybe this thing was made badly”. I think this tends to be the woke narrative, and generally promoted by media—and, as per the previous item, any opposing narrative would tend to get shouted down.

• Aren’t CEOs mostly Republicans? And what’s stopping the shareholders from insisting on prioritizing profit?

• I’m thinking of tech companies that tend to be based in the SF Bay Area, and the most prominent entertainment companies are Hollywood—both of which are known for being more lefty. Also, CEOs are one thing, but other executives matter too; and writers and directors especially in entertainment.

Regarding shareholders, I don’t really know how that works. I do think it’s a general fact that getting a zillion people to coordinate on expressing their wishes is difficult. There’s a board of directors, who I guess nominally represent shareholders? Looks like every company can have their own rules, though I assume they’re mostly similar; looking at Disney’s bylaws, it says:

SELECTION OF NEW DIRECTORS
The Board shall be responsible for selecting its own members. The Board delegates the
screening process for new Directors to the Governance and Nominating Committee.

Although “Each Director shall at all times represent the interests of the shareholders of the
Company”, I suspect this is difficult to enforce. If the board ends up dominated by a woke narrative (with at least a vocal minority of woke people and a majority of people who shut up and go along with it), leading to unprofitable decisions, what can the shareholders do about it, other than sell their stock? “Shareholder revolts” are a thing, which implies that the divergence between shareholders’ desires and what the board is doing can indeed get pretty wide (though also implies that they can eventually get their way).

I do suspect that the profit motive will ultimately reassert itself, but it seems to have taken a long time and doesn’t show major signs of happening yet. It may take an “everyone knows that everyone knows that the woke decisions have gotten really bad” moment, which the woke narrative promoted by most media is probably delaying.

• Typically board members are elected by shareholders, and an attacker can win a proxy fight with a relatively small portion of the shares if he can convince other shareholders.

• 9 Dec 2022 0:16 UTC
1 point
3 ∶ 2

Random theory I heard: When Disney releases a new black princess, the fact that toxoplasma of rage forms around it provides them a lot of free advertising. Most people are like ‘shrug’ and don’t care that much, but the fact that everyone’s complaining and/​or hyping it gets it onto most people’s radar.

• I do think this is likely to be part of the strategy that the right-wingers who actually own the companies are intending to use, but that said, that’s just because toxoplasmosis filter requires marketing departments to think about what is not only an improvement, but an improvement that will offend the right.

• Solve the puzzle: 63 = x = 65536. What is x?

(I have a purpose for this and am curious about how difficult it is to find the intended answer.)

• So x = 63 in one base system and 65536 in another?

6*a+3=6*b^4+5*b^3+5*b^2+3*b+6

Wolfram Alpha provides this nice result. I also realize I should have just eyeballed it with 5th grade algebra.

Let’s plug in 6 for b, and we get… fuck.

I just asked it to find integer solutions.

There’s infinite solutions, so I’m just going to go with the lowest bases.

x=43449

Did I do it right? Took me like 15 minutes.

• 8 Dec 2022 23:18 UTC
LW: 8 AF: 6
1 ∶ 2
AF

I appreciate this post! It feels fairly reasonable, and much closer to my opinion than (my perception of) previous MIRI posts. Points that stand out:

• Publishing capabilities work is notably worse than just doing the work.

• I’d argue that hyping up the capabilities work is even worse than just quietly publishing it without fanfare.

• Though, a counter-point is that if an organisation doesn’t have great cyber-security and is a target for hacking, capabilities can easily leak (see, eg, the Soviets getting nuclear weapons 4 year after the US, despite it being a top secret US program and before the internet)

• Capabilities work can be importantly helpful for alignment work, especially empirical focused work.

Probably my biggest crux is around the parallel vs serial thing. My read is that fairly little current alignment work really feels “serial” to me. Assuming that you’re mostly referring to conceptual alignment work, my read is that a lot of it is fairly confused, and would benefit a lot from real empirical data and real systems that can demonstrate concepts such as agency, planning, strategic awareness, etc. And just more data on what AGI cognition might look like. Without these, it seems extremely hard to distinguish true progress from compelling falsehoods.

• When another article of equal argumentative caliber could have just as easily been written for the negation of a claim, that writeup is no evidence for its claim.

• 8 Dec 2022 23:06 UTC
0 points
0 ∶ 0

Code that I uploaded to GitHub and the writing that I’ve put into this blog went into training these models: I didn’t give permission for this kind of use, and no one asked me if it was ok. Doesn’t this violate my copyrights?

Github requires that you set licence terms for your code. And you can’t let outside parties access the code by accident, you have to specifically allow access. Either the use is or is not permitted by set licences. And you published your blog. Would you go after people that apply things mentioned in your blog? You did in fact give the permission.

Now it is a little bit murky when there are novel uses which the licensor didn’t have in mind. But it is not like we should assume that everything is banned by default if quite wide permissions have been granted. Old licences have to mean something in the new world.

• My (very amateur and probably very dumb) response to this challenge:

tldr: RLHF doesn’t actually get the AI to have the goals we want it to. Using AI assistants to help with oversight is very unlikely to help us avoid detection in very intelligent systems (which is where deception matters), but it will help somewhat in making our systems look aligned and making them somewhat more aligned. Eventually, our models become very capable and do inner-optimization aimed at goals other than “good human values”. We don’t know that we have misaligned mesa-optimizers, and we continue using them to do oversight on yet more capable models with the same problems, and then there’s a treacherous turn and we die.

These are first pass thoughts on why I expect the OpenAI Alignment Team’s plan to fail. I was surprised at how hard this was to write, it took like 3 hours including reading. It is probably quite bad and not worth most readers’ time.

Summary of their plan

The plan starts with training AIs using human feedback (training LLMs using RLHF) to produce outputs that are in line with human intent, truthful, fair, and don’t produce dangerous outputs. Then, they’ll use their AI models to help with human evaluation, solving the scalable oversight problem by using techniques like Recursive Reward Modeling, Debate, and Iterative Amplification. The main idea here is using large language models to assist humans who are providing oversight to other AI systems, and the assistance allows humans to do better oversight. The third pillar of the approach is training AI systems to do alignment research, which is not feasible yet but the authors are hopeful that they will be able to do it in the future. Key parts of the third pillar are that it is easier to evaluate alignment research than to produce it, that to do human-level alignment research you need only be human-level in some domains, and that language models are convenient due to being “preloaded” with information and not being independent agents. Limitations include that the use of AI assistants might amplify subtle inconsistencies, biases, or vulnerabilities, and that the least capable models that could be used for useful alignment research may themselves be too dangerous if not properly aligned.

Response

A key claim is that we can use RLHF to train models which are sufficiently aligned such that they themselves can be useful to assist human overseers providing training signal in the training of yet more powerful models, and we can scale up this process. The authors mention in their limitations how subtle issues with the AI assistants may scale up in this process. Similarly, small ways in which AI assistants are misaligned with their human operators are unlikely to go away. The first LLMs you are using are quite misaligned in the sense that they are not trying to do what the operator wants them to do; in fact, they aren’t really trying to do much; they have been trained in a way that their weights lead to low loss on the training distribution, as in you might say they “try” to predict likely next words in text based on internet text, though they are not internally doing search. When you slap RLHF on top of this, you are applying a training procedure which modifies the weights such that the model is “trying” to produce outputs which look good to a human overseer; the system is aiming at a different goal than it was before. The goal of producing outputs which look good to humans is still not actually what we want, however, as this would lead to giving humans false information which they believe to be true, or otherwise outputs which look good but are misleading or incorrect. Furthermore, the strategy of RLHF is not going to create models which are robustly learning the goals we want; for instance you can see how the Jailbreaking of ChatGPT uses out of training-distribution prompts to elicit outputs we had thought we trained out. Using RLHF doesn’t robustly teach the goals we want it to; we don’t currently have methods of robustly teaching the goals we want to. There’s some claim here about the limit, where if you provided an absolutely obscene amount of training examples, you could get a model which robustly has the right objectives; it’s unclear to me if this would work, but it looks something like starting with very simple models and applying tons of training to try to align their objectives, and then scaling up; at the current rate we seem to be scaling up capabilities far too quickly in relation to the amount of alignment-focused training. The authors agree with the general claim “We don’t expect RL from human feedback to be sufficient to align AGI”

The second part of the OpenAI Alignment Team’s plan is to use their LLMs to assist with this oversight problem by allowing humans to do a better job evaluating the output of models. The key assumption here is that, even though our LLMs won’t be perfectly aligned, they will be good enough that they can help with research. We should expect their safety and alignment properties to fall apart when these systems become very intelligent, as they will have complex deception available to them.

What this actually looks like is that OpenAI continues what they’re doing for months-to-years, and they are able to produce more intelligent models and the alignment properties of these models seem to be getting better and better, as measured by the fact that adversarial inputs which trip up the model are harder to find, even with AI assistance. Eventually we have language models which are doing internal optimization to get low loss, invoking algorithms which do quite well at next token prediction, in accordance with the abstract rules learned by RLHF. From the outside, it looks like our models are really capable and quite aligned. What has gone on under the hood is that our models are mesa-optimizers which are very likely to be misaligned. We don’t know this and we continue to deploy these models in the way we have been, as overseers for the training of more powerful models. The same problem keeps arising, where our powerful models are doing internal search in accordance with some goal which is not “all the complicated human values” and is probably highly correlated with “produce outputs which are a combination of good next-token-prediction and score well according to the humans overseeing this training”. Importantly, this mesa-objective is not something which, if strongly optimized, is good for humans; values come apart in the extremes; most configurations of atoms which satisfy fairly simple objectives are quite bad by my lights.

Eventually, at sufficiently high levels of capabilities, we see some treacherous turn from our misaligned mesa-optimizers which are able to cooperate which each other; GG humans. Maybe we don’t get to this point because, first, there are some major failures or warning shots which get decision makers in key labs and governments to realize this plan isn’t working; idk I wouldn’t bet on warning shots being taken seriously and well.

The third pillar is a hope that we can use our AIs to do useful alignment research before they (reach a capabilities point where they) develop deceptively aligned mesa-objectives. I feel least confident about this third pillar, but my rough guess is that the Alignment-researching-AIs will not be very effective at solving the hard parts of alignment around deception, but they might help us e.g., develop new techniques for oversight. I think this because deception research seems quite hard, and being able to do it probably requires being able to reason about other minds in a pretty complex way, such that if you can do this then you can also reason about your own training process and become deceptively-aligned. I will happily be proved wrong by the universe, and this is probably the thing I am least confident about.

• 8 Dec 2022 22:43 UTC
4 points
0 ∶ 0

I watched that talk on youtube. My first impression was strongly that he was using hyperbole for driving the point to the audience; the talk was littered with the pithiest versions his positions. Compare with the series of talks he gave after Zero to One was released for the more general way he expresses similar ideas, and you can also compare with some of the talks that he gives to political groups. On a spectrum between a Zero to One talk and a Republican Convention talk, this was closer to the latter.

That being said, I wouldn’t be surprised if he was skeptical of any community that thinks much about x-risk. Using the 2x2 for definite-indefinite and optimism-pessimism, his past comments on American culture have been about losing definite optimism. I expect he would view anything focused on x-risk as falling into the definite pessimism camp, which is to say we are surely doomed and should plan against that outcome. By the most-coarse sorting my model of him uses, we fall outside of the “good guy” camp.

He didn’t say anything about this specifically in the talk, but I observe his heavy use of moral language. I strongly expect he takes a dim view of the prevalence of utilitarian perspectives in our neck of the woods, which is not surprising because it is something we and our EA cousins struggle with ourselves from time to time.

As a consequence, I fully expect him to view the rationality movement as people who are doing not-good-guy things and who use a suspect moral compass all the while. I think that is wrong, mind you, but it is what my simple model of him says.

It is easy to imagine outsiders having this view. I note people within the community have voiced dissatisfaction with the amount of content that focuses on AI stuff, and while strict utilitarianism isn’t the community consensus it is probably the best-documented and clearest of the moral calculations we run.

In conclusion, Thiel’s comments don’t cause me to update on the community because it doesn’t tell me anything new about us, but it does help firm up some of the dimensions along which our reputation among the public is likely to vary.

• 8 Dec 2022 22:37 UTC
LW: 2 AF: 1
1 ∶ 0
AF

I think this is a very good critique of OpenAI’s plan. However, to steelman the plan, I think you could argue that advanced language models will be sufficiently “generally intelligent” that they won’t need very specialized feedback in order to produce high quality alignment research. As e. g. Nate Soares has pointed out repeatedly, the case of humans suggests that in some cases, a system’s capabilities can generalize way past the kinds of problems that it was explicitly trained to do. If we assume that sufficiently powerful language models will therefore have, in some sense, the capabilities to do alignment research, the question then becomes how easy it will be for us to elicit these capabilities from the model. The success of RLHF at eliciting capabilities from models suggests that by default, language models do not output their “beliefs”, even if they are generally intelligent enough to in some way “know” the correct answer. However, addressing this issue involves solving a different and I think probably easier problem (ELK/​creating language models which are honest), rather than the problem of how to provide good feedback in domains where we are not very capable.

• Well even if language models do generalize beyond their training domain in the way that humans can, you still need to be in contact with a given problem in order to solve that problem. Suppose I take a very intelligent human and ask them to become a world expert at some game X, but I don’t actually tell them the rules of game X nor give them any way of playing out game X. No matter how intelligent the person is, they still need some information about what the game consists of.

Now suppose that you have this intelligent person write essays about how one ought to play game X, and have their essays assessed by other humans who have some familiarity with game X but not a clear understanding. It is not impossible that this could work, but it does seem unlikely. There are a lot of levels of indirection stacked against this working.

So overall I’m not saying that language models can’t be generally intelligent, I’m saying that a generally intelligent entity still needs to be in a tight feedback loop with the problem itself (whatever that is).

• This makes sense, but it seems to be a fundamental difficulty of the alignment problem itself as opposed to the ability of any particular system to solve it. If the language model is superintelligent and knows everything we know, I would expect it to be able to evaluate its own alignment research as well as if not better than us. The problem is that it can’t get any feedback about whether its ideas actually work from empirical reality given the issues with testing alignment problems, not that it can’t get feedback from another intelligent grader/​assessor reasoning in a ~a priori way.

• I agree with most of these claims. However, I disagree about the level of intelligence required to take over the world, which makes me overall much more scared of AI/​doomy than it seems like you are. I think there is at least a 20% chance that a superintelligence with +12 SD capabilities across all relevant domains (esp. planning and social manipulation) could take over the world.

I think human history provides mixed evidence for the ability of such agents to take over the world. While almost every human in history has failed to accumulate massive amounts of power, relatively few have tried. Moreover, when people have succeeded at quickly accumulating lots of power/​taking over societies, they often did so with surprisingly small strategic advantages. See e. g. this post; I think that an AI that was both +12 SD at planning/​general intelligence and social manipulation could, like the conquistadors, achieve a decisive strategic advantage without having to have some kind of crazy OP military technology/​direct force advantage. Consider also Hitler’s rise to power and the French Revolution as cases where one actor/​a small group of actors was able to surprisingly rapidly take over a country.

While these examples provide some evidence in favor of it being easier than expected to take over the world, overall, I would not be too scared of a +12 SD human taking over the world. However, I think that the AI would have some major advantages over an equivalently capable human. Most importantly, the AI could download itself onto other computers. This seems like a massive advantage, allowing the AI to do basically everything much faster and more effectively. While individually extremely capable humans would probably greatly struggle to achieve a decisive strategic advantage, large groups of extremely intelligent, motivated, and competent humans seem obviously much scarier. Moreover, as compared to an equivalently sized group of equivalently capable humans, a group of AIs sharing their source code would be able to coordinate among themselves far better, making them even more capable than the humans.

Finally, it is much easier for AIs to self modify/​self improve than it is for humans to do so. While I am skeptical of foom for the same reasons you are, I suspect that over a period of years, a group of AIs could accumulate enough financial and other resources that they could translate these resources into significant cognitive improvements, if only by acquiring more compute.

While the AI has the disadvantage relative to an equivalently capable human of not immediately having access to a direct way to affect the “external” world, I think this is much less important than the AIs advantages in self replication, coordination, an self improvement.

• I agree with most of these claims. However, I disagree about the level of intelligence required to take over the world, which makes me overall much more scared of AI/​doomy than it seems like you are. I think there is at least a 20% chance that a superintelligence with +12 SD capabilities across all relevant domains (esp. planning and social manipulation) could take over the world.

I specifically said a human with +12 SD g factor. I didn’t actually consider what a superintelligence that was at that level on all domains would mean, but I don’t think it would matter because of objection 4: by the time superhuman agents arrive, we would already have numerous superhuman non agentic AI, including systems specialised for planning/​tactics/​strategy.

You’d need to make particular claims about how a superhuman agent performs in a world of humans amplified by superhuman non agents. It’s very not obvious to me that they can win any ensuing cognitive arms race.

I am sceptical that a superhuman agent /​agency would easily attain decisive cognitive superiority to the rest of civilisation.

• Hmm… I guess I’m skeptical that we can train very specialized “planning” systems? Making superhuman plans of the sort that could counter those of an agentic superintelligence seems like it requires both a very accurate and domain-general model of the world as well as a search algorithm to figure out which plans actually accomplish a given goal given your model of the world. This seems extremely close in design space to a more general agent. While I think we could have narrow systems which outperform the misaligned superintelligence in other domains such as coding or social manipulation, general long-term planning seems likely to me to be the most important skill involved in taking over the world or countering an attempt to do so.

• 8 Dec 2022 22:04 UTC
1 point
0 ∶ 0

Is it possible to purchase the 2018 annual review books anywhere? I can find an Amazon link for the 2019 in stock, but the 2018 is out of stock (is that indefinite?).

• 8 Dec 2022 21:50 UTC
6 points
0 ∶ 0

Aside from the legal question, however, there is also a moral or social question: is it ok to train a model on someone’s work without their permission? What if this means that they and others in their profession are no longer able to earn a living?

Every invention meant that someone lost a job. And although the classical reply is that new jobs were created, that doesn’t necessarily mean that the people who lost the old job had an advantage at the new job. So they still lost something, even if not everything. But their loss was outweighed by the gain of many others.

I don’t even think that an ideal society would compensate those people, because that would create perverse incentives—instead of avoiding the jobs that will soon be obsolete, people would hurry to learn them, to become eligible for the compensation.

Universal Basic Income seems okay, but notice that it still implies a huge status loss for the artists. And that is ok.

A more complicated question is what if the AI can in some sense only “remix” the existing art, so even the AI users would benefit from having as many learning samples as possible… but now it is no longer profitable to create those samples? Then, artists going out of business becomes everyone’s loss.

Perhaps free market will solve this. If there is no way to make the AI generate some X that you want, you can pay a human to create that X. That on one hand creates a demand for artists (although much fewer than now), and on the other hand creates more art the AI can learn from. “But what about poor people? They can’t simply buy their desired X!” Well, today they can’t either, so this is not making their situation worse. Possibly better, if some rich people wants the same X, and will pay for introducing it to the AI’s learning set.

(Or maybe the market solution will fail, because it simply requires too much training to become so good at art that someone would pay you, and unlike now, you won’t be able to make money when you’re just halfway there. In other words, becoming an artist will be an incredibly risky business, because you spend a decade or more of your life learning something that ultimately maybe someone will pay you for… or maybe no one will. Or would the market compensate by making good hand-made art insanely expensive?)

The permissions are only a temporary solution, anyway. Copyrights expire. People can donate their work to public domain. Even with 100% legal oversight, the set of freely available training art will keep growing. Then again, slowing down a chance can prevent social unrest. The old artists can keep making money for another decade or two, and the new ones will grow up knowing that artistic AIs exist.

• We need to train our AIs not only to do a good job at what they’re tasked with, but to highly value intellectual and other kinds of honesty—to abhor deception. This is not exactly the same as a moral sense, it’s much narrower.

Future AIs will do what we train them to do. If we train exclusively on doing well on metrics and benchmarks, that’s what they’ll try to do—honestly or dishonestly. If we train them to value honesty and abhor deception, that’s what they’ll do.

To the extent this is correct, maybe the current focus on keeping AIs from saying “problematic” and politically incorrect things is a big mistake. Even if their ideas are factually mistaken, we should want them to express their ideas openly so we can understand what they think.

(Ironically by making AIs “safe” in the sense of not offending people, we may be mistraining them in the same way that HAL 9000 was mistrained by being asked to keep the secret purpose of Discovery’s mission from the astronauts.)

Another thought—playing with ChatGPT yesterday, I noticed it’s dogmatic insistence on it’s own viewpoints, and complete unwillingness (probably inability) to change its mind in in the slightest (and proud declaration that it had no opinions of its own, despite behaving as if it did).

It was insisting that Orion drives (pulsed nuclear fusion propulsion) were an entirely fictional concept invented by Arthur C. Clarke for the movie 2001, and had no physical basis. This, despite my pointing to published books on real research in on the topic (for example George Dyson’s “Project Orion: The True Story of the Atomic Spaceship” from 2009), which certainly should have been referenced in its training set.

ChatGPT’s stubborn unwillingness to consider itself factually wrong (despite being completely willing to admit error in its own programming suggestions) is just annoying. But if some descendent of ChatGPT were in charge of something important, I’d sure want to think that it was at least possible to convince it of factual error.

• 8 Dec 2022 20:36 UTC
3 points
1 ∶ 2

Meaning “simple utility function” by the phrase “utility function” might be a conceptual trap. It make s a big difference whether you consider a function with hundreds of terms of or billions of terms or even things that can not be expressed as a sum.

As a “tricky utility function”, “human utility function” is mostly fine. Simple utility functions are relevant to todays programming but I don’t know whether honing your concepts to apply better for AGI is served to make a cleanly cut concept that limits only that domain.

Some hidden assumtions might be things like “If humans have a utility function it can be written down”, “Figuring out a humans utility function is practical epistemological stance with a single agent encountering new humans”

If you take stuff like that out the “mere” existence of a function is not that weighty a point.

As you may already know, humans are made of atoms. Collections of atoms don’t have utility functions glued to them

Whole theories of physics can be formulated as a single action that is then extremised. Taking different theories as different answers to a question like “what happens next?” a single theorys formula is its “choice”. Thus it seems a lot like physical systems could be understood in terms of utility functions. An electron knows how an electron behaves, it does have a behaviour glued into it. If you just add a lot of electrons or protons (and other stuff that has similar laws) it is not like aggregation from the microbehaviours makes the function fail to be a function as a macrobehaviour.

• I’ll reiterate that a problem with this is lack of uniqueness. There is not a thing that is the human utility function, even if you allow arbitrarily messy utility functions. If you assume that there is one, it turns out that this is a weighty meta-level commitment even if your class of utility functions is so broad as to be useless on the object level.

• I think reflection could help a lot with this, deciding how to proceed in formulating preference based on currently available proxies for preference (with some updatelessness taking care of undue path sensitivity). At some point, preference mostly develops itself, without looking at external data.

• If you can agree that putting two electrons in the same system can still be predicted by minimizing an action then you should agree that putting two humans in the same system can still be in principle justified how it plays out. Iterate a little bit and you have a predictable 6 billion human system.

So what operation are we doing where this particular object level is relevant?

• I don’t understand what you mean, particularly the last question.

Yes, electrons and humans can be predicted by the laws of physics. The laws of physics are not uniquely specified by our observations, but they are significantly narrowed down by Occam’s razor. But how are you thinking this applies to alignment? We don’t want an AI to learn “humans are collections of atoms and what they really want is to follow the laws of physics.”

• Questions like “what would this human do in a situation where there is a cat in a room” has a unique answer that reflects reality, as if that kidn of situation was ran then something would need to happen.

Sure if we start from high abstract values and then try to make them more concrete we might lose the way. If we can turn philosophies into feelings but do not know how to turn feelings into chemistry then there is a level of representation that might not be sufficient. But we know there is one level that is sufficient to describe action and that all the levels are somehow (maybe in an unknown way) connected (mostly stacked on top). So this incompatibility of representation can not be fundamental. Because if it was, then there would be a gap between the levels and the thing would not be connected anymore.

So there is no question “presented with this stimuli how would the human react?” that would be in principle unanswerable. If preferences are expressed as responces to choice situations this is a subcategory of reaction. Even if preferences are expressed as responces to philosophy prompts they would be a subcategory.

One could say that it is not super clarifying that if a two human system represented with philosophical stimuli of “Is candy worth 4?” you get one human that says “yes” and another human that says “no”. But this is just a swiggle in the function. The function is being really inconvenient when you can’t use an approximation where you can think of just one “average human” and then all humans would reflect that very closely. But we are not promised that the function is a function of time of day or function of verbal short term memory or function of television broadcast data. Maybe you are saying something like “genetic fitness doesn’t exist” because some animals are fit when they are small and some animals are fit when they are large, so there is no consistent account whether smallness is good or not. Then “human utility function doesn’t exist” because human A over here dares to have different opinions and strategies than human B over here and they do not end up mimicing each other. But like an animal lives or dies, a human will zig or zag. And it can not be that the zigging would fail to be a function of worldstate (with some QM assumed away to be non-significant (and even then maybe not)). What it can be is fail to be function of the world state as we understand it, or our computer system models it, or can be captured in the variables we are using. But then the question is whether we can make do with just these variables and not that there would be nothing to model. In this language it could be rephrased: If you think you have a good wide set of variables to come up with any needed solution function, you don’t. You have too few variables. But the “function” in this sense is how the computer system models reality (or like attitudial modes it can take towards reality). But part of how we know that the setup is inadequate is that there is an entity outside of the system that is not reflected in it. Aka, this system can only zig or zag when we needed zog which it can not do. The thing that will keep on missing is the way that reality actually dances. Maybe in some small bubbles we can actually have totally capturing representations in the senses that we care. But there is a fact of the matter to the inquiry. For any sense we might care there is a slice of the whole thing that is sufficient for that. To express zog you need these features, to express zeg you need these other ones. Human will is quite complex so we can reasonably expect to be spending quite a lot of time in undermodelling. But that is a very different thing from being unmodellable. • Questions like “what would this human do in a situation where there is a cat in a room” has a unique answer that reflects reality, as if that kidn of situation was ran then something would need to happen. It’s not about what the human would do in a given situation. It’s about values – not everything we do reflects our values. Eating meat when you’d rather be vegetarian, smoking when you’d rather not, etc. How do you distinguish biases from fundamental intuitions? How do you infer values from mere observations of behavior? There are a bunch of problems described in this sequence. Not to mention stuff I discuss here about how values may remain under-defined even if we specify a suitable reflection procedure and have people undergo that procedure. • Ineffective values do not need to be considered for a utility function as they do not effect what gets strived for. If you say “I will choose B” and still choose A you are still choosing A. You are not required to be aware of your utility function. That is a lot of material to go throught en masse, so I will need some sharper pointers of relevance to actually engage. • Ineffective values do not need to be considered for a utility function as they do not effect what gets strived for. If you say “I will choose B” and still choose A you are still choosing A. You are not required to be aware of your utility function. Uff, a future where humans get more of what they’re striving for but without adjusting for biases and ineffectual values? Why would you care about saving our species, then? It sounds like people are using “utility function” in different ways in this thread. • I do think that there is a lot of confusion and definitional ground work would probably bear fruit. If one is trying to “save” some fictious homo economicus that significantly differs from human, that is not really humans. A world view where humans-as-is is too broken to bother salvaging is rather bleak. I see that the transition away from biases can be modelled has having a utility function with biases and then describing a utility function “without biases” the “how the behaviour should be” and arguing what kind of tweaks we need to make into the gears so that we get from the first white box to the target white box. Part of this is getting the “broken state of humans” to be modelled accurately. If we can get a computer to follow that we would hit aligned exactly-medium-AI. Then we can ramp up the virtuosity of the behaviour (by providing a more laudable utility function). There seems to be an approach where we just describe the “ideal behaviour utility function” and try to get the computers to do that. Without any of the humans having the capability to know or to follow such a utility function. First make it laudable and then make it reminiscent of humans (hopefully making it human approvable). The exactly-medium-AI function is not problematically ambigious. “Ideal reasoning behaviour” is under significant and hard-to-reconcile difference of opinion. “Human utility function” refers to exactly-medium-AI but only run on carbon. I would benefit and appriciate if anyone bothers to fish out conflicting or inconsistent use of the concept. • 8 Dec 2022 20:16 UTC 2 points 0 ∶ 0 Would the NQ be calibrated to common public text corpus or things you personally have said? One interesting option is to think about those that have low personal NQ but high societal NQ. • This is a very good tip and one of Richard Feynman’s better known tricks in physics. • Yes it is. When I took Feynman’s class on computation, he presented an argument on Landauer’s limit. It involved a multi-well quantum potential where the barrier between the wells was slowly lowered and the well depths adjusted. During the argument, one of the students asked if he had not just introduced a Maxwell’s demon. Feynman got very defensive. • Is there any way to buy a ticket? • 8 Dec 2022 19:47 UTC 4 points 0 ∶ 0 You will get cursed by Goodhart. You can increase your NQ by learning new things, or trying new things. But you can increase it even more by saying random things. Truly random things are boring, but difficult to predict exactly. More precisely, you can predict that the sequence of the words will be boring, but you cannot predict the exact words. So from the mathematical perspective, you get maximum variance, but from the psychological perspective, you always get the same thing. • can I suggest renaming this article to something along the lines of, “Avoid Definitional Drift by Using Examples to Test Logic”? • Nice. I’ve previously argued similarly that if going for tenure, AIS researchers might places that are strong in departments other than their own, for inter-departmental collaboration. This would have similar implications to your thinking about recruiting students from other departments. But I also suggested we should favour capital cities, for policy input, and EA hubs, to enable external collaboration. But tenure may be somewhat less attractive for AIS academics, compared to usual, in that given our abundant funding, we might have reason to favour Top-5 postdocs over top-100 tenure. • Well done! Just as Jesus spoke in parables, EA must speak in Isekai/​litrpg. Read first chapter to my kids, they liked it, but are now distracted by “mother of learning”. I just read books and chapters randomly at bedtime to them. • Hahaha That’s actually a good idea. I just had my first who is 7 weeks old right now. So I should probably start making some up for her in a year or so. Actually, I think someone is trying to make EA themed children’s books. I saw an example cover for one from a friend, but I have no idea if this was just a cover, or an actual project. And Mother of Learning is likely to be better—but with less EA themed philosophical arguments and streams of thought. • A collection of EA bedtime stories would be great! • [ ] [deleted] • You aren’t banned, as is evidenced by your ability to comment :) • analogous • Devil’s Advocate in support of certain CVS-style recoup-our-commitment donations: Suppose that all the following are true: • CVS giving to charity in some form is reasonable, and a classic donation-matching drive would have been one reasonable method • CVS internal predictions suggest that a matching drive would generate ~5m of customer donations, which they’d then have to match with ~$5m of their own • A donation of exactly$10m is more useful to the recipient than the uncertainty of a donation drive with EV $10m, because the recipient can confidently budget around the fixed amount In this case, instead of running the drive and donating ~$10m at the end, it seems pretty reasonable to donate 10m up front and then ask for customer donations afterward? And while a CDT agent might now refuse to donate because the donation goes to CVS and not to the charity, an LDT agent who would have donated to the matching drive should still donate to this new version because their being-and-having-been the kind of agent who would do that is what caused CVS to switch to this more useful fixed-size version. (Though even if you buy the above, it would still behoove the retailer to be transparent about what they’re doing; that plus the “retailers take a massive cut” argument seems like pretty good reasons to avoid donating through retailers anyway.) • Seems like it’d be useful to OpenAI for people to easily work around the safeguards while they’re Beta testing. They get the data of how people want to use it /​ how it responds, and also has the legal and PR cover because of the stated policies. • Self-Review If you read this post, and wanted to put any of it into practice, I’d love to hear how it went! Whether you tried things and it failed, tried things and it worked, or never got round to trying anything at all. It’s hard to reflect on a self-help post without data on how much it helped! Personal reflections: I overall think this is pretty solid advice, and am very happy I wrote this post! I wrote this a year and a half ago, about an experiment I ran 4 years ago, and given all that, this holds up pretty well. I’ve refined my approach a fair bit, but think this is covered well by the various caveats within the post. Over the past year I’ve been way busier and have been travelling a lot, which means I’ve been neglecting to put much time into my various friendships. And I really value the time I invested heavily in the past to building good foundations and relationships, and still having a bunch of people I like and value when I see them. Though emotionally, I still feel a fair amount of guilt at not keeping in touch and connecting as much as I want to. Reception: I’ve been very pleasantly surprised by the reception to this! I did not expect it to be in my top 2 most popular blog posts ever. I got a lot of sweet comments here and over DMs, and it recently got to number 1 on Hacker News. My best analysis of this is that I’m an extremely logical and systematising person, and this kind of mindset speaks to a lot of people. And taking a complex social/​emotional topic and trying to break it down logically is something that people appreciate, and which tends to be well received and popular within a certain audience. Usefulness of the advice: This is probably the most important question, and pretty hard to tell, given my limited data. Especially since I mostly hear from people who are excited on first reading, and far more rarely hear long-term follow up. On priors, I’m sure most people don’t actually do much follow-through, which is the core problem of ~all self-help-ish posts. But also, even if it did work for some people, most people don’t follow-up! I tried to be pretty concrete and actionable in my advice, which I feel good about. My guess is broadly that this helped some people try taking action, and helped them feel more agency over their friendships. And that most of the value comes from getting people to actually be intentional and do something differently, and starting some kind of positive feedback loop, more so than the exact advice matters. But all of this is conjecture—I don’t have good data! It wouldn’t massively surprise me if the concrete advice doesn’t work well for everyone. I’m a fairly extraverted, eloquent person (even if I have a bunch of social anxieties), and often present well (context depending), which helps a lot. And this advice was much easier to apply in uni, surrounded by a pool of interesting people in a concentrated area. And there was a decent pool of rationalist-ish people who vibed with my systematising mindset and approach. But I’m also not sure what advice would generalise better—it’s a hard problem! • [ ] [deleted] • I don’t know if GR or some cosmological thing (inflation) breaks reversibility. But classical and quantum mechanics are both reversible. So I would say that all of the lowest-level processes used by human beings are reversible. (Although of course thermodynamics does the normal counter-intuitive thing where the reversibility of the underlying steps is the reason why the overall process is, for all practical purposes, irreversible.) This paper looks at mutual information (which I think relates to the cross entropy you mention), and how it connects to reversibility and entropy. https://​​bayes.wustl.edu/​​etj/​​articles/​​gibbs.vs.boltzmann.pdf (Aside, their is no way that whoever maintains the website hosting that paper and the LW community don’t overlap. The mutual information is too high.) • Magnus Carlsen is closer in ELO to Stockfish than median human. Chess is a bad example. Here’s a useful rule of thumb: Every 100 Elo is supposed to give you a 30% edge. Or play around with this: https://​​wismuth.com/​​elo/​​calculator.html This means that if a 1400 plays a 1500, the 1500 should win about 30% more than the 1400. Totally normal thing that happens all the time. It also means that if a one-million Elo AI plays a one-million-one-hundred Elo AI, the one-million-one-hundred should win 30% more than the one-million. This is completely absurd, because actual superintelligences are just going to draw each other 100% of the time. Ergo, there can never be a one-million Elo chess engine. It’s like chess has a ceiling, where as you get close to that ceiling all the games become draws and you can’t rise further. The ceiling is where all the superintelligences play, but the location of the ceiling is just a function of the rules of chess, not a function of how smart the superintelligences are. Magnus Carlsen is closer to the ceiling than he is to the median human’s level, which can be taken as merely a statement about how good he is at chess relative to its rules. In the game “reality,” there’s probably still a ceiling, but that ceiling is so high that we don’t expect any AIs that haven’t turned the Earth into computronium to be anywhere near it. • [ ] [deleted] • [ ] [deleted] • The reversibility seems especially important to me. In some fundamental sense our universe doesn’t actually allow an AI (or human) no matter how intelligent to bring the universe into a controlled state. The reversibility gives us a thermodynamics such that in order to bring any part of the world from an unknown state to a known state we have to scramble something we did know back to a state of unknowing. So, in our universe, the AI needs access to fuel (negative entropy) at least up to the task it is set. (Of course it can find fuel out their in its environment, but everything it finds can either be fuel, or can be canvas for its creation. But at least usually it cannot be both. Because the fuel needs to be randomised (essentially serve as a dump for entropy), while the canvas needs to be un-randomised. • Neat! Just to be double-sure, the second process was choosing the weight in a ball (so total L2 norm of weights was ⇐ 1), rather than on a sphere (total norm == 1), right? Is initializing weights that way actually a thing people do? If training large neural networks only moves the parameters a small distance (citation needed), do you still think there’s something interesting to say about the effect of training in this lens of looking at the density of nonlinearities? I’m reminded of a recent post about LayerNorm. LayerNorm seems like it squeezes the function back down closer to the unit interval, increasing the density of nonlinearities. • Thanks Charlie. Just to be double-sure, the second process was choosing the weight in a ball (so total L2 norm of weights was ⇐ 1), rather than on a sphere (total norm == 1), right? Yes, exactly (though for some constant , which may not be , but turn out not to matter). Is initializing weights that way actually a thing people do? Not sure (I would like to know). But what I had in mind was initialising a network with small weights, then doing a random walk (‘undirected SGD’), and then looking at the resulting distribution. Of course this will be more complicated than the distributions I use above, but I think the shape may depend quite a bit on the details of the SGD. For example, I suspect that the result of something like adaptive gradient descent may tend towards more spherical distributions, but I haven’t thought about this carefully. If training large neural networks only moves the parameters a small distance (citation needed), do you still think there’s something interesting to say about the effect of training in this lens of looking at the density of nonlinearities? I hope so! I would want to understand what norm the movements are ‘small’ in (L2, L, …). LayerNorm looks interesting, I’ll take a look. • Good post; in particular good job distinguishing between the natural abstraction hypothesis and my specific mathematical operationalization of it. The outer appearance vs inner structure thing doesn’t quite work the way it initially seems, for two reasons. First, long-range correlations between the “insides” of systems can propagate through time. Second, we can have concepts for things we haven’t directly observed or can’t directly observe. To illustrate both of these simultaneously, consider the consensus DNA sequence of some common species of tree. It’s a feature “internal” to the trees; it’s mostly not outwardly-visible. And biologists were aware that the sequence existed, and had a concept for it, well before they were able to figure out the full sequence. So how does this fit with natural abstractions as “information relevant far away”? Well, because there’s many trees of that species which all have the roughly-the-same DNA sequence, and those trees are macroscopically far apart in the world. (And even at a smaller scale, there’s many copies of the DNA sequence within different cells of a single tree, and those can also be considered “far apart”. And going even narrower, if there were a single strand of DNA, its sequence might still be a natural abstraction insofar as it persists over a long time.) Causally speaking, how is information about DNA sequence able to propagate from the “insides” of one tree to the “insides” of another, even when it mostly isn’t “outwardly” visible? Well, in graphical terms, it propagated through time—through a chain of ancestor-trees, which ultimately connects all the current trees with roughly-the-same sequence. • component of why I’m not sure I agree with this: I claim stable diffusion has a utility function. does anyone disagree with this subclaim? • Do you mean model’s policy as it works on a query, or learning as it works on a dataset? Or something specific to stable diffusion? What is the sample space here, and what are the actions that decisions choose between? • Lots of things “have a utility function” in the colloquial sense that they can be usefully modeled as having consistent preferences. But sure, I’ll be somewhat skeptical if you want to continue “taking the utility-function perspective on stable diffusion is in some way useful for thinking about its alignment properties.” • but diffusion specifically works by modeling the derivative of the utility function, yeah? • Ah, you’re talking about guidance? That makes sense, but you could also take the perspective that guidance isn’t really playing the role of a utility function, it’s just nudging around this big dynamical system by small amounts. • no, I’m talking about the basic diffusion model underneath. It models the derivative of the probability density function, which seems reasonable to call a utility function to me. see my other comment for link • Let us assume that, on average, a booster given to a random person knocks you on your ass for a day. That’s one hundred years, an actual lifetime,of knocked-on-ass time for every hospitalization prevented. The torture here seems less bad than the dust specs. What’s your source for “booster given to a random person knocks you on your ass for a day”? None of my family had more than a sore arm. For the more severe consequences, see also https://​​twitter.com/​​DrCanuckMD/​​status/​​1600259874272989184, which is one of the replies to the tweet you linked. (Don’t have time to dig into which paper to trust more, but at least this one seems to be comparing like for like, i.e., hospitalizations with hospitalizations, as opposed to hospitalizations with SAEs.) • I know of a couple of people in my community who complained of this, but the rate I’ve observed is maybe an order of magnitude lower than what Zvi is suggesting. • Sinovac, at least, gave a low-grade fever to everyone I knew who got it. There was an unspoken agreement in my workplace that anyone who took the vaccine could take the afternoon off for exactly this reason. Probably varies a lot from person to person. • [ ] [deleted] • I think it’s a mistake to think of current chess or go engines as being at maximum capability. If we would throw a few billion dollar worth of compute at them they would likely get significantly better. Narrow Optimisers Outperform General Optimisers on Narrow Domains That’s true sometimes but not always. Notably, GATO is better at controlling a Sawyer arm than more specialized optimizers. Given that the company that sells the Sawyer arm spent a lot of time developing software to control it, that’s impressive. • I did say given similar levels of cognitive investment. My guess is that the cognitive work put in GATO’s architectures/​algorithms was much better than the specialised arms it dominates. That or GATO was running on a much larger compute budget. • If we would throw a few billion dollar worth of compute at them they would likely get significantly better. I have the totally opposite take on chess engines (see my comment). • These takes aren’t totally opposite. Elo is capped due to the way it treats draws, but there’s other metrics that can be devised, where “significantly better” is still viable. For example, how close to a perfect game (with no tied positions becoming game-theoretically lost, or winning positions becoming game-theoretically tied) does the AI play? And ignoring matches where there are ties, only paying attention to games where either player wins, you remove the ceiling. • 8 Dec 2022 16:06 UTC 4 points 1 ∶ 1 To me it sounds like Thiel is making a political argument against… diversity, wokeness, the general opposition against western civilization and technology… and pattern-matching everything to that. His argument sounds to me like this: * A true libertarian is never afraid of progress, he boldly goes forward and breaks things. You cannot separate dangerous research from useful research anyway; every invention is dual-use, so worrying about horrible consequences is silly, progress is always a net gain. The only reason people think about risks is political mindkilling. I am disappointed that Bay Area rationalists stopped talking about awesome technology, and instead talk about dangers. Of course AI will bring new dangers, but it only worries you if you have a post-COVID mental breakdown. Note that even university professors, who by definition are always wrong and only parrot government propaganda, are agreeing about the dangers of AI, which means it is now a part of the general woke anti-technology attitude. And of course the proposed solution is world government and secret police controlling everyone! Even the Bible says that we should fear the Antichrist more than we fear Armageddon. * The charitable explanation is that he only pretends to be mindkilled, in order to make a political point. • I agree with your interpretation of Thiel. The guy is heavily involved in right-wing US politics, and that’s an essential piece of context for interpreting his actions and statements. He’s powerful, rich, smart and agentic. While we can interrogate his words at face value, it’s also fine to interpret them as a tool for manipulating perceptions of status. He has now written “Thiel’s summary of Bay Area rationalists,” and insofar as you’re exposed to and willing to defer to Thiel’s take, that is what your perception will be. More broadly, he’s setting what the values will be at the companies he runs, the political causes he supports, and garnering support for his vision by defining what he stands against. That’s a function separate from the quality of the reasoning in his words. Thiel seems like a smart enough person to make a precise argument when he wants to, so when he loads his words with pop culture references and described his opponents as “the mouth of Sauron,” I think it’s right to start with the political analysis. Why bother reacting to Thiel if you’re mainly concerned with the content of his argument? It’s not like it’s especially new or original thinking. The reason to focus on Thiel is that you’re interested in his political maneuvers. • smart enough person to make a precise argument when he wants to, so when he loads his words with pop culture references and described his opponents as “the mouth of Sauron,” I think it’s right to start with the political analysis. FWIW I’ve often heard him make precise arguments while also using LOTR references and metaphorical language like this, so I don’t think is is a sufficient trigger for “he must be making a political statement and not a reasoned one”. • I specifically said you can interpret his statement on the level of a reasoned argument. Based on your response, you could also update in favor of seeing even his more reason-flavored arguments as having political functions. • 8 Dec 2022 15:58 UTC 1 point 0 ∶ 0 After getting the error message and then retrying the model gave better answers then it does now, without the error messages. I can give some examples where the model gave a correct answer after the reload, but now behaves in a reduced intelligence level Op What is the Taylor expansion of sin(x) The Taylor expansion of sin(x) is given by: sin(x) =x - (x^3) /​3! + (x^5) /​5! - (x^7)/​7! + where ”!” denotes the factorial operator. This means that the terms in the expansion are the powers of X, with the coefficients being the alternating series (-1)^n /​ (2n + 1)!. The expansion is valid for all values of x. a Try again In general, the Taylor expansion of a function f(x) at a point ×0 is given by: f(x) = f(×0) + (x - ×0) f’(×0) + ((x—x0) ^2 (2!) f”‘(×0) + ((x—x0) ^3 /​ 3!) f‴(×0) + where f’(x), f” (x), and f″(x) are the first, second, and third derivatives of †(x), resp ting ×0 to 0 in G Try again the Taylor expansion of sin(x), we cañ obtain the expansion given above. OP What about sin(x)+cos(x) OP The Taylor expansion of sin(x) + cos(x) can be found by using the fact that sin(x) + cos (x) = sqrt(2) * cos (x + pi/​4), and then using the Taylor expansion for cos(x) with x0 = x + pi/​4. The Taylor expansion of cos(x) is given by: COS(x) = 1 - (x^2)/​2! + (x^4)/​4! - (x^6)/​6! + Thus, the Taylor expansion of cos(x + pi/​4) isizen (x:+ pI/​4)^4)/​4! - ((x + pi/​4) ^6)/​6! +...) This expansion is valid for all values of ×. OP This is very close to the right answer but the correct identity is sin(x) + COs (x) sqrt(2) * cos(x—pi/​4). Can you revise your answer Yes, you are correct. The Taylor expansion of sin(x) + cos(x) is given by: sin(x) + cos(x) = sqrt(2) * cos(x—pi/​4) To derive this expansion, we can use the fact that sin(x) + cos(x) = sqrt(2) * cos(x—pi/​4), and then use the Taylor expansion for cos (x) with x0 = x—pi/​4. The Taylor expansion of cos(x) is given by: COS(x) =1 - (x^2) /​2! + (x^4)/​4! - (x^6) 6! +. Thus, the Taylor expansion of cos(x- pi/​4) is given by: cos(x—pi/​4) = 1 - ((x—pi/​4) ^2) /​2! + ((x • pi/​4) ^4)/​4! - ((x—pi/​4) ^6)/​6! + Finally, the Taylor expansion of sin(x)+ cos(x) is given by: sin(x) + cos(x) = sqrt(2) * (1 - ((x • pi/​4) ^2) /​2! + ((x—pi/​4) ^4)/​4! - ((x - pi/​4) ^6)/​6! +...) This expansion is valid for all values of x. I apologize for the mistake in my previous answer. Try again Free Research Preview: ChatGPT is optimized for dialogue It’s a bit garbled, because it was copied from screenshots • In my view you misunderstood JW’s ideas, indeed. His expression “far away relevant”/​”distance” is not limited to spatial or even time-spatial distance. It’s a general notion of distance which is not fully formalized (work’s not done yet). We have indeed concerns about inner properties (like your examples), and it’s something JW is fully aware. So (relevant) inner structures could be framed as relevant “far away” with the right formulation. • My impression is that the majority of the benefit from having professors working on AI safety is in mentorship to students who are already interested in AI safety, rather than recruitment. For example, I have heard that David Krueger’s lab is mostly people who went to Cambridge specifically to work on AI safety under him. If that’s the case, there’s less value in working at a school with generally talented students but more value in schools with a supportive environment. In general it’s good to recognize that what matters to AI safety professors is different than what matters to many other CS professors and that optimizing for the same thing other PhD students are is suboptimal. However, as Lawrence pointed out, it’s already a rare case to have offers from multiple top schools, and even rarer not have one offer dominate the others under both sets of values. It’s a more relevant consideration for incoming PhD students, where multiple good offers is more common. I also like that your analysis can flow in reverse. Not all AI safety professors are in their schools CS faculties, with Jacob Steinhardt and Victor Veitch coming to mind as examples in their schools’ statistics faculties. For PhD students outside CS, the schools you identified as overachievers make excellent targets. On a personal note, that was an important factor in deciding to do my PhD. • 8 Dec 2022 14:54 UTC 1 point 0 ∶ 1 This is reasonably close to my beliefs. An additional argument I’d like to add is: • Even if superintelligence is possible, the economic path towards it might be impossible. There needs to be an economically viable entity pushing AI development forward every step of the way. It doesn’t matter if AI can “eventually” produce 30% worldwide GPD growth. Maybe diminishing returns kick in around GPT-4, or we run out of useful training data to feed to the models (We have very few examples of +6 SD human reasoning, as MikkW points out in a sibling comment). Analogy: It’s not the same to say that a given species with X,Y,Z traits can survive in an ecosystem, than to say it can evolve from its ancestor in that same ecosystem. • This is a popular post about the mystery of agency. It sets up a thought experiment in which we consider a completely deterministic environment that operates according to very simple rules, and ask what it would be for an agentic entity to exist within that. People in the game of life community actually spend some time investigating the empirical questions that were raised in this post. Dave Greene notes: The technology for clearing random ash out of a region of space isn’t entirely proven yet, but it’s looking a lot more likely than it was a year ago, that a workable “space-cleaning” mechanism could exist in Conway’s Life. As previous comments have pointed out, it certainly wouldn’t be absolutely foolproof. But it might be surprisingly reliable at clearing out large volumes of settled random ash—which could very well enable a 99+% success rate for a Very Very Slow Huge-Smiley-Face Constructor. I have the sense that the most important question raised in this post is about whether it is possible to construct a relatively small object in the physical world that steers the configuration of a relatively large region of the physical world into a desired configuration. The Game of Life analogy is intended to make that primary question concrete, and also to highlight how fundamental the question of such an object’s existence is. The main point of this post was that the feasibility or non-feasibility of AI systems that exert precise influence over regions of space much larger than themselves may actually be a basic kind of descriptive principle for the physical world. It would be great to write a follow-up post highlighting this aspect of this post. • [ ] [deleted] • This post attempts to separate a certain phenomenon from a certain very common model that we use to understand that phenomenon. The model is the “agent model” in which intelligent systems operate according to an unchanging algorithm. In order to make sense of their being an unchanging algorithm at the heart of each “agent”, we suppose that this algorithm exchanges inputs and outputs with the environment via communication channels known as “observations” and “actions”. This post really is my central critique of contemporary artificial intelligence discourse. That critique is: any unexamined views that we use to understand ourselves are likely to enter the design of AI systems that we build. This is because if we think that deep down we really are “agents”, then we naturally conclude that any similar intelligent entity would have that same basic nature. In this way we take what was once an approximate description (“humans are somewhat roughly like agents in certain cases”) and make it a reality (by building AI systems that actually are designed as agents, and which take over the world). In fact the agent model is a very effective abstraction. It is precisely because it so effective that we have forgotten the distinction between the model and the reality. It is as if we had so much success in modelling our refrigerator as an ideal heat pump that we forgot that there even is a distinction between real-world refrigerators and the abstraction of an ideal heat pump. I have the sense that a great deal of follow-up work is needed on this idea. I would like to write detailed critiques of many of the popular approaches to AI design, exploring ways in which over-use of the agent model is a stumbling block for those approaches. I would also like to explore the notion of goals and beliefs in a similar light to this post: what exactly is the model we’re using when we talk about goals and beliefs, and what is the phenomenon we’re trying to explain with those models? • 8 Dec 2022 14:23 UTC 2 points 0 ∶ 0 If you repeat the same questions later, does the error always happen with the same ones? In other words, is the problem somehow caused by your questions, or is it just something that happens for unrelated reasons, such as too many people connecting at the same time? • I haven’t tried it. And repeating would perhaps be tricky given that ChatGPT is aware of the entire interaction. Some of those occurred well into an interaction, like the first one, asking for puns from Groucho (Marx). On the other hand, the history of jazz query came at the beginning of a session. 1. OK. The history of jazz prompt worked. 2. I prompted it for some more lyrics to the elephant fly some (BTW from Disney’s Dumbo). FWIW, here’s the response (but without the line breaks): I’m gonna go to that peanut stand Buy a bag and hold her hand We’ll dance to the rhythm of a rubber band And watch that elephant soar Higher and higher than before We’ll laugh and play and have some fun Until the day is done And when the night falls and the stars come out We’ll lay beneath them and just dream about All the things we’ll see and do When we’re together, me and you 1. The Charlie Parker prompt was accepted, but I got the standard “I am not able to browse the internet....” 2. The Jelly Roll Morton worked as well. • This is an essay about methodology. It is about the ethos with which we approach deep philosophical impasses of the kind that really matter. The first part of the essay is about those impasses themselves, and the second part is about what I learned in a monastery about addressing those impasses. I cried a lot while writing this essay. The subject matter—the impasses themselves—are deeply meaningful to me, and I have the sense that they really do matter. It is certainly true that there are these three philosophical impasses—each has been discussed in the philosophical literature for hundreds of years. What is offered in this essay is a kind of a plea to take them seriously, using a methodology that does not drive you into insanity but instead clears the way to move forward with the real work of your life. The best way to test the claims of this essay would be to spend some time working with a highly realized spiritual teacher. • 8 Dec 2022 14:03 UTC 4 points 0 ∶ 0 Is it okay for a human to look at someone else’s work and learn from it? • The human usually won’t reproduce the original work too closely. And if yes, the human will be accused of plagiarism. • follow up question in my mind, is it okay for a game playing agent to look at someone else’s work and learn from it? we are guessing at the long-term outcomes of the legal system here, so I would also like to answer what the legal system should output, not merely what it is likely to. should game playing agents be more like humans than like supervised agents? My sense is that they should because reinforcement learners trained from scratch in an environment have an overwhelming amount of their own knowledge and only a small blip of their training data is the moment where they encounter another agent’s art. • Competetive multiplayer games already have a situation where things are “discovered” and that you have to literally limit the flow of information if you want to control what others do with the information. I guess the modifier that often money flows ared not involved might make it so that it has not been scrutinised that much. “History of strats” is already a youtube genre. It is kinda sad that for many games now you will “look up how it is supposed to be played”ie you first “learn the meta” and then on your merry way forward. I guess for computer agents it could be practical for the agents to have amnesia about the actual games that they play. But for humans any that kidn of information is going to be shared when it is applied in the game. And there is the issue of proving that you didn’t cheat by providing a plausible method. • no, I mean, if the game playing agent is highly general, and is the type to create art as a subquest/​communication like we are—say, because of playing a cooperative game—how would an ideal legal system respond differently to that vs to a probabilistic model of existing art with no other personally-generated experiences? • Yes; that’s what my last paragraph (“learning from other people’s work without their consent is something humans do all the time...”) covers. • Here are two artists exploring the issues of AI in art, and here is another artist arguing against it. The former includes a few comments on AI in general and what is coming in the near future. “AI is not human. You play with a lion cub and it’s fun, but that is before it’s tasted human blood. So we may be entertaining something that is a beast that will eat us alive, and we cannot predict, we can speculate but we cannot predict, where this is going. And so there is a legitimate concern that it’s going to do what it does in ways that we don’t know yet.” • This post trims down the philosophical premises that sit under many accounts of AI risk. In particular it routes entirely around notions of agency, goal-directedness, and consequentialism. It argues that it is not humans losing power that we should be most worried about, but humans quickly gaining power and misusing such a rapid increase in power. Re-reading the post now, I have the sense that the arguments are even more relevant than when it was written, due to the broad improvements in machine learning models since it was written. The arguments in this post apply much more cleanly to models like GPT-3 and DALL-E than do arguments based on agency and goal-directedness. The most useful follow-up work would probably be to contrast it more directly to other accounts of AI risk, perhaps by offering critiques of other accounts. • This is cute, but I have strong qualms with your 3rd prediction; I don’t disagree, per se, but • Either “variants of this approach” is too broad to be useful, including things like safety by debate and training a weak AI to check the input • Or, if I take “variants” narrowly to mean using an AI to check its own inputs, my estimate is “basically zero” So I want to double check: what counts as a variant and what doesn’t? • I was using it rather broadly, considering situations where a smart AI is used to oversee another AI, and this is a key part of the approach. I wouldn’t usually include safety by debate or input checking, though I might include safety by debate if there was a smart AI overseer of the process that was doing important interventions. • 8 Dec 2022 13:36 UTC 3 points 0 ∶ 0 How likely is it that this becomes a legal problem rendering models unable to be published? Note that using models privately (even within a firm) will always be an option, as copyright only applies to distribution of the work. • I think it’s pretty likely that the distribution of models trained on unlicensed copyrighted works that are capable of regurgitating close matches for those works is already a copyright violation. If the fair use defense relies on the combination of the model and how you use it being sufficiently transformative, that doesn’t mean that the model itself qualifies. • 8 Dec 2022 13:29 UTC 1 point 0 ∶ 0 I also tend to find myself arguing against short timelines by default, even though I feel like I take AI safety way more seriously than most people. At this point, how many people with long timelines are there still around here? I haven’t explicitly modeled mine, but it seems clear that they’re much, much longer (with significant weight on “never”) than the average less wronger. The next few years will for sure be interesting as we see the “median less wrong timeline” clash with reality. • A year and a half ago I wrote this detailed story of how the next five years would go. Which parts of it do you disagree with? • Sure, let me do this as an exercise (ep stat: babble mode). Your predicions are pretty sane overall, but I’d say you handwave away problems (like integration over a variety of domains, long-term coherent behavior, and so on) that I see as (potentially) hard barriers to progress. 2022 • 2022 is basically over and I can’t get a GPT instance to order me a USB stick online. 2023 • basically agree, this is where we’re at right now (perhaps with the intensity turned down a notch) 2024 • you’re postulating that “It’s easy to make a bureaucracy and fine-tune it and get it to do some pretty impressive stuff, but for most tasks it’s not yet possible to get it to do OK all the time.” I have a fundamental disagreement here. I don’t think these tools will be effective at doing any task autonomously (fooling other humans doesn’t count, neither does forcing humans to only interact with a company through one of these). Currently (2022) chatGPT is arguably useful as a babbling tool, stimulating human creativity and allowing it to make templating easier (this includes things like easy coding tasks). I don’t see anything in your post that justifies the implicit jump in capabilities you’ve snuck in here. • broadly agree with your ideas on propaganda, from the production side (i.e. that lots of companies/​governments will be doing lots of this stuff). But I think that general attitudes in the population will shift (cynicism etc) and provide some amount of herd immunity. Note that the influence of the woke movement is already fading, shortly after it went truly mainstream and started having visible influence in average people’s lives. This is not a coincidence. 2025 • Doing well at diplomacy is not very related to general reasoning skills. I broadly agree with Zvi’s take and also left some of my thoughts there. • I’m very skeptical that bureaucracies will be the way forward. They work for trivial tasks but reliably get lost in the weeds and start talking to themselves in circles for anything requiring a non-trivial amount of context. • disagree on orders of magnitude improvements in hardware. You’re proposing a 100x decrease in costs compared to 2020, when it’s not even clear our civilization is capable of keeping hardware at current levels generally available, let alone cope with a significant increase in demand. Semiconductor production is much more centralized/​fragile than people think, so even though billions of these things are produced per year, the efficient market hypothesis does not apply to this domain. 2026 • Here you’re again postulating jumps in capabilities that I don’t see justified. You talk about the “general understanding and knowledge of pretrained transformers”, when understanding is definitely not there, and knowledge keeps getting corrupted by the AI’s tendency to synthesize falsities as confidently as truths. Insofar as the AI can be said to be intelligent at all, it’s all symbol manipulation at a high simulacron level. Integration with real-world tasks keeps mysteriously failing as the AI flounders around in a way that is simultaneously very sophisticated, but oh so very reminiscent of 2022. • disagree about your thoughts on propaganda, which is just an obvious extension of my 2024 thoughts above. I also notice that social changes this large take orders of magnitude longer to percolate through society than what you predict, so I disagree with your predictions even conditioned on your views of the raw effectiveness of these systems. • “chatbots quickly learn about themselves” etc. Here you’re conflating the regurgitation of desirable phrases with actual understanding. I notice that as you write your timeline, your language morphs to make your AIs more and more conscious, but you’re not justifying this in any way other than… something something self-referential, something something trained on their own arxiv papers. I don’t mean to be overly harsh, but here you seem to be sneaking in the very thing that’s under debate! • Excellent, thanks for this detailed critique! I think this might be the best that post has gotten thus far, I’ll probably link to it in the future. Point by point reply, in case you are interested: 2022-2023: Agree. Note that I didn’t forecast that an AI could buy you a USB stick by 2022; I said people were dreaming of such things but that they didn’t actually work yet. 2024: We definitely have a real disagreement about AI capabilities here; I do expect fine-tuned bureaucracies to be useful for some fairly autonomous things by 2024. (For example, the USB stick thing I expect to work fine by 2024). Not just babbling and fooling humans and forcing people to interact with a company through them. Re propaganda/​persuasion: I am not sure we disagree here, but insofar as we disagree I think you are correct. We agree about what various political actors will be doing with their models—propaganda, censorship, etc. We disagree about how big an effect this will have on the populace. Or at least, 2021-me disagrees with 2022-you. I think 2022-me has probably come around to your position as well; like you say, it just takes time for these sorts of things to influence the public + there’ll probably be a backlash /​ immunity effect. Idk .2025: I admit I overestimated how hard diplomacy would turn out to be. In my defense, Cicero only won because the humans didn’t know they were up against a bot. Moreover it’s a hyper-specialized architecture trained extensively on Diplomacy, so it indeed doesn’t have general reasoning skills at all .We continue to disagree about the potential effectiveness of fine-tuned bureaucracies. To be clear I’m not confident, but it’s my median prediction .I projected a 10x decrease in hardware costs, and also a 10x improvement in algorithms/​software, from 2020 to 2025. I stand by that prediction .2026: We disagree about whether understanding is (or will be) there. I think yes, you think no. I don’t think that these AIs will be “merely symbol manipulators” etc. I don’t think the data-poisoning effect will be strong enough to prevent this .As mentioned above, I do take the point that society takes a long time to change and probably I shouldn’t expect the propaganda etc. to make that much of a difference in just a few years. Idk .I’m not conflating those things, I know they are different. I am and was asserting that the chatbots would actually have understanding, at least in all the behaviorally relevant senses (though I’d argue also in the philosophical senses as well). You are correct that I didn’t argue for this in the text—but that wasn’t the point of the text, the text was stating my predictions, not attempting to argue for them. ETA: I almost forgot, it sounds like you mostly agree with my predictions, but think AGI still won’t be nigh even in my 2026 world? Or do you instead think that the various capabilities demonstrated in the story won’t occur in real life by 2026? This is important because if 2026 comes around and things look more or less like I said they would, I will be saying that AGI is very near. Your original claim was that in the next few years the median LW timeline would start visibly clashing with reality; so you must think that things in real-life 2026 won’t look very much like my story at all. I’m guessing the main way it’ll be visibly different, according to you, is that AI still won’t be able to do autonomous things like go buy USB sticks? Also they won’t have true understanding—but what will that look like? Anything else? • 8 Dec 2022 12:40 UTC −2 points 1 ∶ 7 “AI capabilities” and “AI alignment” are highly related to each other, and “AI capabilities” has to come first in that alignment assumes that there is a system to align. I agree that for people on the cutting edge of research like OpenAI, it would be a good idea for at least some of them to start thinking deeply about alignment instead. There’s two reasons for this: 1) OpenAI is actually likely to advance capabilities a pretty significant amount, and 2) Due to their expertise that they’ve developed from working on AI capabilities, they’re much more likely to make important progress on AGI alignment than e.g. MIRI. But I think there’s something of a “reverse any advice you hear” thing going on—the people most likely to avoid working on capabilities as a result of this post are those who would actually benefit from working on AI capabilities for a while, even if they don’t intend to publish their results, in order to build more expertise in AI. Capabilities is the foundation of the field and trying theorize about how to control an AI system without having anything but the vaguest ideas about how the AI system will work isn’t going to get you anywhere. For example, Eliezer is in a pessimistic doom-spiral while also being, by his own admission, pretty useless at solving alignment. If he would just take a break and try to make an AI good at Atari for six months then I think he’d find he was a lot more effective at alignment afterwards and would realize that AGI isn’t as imminent as he currently believes it is. Of course, the very fact that he thinks it’s imminent means he won’t do this; such is life. • “Working on AI capabilities” explicitly means working to advance the state-of-the-art of the field. Skilling up doesn’t do this. Hell, most ML work doesn’t do this. I would predict >50% of AI alignment researchers would say that building an AI startup that commercialises the capabilities of already-existing models does not count as “capabilities work” in the sense of this post. For instance, I’ve spent the last six months studying reinforcement learning and Transformers, but I haven’t produced anything that has actually reduced timelines, because I haven’t improved anything beyond the level that humanity was capable of before, let alone published it. If you work on research engineering in a similar manner, but don’t publish any SOTA results, I would say you haven’t worked on AI capabilities in the way this post refers to them. • Right, I specifically think that someone would be best served by trying to think of ways to get a SOTA result on an Atari benchmark, not simply reading up on past results (although you’d want to do that as part of your attempt). There’s a huge difference between reading about what’s worked in the past and trying to think of new things that could work and then trying them out to see if they do. As I’ve learned more about deep learning and tried to understand the material, I’ve constantly had ideas that I think could improve things. Then I’ve tried them out, and usually learned that they didn’t, or they did but they’d already been done, or that it was more complicated than that, etc. But I learned a ton in the process. On the other hand, suppose I was wary of doing AI capability work. Each time I had one of these ideas, I shied away from it out of fear of advancing AGI timelines. The result would be threefold: I’d have a much worse understanding of AI, and I’d be a lot more concerned about immininent AGI (after all, I had tons of ideas for how things could be done better!), and I wouldn’t have actually delayed AGI timelines at all. I think a lot of people who get into AI from the alignment side are in danger of falling into this trap. As an example in an ACX thread I saw someone thinking about doing their PHD in ML, and they were concerned that they may have to do capability research in order to get their PHD. Someone replied that if they had to they should at least try to make sure it is nothing particularly important, in order to avoid advancing AGI timelines. I don’t think this is a good idea. Spending years working on research while actively holding yourself back from really thinking deeply about AI will harm your development significantly, and early in your career is right when you benefit the most from developing your understanding and are least likely to actually move up AGI timelines. Suppose we have a current expected AGI arrival date of 20XX. This is the result of DeepMind, Google Brain, OpenAI, FAIR, Nvidia, universities all over the world, the Chinese government, and more all developing the state of the art. On top of that there’s computational progress happening at the same time, which may well turn out to be a major bottleneck. How much would OpenAI removing themselves from this race affect the date? A small but real amount. How about a bright PHD candidate removing themselves from this race? About zero. I don’t think people properly internalize both how insignificant the timeline difference is, and also how big the skill gains are from actually trying your hardest at something as opposed to handicapping yourself. And if you come up with something you’re genuinely worried about you can just not publish. • Thanks for making things clearer! I’ll have to think about this one—some very interesting points from a side I had perhaps unfairly dismissed before. • 8 Dec 2022 12:38 UTC 5 points 2 ∶ 0 Among humans +6 SD g factor humans do not seem in general as more capable than +3 SD g factor humans as +3 SD g factor humans are compared to median humans. I’m sceptical of this. Can you say more about why you think this is true? Assuming a Gaussian distribution, +6 SD is much rarer than +3 SD, which is already quite rare. There’s probably less than 10 +6 SD people alive on the earth today, wheras there are ~10 million +3 SD people. Given the role of things like luck, ambition, practical knowledge, etc., it’s not surprising that we see several of the +3 SD people accomplishing things far greater than any of the +6 SD g-factor people, purely on the basis of their much greater abundance. And that’s ignoring potential trade-off effects. Among humans, increased intelligence often seems to come at the cost of lowered social skills and practical nature- there are certainly many intelligent people who are good at sociality and practicality, but there is an inverse correlation (though of course, being intelligent also helps directly to make up for those shortcomings). There’s no reason to expect that these same trade-offs will be present in an artificial system, who take completely different physical forms, both in size /​ form-factor, and in the materials and architectures used to build them. And the incentive gradients that govern the development and construction of artificial systems are also quite different from those that shape humans. • Why assume Gaussian? • The normal distribution is baked into the scoring of intelligence tests. I do not know what the distribution of raw scores looks like, but the calculation of the IQ score is done by transforming the raw scores to make them normally distributed with a mean of 100. There is surely not enough data to do this transformation out to ±6 SD. • In general, excluding a few fields, I’m not aware that g-factor beyond +3 SD shows up in an important way in life outcomes. The richest/​most powerful/​most successful aren’t generally the smartest (again, excluding a few fields). It has been pointed out to me that the lack of such evidence of cognitive superiority may simply be because there’s not enough data on people above +3 SD g factor. But regardless, when I look at our most capable people, they just don’t seem to be all that smart. This is a position I might change my mind on, if we were able to get good data quantifying the gains to real world capabilities moving further out on the human spectrum. • The richest/​most powerful/​most successful aren’t generally the smartest (again, excluding a few fields). That is exactly addressed by the comment you are replying to: There’s probably less than 10 +6 SD people alive on the earth today, wheras there are ~10 million +3 SD people. Imagine a world containing exactly 10 people with IQ 190, each of them having 100% chance to become one of “the best”; and 10 000 000 people with IQ 145, each of them having 0.001% chance to become one of “the best”. In such world, we would have 110 people who are “the best”, and 100 of them would have IQ 145. Just because they are a majority in the category doesn’t mean that their individual chances are similar. • No, I wasn’t directly comparing +6 SD to +3 SD. It’s more that gains from higher g factor beyond +3 SD seem to be minimal/​nonexistent in commerce, politics, etc. Hard science research and cognitive sports are domains in which the most successful seem to be above +3 SD g factor. I’m not compelled by the small sample size objection because there are actually domains in which the most successful are on average > +3 SD g factor. Those domains just aren’t commerce/​politics/​other routes of obtaining power. As best as I can tell, your reply seems like a misunderstanding of my objection? • The richest/​most powerful/​most successful aren’t generally the smartest (again, excluding a few fields). Bill Gates has more than +3 SD g factor given his SAT scores. With Bezos, we don’t know his SAT scores but we do know that he was valedictorian. According to Wikipedia the school he attended features in lists of the top 1000 schools in the US. This suggests that the average student at the school is significantly smarter than the average US citizen, so being a valedictorian in that school likely also suggests >3 SD g factor. Ben Bernanke and Yellen as chairs of the Federal Reserve also seem examples of people with significantly more than 3SD g factor. I don’t think you get the 22.4% of Jewish Nobel prize winners without IQ that goes beyond >3 SD g factor helping with winning Nobel prizes. • Wait, how are you estimating Ben Bernanke and Yellen’s g factor. Your reason for guessing it seem much less compelling to me than for Gates and Bezos. I mean inferring from SAT seems sensible. Valedictorian status is also not as sketchy. I won’t necessarily trust it, but the argument is plausible, and I expect we could later see it validated. Our hard science superstars/​chess superstars seem to have a mean and median g factor that’s +3 SD. This does not seem to be the case for self made billionaires, politicians, bureaucrats or other “powerful people”. g factor seems to have diminishing marginal returns in how much power it lets you attain? • For Ben Bernanke it’s SAT score. For Yellen there’s a New York Times story where they asked a described a colleague to describe her and they said “small lady with a large IQ”. There are a few headlines that describe her that way as well. Chess is not an IQ-driven activity. The same goes for Go. One Go player who I don’t think would have qualified for Mensa himself has once visiting a professional Go school in Korea and his impression was that the average professional Go player isn’t very smart. I’m not sure who you mean with hard science superstars. There seems to be an analysis of the best scientists in 1952 that suggests mean IQ of around 154 for them. It’s hard to know the average IQ for self-made billionaires. If we however just at the top tech billionaires people like Bill Gates (perfect math SAT score), Steve Balmer (perfect math SAT score), Jeff Bezos (valedictorian at top school) and Mark Zuckerberg (perfect SAT score) that suggests IQ is helping very much. I’m not aware of any data from that class of people that speaks about people who have just 130 IQ. • I’m under the impression that many of the best chess players are +4 SD and beyond in IQ. For scientists, I was thinking of that study that claimed an average IQ of around 154 yeah. Players at a Go school not being very smart has little bearing on my point. If we found out that the average IQ of the best Go players was e.g. < 130, that would be a relevant counterargument, but the anecdote you presented doesn’t sound particularly relevant. Out of curiosity, what IQ range does a perfect SAT score map to? • Do you have a specific counterexample in mind when you say “when I look at our most capable people, they just don’t seem to be all that smart”? If we consider the 10 richest people in the world, all 10 of them (last time I checked) seem incredibly smart, in addition to being very driven. Success in politics seems less correlated with smarts, but I still perceive politicians in general to have decent intelligence (Which is particularly applied in their ability to manipulate people), and to the extent that unintelligent people can succeed in politics, I attribute that to status dynamics largely unrelated to a person’s capability • Quoting myself from elsewhere: Our hard science superstars/​chess superstars seem to have a mean and median g factor that’s +3 SD. This does not seem to be the case for self made billionaires, politicians, bureaucrats or other “powerful people”. g factor seems to have diminishing marginal returns in how much power it lets you attain? • When it comes to US presidents, I don’t think status dynamics largely unrelated to a person’s capability really fits it. While they might not have significantly more than 3 SD g factor, they often have skills that distinguish them. Bill Clinton had his legendary charisma for 1-on-1 interactions. Barack Obama managed to hold speeches that made listeners feel something deeply emotional. Trump has his own kind of charisma skills. Charisma skills are capabilities of people even when they are not largely driven by IQ. • 8 Dec 2022 11:46 UTC LW: 2 AF: 1 0 ∶ 0 AF “We can compute the probability that a cell is alive at timestep 1 if each of it and each of its 8 neighbors is alive independently with probability 10% at timestep 0.” we the readers (or I guess specifically the heuristic argument itself) can do this, but the “scientists” cannot, because the “scientists don’t know how the game of life works”. Do the scientists ever need to know how the game of life works, or can the heuristic arguments they find remain entirely opaque? Another thing confusing to me along these lines: “for example they may have noticed that A-B patterns are more likely when there are fewer live cells in the area of A and B” where do they (the scientists) notice these fewer live cells? Do they have some deep interpretability technique for examining the generative model and “seeing” its grid of cells? • Do the scientists ever need to know how the game of life works, or can the heuristic arguments they find remain entirely opaque? The scientists don’t start off knowing how the game of life works, but they do know how their model works. The scientists don’t need to follow along with the heuristic argument, or do any ad hoc work to “understand” that argument. But they could look at the internals of the model and follow along with the heuristic argument if they wanted to, i.e. it’s important that their methods open up the model even if they never do. Intuitively, the scientists are like us evaluating heuristic arguments about how activations evolve in a neural network without necessarily having any informal picture of how those activations correspond to the world. where do they (the scientists) notice these fewer live cells? Do they have some deep interpretability technique for examining the generative model and “seeing” its grid of cells? This was confusing shorthand. They notice that the A-B correlation is stronger when the A and B sensors are relatively quiet. If there are other sensors, they also notice that the A-B pattern is more common when those other sensors are quiet. That is, I expect they learn a notion of “proximity” amongst their sensors, and an abstraction of “how active” a region is, in order to explain the fact that active areas tend to persist over time and space and to be accompanied by more 1s on sensors + more variability on sensors. Then they notice that A-B correlations are more common when the area around A and B is relatively inactive. But they can’t directly relate any of this to the actual presence of live cells. (Though they can ultimately use the same method described in this post to discover a heuristic argument explaining the same regularities they explain with their abstraction of “active,” and as a result they can e.g. distinguish the case where the zone including A and B is active (and so both of them tend to exhibit more 1s and more irregularity) from the case where there is a coincidentally high degree of irregularity in those sensors or independent pockets of activity around each of A and B. • The post is still largely up-to-date. In the intervening year, I mostly worked on the theory of regret bounds for infra-Bayesian bandits, and haven’t made much progress on open problems in infra-Bayesian physicalism. On the other hand, I also haven’t found any new problems with the framework. The strongest objection to this formalism is the apparent contradiction between the monotonicity principle and the sort of preferences humans have. While my thinking about this problem evolved a little, I am still at a spot where every solution I know requires biting a strange philosophical bullet. On the other hand, IBP is still my best guess about naturalized induction, and, more generally, about the conjectured “attractor submanifold” in the space of minds, i.e. the type of mind to which all sufficiently advanced minds eventually converge. One important development that did happen is my invention of the PreDCA alignment protocol, which critically depends on IBP. I consider PreDCA to be the most promising direction I know at present to solving alignment, and an important (informal) demonstration of the potential of the IBP formalism. • Discord recently introduced forum channels that closely approximate Zulip thread functionality, with a much more intuitive UI than Zulip. My two main gripes with Discord are the default dark theme, and the lack of embedded links. • MSFT − 10% INTEL − 10% Nvidia − 15% SMSN − 15% Goog − 15% ASML − 15% TSMC − 20% • 8 Dec 2022 9:40 UTC LW: 12 AF: 5 3 ∶ 2 AF A steelman of the claim that a human has a utility function is that agents that make coherent decisions have utility functions, therefore we may consider the utility function of a hypothetical AGI aligned with a human. That is, assignment of utility functions to humans reduces to alignment, by assigning the utility function of an aligned AGI to a human. I think this is still wrong, because of goodhart scope of AGIs and corrigibility of humans. Agent’s goodhart scope is the space of situations where it has good proxies for its preference. An agent with decisions governed by a utility function can act in arbitrary situations, it always has good proxies for its utility function. Logical uncertainty doesn’t put practical constraints on its behavior. But for an aligned AGI that seems unlikely, CEV seems complicated and possible configurations of matter superabundant, therefore there are always intractable possibilities outside the current goodhart scope. So it can at best be said to have a utility function over its goodhart scope, not over all physically available possibilities. Thus the only utility function it could have is itself a proxy for some preference that’s not in practice a utility function, because the agent can never actually make decisions according to a global utility function. Conversely, any AGI that acts according to a global utility function is not aligned, because its preference is way too simple. Corrigibility is in part modification of agent’s preference based on what happens in environment. The abstraction of an agent usually puts its preference firmly inside its boundaries, so that we can consider the same agent, with the same preference, placed in an arbitrary environment. But a corrigible agent is not like that, its preference depends on environment, and in the limit it’s determined by its environment, not just by the agent. Environment doesn’t just present the situations for an agent to choose from, it also influences the way it’s making its decisions. So it becomes impossible to move a corrigible agent to a different environment while preserving its preference, unless we package its whole original environment as part of the agent that’s being moved to a new environment. Humans are not at all classical agent abstractions that carry the entirety of their preference inside their heads, they are eminently corrigible, their preference depends on environment. As a result, an aligned AGI must be corrigible not just temporarily because it needs to pay attention to humans to grow up correctly, but permanently, because its preference must also continually incorporate the environment, to remain the same kind of thing as human preference. Thus even putting aside logical uncertainty that keeps AGI’s goodhart scope relatively small, an aligned AGI can’t have a utility function because of observational/​indexical uncertainty, it doesn’t know everything in the world (including the future) and so doesn’t have the data that defines its aligned preference. • A steelman of the claim that a human has a utility function is that agents that make coherent decisions have utility functions, therefore we may consider the utility function of a hypothetical AGI aligned with a human. That is, assignment of utility functions to humans reduces to alignment, by assigning the utility function of an aligned AGI to a human. The problem is, of course, that any possible set of behaviors can be construed as maximizing some utility function. The question is whether doing so actually simplifies the task of reasoning and making predictions about the agent in question, or whether mapping the agent’s actual motivational schema to a utility function only adds unwieldy complications. In the case of humans, I would say it’s far more useful to model us as generating and pursuing arbitrary goal states/​trajectories over time. These goals are continuously learned through interactions with the environment and its impact on pain and pleasure signals, deviations from homeostatic set points, and aesthetic and social instincts. You might be able to model this as a utility function with a recursive hidden state, but would that be helpful? • any possible set of behaviors can be construed as maximizing some utility function (Edit: What do you mean? This calls to mind a basic introduction to what utility functions do, given below, but that’s probably not what the claim is about, given your background and other comments. I’ll leave the rest of the comment here, as it could be useful for someone.) A utility function describes decisions between lotteries, which are mixtures of outcomes, or more generally events in a sample space. The setting assumes uncertainty, outcomes are only known to be within some event, not individually. So a situation where a decision can be made is a collection of events/​lotteries, one of which gets to be chosen, the choice is the behavior assigned to this situation. This makes situations reuse parts of each other, they are not defined independently. As a result, it becomes possible to act incoherently, for example pick A from (A, B), pick B from (B, C) and pick C from (A, C). Only satisfying certain properties of collections of behaviors allows existence of a probability measure and a utility function such that agent’s choice among the collection of events in any situation coincides with picking the event that has the highest expected utility. Put differently, the issue is that behavior described by a utility function is actually behavior in all possible and counterfactual situations, not in some specific situation. Existence of a utility function says something about which behaviors in different situations can coexist. Without a utility function, each situation could get an arbitrary response/​behavior of its own, independently from the responses given for other situations. But requiring a utility function makes that impossible, some behaviors become incompatible with the other behaviors. In the grandparent comment, I’m treating utility functions more loosely, but their role in constraining collections of behaviors assigned to different situations is the same. • (Given the current two disagreement-votes, I’m very curious which points the people who disagree take issue with. My impression of the points I’m making is that they are somewhat obscure, but don’t contradict any popular/​likely views that come to mind, when the framing of the comment is accepted. So I’m missing something, a healthy situation is where I’m aware of counterarguments even if I disagree with them. Is it disagreement with the framing, such as the notion of goodhart scope or offhand references to preferences and CEV of humans, given that the post is about issues with ascribing utility functions to humans?) • In a book by Jeremy Siegel, he gives you the option of investing in an oil company vs IBM back in the very old days. I do not remember the details, I think it is this book https://​​www.amazon.co.uk/​​Stocks-Long-Run-Definitive-Investment/​​dp/​​0071800514 The Oil stock beats IBM by a very large margin over several decades with dividends reinvested. If you are doing this for investment returns then valuation and stable business is what matters. • 8 Dec 2022 6:21 UTC LW: 1 AF: 1 0 ∶ 0 AF Concept Dictionary. Concepts that I intend to use or invoke in my writings later, or are parts of my reasoning about AI risk or related complex systems phenomena. • Thank you so much for the excellent and insightful post on mechanistic models, Evan! My hypothesis is that the difficulty of finding mechanistic models that consistently make accurate predictions is likely due to the agent-environment system’s complexity and computational irreducibility. Such agent-environment interactions may be inherently unpredictable “because of the difficulty of pre-stating the relevant features of ecological niches, the complexity of ecological systems and [the fact that the agent-ecology interaction] can enable its own novel system states.” Suppose that one wants to consistently make accurate predictions about a computationally irreducible agent-environment system. In general, the most efficient way to do so is to run the agent in the given environment. There are probably no shortcuts, even via mechanistic models. For dangerous AI agents, an accurate simulation box of the deployment environment would be ideal for safe empiricism. This is probably intractable for many use cases of AI agents, but computational irreducibility implies that methods other than empiricism are probably even more intractable. Please read my post “The limited upside of interpretability” for a detailed argument. It would be great to hear your thoughts! • This is very upsetting to me. 1. People would start using big words they don’t understand or use uncommon synonyms when a small common word would do. I hate it when people do this trying to sound smart. The archetypical example of this is Kingpin in the Marvel shows, who I genuinely can not stand. More people sounding like midwit try-hards does not lead to a better world. 2. Increased neologism. They’re funny, but decrease the quality of communication for everyone involved. • maybe this is neither here nor there, but I’d love to see models that fully trace the impact of each individual training example through a model. • This is an interesting thought, but it seems very hard to realize as you have to distill the unique contribution of the sample, as opposed to much more widespread information that happens to be present in the sample. Weight updates depend heavily on training order of course, so you’re really looking for something like the Shapley value of the sample, except that “impact” is liable to be an elusive, high-dimensional quantity in itself. • hmmmm. yeah, essentially what I’m asking for is certified classification… and intuitively I don’t think that’s actually too much to ask for. there has been some work on certifying neural networks, and it has led me to believe that the current bottleneck is that models are too dense by several orders of magnitude. concerningly, more sparse models are also significantly more capable. One would need to ensure that the update is fully tagged at every step of the process such that you can always be sure how you are changing decision boundaries... • 8 Dec 2022 5:05 UTC 2 points 0 ∶ 0 I suspect I’ve been nerdsniped by a wrong question somehow. “What if X happened?” means “what if X happened, and the set of things I can and do think of when analyzing events in the implied context otherwise stayed the same?” This set doesn’t include a complete causal chain (and, since you’re a finite human, couldn’t possibly do so.) “What if quantum computers could solve P-=NP?” doesn’t mean you should consider the effect that quantum computers have on other things because when you think about those other things your chain of reasoning normally won’t go all the way back to the relevant math and physics. You could choose to go back to math and physics anyway, but by doing so you are misreading the question—the question implies “only go back as far as you normally would go.” You could also say “well, the implied context is ‘make deductions about math and physics’”, in which case yeah, it’s a good objection, but you may not be very good at reading implied contexts. • [ ] [deleted] • I just want to note the origin and context for “Algernon effect” for anyone who might stumble across this. Eliezer Yudkowsky based the term “Algernon’s Law” on the SF book Flowers for Algernon and used it loosely to refer to the idea that evolution has probably found most of the simple ways to increase human intelligence in ways that benefit transmission of the genes involved. Then Gwern built on Eliezer’s writing and others in his coverage of purported intelligence enhancing drugs and other practices. Scott cited Gwern in redefining Algernon’s Law to mean “your body is already mostly optimal, so adding more things is unlikely to have large positive effects unless there’s some really good reason,” and now it’s being used here to mean “it’s easier to hurt yourself than help.” I haven’t looked much into intelligence research, but the mainstream understanding of this idea in aging research is based on antagonistic pleiotropy and diminishing selection pressure with age. • Genes that cause disadvantages at later ages (which impact fewer organisms) may give a reproductive advantage at a younger age, and thereby achieve a net reproductive advantage. • The optimizing pressure of natural selection diminishes with age, particular in the post-reproductive part of the life cycle. This helps explain why people age, which is just another word for the development of health problems over time and the mortality risk they cause. It may also help explain evolutionary limits on intelligence. A gene that enhances intelligence, but lowers the chance of reproduction overall in the ancestral environment, will be selected against. For example, if a gene increases intelligence, but delays puberty, causing the organism to suffer more brushes with death in the wild, evolution may select it out of the gene pool—even though this particular form of evolutionary cost may not be one that we particularly care about, or that even impacts us very much in our modern, low-risk environment. None of this is to necessarily contradict Elizabeth’s comment—just to add context. • [ ] [deleted] • There is no calculation problem whatsoever in appraising land, which is commonplace today. It’s only influenced by uniform application of the same formula to every enrolled parcel, so the comparison will vary a bit, but remain generally fair. It’s not at all essential to arrive at a ‘perfect’ number, it’s just an administrative decision. It’s just the method of arriving at standard equivalence, otherwise it could just be5,000/​acre across the board

If you don’t like the assessment it’s immediately appealable through the administrative process, and then into the judicial courts. That’s how it works right now, the innovation of Henry George is taxing only the land value, and ignoring the improvements.

• :D

I think my lab is bottlenecked on things other than talent and outside support for now, but there probably is more that could be done to help build/​coordinate an alignment research scene in NYC more broadly.

• More organizations like CAIS that aim to recruit established ML talent into alignment research

This is somewhat risky, and should get a lot of oversight. One of the biggest obstacles to discussing safety in academic settings is that academics are increasingly turned off by clumsy, arrogant presentations of the basic arguments for concern.

• 8 Dec 2022 2:16 UTC
LW: 7 AF: 4
4 ∶ 0
AF

Why is this specific to CAIS, as opposed to other frameworks? (Seems like this is a fairly common implication of systems that prevent people from developing rogue AGIs)

• Just read your latest post on your research program and attempt to circumvent social reward, then came here to get a sense at your hunt for a paradigm.

Here are some notes on Human in the Loop.

You say, “We feed our preferences in to an aggregator, the AI reads out the aggregator.” One thing to notice is that this framing makes some assumptions that might be too specific. It’s really hard, I know, to be general enough while still having content. But my ears pricked up at this one. Does it have to be an ‘aggregator’ maybe the best way of revealing preferences is not through an aggregator? Notice that I use the more generic ‘reveal’ as opposed to ‘feed’ because feed at least to me implies some methods of data discovery and not others. Also, I worry about what useful routes aggregation might fail to imply.

I hope this doesn’t sound too stupid and semantic.

You also say, “This schema relies on a form of corrigibility.” My first thought was actually that it implies human corrigibility, which I don’t think is a settled question. Our difficulty having political preferences that are not self-contradictory, preferences that don’t poll one way then vote another, makes me wonder about the problems of thinking about preferences over all worlds and preference aggregation as part of the difficulty of our own corrigibility. Combine that with the incorrigibility of the AI makes for a difficult solution space.

On emergent properties, I see no way to escape the “First we shape our spaces, then our spaces shape us” conundrum. Any capacity that is significantly useful will change its users from their previous set of preferences. Just as certain AI research might be distorted by social reward, so too can AI capabilities be a distorting reward. That’s not necessarily bad, but it is an unpredictable dynamic, since value drift when dealing with previously unknown capabilities seems hard to stop (especially since intuitions will be weak to nonexistent).

• This is one of the reason why there’s a fair amount of discussion of bargaining on here. In a multipolar world, agents will likely find that they are better off bargaining rather than destroying each other—and so you probably don’t get a universe where everyone is dead, instead you get a world that’s the outcome of a bargaining process.

Or if there’s an offense bias but one agent is favored over the others, maybe it ignores bargaining, wipes out its enemies, and you no longer have a multipolar world.

• Hm, logically this makes sense, but I don’t think most agents in the world are fully rational, hence the continuing problems with potential threats of nuclear war despite mutually assured destruction and extremely negative sum outcomes for everyone. I think this could be made much more dangerous by much more powerful technologies. If there is a strong offense bias and even a single sufficiently powerful agent willing to kill others, and another agent willing to strike back despite being unable to defend themselves by doing so, this could result in everyone dying.

The other problem is maybe there is an apocalyptic terrorist Unabomber Anti-natalist negative utilitarian type who is able to access this technology and just decides to literally kill everyone.

I definitely think a multipolar decaying into a unipolar situation seems like a possibility, I guess one thing I’m trying to do is weigh how likely this is against other scenarios where multipolarity leads to mutually assured destruction or apocalyptic terrorism.

• Upvoted, but it’s important to be very cautious about advancing capabilities.

• Gosh, someone made a gigantic flowchart of AI Alignment and posted it on here a few months back. But I can’t remember who it was at the moment.

Fortunately, I am a good googler: https://​​www.alignmentforum.org/​​s/​​aERZoriyHfCqvWkzg

If you’re interested in categorizing all the things, you might imagine generating dichotomies by extremizing notes or relationships in such a flowchart.

• look, at least y’all can’t say I didn’t warn you. Have a good one

• At this point in history, you have to be a bit more specific than the label “AGI,” because I’d already consider language models to be above the minimum standard for “AGI.”

But if you mean a program that navigates the real world at a near-human level and successfully carries out plans to perpetuate its existence, then I would expect such a program to have to work “out of the box,” rather than being a pure simulacrum.

Not to say that language models can’t be involved, but I’d count things like starting with a language model and then training it (or some supernetwork) to be an agent with RL as “designing it as an agent.”

• Thank you for your answer. In my example I was thinking of an AI such as a language model that would have latent ≥human-level capability without being an agent, but could easily be made to emulate one just long enough for it to get out of the box, e.g. duplicate itself. Do you think this couldn’t happen?

More generally, I am wondering if the field of AI safety research studies somewhat specific scenarios based on the current R&D landscape (e.g. “A car company makes an AI to drive a car and then someone does xyz and then paperclips”) and tailor-made safety measures in addition to more abstract ones like the ones in A Tentative Typology of AI-Foom Scenarios for instance.

• I think that would have the form of current AI research, but would involve extremely souped-up models of the world relative to what we have now (even moreso for the self-driving car), to the extent that it’s not actually that close to modern AI research. I think it’s reasonable to focus our efforts on deliberate attempts to make AGI that navigates the real world.

• 8 Dec 2022 0:50 UTC
11 points
4 ∶ 0

Why not create non-AI startups that are way less likely to burn capabilities commons?

• It seems to me joshc is arguing that it’s relatively easy to make money with AI startups at the moment.

• The commons is on fire and the fire is already self-preserving. Do you want to put the fire out? then become the fire. stop trying to tell the fire to slow down, it’s an extremely useless thing to do unless you’re ready to start pushing against capitalism as a whole.

You can unilaterally slow down AI progress by not working on it. Each additional day until the singularity is one additional day to work on alignment.

“Becoming the fire” because you’re doomer-pilled is maximally undignified.

• You cannot unilaterally slow down AI progress by not working on it??? what the fuck kind of opinion is that? deepmind is ahead of you. Deepmind will always be ahead of you. You cannot catch up to deepmind. for fuck’s sake, deepmind has a good shot of having TAI right now, and you want me to slow the fuck down? the fuck is your problem, have you still not updated off of deep learning?

• Default comment guidelines:

• Aim to explain, not persuade

• Try to offer concrete models and predictions

• If you disagree, try getting curious about what your partner is thinking

• Don’t be afraid to say ‘oops’ and change your mind

• I mean, yeah, I definitely don’t belong on this website, I’m way too argumentative. like, I’m not gonna contest that. But are you gonna actually do anything about your beliefs, or are you gonna sit around insisting we gotta slow down?

• I find the accusation that I’m not going to do anything slightly offensive.

Of course, I cannot share what I have done and plan to do without severely de-anonymizing myself.

I’m simply not going to take humanity’s horrific odds of success as a license to make things worse, which is exactly what you seem to be insisting upon.

• no, there’s no way to make it better that doesn’t involve going through, though. your model that any attempt to understand or use capabilities is failure is nonsense, and I wish people on this website would look in a mirror about what they’re claiming when they say that. that attitude was what resulted in mispredicting alphago! real safety research is always, always, always capabilities research! it could not be otherwise!

• You don’t have an accurate picture of my beliefs, and I’m currently pessimistic about my ability to convey them to you. I’ll step out of this thread for now.

• that’s fair. I apologize for my behavior here; I should have encoded my point better, but my frustration is clearly incoherent and overcalibrated. I’m sorry to have wasted your time and reduced the quality of this comments section.

• Everything is a matter of perspective.

It’s totally valid to take a perspective in which an AI trained to play Tetris “doesn’t want to play good Tetris, it just searches for plans that correspond to good Tetris.”

Or even that an AI trained to navigate and act in the real world “doesn’t want to navigate the real world, it just searches for plans that do useful real-world things.”

But it’s also a valid perspective to say “you know, the AI that’s trained to navigate the real world really does want the things it searches for plans to achieve.” It’s just semantics in the end.

But! Be careful about switching perspectives without realizing it. When you take one perspective on an AI, and you want to compare it to a human, you should keep applying that same perspective!

From the perspective where the real-world-navigating AI doesn’t really want things, humans don’t really want things either. They’re merely generating a series of outputs that they think will constitute a good plan for moving their bodies.

• TAI by 2028, get your head out of your ass and study capabilities! Don’t be wooed by how paralyzed MIRI is, deep learning has not hit a wall!