Emo­tional is­sues of­ten have an im­me­di­ate payoff

ChipmonkJun 10, 2024, 11:39 PM
26 points
2 comments4 min readLW link
(chrislakin.blog)

DPO/​PPO-RLHF on LLMs in­cen­tivizes syco­phancy, ex­ag­ger­a­tion and de­cep­tive hal­lu­ci­na­tion, but not mis­al­igned powerseeking

tailcalledJun 10, 2024, 9:20 PM
29 points
13 comments2 min readLW link

Plop! Goes the Concept

Jonathan MoregårdJun 10, 2024, 7:23 PM
6 points
0 comments8 min readLW link
(honestliving.substack.com)

What can we learn from or­cas?

JonasbJun 10, 2024, 6:01 PM
1 point
0 comments8 min readLW link
(www.denominations.io)

How to build a data cen­ter, by Con­struc­tion Physics

TheManxLoinerJun 10, 2024, 5:38 PM
2 points
0 comments1 min readLW link
(www.construction-physics.com)

Ob­ser­va­tions for do­ing de­bate with mod­els be­hind APIs

PoD123Jun 10, 2024, 4:22 PM
3 points
0 comments3 min readLW link

My AI Model Delta Com­pared To Yudkowsky

johnswentworthJun 10, 2024, 4:12 PM
289 points
103 comments4 min readLW link

[Question] Good ways to mon­e­tar­ily profit from the in­creas­ing de­mand for power?

Matt GoldenbergJun 10, 2024, 3:29 PM
12 points
5 comments1 min readLW link

The Evolu­tion to­wards the Blank Slate

Arturo MaciasJun 10, 2024, 3:20 PM
−6 points
0 comments3 min readLW link

10 Public “I was wrong” Ad­mis­sions by Scien­tists and Intellectuals

Hashem ElAssadJun 10, 2024, 2:19 PM
0 points
3 comments1 min readLW link

[Valence se­ries] 4. Valence & Lik­ing /​ Admiring

Steven ByrnesJun 10, 2024, 2:19 PM
48 points
12 comments15 min readLW link

5. Open Cor­rigi­bil­ity Questions

Max HarmsJun 10, 2024, 2:09 PM
30 points
0 comments7 min readLW link

4. Ex­ist­ing Writ­ing on Corrigibility

Max HarmsJun 10, 2024, 2:08 PM
55 points
15 comments106 min readLW link

On Dwarksh’s Pod­cast with Leopold Aschenbrenner

ZviJun 10, 2024, 12:40 PM
102 points
7 comments59 min readLW link
(thezvi.wordpress.com)

Sum­mary of Si­tu­a­tional Aware­ness—The Decade Ahead

OscarJun 10, 2024, 8:44 AM
6 points
2 comments1 min readLW link
(forum.effectivealtruism.org)

Why I don’t be­lieve in the placebo effect

transhumanist_atom_understanderJun 10, 2024, 2:37 AM
135 points
22 comments9 min readLW link

Soviet com­edy film recommendations

Nina PanicksseryJun 9, 2024, 11:40 PM
42 points
11 comments2 min readLW link
(open.substack.com)

The Data Wall is Important

JustisMillsJun 9, 2024, 10:54 PM
40 points
20 comments2 min readLW link
(justismills.substack.com)

Two Fam­ily Dance Flyers

jefftkJun 9, 2024, 8:50 PM
13 points
0 comments1 min readLW link
(www.jefftk.com)

[Question] What hap­pens to ex­ist­ing life sen­tences un­der LEV?

O OJun 9, 2024, 5:49 PM
5 points
7 comments1 min readLW link

3b. For­mal (Faux) Corrigibility

Max HarmsJun 9, 2024, 5:18 PM
26 points
13 comments17 min readLW link

3a. Towards For­mal Corrigibility

Max HarmsJun 9, 2024, 4:53 PM
24 points
2 comments19 min readLW link

In­tro­duc­ing SARA: a new ac­ti­va­tion steer­ing technique

Alejandro TlaieJun 9, 2024, 3:33 PM
17 points
7 comments6 min readLW link

“What the hell is a rep­re­sen­ta­tion, any­way?” | Clar­ify­ing AI in­ter­pretabil­ity with tools from philos­o­phy of cog­ni­tive sci­ence | Part 1: Ve­hi­cles vs. contents

IwanWilliamsJun 9, 2024, 2:19 PM
9 points
1 comment4 min readLW link

Ex­plor­ing Llama-3-8B MLP Neurons

ntt123Jun 9, 2024, 2:19 PM
10 points
0 comments4 min readLW link
(neuralblog.github.io)

De­mys­tify­ing “Align­ment” through a Comic

milanroskoJun 9, 2024, 8:24 AM
106 points
19 comments1 min readLW link

Dumb­ing down

Martin SustrikJun 9, 2024, 6:50 AM
72 points
1 comment4 min readLW link

What if a tech com­pany forced you to move to NYC?

KatjaGraceJun 9, 2024, 6:30 AM
56 points
22 comments1 min readLW link
(worldspiritsockpuppet.com)

[Question] What should I do? (long term plan about start­ing an AI lab)

not_a_catJun 9, 2024, 12:45 AM
2 points
1 comment2 min readLW link

Search­ing for the Root of the Tree of Evil

Ivan VendrovJun 8, 2024, 5:05 PM
36 points
14 comments5 min readLW link
(nothinghuman.substack.com)

2. Cor­rigi­bil­ity Intuition

Max HarmsJun 8, 2024, 3:52 PM
67 points
10 comments33 min readLW link

Two easy things that maybe Just Work to im­prove AI discourse

Bird ConceptJun 8, 2024, 3:51 PM
191 points
35 comments2 min readLW link

I made an AI safety fel­low­ship. What I wish I knew.

Ruben CastaingJun 8, 2024, 3:23 PM
12 points
0 comments2 min readLW link

Align­ment Gaps

kcyrasJun 8, 2024, 3:23 PM
11 points
4 comments8 min readLW link

The Slack Dou­ble Crux, or how to ne­go­ti­ate with yourself

Thac0Jun 8, 2024, 3:22 PM
6 points
2 comments4 min readLW link

The Per­ils of Pop­u­lar­ity: A Crit­i­cal Ex­am­i­na­tion of LessWrong’s Ra­tional Discourse

BubbaJoeLouisJun 8, 2024, 3:22 PM
−24 points
3 comments2 min readLW link

Sta­tus quo bias is usu­ally justified

Amadeus PagelJun 8, 2024, 2:54 PM
10 points
3 comments1 min readLW link
(amadeuspagel.substack.com)

Closed-Source Evaluations

JonoJun 8, 2024, 2:18 PM
15 points
4 comments1 min readLW link

Ac­cess to pow­er­ful AI might make com­puter se­cu­rity rad­i­cally easier

BuckJun 8, 2024, 6:00 AM
105 points
14 comments6 min readLW link

[Question] Why don’t we just get rid of all the bioethi­cists?

SableJun 8, 2024, 3:48 AM
13 points
0 comments1 min readLW link

Sev, Sev­teen, Sevty, Sevth

jefftkJun 8, 2024, 2:30 AM
17 points
9 comments1 min readLW link
(www.jefftk.com)

1. The CAST Strategy

Max HarmsJun 7, 2024, 10:29 PM
48 points
22 comments38 min readLW link

0. CAST: Cor­rigi­bil­ity as Sin­gu­lar Target

Max HarmsJun 7, 2024, 10:29 PM
147 points
17 comments8 min readLW link

What is space? What is time?

TahpJun 7, 2024, 10:15 PM
8 points
3 comments7 min readLW link

[Question] Ques­tion about Lewis’ coun­ter­fac­tual the­ory of causation

jbkjrJun 7, 2024, 8:15 PM
12 points
7 comments1 min readLW link

Re­la­tion­ships among words, met­al­in­gual defi­ni­tion, and interpretability

Bill BenzonJun 7, 2024, 7:18 PM
2 points
0 comments5 min readLW link

Let’s Talk About Emergence

jacobhaimesJun 7, 2024, 7:18 PM
4 points
0 comments7 min readLW link
(www.odysseaninstitute.org)

D&D.Sci Alchemy: Arch­mage Anachronos and the Sup­ply Chain Issues

aphyerJun 7, 2024, 7:02 PM
42 points
16 comments3 min readLW link

Nat­u­ral La­tents Are Not Ro­bust To Tiny Mixtures

Jun 7, 2024, 6:53 PM
61 points
8 comments5 min readLW link

Si­tu­a­tional Aware­ness Sum­ma­rized—Part 2

Joe RogeroJun 7, 2024, 5:20 PM
12 points
2 comments4 min readLW link