Cor­rigi­bil­ity = Tool-ness?

Jun 28, 2024, 1:19 AM
78 points
8 comments9 min readLW link

[New Fea­ture] Your Sub­scribed Feed

Jun 11, 2024, 10:45 PM
77 points
13 comments4 min readLW link

Claude 3.5 Sonnet

Zach Stein-PerlmanJun 20, 2024, 6:00 PM
75 points
41 comments1 min readLW link
(www.anthropic.com)

(Not) Derailing the LessOn­line Puz­zle Hunt

ErrorJun 4, 2024, 1:28 AM
74 points
2 comments4 min readLW link

MIRI’s June 2024 Newsletter

HarlanJun 14, 2024, 11:02 PM
74 points
20 comments2 min readLW link
(intelligence.org)

Mis­takes peo­ple make when think­ing about units

Isaac KingJun 25, 2024, 3:39 AM
74 points
14 comments7 min readLW link

Com­pa­nies’ safety plans ne­glect risks from schem­ing AI

Zach Stein-PerlmanJun 3, 2024, 3:00 PM
73 points
4 comments6 min readLW link

Dumb­ing down

Martin SustrikJun 9, 2024, 6:50 AM
72 points
1 comment4 min readLW link

Shard The­ory—is it true for hu­mans?

RishikaJun 14, 2024, 7:21 PM
71 points
7 comments15 min readLW link

[Link Post] “Foun­da­tional Challenges in As­sur­ing Align­ment and Safety of Large Lan­guage Models”

David Scott Krueger (formerly: capybaralet)Jun 6, 2024, 6:55 PM
70 points
2 comments6 min readLW link
(llm-safety-challenges.github.io)

Former OpenAI Su­per­al­ign­ment Re­searcher: Su­per­in­tel­li­gence by 2030

Julian BradshawJun 5, 2024, 3:35 AM
70 points
30 comments1 min readLW link
(situational-awareness.ai)

Differ­ent senses in which two AIs can be “the same”

Jun 24, 2024, 3:16 AM
69 points
2 comments4 min readLW link

2. Cor­rigi­bil­ity Intuition

Max HarmsJun 8, 2024, 3:52 PM
67 points
10 comments33 min readLW link

SB 1047 Is Weakened

ZviJun 6, 2024, 1:40 PM
67 points
4 comments9 min readLW link
(thezvi.wordpress.com)

In­ter­pret­ing and Steer­ing Fea­tures in Images

Gytis DaujotasJun 20, 2024, 6:33 PM
66 points
6 comments5 min readLW link

AI #69: Nice

ZviJun 20, 2024, 12:40 PM
65 points
9 comments51 min readLW link
(thezvi.wordpress.com)

How a chip is designed

YMJun 28, 2024, 8:04 AM
65 points
4 comments5 min readLW link

AiPhone

ZviJun 12, 2024, 10:20 PM
63 points
4 comments14 min readLW link
(thezvi.wordpress.com)

“Me­tas­trate­gic Brain­storm­ing”, a core build­ing-block skill

RaemonJun 11, 2024, 4:27 AM
63 points
5 comments6 min readLW link

What is a Tool?

Jun 25, 2024, 11:40 PM
62 points
4 comments6 min readLW link

Nat­u­ral La­tents Are Not Ro­bust To Tiny Mixtures

Jun 7, 2024, 6:53 PM
61 points
8 comments5 min readLW link

Is Claude a mys­tic?

jessicataJun 7, 2024, 4:27 AM
60 points
23 comments13 min readLW link
(unstablerontology.substack.com)

microwave drilling is impractical

bhauthJun 12, 2024, 10:16 PM
59 points
19 comments4 min readLW link
(www.bhauth.com)

Me­moriz­ing weak ex­am­ples can elicit strong be­hav­ior out of pass­word-locked models

Jun 6, 2024, 11:54 PM
58 points
5 comments7 min readLW link

Datasets that change the odds you exist

dynomightJun 29, 2024, 6:45 PM
56 points
4 comments6 min readLW link
(dynomight.net)

De­gen­era­cies are sticky for SGD

Jun 16, 2024, 9:19 PM
56 points
1 comment16 min readLW link

What if a tech com­pany forced you to move to NYC?

KatjaGraceJun 9, 2024, 6:30 AM
56 points
22 comments1 min readLW link
(worldspiritsockpuppet.com)

Calcu­lat­ing Nat­u­ral La­tents via Resampling

Jun 6, 2024, 12:37 AM
55 points
4 comments10 min readLW link

4. Ex­ist­ing Writ­ing on Corrigibility

Max HarmsJun 10, 2024, 2:08 PM
55 points
15 comments106 min readLW link

On “first crit­i­cal tries” in AI alignment

Joe CarlsmithJun 5, 2024, 12:19 AM
54 points
8 comments14 min readLW link

Fat Tails Dis­cour­age Compromise

niplavJun 17, 2024, 9:39 AM
53 points
5 comments1 min readLW link

Book Re­view: Righ­teous Vic­tims—A His­tory of the Zion­ist-Arab Conflict

Yair HalberstadtJun 24, 2024, 11:02 AM
53 points
8 comments34 min readLW link

Schel­ling points in the AGI policy space

mesaoptimizerJun 26, 2024, 1:19 PM
52 points
2 comments6 min readLW link

Two LessWrong speed friend­ing experiments

Jun 15, 2024, 10:52 AM
52 points
3 comments4 min readLW link

So you want to work on tech­ni­cal AI safety

gwJun 24, 2024, 2:29 PM
51 points
3 comments14 min readLW link

Bed Time Quests & Din­ner Games for 3-5 year olds

Jun 22, 2024, 7:53 AM
51 points
0 comments1 min readLW link
(kidquest.substack.com)

D&D.Sci Alchemy: Arch­mage Anachronos and the Sup­ply Chain Is­sues Eval­u­a­tion & Ruleset

aphyerJun 17, 2024, 9:29 PM
51 points
11 comments6 min readLW link

how birds sense mag­netic fields

bhauthJun 27, 2024, 6:59 PM
51 points
4 comments5 min readLW link
(www.bhauth.com)

Philoso­phers wrestling with evil, as a so­cial me­dia feed

David GrossJun 3, 2024, 10:25 PM
51 points
2 comments16 min readLW link

An is­sue with train­ing schemers with su­per­vised fine-tuning

Fabien RogerJun 27, 2024, 3:37 PM
49 points
12 comments6 min readLW link

AI #67: Brief Strange Trip

ZviJun 6, 2024, 6:50 PM
49 points
6 comments40 min readLW link
(thezvi.wordpress.com)

in defense of Linus Pauling

bhauthJun 3, 2024, 9:27 PM
49 points
8 comments2 min readLW link
(www.bhauth.com)

Con­tra Ace­moglu on AI

Maxwell TabarrokJun 28, 2024, 1:13 PM
48 points
0 comments5 min readLW link
(www.maximum-progress.com)

[Valence se­ries] 4. Valence & Lik­ing /​ Admiring

Steven ByrnesJun 10, 2024, 2:19 PM
48 points
12 comments15 min readLW link

What dis­t­in­guishes “early”, “mid” and “end” games?

RaemonJun 21, 2024, 5:41 PM
48 points
22 comments1 min readLW link

1. The CAST Strategy

Max HarmsJun 7, 2024, 10:29 PM
48 points
22 comments38 min readLW link

On OpenAI’s Model Spec

Zvi21 Jun 2024 13:00 UTC
47 points
4 comments30 min readLW link
(thezvi.wordpress.com)

En­riched tab is now the de­fault LW Front­page ex­pe­rience for logged-in users

21 Jun 2024 0:09 UTC
46 points
27 comments3 min readLW link

AI #68: Re­mark­ably Rea­son­able Reactions

Zvi13 Jun 2024 16:30 UTC
46 points
11 comments50 min readLW link
(thezvi.wordpress.com)

Higher-effort sum­mer sols­tice: What if we used AI (i.e., An­gel Is­land)?

Rachel Shu25 Jun 2024 1:35 UTC
46 points
9 comments3 min readLW link