Towards Monose­man­tic­ity: De­com­pos­ing Lan­guage Models With Dic­tionary Learning

Zac Hatfield-Dodds5 Oct 2023 21:01 UTC
286 points
21 comments2 min readLW link
(transformer-circuits.pub)

Book Re­view: Go­ing Infinite

Zvi24 Oct 2023 15:00 UTC
240 points
109 comments97 min readLW link
(thezvi.wordpress.com)

Align­ment Im­pli­ca­tions of LLM Suc­cesses: a De­bate in One Act

Zack_M_Davis21 Oct 2023 15:22 UTC
238 points
50 comments13 min readLW link

An­nounc­ing MIRI’s new CEO and lead­er­ship team

Gretta Duleba10 Oct 2023 19:22 UTC
220 points
52 comments3 min readLW link

Thoughts on re­spon­si­ble scal­ing poli­cies and regulation

paulfchristiano24 Oct 2023 22:21 UTC
214 points
33 comments6 min readLW link

We’re Not Ready: thoughts on “paus­ing” and re­spon­si­ble scal­ing policies

HoldenKarnofsky27 Oct 2023 15:19 UTC
199 points
33 comments8 min readLW link

Labs should be ex­plicit about why they are build­ing AGI

peterbarnett17 Oct 2023 21:09 UTC
187 points
16 comments1 min readLW link

An­nounc­ing Timaeus

22 Oct 2023 11:59 UTC
186 points
15 comments4 min readLW link

AI as a sci­ence, and three ob­sta­cles to al­ign­ment strategies

So8res25 Oct 2023 21:00 UTC
175 points
79 comments11 min readLW link

Ar­chi­tects of Our Own Demise: We Should Stop Devel­op­ing AI

Roko26 Oct 2023 0:36 UTC
174 points
74 comments3 min readLW link

Pres­i­dent Bi­den Is­sues Ex­ec­u­tive Order on Safe, Se­cure, and Trust­wor­thy Ar­tifi­cial Intelligence

Tristan Williams30 Oct 2023 11:15 UTC
170 points
39 comments1 min readLW link
(www.whitehouse.gov)

Thomas Kwa’s MIRI re­search experience

2 Oct 2023 16:42 UTC
169 points
52 comments1 min readLW link

RSPs are pauses done right

evhub14 Oct 2023 4:06 UTC
164 points
70 comments7 min readLW link

Eval­u­at­ing the his­tor­i­cal value mis­speci­fi­ca­tion argument

Matthew Barnett5 Oct 2023 18:34 UTC
162 points
140 comments7 min readLW link

Holly El­more and Rob Miles di­alogue on AI Safety Advocacy

20 Oct 2023 21:04 UTC
157 points
30 comments27 min readLW link

An­nounc­ing Dialogues

Ben Pace7 Oct 2023 2:57 UTC
154 points
51 comments4 min readLW link

LoRA Fine-tun­ing Effi­ciently Un­does Safety Train­ing from Llama 2-Chat 70B

12 Oct 2023 19:58 UTC
148 points
29 comments14 min readLW link

Will no one rid me of this tur­bu­lent pest?

Metacelsus14 Oct 2023 15:27 UTC
148 points
23 comments10 min readLW link
(denovo.substack.com)

Comp Sci in 2027 (Short story by Eliezer Yud­kowsky)

sudo29 Oct 2023 23:09 UTC
141 points
22 comments10 min readLW link
(nitter.net)

Com­par­ing An­thropic’s Dic­tionary Learn­ing to Ours

Robert_AIZI7 Oct 2023 23:30 UTC
136 points
8 comments4 min readLW link

At 87, Pearl is still able to change his mind

rotatingpaguro18 Oct 2023 4:46 UTC
136 points
15 comments5 min readLW link

Re­sponse to Quintin Pope’s Evolu­tion Pro­vides No Ev­i­dence For the Sharp Left Turn

Zvi5 Oct 2023 11:39 UTC
129 points
29 comments9 min readLW link

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC
129 points
11 comments19 min readLW link

Don’t Dis­miss Sim­ple Align­ment Approaches

Chris_Leong7 Oct 2023 0:35 UTC
127 points
8 comments4 min readLW link

The 99% prin­ci­ple for per­sonal problems

Kaj_Sotala2 Oct 2023 8:20 UTC
125 points
20 comments2 min readLW link
(kajsotala.fi)

Good­hart’s Law in Re­in­force­ment Learning

16 Oct 2023 0:54 UTC
125 points
22 comments7 min readLW link

Stampy’s AI Safety Info soft launch

5 Oct 2023 22:13 UTC
120 points
9 comments2 min readLW link

unRLHF—Effi­ciently un­do­ing LLM safeguards

12 Oct 2023 19:58 UTC
117 points
15 comments20 min readLW link

Re­veal­ing In­ten­tion­al­ity In Lan­guage Models Through AdaVAE Guided Sampling

jdp20 Oct 2023 7:32 UTC
117 points
14 comments22 min readLW link

I Would Have Solved Align­ment, But I Was Wor­ried That Would Ad­vance Timelines

307th20 Oct 2023 16:37 UTC
115 points
32 comments9 min readLW link

Re­spon­si­ble Scal­ing Poli­cies Are Risk Man­age­ment Done Wrong

simeon_c25 Oct 2023 23:46 UTC
114 points
33 comments22 min readLW link
(www.navigatingrisks.ai)

A new in­tro to Quan­tum Physics, with the math fixed

titotal29 Oct 2023 15:11 UTC
112 points
22 comments17 min readLW link
(titotal.substack.com)

The Witch­ing Hour

Richard_Ngo10 Oct 2023 0:19 UTC
110 points
0 comments10 min readLW link
(www.narrativeark.xyz)

Ap­ply for MATS Win­ter 2023-24!

21 Oct 2023 2:27 UTC
106 points
6 comments5 min readLW link

Char­bel-Raphaël and Lu­cius dis­cuss Interpretability

30 Oct 2023 5:50 UTC
104 points
7 comments21 min readLW link

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

23 Oct 2023 16:37 UTC
101 points
3 comments8 min readLW link

TOMORROW: the largest AI Safety protest ever!

Holly_Elmore20 Oct 2023 18:15 UTC
101 points
25 comments2 min readLW link

What’s up with “Re­spon­si­ble Scal­ing Poli­cies”?

29 Oct 2023 4:17 UTC
99 points
8 comments20 min readLW link

What’s Hard About The Shut­down Problem

johnswentworth20 Oct 2023 21:13 UTC
98 points
31 comments4 min readLW link

Truth­seek­ing when your dis­agree­ments lie in moral philosophy

10 Oct 2023 0:00 UTC
98 points
4 comments4 min readLW link
(acesounderglass.com)

I don’t find the lie de­tec­tion re­sults that sur­pris­ing (by an au­thor of the pa­per)

JanB4 Oct 2023 17:10 UTC
97 points
8 comments3 min readLW link

[Question] Ly­ing to chess play­ers for alignment

Zane25 Oct 2023 17:47 UTC
96 points
54 comments1 min readLW link

Value sys­tem­ati­za­tion: how val­ues be­come co­her­ent (and mis­al­igned)

Richard_Ngo27 Oct 2023 19:06 UTC
95 points
47 comments13 min readLW link

Sym­bol/​Refer­ent Con­fu­sions in Lan­guage Model Align­ment Experiments

johnswentworth26 Oct 2023 19:49 UTC
93 points
44 comments6 min readLW link

Try­ing to un­der­stand John Went­worth’s re­search agenda

20 Oct 2023 0:05 UTC
92 points
11 comments12 min readLW link

Linkpost: They Stud­ied Dishon­esty. Was Their Work a Lie?

Linch2 Oct 2023 8:10 UTC
91 points
12 comments2 min readLW link
(www.newyorker.com)

Open Source Repli­ca­tion & Com­men­tary on An­thropic’s Dic­tionary Learn­ing Paper

Neel Nanda23 Oct 2023 22:38 UTC
91 points
12 comments9 min readLW link

Linkpost: A Post Mortem on the Gino Case

Linch24 Oct 2023 6:50 UTC
89 points
7 comments2 min readLW link
(www.theorgplumber.com)

Techno-hu­man­ism is techno-op­ti­mism for the 21st century

Richard_Ngo27 Oct 2023 18:37 UTC
88 points
5 comments14 min readLW link
(www.mindthefuture.info)

Im­prov­ing the Welfare of AIs: A Nearcasted Proposal

ryan_greenblatt30 Oct 2023 14:51 UTC
87 points
5 comments20 min readLW link