Reli­able Sources: The Story of David Gerard

TracingWoodgrains10 Jul 2024 19:50 UTC
408 points
56 comments43 min readLW link2 reviews

Univer­sal Ba­sic In­come and Poverty

Eliezer Yudkowsky26 Jul 2024 7:23 UTC
354 points
150 comments9 min readLW link1 review

80,000 hours should re­move OpenAI from the Job Board (and similar EA orgs should do similarly)

Raemon3 Jul 2024 20:34 UTC
274 points
71 comments3 min readLW link

Self-Other Over­lap: A Ne­glected Ap­proach to AI Alignment

30 Jul 2024 16:22 UTC
242 points
53 comments12 min readLW link2 reviews

Towards more co­op­er­a­tive AI safety strategies

Richard_Ngo16 Jul 2024 4:36 UTC
229 points
134 comments4 min readLW link1 review

Op­ti­mistic As­sump­tions, Longterm Plan­ning, and “Cope”

Raemon17 Jul 2024 22:14 UTC
228 points
47 comments7 min readLW link1 review

Su­perba­bies: Put­ting The Pie­ces Together

sarahconstantin11 Jul 2024 20:40 UTC
221 points
42 comments10 min readLW link3 reviews
(sarahconstantin.substack.com)

This is already your sec­ond chance

Malmesbury28 Jul 2024 17:13 UTC
194 points
13 comments8 min readLW link

Safety con­sul­ta­tions for AI lab employees

Zach Stein-Perlman27 Jul 2024 15:00 UTC
183 points
6 comments1 min readLW link

De­com­pos­ing Agency — ca­pa­bil­ities with­out desires

11 Jul 2024 9:38 UTC
156 points
33 comments12 min readLW link1 review
(strangecities.substack.com)

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel Nanda7 Jul 2024 17:39 UTC
144 points
17 comments25 min readLW link1 review

On say­ing “Thank you” in­stead of “I’m Sorry”

Michael Cohn8 Jul 2024 3:13 UTC
138 points
16 comments3 min readLW link

“AI achieves silver-medal stan­dard solv­ing In­ter­na­tional Math­e­mat­i­cal Olympiad prob­lems”

gjm25 Jul 2024 15:58 UTC
133 points
38 comments2 min readLW link
(deepmind.google)

Pan­theon Interface

8 Jul 2024 19:03 UTC
129 points
22 comments6 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
125 points
18 comments18 min readLW link

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC
118 points
20 comments12 min readLW link

What are you get­ting paid in?

Austin Chen17 Jul 2024 19:23 UTC
116 points
16 comments4 min readLW link1 review
(www.approachwithalacrity.com)

You should go to ML conferences

Jan_Kulveit24 Jul 2024 11:47 UTC
112 points
13 comments4 min readLW link

I found >800 or­thog­o­nal “write code” steer­ing vectors

15 Jul 2024 19:06 UTC
112 points
20 comments7 min readLW link
(jacobgw.com)

Most smart and skil­led peo­ple are out­side of the EA/​ra­tio­nal­ist com­mu­nity: an analysis

titotal12 Jul 2024 12:13 UTC
111 points
39 comments14 min readLW link
(open.substack.com)

In­tro­duc­tion to French AI Policy

Lucie Philippon4 Jul 2024 3:39 UTC
111 points
12 comments6 min readLW link

Othel­loGPT learned a bag of heuristics

2 Jul 2024 9:12 UTC
111 points
10 comments9 min readLW link

Dialogue in­tro­duc­tion to Sin­gu­lar Learn­ing Theory

Olli Järviniemi8 Jul 2024 16:58 UTC
110 points
16 comments8 min readLW link1 review

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

8 Jul 2024 22:24 UTC
109 points
40 comments5 min readLW link1 review

Poker is a bad game for teach­ing epistemics. Fig­gie is a bet­ter one.

rossry8 Jul 2024 6:05 UTC
106 points
47 comments11 min readLW link
(blog.rossry.net)

Trans­former Cir­cuit Faith­ful­ness Met­rics Are Not Robust

12 Jul 2024 3:47 UTC
104 points
5 comments7 min readLW link
(arxiv.org)

A sim­ple model of math skill

Alex_Altair21 Jul 2024 18:57 UTC
104 points
16 comments8 min readLW link

A Solomonoff In­duc­tor Walks Into a Bar: Schel­ling Points for Communication

26 Jul 2024 0:33 UTC
103 points
7 comments13 min readLW link1 review

Against Aschen­bren­ner: How ‘Si­tu­a­tional Aware­ness’ con­structs a nar­ra­tive that un­der­mines safety and threat­ens humanity

GideonF15 Jul 2024 18:37 UTC
100 points
17 comments21 min readLW link
(forum.effectivealtruism.org)

Covert Mal­i­cious Finetuning

2 Jul 2024 2:41 UTC
98 points
4 comments3 min readLW link

New page: Integrity

Zach Stein-Perlman10 Jul 2024 15:00 UTC
91 points
3 comments1 min readLW link

Reflec­tions on Less Online

Error7 Jul 2024 3:49 UTC
90 points
15 comments18 min readLW link

AI #73: Openly Evil AI

Zvi18 Jul 2024 14:40 UTC
89 points
20 comments52 min readLW link
(thezvi.wordpress.com)

Re: An­thropic’s sug­gested SB-1047 amendments

RobertM27 Jul 2024 22:32 UTC
87 points
13 comments9 min readLW link
(www.documentcloud.org)

Fluent, Cruxy Predictions

Raemon10 Jul 2024 18:00 UTC
86 points
18 comments14 min readLW link1 review

A sim­ple case for ex­treme in­ner misalignment

Richard_Ngo13 Jul 2024 15:40 UTC
86 points
41 comments7 min readLW link

Scal­able over­sight as a quan­ti­ta­tive rather than qual­i­ta­tive problem

Buck6 Jul 2024 17:42 UTC
86 points
11 comments3 min readLW link

De­com­pos­ing the QK cir­cuit with Bilin­ear Sparse Dic­tionary Learning

2 Jul 2024 13:17 UTC
86 points
7 comments12 min readLW link

3C’s: A Recipe For Mathing Concepts

3 Jul 2024 1:06 UTC
81 points
5 comments7 min readLW link

Con­sider the hum­ble rock (or: why the dumb thing kills you)

pleiotroth4 Jul 2024 13:54 UTC
77 points
12 comments4 min readLW link1 review

On the CrowdStrike Incident

Zvi22 Jul 2024 12:40 UTC
75 points
14 comments17 min readLW link
(thezvi.wordpress.com)

In­ter­pret­ing Prefer­ence Models w/​ Sparse Autoencoders

1 Jul 2024 21:35 UTC
75 points
12 comments9 min readLW link

LK-99 in retrospect

bhauth7 Jul 2024 2:06 UTC
74 points
21 comments3 min readLW link
(www.bhauth.com)

Mul­ti­plex Gene Edit­ing: Where Are We Now?

sarahconstantin16 Jul 2024 20:50 UTC
73 points
6 comments7 min readLW link
(sarahconstantin.substack.com)

Yoshua Ben­gio: Rea­son­ing through ar­gu­ments against tak­ing AI safety seriously

Kvee11 Jul 2024 23:53 UTC
72 points
3 comments1 min readLW link
(yoshuabengio.org)

D&D.Sci Sce­nario Index

23 Jul 2024 2:00 UTC
72 points
1 comment2 min readLW link1 review

Friend­ship is trans­ac­tional, un­con­di­tional friend­ship is insurance

Ruby17 Jul 2024 22:52 UTC
69 points
25 comments2 min readLW link1 review

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

22 Jul 2024 16:17 UTC
69 points
0 comments16 min readLW link

In­de­ci­sion and in­ter­nal­ized au­thor­ity figures

Kaj_Sotala6 Jul 2024 10:10 UTC
69 points
1 comment2 min readLW link
(kajsotala.fi)

An AI Race With China Can Be Bet­ter Than Not Racing

niplav2 Jul 2024 17:57 UTC
68 points
36 comments11 min readLW link