Tell me about your­self: LLMs are aware of their learned behaviors

Jan 22, 2025, 12:47 AM
130 points
5 comments6 min readLW link

Build­ing AI Re­search Fleets

Jan 12, 2025, 6:23 PM
130 points
11 comments5 min readLW link

Some ar­ti­cles in “In­ter­na­tional Se­cu­rity” that I enjoyed

BuckJan 31, 2025, 4:23 PM
130 points
10 comments4 min readLW link

The Paris AI Anti-Safety Summit

ZviFeb 12, 2025, 2:00 PM
129 points
21 comments21 min readLW link
(thezvi.wordpress.com)

Grad­ual Disem­pow­er­ment, Shell Games and Flinches

Jan_KulveitFeb 2, 2025, 2:47 PM
129 points
36 comments6 min readLW link

The Pando Prob­lem: Re­think­ing AI Individuality

Jan_KulveitMar 28, 2025, 9:03 PM
128 points
14 comments13 min readLW link

AI-en­abled coups: a small group could use AI to seize power

Apr 16, 2025, 4:51 PM
128 points
18 comments7 min readLW link

Park­in­son’s Law and the Ide­ol­ogy of Statistics

BenquoJan 4, 2025, 3:49 PM
127 points
7 comments8 min readLW link
(benjaminrosshoffman.com)

The In­tel­li­gence Curse

lukedragoJan 3, 2025, 7:07 PM
126 points
27 comments18 min readLW link
(lukedrago.substack.com)

Do mod­els say what they learn?

Mar 22, 2025, 3:19 PM
126 points
12 comments13 min readLW link

Med­i­ta­tions on Doge

Martin SustrikMay 25, 2025, 12:00 PM
125 points
42 comments9 min readLW link
(250bpm.substack.com)

An­thropic, and tak­ing “tech­ni­cal philos­o­phy” more seriously

RaemonMar 13, 2025, 1:48 AM
125 points
29 comments11 min readLW link

So­cial Anx­iety Isn’t About Be­ing Liked

ChipmonkMay 16, 2025, 10:26 PM
124 points
21 comments2 min readLW link
(chrislakin.blog)

[Question] when will LLMs be­come hu­man-level blog­gers?

nostalgebraistMar 9, 2025, 9:10 PM
124 points
34 comments6 min readLW link

AI 2027 is a Bet Against Am­dahl’s Law

snewmanApr 21, 2025, 3:09 AM
124 points
56 comments9 min readLW link

Five Hinge‑Ques­tions That De­cide Whether AGI Is Five Years Away or Twenty

charlieoneillMay 6, 2025, 2:48 AM
124 points
17 comments5 min readLW link

How I’ve run ma­jor projects

benkuhnMar 16, 2025, 6:40 PM
123 points
10 comments8 min readLW link
(www.benkuhn.net)

Ob­sta­cles in ARC’s agenda: Find­ing explanations

David MatolcsiApr 30, 2025, 11:03 PM
122 points
10 comments17 min readLW link

Ctrl-Z: Con­trol­ling AI Agents via Resampling

Apr 16, 2025, 4:21 PM
122 points
0 comments20 min readLW link

Do rea­son­ing mod­els use their scratch­pad like we do? Ev­i­dence from dis­till­ing paraphrases

Fabien RogerMar 11, 2025, 11:52 AM
121 points
23 comments11 min readLW link
(alignment.anthropic.com)

Re­search Notes: Run­ning Claude 3.7, Gem­ini 2.5 Pro, and o3 on Poké­mon Red

Julian BradshawApr 21, 2025, 3:52 AM
121 points
20 comments14 min readLW link

It’s hard to make schem­ing evals look re­al­is­tic for LLMs

May 24, 2025, 7:17 PM
120 points
20 comments5 min readLW link

2024 in AI predictions

jessicataJan 1, 2025, 8:29 PM
117 points
3 comments8 min readLW link

Re­search di­rec­tions Open Phil wants to fund in tech­ni­cal AI safety

Feb 8, 2025, 1:40 AM
117 points
21 comments58 min readLW link
(www.openphilanthropy.org)

Three Months In, Eval­u­at­ing Three Ra­tion­al­ist Cases for Trump

Arjun PanicksseryApr 18, 2025, 8:27 AM
115 points
32 comments4 min readLW link

“The Era of Ex­pe­rience” has an un­solved tech­ni­cal al­ign­ment problem

Steven ByrnesApr 24, 2025, 1:57 PM
114 points
48 comments23 min readLW link

Nega­tive Re­sults for SAEs On Down­stream Tasks and Depri­ori­tis­ing SAE Re­search (GDM Mech In­terp Team Progress Up­date #2)

Mar 26, 2025, 7:07 PM
113 points
15 comments29 min readLW link
(deepmindsafetyresearch.medium.com)

The Game Board has been Flipped: Now is a good time to re­think what you’re doing

LintzAJan 28, 2025, 11:36 PM
112 points
30 comments13 min readLW link

Down­stream ap­pli­ca­tions as val­i­da­tion of in­ter­pretabil­ity progress

Sam MarksMar 31, 2025, 1:35 AM
112 points
3 comments7 min readLW link

The News is Never Neglected

lsusrFeb 11, 2025, 2:59 PM
112 points
18 comments1 min readLW link

Open Philan­thropy Tech­ni­cal AI Safety RFP - $40M Available Across 21 Re­search Areas

Feb 6, 2025, 6:58 PM
111 points
0 comments1 min readLW link
(www.openphilanthropy.org)

We should try to au­to­mate AI safety work asap

Marius HobbhahnApr 26, 2025, 4:35 PM
111 points
10 comments15 min readLW link

Please Donate to CAIP (Post 1 of 6 on AI Gover­nance)

Mass_DriverMay 7, 2025, 5:13 PM
111 points
20 comments33 min readLW link

You can just wear a suit

lsusrFeb 26, 2025, 2:57 PM
111 points
48 comments2 min readLW link

One Year in DC

tlevinMay 19, 2025, 7:46 PM
110 points
5 commentsLW link
(www.greentape.pub)

Among Us: A Sand­box for Agen­tic Deception

Apr 5, 2025, 6:24 AM
110 points
7 comments7 min readLW link

New Cause Area Proposal

CallumMcDougallApr 1, 2025, 7:12 AM
109 points
4 comments1 min readLW link

UK AISI’s Align­ment Team: Re­search Agenda

May 7, 2025, 4:33 PM
109 points
2 comments11 min readLW link

Thread for Sense-Mak­ing on Re­cent Mur­ders and How to Sanely Respond

Ben PaceJan 31, 2025, 3:45 AM
109 points
146 comments2 min readLW link

2024 Unoffi­cial LessWrong Sur­vey Results

ScrewtapeMar 14, 2025, 10:29 PM
109 points
28 comments48 min readLW link

Aris­toc­racy and Hostage Capital

Arjun PanicksseryJan 8, 2025, 7:38 PM
108 points
7 comments3 min readLW link
(arjunpanickssery.substack.com)

What OpenAI Told Cal­ifor­nia’s At­tor­ney General

garrisonMay 17, 2025, 11:14 PM
108 points
3 commentsLW link
(www.obsolete.pub)

Fake think­ing and real thinking

Joe CarlsmithJan 28, 2025, 8:05 PM
108 points
13 comments38 min readLW link

Two hemi­spheres—I do not think it means what you think it means

ViliamFeb 9, 2025, 3:33 PM
108 points
21 comments14 min readLW link

Notes on the Long Tasks METR pa­per, from a HCAST task contributor

abstractapplicMay 4, 2025, 11:17 PM
108 points
7 comments2 min readLW link

The Lizard­man and the Black Hat Bobcat

ScrewtapeApr 6, 2025, 7:02 PM
107 points
15 comments9 min readLW link

How train­ing-gamers might func­tion (and win)

Vivek HebbarApr 11, 2025, 9:26 PM
107 points
5 comments13 min readLW link

At­tri­bu­tion-based pa­ram­e­ter decomposition

Jan 25, 2025, 1:12 PM
107 points
21 comments4 min readLW link
(publications.apolloresearch.ai)

My su­pervillain ori­gin story

Dmitry VaintrobJan 27, 2025, 12:20 PM
106 points
1 comment5 min readLW link

How do you deal w/​ Su­per Stim­uli?

Logan RiggsJan 14, 2025, 3:14 PM
106 points
25 comments3 min readLW link