Se­quent: scale and au­toma­tion for higher con­fi­dence in alignment

10 Jun 2026 15:37 UTC
276 points
2 comments11 min readLW link
(sequent.org)

Es­ti­mat­ing No-CoT Task-Com­ple­tion Time Hori­zons of Fron­tier AI Models

10 Jun 2026 17:58 UTC
237 points
20 comments4 min readLW link

PSA: Al­most no­body is di­rectly work­ing on su­per­in­tel­li­gent alignment

12 Jun 2026 5:17 UTC
230 points
41 comments1 min readLW link

Sym­pa­thy for both sides of the egre­gious mis­al­ign­ment debate

Steven Byrnes12 Jun 2026 16:26 UTC
197 points
26 comments4 min readLW link

My fa­vorite de­pic­tion of utopia

Caleb Biddulph2 Jun 2026 23:15 UTC
189 points
20 comments33 min readLW link
(docs.google.com)

The Machines Lack Honour

Raymond Douglas9 Jun 2026 15:30 UTC
169 points
21 comments12 min readLW link

An­nounc­ing the ARC White-Box Es­ti­ma­tion Challenge

2 Jun 2026 16:20 UTC
165 points
15 comments3 min readLW link
(www.alignment.org)

Even “illeg­ible” Mythos rea­son­ing traces seem pretty legible

faul_sname10 Jun 2026 8:49 UTC
157 points
23 comments2 min readLW link

A fron­tier AI com­pany should shut down

MichaelDickens15 Jun 2026 16:56 UTC
135 points
37 comments2 min readLW link

Dis­solv­ing the Deep Learn­ing Sam­ple Effi­ciency Gap

Samuel Knoche1 Jun 2026 18:44 UTC
124 points
24 comments17 min readLW link
(theraptureofthenerds.substack.com)

Ma­chinic Psy­chophar­ma­col­ogy: Do LLMs Self-Med­i­cate?

10 Jun 2026 14:15 UTC
124 points
11 comments23 min readLW link

Can ac­ti­va­tion ver­bal­iz­ers sur­face an in­ter­nal chain of thought?

7 Jun 2026 4:24 UTC
122 points
0 comments16 min readLW link

Park­in­son’s Heuris­tic: The Only Time To Do Anything

Ben Pace12 Jun 2026 6:55 UTC
117 points
8 comments5 min readLW link

Why Soft­ware Au­toma­tion Is Hard

silentbob6 Jun 2026 8:56 UTC
114 points
20 comments12 min readLW link

Amer­i­can Govern­ment Takes Down Claude Fable

Zvi13 Jun 2026 19:40 UTC
111 points
13 comments20 min readLW link
(thezvi.wordpress.com)

Agent Foun­da­tions Re­minds Me of Con­ti­nen­tal Philosophy

IanWS2 Jun 2026 14:34 UTC
106 points
15 comments5 min readLW link
(write.ianwsperber.com)

Learn­ings from start­ing an AI safety re­search team

5 Jun 2026 16:27 UTC
97 points
7 comments6 min readLW link

Bun’s Mi­gra­tion from Zig to Rust as a Po­ten­tial Case Study for Grad­ual Disempowerment

Sayhan Yalvaçer8 Jun 2026 7:06 UTC
96 points
8 comments3 min readLW link

One Year of PauseAI UK

5 Jun 2026 16:41 UTC
94 points
7 comments11 min readLW link
(pauseai.uk)

Gears for poli­ti­cal races

Tom Smith17 Jun 2026 20:19 UTC
93 points
5 comments14 min readLW link

“Con­ta­gious Hum­ming” to Silence a Room

JohnofCharleston1 Jun 2026 19:08 UTC
90 points
20 comments2 min readLW link

The Hid­den Struc­tures of Problems

spencerg14 Jun 2026 13:51 UTC
90 points
9 comments3 min readLW link
(www.spencergreenberg.com)

Guardian An­gels: LLM Per­son­al­iza­tion for Pro­duc­tivity and Security

gwern17 Jun 2026 3:21 UTC
83 points
7 comments2 min readLW link
(gwern.net)

Does preser­va­tion make sense be­fore we know how to re­vive?

Aurelia15 Jun 2026 23:40 UTC
82 points
2 comments25 min readLW link

An­thropic did not call for a pause on AI

10 Jun 2026 20:02 UTC
80 points
5 comments5 min readLW link
(controlai.news)

Models May Be­have Worse When Eval Aware

11 Jun 2026 9:28 UTC
80 points
7 comments13 min readLW link

Why Even Ex­perts Don’t Know What to Do About AI Risk

2 Jun 2026 17:31 UTC
78 points
22 comments2 min readLW link

Towards a For­mal Scien­tific Epistemology

Richard_Ngo9 Jun 2026 20:31 UTC
75 points
9 comments7 min readLW link
(www.mindthefuture.info)

The Three Filters: Why Al­most Every Plan to Sur­vive ASI Fails Miserably

Alex Amadori10 Jun 2026 9:44 UTC
73 points
26 comments16 min readLW link
(alexamadori.substack.com)

China won’t win the AI race but would it be much worse if it did?

Chastity Ruth3 Jun 2026 5:46 UTC
71 points
18 comments13 min readLW link

Scal­ing Hy­poth­e­sis #2: Are Hu­mans Just More Over-Pa­ram­e­ter­ized?

gwern17 Jun 2026 2:53 UTC
71 points
12 comments1 min readLW link
(gwern.net)

The Once And Fu­ture Fable #2

Zvi15 Jun 2026 16:00 UTC
71 points
8 comments23 min readLW link
(thezvi.wordpress.com)

SFT Drives Gem­ini’s Safety Properties

13 Jun 2026 15:31 UTC
69 points
3 comments1 min readLW link

The Fi­nan­cial Ledger The­ory of Apologies

Ben Pace17 Jun 2026 6:57 UTC
68 points
6 comments4 min readLW link

Some hu­mans are both male and fe­male, and can (but shouldn’t) have chil­dren with themselves

HedonicEscalator1 Jun 2026 1:51 UTC
68 points
14 comments6 min readLW link
(hedonicescalator.substack.com)

US gov­ern­ment di­rec­tive to sus­pend ac­cess to Fable 5 and Mythos 5

Capybasilisk13 Jun 2026 1:16 UTC
67 points
15 comments1 min readLW link
(www.anthropic.com)

Against Corrigibility

peralice6 Jun 2026 20:28 UTC
66 points
17 comments12 min readLW link

You Can Catch Sleeper Agents by Teach­ing Another Model to Imi­tate Them

RobinHa10 Jun 2026 15:21 UTC
65 points
5 comments9 min readLW link
(robinhaselhorst.com)

A Mike’s-Eye View of ARC’s Research

Mikewins9 Jun 2026 18:30 UTC
64 points
1 comment11 min readLW link
(www.alignment.org)

Build­ing Bet­ter Ac­ti­va­tion Oracles

4 Jun 2026 18:34 UTC
62 points
1 comment7 min readLW link

What if An­thropic unilat­er­ally paused ca­pa­bil­ities de­vel­op­ment right now?

Karl von Wendt6 Jun 2026 7:39 UTC
61 points
15 comments3 min readLW link

Build­ing and eval­u­at­ing model diffing agents

12 Jun 2026 17:14 UTC
61 points
2 comments12 min readLW link

Beyond the lex­i­cal per­son­al­ity traits: What is the struc­ture of per­son­al­ity?

tailcalled5 Jun 2026 19:05 UTC
60 points
1 comment5 min readLW link

How might con­tinual learn­ing af­fect safety and al­ign­ment?

13 Jun 2026 17:34 UTC
59 points
2 comments16 min readLW link

Com­ing Around To Poli­ti­cal Donations

jefftk6 Jun 2026 21:30 UTC
59 points
8 comments2 min readLW link
(www.jefftk.com)

How to build a can­cer vac­cine, and whether they will work this time

Abhishaike Mahajan8 Jun 2026 20:45 UTC
58 points
9 comments25 min readLW link
(www.owlposting.com)

Syn­thetic doc­u­ment fine­tun­ing for in­still­ing pos­i­tive traits

16 Jun 2026 0:04 UTC
57 points
1 comment10 min readLW link

(Mis)gen­er­al­iza­tion of Helpful-Only Fine-tuning

4 Jun 2026 18:40 UTC
55 points
7 comments11 min readLW link

Opus 4.8 Part 2: Model Welfare

Zvi1 Jun 2026 15:11 UTC
55 points
1 comment25 min readLW link
(thezvi.wordpress.com)

Sev­eral fron­tier mod­els are sub­stan­tially pre­fill aware

17 Jun 2026 17:41 UTC
54 points
1 comment5 min readLW link