25 Jul 2025 18:39 UTC

432 points

166 comments38 min readLW link

Generalized Hangriness: A Standard Rationalist Stance Toward Emotions

johnswentworth10 Jul 2025 18:22 UTC

374 points

71 comments7 min readLW link

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

cloud, mle and Owain_Evans

22 Jul 2025 16:37 UTC

348 points

40 comments4 min readLW link

So You Think You’ve Awoken ChatGPT

JustisMills11 Jul 2025 1:01 UTC

329 points

88 comments9 min readLW link

Make More Grayspaces

Duncan Sabien (Inactive)19 Jul 2025 22:22 UTC

315 points

65 comments13 min readLW link

the jackpot age

thiccythot11 Jul 2025 21:05 UTC

305 points

19 comments4 min readLW link

Love stays loved (formerly “Skin”)

Swimmer963 (Miranda Dixon-Luinenburg) 18 Jul 2025 19:17 UTC

282 points

13 comments29 min readLW link

Shallow Water is Dangerous Too

jefftk20 Jul 2025 2:30 UTC

236 points

24 comments2 min readLW link

(www.jefftk.com)

Surprises and learnings from almost two months of Leo Panickssery

Nina Panickssery12 Jul 2025 23:33 UTC

220 points

13 comments6 min readLW link

(ninapanickssery.substack.com)

About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong

bohaska29 Jul 2025 11:59 UTC

212 points

11 comments4 min readLW link

(www.futurehouse.org)

Lessons from the Iraq War for AI policy

Buck10 Jul 2025 18:52 UTC

202 points

25 comments4 min readLW link

[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks, jacob_drori, cloud and TurnTrout

30 Jul 2025 21:26 UTC

202 points

23 comments6 min readLW link

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Adam Karvonen and Sam Marks

2 Jul 2025 16:35 UTC

191 points

26 comments4 min readLW link

Maya’s Escape

Bridgett Kay27 Jul 2025 16:47 UTC

182 points

9 comments11 min readLW link

(dxmrevealed.wordpress.com)

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Vlad Mikulik and Rohin Shah

15 Jul 2025 16:23 UTC

169 points

32 comments1 min readLW link

(bit.ly)

An Opinionated Guide to Using Anki Correctly

Luise Woehlke8 Jul 2025 20:01 UTC

166 points

60 comments27 min readLW link

Why Do Some Language Models Fake Alignment While Others Don’t?

abhayesian, John Hughes, Alex Mallen, Jozdien, janus and Fabien Roger

8 Jul 2025 21:49 UTC

160 points

14 comments5 min readLW link

(arxiv.org)

“Buckle up bucko, and get ready for multiple hard cognitive steps.”

Raemon5 Jul 2025 1:47 UTC

154 points

26 comments4 min readLW link

On “ChatGPT Psychosis” and LLM Sycophancy

jdp23 Jul 2025 1:11 UTC

144 points

28 comments18 min readLW link

(minihf.com)

Shutdown Resistance in Reasoning Models

benwr, JeremySchlatter and Jeffrey Ladish

6 Jul 2025 0:01 UTC

140 points

15 comments9 min readLW link

(palisaderesearch.org)

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Edward Turner, Anna Soligo, Senthooran Rajamanoharan and Neel Nanda

14 Jul 2025 21:05 UTC

140 points

24 comments5 min readLW link

Do confident short timelines make sense?

TsviBT and abramdemski

15 Jul 2025 3:37 UTC

140 points

78 comments69 min readLW link

“What’s my goal?”

Raemon2 Jul 2025 2:58 UTC

132 points

9 comments2 min readLW link

Authors Have a Responsibility to Communicate Clearly

TurnTrout1 Jul 2025 15:41 UTC

127 points

29 comments6 min readLW link

(turntrout.com)

what makes Claude 3 Opus misaligned

janus10 Jul 2025 20:06 UTC

126 points

14 comments5 min readLW link

Vitalik’s Response to AI 2027

Daniel Kokotajlo11 Jul 2025 21:43 UTC

124 points

53 comments12 min readLW link

(vitalik.eth.limo)

On the functional self of LLMs

eggsyntax7 Jul 2025 15:39 UTC

124 points

38 comments8 min readLW link

The Purpose of a System is what it Rewards

Rob Ennals26 Jul 2025 22:08 UTC

121 points

16 comments2 min readLW link

(messyprogress.substack.com)

Simplex Progress Report—July 2025

Adam Shai, Paul Riechers, hrbigelow, Eric Alt and mntss

28 Jul 2025 21:58 UTC

116 points

3 comments15 min readLW link

If Anyone Builds It, Everyone Dies: Call for Translators (for Supplementary Materials)

yams21 Jul 2025 22:37 UTC

112 points

12 comments1 min readLW link

Curing PMDD with Hair Loss Pills

David Lorell2 Jul 2025 21:35 UTC

107 points

4 comments8 min readLW link

LLMs Can’t See Pixels or Characters

Brendan Long20 Jul 2025 20:00 UTC

100 points

44 comments4 min readLW link

(www.brendanlong.com)

Recent Redwood Research project proposals

ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman , Tyler Tracy, Aryan Bhatt and Joey Yudelson

14 Jul 2025 22:27 UTC

99 points

0 comments3 min readLW link

Video and transcript of talk on “Can goodness compete?”

Joe Carlsmith17 Jul 2025 17:54 UTC

98 points

19 comments34 min readLW link

(joecarlsmith.substack.com)

‘AI for societal uplift’ as a path to victory

Raymond Douglas4 Jul 2025 15:32 UTC

97 points

22 comments2 min readLW link

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

habryka11 Jul 2025 0:23 UTC

97 points

43 comments6 min readLW link

(metr.org)

Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

Senthooran Rajamanoharan and Neel Nanda

14 Jul 2025 14:52 UTC

92 points

19 comments11 min readLW link

No, Grok, No

Zvi9 Jul 2025 15:10 UTC

92 points

3 comments17 min readLW link

(thezvi.wordpress.com)

METR: How Does Time Horizon Vary Across Domains?

Thomas Kwa and Vincent Cheng

14 Jul 2025 16:13 UTC

88 points

8 comments14 min readLW link

(metr.org)

Directly Try Solving Alignment for 5 weeks

Kabir Kumar21 Jul 2025 21:51 UTC

86 points

4 comments6 min readLW link

(beta.ai-plans.com)

If Anyone Builds It, Everyone Dies: Advertisement design competition

yams2 Jul 2025 23:14 UTC

86 points

37 comments1 min readLW link

(intelligence.org)

China proposes new global AI cooperation organisation

Matrice Jacobine30 Jul 2025 2:50 UTC

85 points

8 comments1 min readLW link

(www.reuters.com)

You can get LLMs to say almost anything you want

Kaj_Sotala13 Jul 2025 16:30 UTC

84 points

10 comments14 min readLW link

xAI’s Grok 4 has no meaningful safety guardrails

eleventhsavi0r13 Jul 2025 18:22 UTC

84 points

15 comments6 min readLW link

Selective Generalization: Improving Capabilities While Maintaining Alignment

ariana_azarbal, Matthew A. Clarke, Jorio Cocola, Cailley Factor and cloud

16 Jul 2025 21:25 UTC

82 points

6 comments7 min readLW link

Subway Particle Levels Aren’t That High

jefftk9 Jul 2025 2:30 UTC

82 points

5 comments1 min readLW link

(www.jefftk.com)

White Box Control at UK AISI—Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood and Alan Cooney

10 Jul 2025 13:37 UTC

81 points

10 comments18 min readLW link

against that one rationalist mashal about japanese fifth-columnists

Fraser13 Jul 2025 1:42 UTC

81 points

6 comments3 min readLW link

(frvser.com)

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan and Neel Nanda

23 Jul 2025 14:57 UTC

79 points

8 comments5 min readLW link

SLT for AI Safety

Jesse Hoogland1 Jul 2025 4:52 UTC

78 points

0 comments3 min readLW link