19 Feb 2025 20:39 UTC

637 points

360 comments31 min readLW link

How AI Takeover Might Happen in 2 Years

joshc7 Feb 2025 17:10 UTC

426 points

142 comments29 min readLW link

(x.com)

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley and Owain_Evans

25 Feb 2025 17:39 UTC

334 points

92 comments4 min readLW link

Murder plots are infohazards

Chris Monteiro13 Feb 2025 19:15 UTC

304 points

46 comments2 min readLW link

So You Want To Make Marginal Progress...

johnswentworth7 Feb 2025 23:22 UTC

304 points

42 comments4 min readLW link

Arbital has been imported to LessWrong

RobertM, jimrandomh, Ben Pace and Ruby

20 Feb 2025 0:47 UTC

284 points

30 comments5 min readLW link

A History of the Future, 2025-2040

L Rudolf L17 Feb 2025 12:03 UTC

249 points

42 comments75 min readLW link

(nosetgauge.substack.com)

Power Lies Trembling: a three-book review

Richard_Ngo22 Feb 2025 22:57 UTC

214 points

29 comments15 min readLW link

(www.mindthefuture.info)

Why Did Elon Musk Just Offer to Buy Control of OpenAI for $100 Billion?

garrison11 Feb 2025 0:20 UTC

208 points

8 comments6 min readLW link

(garrisonlovely.substack.com)

Eliezer’s Lost Alignment Articles / The Arbital Sequence

Ruby and RobertM

20 Feb 2025 0:48 UTC

208 points

10 comments5 min readLW link

[Question] Have LLMs Generated Novel Insights?

abramdemski and Cole Wyeth

23 Feb 2025 18:22 UTC

169 points

41 comments2 min readLW link

The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better

Thane Ruthenis21 Feb 2025 20:15 UTC

157 points

53 comments6 min readLW link

Levels of Friction

Zvi10 Feb 2025 13:10 UTC

155 points

8 comments12 min readLW link

(thezvi.wordpress.com)

It’s been ten years. I propose HPMOR Anniversary Parties.

Screwtape16 Feb 2025 1:43 UTC

154 points

3 comments1 min readLW link

A computational no-coincidence principle

Eric Neyman14 Feb 2025 21:39 UTC

149 points

40 comments6 min readLW link

(www.alignment.org)

Gradual Disempowerment, Shell Games and Flinches

Jan_Kulveit2 Feb 2025 14:47 UTC

146 points

36 comments6 min readLW link

You can just wear a suit

lsusr26 Feb 2025 14:57 UTC

139 points

59 comments2 min readLW link

The Paris AI Anti-Safety Summit

Zvi12 Feb 2025 14:00 UTC

129 points

21 comments21 min readLW link

(thezvi.wordpress.com)

Research directions Open Phil wants to fund in technical AI safety

jake_mendel, maxnadeau and Peter Favaloro

8 Feb 2025 1:40 UTC

117 points

21 comments58 min readLW link

(www.openphilanthropy.org)

The News is Never Neglected

lsusr11 Feb 2025 14:59 UTC

113 points

18 comments1 min readLW link

Two hemispheres—I do not think it means what you think it means

Viliam9 Feb 2025 15:33 UTC

112 points

21 comments14 min readLW link

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

jake_mendel, maxnadeau and Peter Favaloro

6 Feb 2025 18:58 UTC

111 points

0 comments1 min readLW link

(www.openphilanthropy.org)

My model of what is going on with LLMs

Cole Wyeth13 Feb 2025 3:43 UTC

110 points

49 comments7 min readLW link

Judgements: Merging Prediction & Evidence

abramdemski23 Feb 2025 19:35 UTC

107 points

7 comments6 min readLW link

A short course on AGI safety from the GDM Alignment team

Vika and Rohin Shah

14 Feb 2025 15:43 UTC

105 points

2 comments1 min readLW link

(deepmindsafetyresearch.medium.com)

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

6 Feb 2025 15:46 UTC

104 points

9 comments2 min readLW link

(arxiv.org)

AGI Safety & Alignment @ Google DeepMind is hiring

Rohin Shah17 Feb 2025 21:11 UTC

103 points

19 comments10 min readLW link

C’mon guys, Deliberate Practice is Real

Raemon5 Feb 2025 22:33 UTC

100 points

25 comments9 min readLW link

Timaeus in 2024

Jesse Hoogland, Stan van Wingerden, Alexander Gietelink Oldenziel and Daniel Murfet

20 Feb 2025 23:54 UTC

99 points

1 comment8 min readLW link

Reviewing LessWrong: Screwtape’s Basic Answer

Screwtape5 Feb 2025 4:30 UTC

97 points

18 comments6 min readLW link

Microplastics: Much Less Than You Wanted To Know

jenn, kaleb and Brent

15 Feb 2025 19:08 UTC

94 points

10 comments13 min readLW link

Dear AGI,

Nathan Young18 Feb 2025 10:48 UTC

89 points

11 comments3 min readLW link

Anthropic releases Claude 3.7 Sonnet with extended thinking mode

LawrenceC24 Feb 2025 19:32 UTC

88 points

8 comments4 min readLW link

(www.anthropic.com)

Wired on: “DOGE personnel with admin access to Federal Payment System”

Raemon5 Feb 2025 21:32 UTC

88 points

45 comments2 min readLW link

(web.archive.org)

Voting Results for the 2023 Review

Raemon6 Feb 2025 8:00 UTC

87 points

3 comments69 min readLW link

The Risk of Gradual Disempowerment from AI

Zvi5 Feb 2025 22:10 UTC

87 points

20 comments20 min readLW link

(thezvi.wordpress.com)

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Lucy Farnik26 Feb 2025 12:50 UTC

85 points

8 comments7 min readLW link

How might we safely pass the buck to AI?

joshc19 Feb 2025 17:48 UTC

84 points

58 comments31 min readLW link

Ambiguous out-of-distribution generalization on an algorithmic task

Wilson Wu and Louis Jaburi

13 Feb 2025 18:24 UTC

84 points

6 comments11 min readLW link

Pick two: concise, comprehensive, or clear rules

Screwtape3 Feb 2025 6:39 UTC

84 points

27 comments8 min readLW link

The Mask Comes Off: A Trio of Tales

Zvi14 Feb 2025 15:30 UTC

81 points

1 comment13 min readLW link

(thezvi.wordpress.com)

Language Models Use Trigonometry to Do Addition

Subhash Kantamneni5 Feb 2025 13:50 UTC

80 points

1 comment10 min readLW link

A Problem to Solve Before Building a Deception Detector

Eleni Angelou and lewis smith

7 Feb 2025 19:35 UTC

78 points

12 comments14 min readLW link

Evaluating “What 2026 Looks Like” So Far

Jonny Spicer24 Feb 2025 18:55 UTC

78 points

7 comments7 min readLW link

OpenAI releases deep research agent

Seth Herd3 Feb 2025 12:48 UTC

78 points

21 comments3 min readLW link

(openai.com)

Anti-Slop Interventions?

abramdemski4 Feb 2025 19:50 UTC

78 points

33 comments6 min readLW link

Osaka

lsusr26 Feb 2025 13:50 UTC

78 points

13 comments1 min readLW link

Thermodynamic entropy = Kolmogorov complexity

Aram Ebtekar17 Feb 2025 5:56 UTC

77 points

14 comments1 min readLW link

(doi.org)

The Simplest Good

Jesse Hoogland2 Feb 2025 19:51 UTC

76 points

6 comments5 min readLW link

MATS Applications + Research Directions I’m Currently Excited About

Neel Nanda6 Feb 2025 11:03 UTC

73 points

7 comments8 min readLW link