All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 202420252026

All JanFebMar Apr May Jun Jul Aug Sep Oct Nov Dec

All12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Proposal: Safeguarding Against Jailbreaking Through Iterative Multi-Turn Testing

jacquesallen31 Jan 2025 23:00 UTC

4 points

0 comments8 min readLW link

The Failed Strategy of Artificial Intelligence Doomers

Ben Pace31 Jan 2025 18:56 UTC

143 points

77 comments5 min readLW link

(www.palladiummag.com)

Safe Search is off: root causes of AI catastrophic risks

Jemal Young31 Jan 2025 18:22 UTC

4 points

0 comments3 min readLW link

5,000 calories of peanut butter every week for 3 years straight

Declan Molony31 Jan 2025 17:29 UTC

18 points

8 comments1 min readLW link

Will alignment-faking Claude accept a deal to reveal its misalignment?

ryan_greenblatt and Kyle Fish

31 Jan 2025 16:49 UTC

208 points

28 comments12 min readLW link

Some articles in “International Security” that I enjoyed

Buck31 Jan 2025 16:23 UTC

134 points

10 comments4 min readLW link

[Question] How do biological or spiking neural networks learn?

Dom Polsinelli31 Jan 2025 16:03 UTC

2 points

1 comment2 min readLW link

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Stuart_Armstrong and rgorman

31 Jan 2025 15:36 UTC

16 points

2 comments2 min readLW link

[Question] Strong, Stable, Open: Choose Two—in search of an article

Eli_31 Jan 2025 14:48 UTC

2 points

0 comments1 min readLW link

DeepSeek: Don’t Panic

Zvi31 Jan 2025 14:20 UTC

45 points

6 comments27 min readLW link

(thezvi.wordpress.com)

Catastrophe through Chaos

Marius Hobbhahn31 Jan 2025 14:19 UTC

191 points

17 comments12 min readLW link

Interviews with Moonshot AI’s CEO, Yang Zhilin

Cosmia_Nebula31 Jan 2025 9:19 UTC

4 points

0 comments68 min readLW link

(rentry.co)

Review: The Lathe of Heaven

dr_s31 Jan 2025 8:10 UTC

25 points

1 comment8 min readLW link

[Question] Is weak-to-strong generalization an alignment technique?

cloud31 Jan 2025 7:13 UTC

22 points

1 comment2 min readLW link

Takeaways from sketching a control safety case

joshc31 Jan 2025 4:43 UTC

28 points

0 comments3 min readLW link

(redwoodresearch.substack.com)

Thread for Sense-Making on Recent Murders and How to Sanely Respond

Ben Pace31 Jan 2025 3:45 UTC

109 points

146 comments2 min readLW link

Steering Gemini with BiDPO

TurnTrout31 Jan 2025 2:37 UTC

104 points

5 comments1 min readLW link

(turntrout.com)

In response to critiques of Guaranteed Safe AI

Nora_Ammann31 Jan 2025 1:43 UTC

44 points

14 comments26 min readLW link

Proposal for a Form of Conditional Supplemental Income (CSI) in a Post-Work World

sweenesm31 Jan 2025 1:00 UTC

10 points

2 comments3 min readLW link

Outlaw Code

Commander Zander30 Jan 2025 23:41 UTC

10 points

1 comment2 min readLW link

Can someone, anyone, make superintelligence a more concrete concept?

Ori Nagel30 Jan 2025 23:25 UTC

3 points

6 comments4 min readLW link

Upcoming Neuroscience Workshop—Functionalizing Brain Data, Ground-Truthing, and the Role of Artificial Data in Advancing Neuroscience

Devin Ward30 Jan 2025 23:02 UTC

1 point

0 comments1 min readLW link

What’s Behind the SynBio Bust?

sarahconstantin30 Jan 2025 22:30 UTC

55 points

8 comments6 min readLW link

(sarahconstantin.substack.com)

The future of humanity is in management

jasoncrawford30 Jan 2025 22:14 UTC

3 points

5 comments13 min readLW link

(newsletter.rootsofprogress.org)

[Translation] AI Generated Fake News is Taking Over my Family Group Chat

mushroomsoup30 Jan 2025 20:24 UTC

3 points

0 comments6 min readLW link

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

30 Jan 2025 17:28 UTC

61 points

0 comments5 min readLW link

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Jan_Kulveit, Raymond Douglas, Nora_Ammann, Deger Turan, David Scott Krueger (formerly: capybaralet) and David Duvenaud

30 Jan 2025 17:03 UTC

181 points

65 comments2 min readLW link

(gradual-disempowerment.ai)

[Question] Implication of Uncomputable Problems

Nathan112330 Jan 2025 16:48 UTC

−3 points

3 comments1 min readLW link

Hello World

Charlie Sanders30 Jan 2025 15:33 UTC

7 points

0 comments2 min readLW link

(www.dailymicrofiction.com)

Introducing the Coalition for a Baruch Plan for AI: A Call for a Radical Treaty-Making process for the Global Governance of AI

rguerreschi30 Jan 2025 15:26 UTC

11 points

0 comments2 min readLW link

AI #101: The Shallow End

Zvi30 Jan 2025 14:50 UTC

39 points

1 comment59 min readLW link

(thezvi.wordpress.com)

Memorization-generalization in practice

Dmitry Vaintrob30 Jan 2025 14:10 UTC

7 points

1 comment4 min readLW link

ARENA 5.0 - Call for Applicants

JamesH, James Fox, CallumMcDougall, Chloe Li and David Quarel

30 Jan 2025 13:18 UTC

35 points

2 comments6 min readLW link

You should read Hobbes, Locke, Hume, and Mill via EarlyModernTexts.com

Arjun Panickssery30 Jan 2025 12:35 UTC

52 points

3 comments3 min readLW link

(arjunpanickssery.substack.com)

[Question] Should you publish solutions to corrigibility?

rvnnt30 Jan 2025 11:52 UTC

13 points

13 comments1 min readLW link

Tetherware #1: The case for humanlike AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC

5 points

14 comments10 min readLW link

(tetherware.substack.com)

A High Level Closed-Door Session Discussing DeepSeek: Vision Trumps Technology

Cosmia_Nebula30 Jan 2025 9:53 UTC

30 points

1 comment8 min readLW link

(rentry.co)

Are we the Wolves now? Human Eugenics under AI Control

Brit30 Jan 2025 8:31 UTC

−1 points

2 comments2 min readLW link

[Question] Why not train reasoning models with RLHF?

Caleb Biddulph30 Jan 2025 7:58 UTC

4 points

4 comments1 min readLW link

The Road to Evil Is Paved with Good Objectives: Framework to Classify and Fix Misalignments.

Shivam30 Jan 2025 2:44 UTC

1 point

0 comments11 min readLW link

How exactly can AI take your job in the next few years?

Ansh Juneja30 Jan 2025 2:33 UTC

9 points

0 comments21 min readLW link

Absorbing Your Friends’ Powers

Alice Blair30 Jan 2025 2:32 UTC

8 points

1 comment2 min readLW link

Detailed Ideal World Benchmark

Knight Lee30 Jan 2025 2:31 UTC

5 points

2 comments2 min readLW link

Fertility Will Never Recover

Eneasz30 Jan 2025 1:16 UTC

19 points

31 comments2 min readLW link

(deathisbad.substack.com)

Predation as Payment for Criticism

Benquo30 Jan 2025 1:06 UTC

10 points

6 comments1 min readLW link

(benjaminrosshoffman.com)

Learn to Develop Your Advantage

ReverendBayes29 Jan 2025 22:06 UTC

16 points

1 comment5 min readLW link

Revealing alignment faking with a single prompt

Florian_Dietz29 Jan 2025 21:01 UTC

9 points

5 comments4 min readLW link

Allegory of the Tsunami

Evan Hu29 Jan 2025 19:09 UTC

4 points

1 comment3 min readLW link

My Mental Model of AI Optimist Opinions

tailcalled29 Jan 2025 18:44 UTC

14 points

7 comments1 min readLW link

Planning for Extreme AI Risks

joshc29 Jan 2025 18:33 UTC

143 points

5 comments16 min readLW link