All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 202320242025 2026

All Jan Feb Mar Apr May Jun Jul Aug Sep Oct NovDec

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 192021 22 23 24 25 26 27 28 29 30 31

Mid-Generation Self-Correction: A Simple Tool for Safer AI

MrThink19 Dec 2024 23:41 UTC

13 points

0 comments1 min readLW link

Apply now to SPAR!

agucova19 Dec 2024 22:29 UTC

11 points

0 comments1 min readLW link

How to replicate and extend our alignment faking demo

Fabien Roger19 Dec 2024 21:44 UTC

114 points

5 comments2 min readLW link

(alignment.anthropic.com)

The Genesis Project

mannatvjain19 Dec 2024 21:26 UTC

15 points

0 comments1 min readLW link

(genesis-embodied-ai.github.io)

Measuring whether AIs can statelessly strategize to subvert security measures

Alex Mallen and Buck

19 Dec 2024 21:25 UTC

65 points

0 comments11 min readLW link

Claude’s Constitutional Consequentialism?

1a3orn19 Dec 2024 19:53 UTC

44 points

6 comments6 min readLW link

A short critique of Omohundro’s “Basic AI Drives”

Soumyadeep Bose19 Dec 2024 19:19 UTC

6 points

0 comments4 min readLW link

When Is Insurance Worth It?

kqr19 Dec 2024 19:07 UTC

182 points

73 comments4 min readLW link 1 review

(entropicthoughts.com)

Launching Third Opinion: Anonymous Expert Consultation for AI Professionals

karl19 Dec 2024 19:06 UTC

3 points

0 comments5 min readLW link

Using LLM Search to Augment (Mathematics) Research

kaleb19 Dec 2024 18:59 UTC

5 points

0 comments6 min readLW link

A progress policy agenda

jasoncrawford19 Dec 2024 18:42 UTC

31 points

1 comment5 min readLW link

(newsletter.rootsofprogress.org)

building character isn’t about willpower or sacrifice

dhruvmethi19 Dec 2024 18:17 UTC

1 point

0 comments4 min readLW link

AISN #45: Center for AI Safety 2024 Year in Review

Corin Katzke and Dan H

19 Dec 2024 18:15 UTC

13 points

0 comments4 min readLW link

(newsletter.safe.ai)

Learning Multi-Level Features with Matryoshka SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

19 Dec 2024 15:59 UTC

46 points

6 comments11 min readLW link

Simple Steganographic Computation Eval—gpt-4o and gemini-exp-1206 can’t solve it yet

Filip Sondej19 Dec 2024 15:47 UTC

13 points

2 comments3 min readLW link

AI #95: o1 Joins the API

Zvi19 Dec 2024 15:10 UTC

58 points

1 comment41 min readLW link

(thezvi.wordpress.com)

Executive Director for AIS Brussels—Expression of interest

gergogaspar and ENAIS

19 Dec 2024 9:19 UTC

1 point

0 comments4 min readLW link

Executive Director for AIS France—Expression of interest

gergogaspar and ENAIS

19 Dec 2024 8:14 UTC

9 points

0 comments3 min readLW link

Inescapably Value-Laden Experience—a Catchy Term I Made Up to Make Morality Rationalisable

James Stephen Brown19 Dec 2024 4:45 UTC

5 points

0 comments2 min readLW link

(nonzerosum.games)

I’m Writing a Book About Liberalism

Yoav Ravid19 Dec 2024 0:13 UTC

6 points

6 comments2 min readLW link

A Solution for AGI/ASI Safety

Weibing Wang18 Dec 2024 19:44 UTC

50 points

29 comments1 min readLW link

Takes on “Alignment Faking in Large Language Models”

Joe Carlsmith18 Dec 2024 18:22 UTC

105 points

7 comments62 min readLW link

A Matter of Taste

Zvi18 Dec 2024 17:50 UTC

36 points

5 comments11 min readLW link

(thezvi.wordpress.com)

Are we a different person each time? A simple argument for the impermanence of our identity

l4mp18 Dec 2024 17:21 UTC

−4 points

5 comments1 min readLW link

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

18 Dec 2024 17:19 UTC

492 points

87 comments10 min readLW link 3 reviews

Can o1-preview find major mistakes amongst 59 NeurIPS ’24 MLSB papers?

Abhishaike Mahajan18 Dec 2024 14:21 UTC

19 points

0 comments6 min readLW link

(www.owlposting.com)

Walking Sue

Matthew McRedmond18 Dec 2024 13:19 UTC

2 points

5 comments8 min readLW link

What conclusions can be drawn from a single observation about wealth in tennis?

Trevor Cappallo18 Dec 2024 9:55 UTC

8 points

3 comments2 min readLW link

Don’t Associate AI Safety With Activism

Eneasz18 Dec 2024 8:01 UTC

17 points

15 comments1 min readLW link

(deathisbad.substack.com)

[Question] How should I optimize my decision making model for ‘ideas’?

CstineSublime18 Dec 2024 4:09 UTC

3 points

0 comments4 min readLW link

Preppers Are Too Negative on Objects

jefftk18 Dec 2024 2:30 UTC

45 points

2 comments1 min readLW link

(www.jefftk.com)

Review: Breaking Free with Dr. Stone

TurnTrout18 Dec 2024 1:26 UTC

47 points

5 comments1 min readLW link

(turntrout.com)

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and rusheb

17 Dec 2024 23:58 UTC

116 points

1 comment2 min readLW link

Careless thinking: A theory of bad thinking

Nathan Young17 Dec 2024 18:23 UTC

49 points

17 comments9 min readLW link

(nathanpmyoung.substack.com)

The Second Gemini

Zvi17 Dec 2024 15:50 UTC

23 points

0 comments11 min readLW link

(thezvi.wordpress.com)

AIS Hungary is hiring a part-time Technical Lead! (Deadline: Dec 31st)

gergogaspar17 Dec 2024 14:12 UTC

1 point

0 comments2 min readLW link

Everything you care about is in the map

Tahp17 Dec 2024 14:05 UTC

17 points

27 comments3 min readLW link

Reality is Fractal-Shaped

silentbob17 Dec 2024 13:52 UTC

18 points

1 comment8 min readLW link

Trying to translate when people talk past each other

Kaj_Sotala17 Dec 2024 9:40 UTC

41 points

12 comments6 min readLW link

(kajsotala.fi)

What is “wireheading”?

Vishakha and Algon

17 Dec 2024 7:49 UTC

10 points

0 comments1 min readLW link

(aisafety.info)

Where do you put your ideas?

CstineSublime17 Dec 2024 7:26 UTC

9 points

20 comments1 min readLW link

Elevating Air Purifiers

jefftk17 Dec 2024 1:40 UTC

25 points

0 comments1 min readLW link

(www.jefftk.com)

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Caspar Oesterheld, Ethan Perez and Chi Nguyen

16 Dec 2024 22:42 UTC

53 points

1 comment2 min readLW link

(arxiv.org)

A practical guide to tiling the universe with hedonium

Vittu Perkele16 Dec 2024 21:25 UTC

−8 points

1 comment1 min readLW link

(perkeleperusing.substack.com)

AI Safety Seed Funding Network—Join as a Donor or Investor

Alexandra Bos16 Dec 2024 19:30 UTC

30 points

0 comments2 min readLW link

I read every major AI lab’s safety plan so you don’t have to

sarahhw16 Dec 2024 18:51 UTC

20 points

0 comments12 min readLW link

(longerramblings.substack.com)

Grokking revisited: reverse engineering grokking modulo addition in LSTM

Nikita Khomich and Danik

16 Dec 2024 18:48 UTC

4 points

0 comments6 min readLW link

Progress links and short notes, 2024-12-16

jasoncrawford16 Dec 2024 17:24 UTC

7 points

0 comments2 min readLW link

(newsletter.rootsofprogress.org)

Effective Altruism FAQ

Bentham's Bulldog16 Dec 2024 16:27 UTC

0 points

7 comments12 min readLW link

Variably compressibly studies are fun

dkl916 Dec 2024 16:00 UTC

0 points

0 comments2 min readLW link

(dkl9.net)