All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 20242025

All Jan FebMarApr May Jun Jul Aug Sep Oct Nov Dec

All 1 2 345 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

The Compliment Sandwich 🥪 aka: How to criticize a normie without making them upset.

keltan3 Mar 2025 23:15 UTC

15 points

10 comments1 min readLW link

AI Safety at the Frontier: Paper Highlights, February ’25

gasteigerjo3 Mar 2025 22:09 UTC

7 points

0 comments7 min readLW link

(aisafetyfrontier.substack.com)

What goals will AIs have? A list of hypotheses

Daniel Kokotajlo3 Mar 2025 20:08 UTC

90 points

20 comments18 min readLW link

Takeaways From Our Recent Work on SAE Probing

Josh Engels, Subhash Kantamneni, Senthooran Rajamanoharan and Neel Nanda

3 Mar 2025 19:50 UTC

30 points

4 comments5 min readLW link

Why People Commit White Collar Fraud (Ozy linkpost)

sapphire3 Mar 2025 19:33 UTC

24 points

1 comment1 min readLW link

(thingofthings.substack.com)

[Question] Ask Me Anything—Samuel

samuelshadrach3 Mar 2025 19:24 UTC

0 points

0 comments1 min readLW link

Expanding HarmBench: Investigating Gaps & Extending Adversarial LLM Testing

racinkc13 Mar 2025 19:23 UTC

1 point

0 comments1 min readLW link

Could Advanced AI Accelerate the Pace of AI Progress? Interviews with AI Researchers

jleibowich, Nikola Jurkovic and Tom Davidson

3 Mar 2025 19:05 UTC

41 points

1 comment1 min readLW link

(papers.ssrn.com)

Middle School Choice

jefftk3 Mar 2025 16:10 UTC

27 points

10 comments4 min readLW link

(www.jefftk.com)

On GPT-4.5

Zvi3 Mar 2025 13:40 UTC

44 points

12 comments22 min readLW link

(thezvi.wordpress.com)

Coalescence—Determinism In Ways We Care About

vitaliya3 Mar 2025 13:20 UTC

12 points

0 comments11 min readLW link

Methods for strong human germline engineering

TsviBT3 Mar 2025 8:13 UTC

149 points

29 comments108 min readLW link

[Question] Examples of self-fulfilling prophecies in AI alignment?

Chris Lakin3 Mar 2025 2:45 UTC

24 points

10 comments1 min readLW link

[Question] Request for Comments on AI-related Prediction Market Ideas

PeterMcCluskey2 Mar 2025 20:52 UTC

17 points

1 comment3 min readLW link

Statistical Challenges with Making Super IQ babies

Jan Christian Refsgaard2 Mar 2025 20:26 UTC

154 points

26 comments9 min readLW link

Cautions about LLMs in Human Cognitive Loops

Alice Blair2 Mar 2025 19:53 UTC

40 points

13 comments7 min readLW link

Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout2 Mar 2025 19:51 UTC

156 points

29 comments1 min readLW link

(turntrout.com)

Spencer Greenberg hiring a personal/professional/research remote assistant for 5-10 hours per week

spencerg2 Mar 2025 18:01 UTC

13 points

0 comments1 min readLW link

[Question] Will LLM agents become the first takeover-capable AGIs?

Seth Herd2 Mar 2025 17:15 UTC

37 points

10 comments1 min readLW link

Not-yet-falsifiable beliefs?

Benjamin Hendricks2 Mar 2025 14:11 UTC

6 points

4 comments1 min readLW link

Saving Zest

jefftk2 Mar 2025 12:00 UTC

24 points

1 comment1 min readLW link

(www.jefftk.com)

Open Thread Spring 2025

Ben Pace2 Mar 2025 2:33 UTC

20 points

48 comments1 min readLW link

[Question] help, my self image as rational is affecting my ability to empathize with others

KvmanThinking2 Mar 2025 2:06 UTC

1 point

13 comments1 min readLW link

Maintaining Alignment during RSI as a Feedback Control Problem

beren2 Mar 2025 0:21 UTC

67 points

6 comments11 min readLW link

AI Safety Policy Won’t Go On Like This – AI Safety Advocacy Is Failing Because Nobody Cares.

henophilia1 Mar 2025 20:15 UTC

1 point

1 comment1 min readLW link

(blog.hermesloom.org)

Meaning Machines

appromoximate1 Mar 2025 19:16 UTC

0 points

0 comments13 min readLW link

[Question] Share AI Safety Ideas: Both Crazy and Not

ank1 Mar 2025 19:08 UTC

17 points

28 comments1 min readLW link

Historiographical Compressions: Renaissance as An Example

adamShimi1 Mar 2025 18:21 UTC

17 points

4 comments7 min readLW link

(formethods.substack.com)

Real-Time Gigstats

jefftk1 Mar 2025 14:10 UTC

9 points

0 comments1 min readLW link

(www.jefftk.com)

Open problems in emergent misalignment

Jan Betley and Daniel Tan

1 Mar 2025 9:47 UTC

83 points

17 comments7 min readLW link

Estimating the Probability of Sampling a Trained Neural Network at Random

Adam Scherlis and Nora Belrose

1 Mar 2025 2:11 UTC

32 points

10 comments1 min readLW link

(arxiv.org)

[Question] What nation did Trump prevent from going to war (Feb. 2025)?

James Camacho1 Mar 2025 1:46 UTC

3 points

5 comments1 min readLW link

AXRP Episode 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

DanielFilan1 Mar 2025 1:20 UTC

13 points

0 comments13 min readLW link

TamperSec is hiring for 3 Key Roles!

Jonathan_H28 Feb 2025 23:10 UTC

15 points

0 comments4 min readLW link

Do we want alignment faking?

Florian_Dietz28 Feb 2025 21:50 UTC

7 points

4 comments1 min readLW link

Few concepts mixing dark fantasy and science fiction

Marek Zegarek28 Feb 2025 21:03 UTC

0 points

0 comments3 min readLW link

Latent Space Collapse? Understanding the Effects of Narrow Fine-Tuning on LLMs

tenseisoham28 Feb 2025 20:22 UTC

3 points

0 comments9 min readLW link

How to Contribute to Theoretical Reward Learning Research

Joar Skalse28 Feb 2025 19:27 UTC

16 points

0 comments21 min readLW link

Other Papers About the Theory of Reward Learning

Joar Skalse28 Feb 2025 19:26 UTC

16 points

0 comments5 min readLW link

Defining and Characterising Reward Hacking

Joar Skalse28 Feb 2025 19:25 UTC

15 points

0 comments4 min readLW link

Misspecification in Inverse Reinforcement Learning—Part II

Joar Skalse28 Feb 2025 19:24 UTC

9 points

0 comments7 min readLW link

STARC: A General Framework For Quantifying Differences Between Reward Functions

Joar Skalse28 Feb 2025 19:24 UTC

11 points

0 comments8 min readLW link

Misspecification in Inverse Reinforcement Learning

Joar Skalse28 Feb 2025 19:24 UTC

19 points

0 comments11 min readLW link

Partial Identifiability in Reward Learning

Joar Skalse28 Feb 2025 19:23 UTC

16 points

0 comments12 min readLW link

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar Skalse28 Feb 2025 19:20 UTC

29 points

4 comments14 min readLW link

An Open Letter To EA and AI Safety On Decelerating AI Development

kenneth_diao28 Feb 2025 17:21 UTC

8 points

0 comments14 min readLW link

(graspingatwaves.substack.com)

Dance Weekend Pay II

jefftk28 Feb 2025 15:10 UTC

11 points

0 comments1 min readLW link

(www.jefftk.com)

Existentialists and Trolleys

David Gross28 Feb 2025 14:01 UTC

5 points

3 comments7 min readLW link

On Emergent Misalignment

Zvi28 Feb 2025 13:10 UTC

88 points

5 comments22 min readLW link

(thezvi.wordpress.com)

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Jacob Dunefsky28 Feb 2025 12:01 UTC

21 points

1 comment14 min readLW link

(arxiv.org)