All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 20242025

All Jan FebMarApr May Jun

All 1 234 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

[Question] Request for Comments on AI-related Prediction Market Ideas

PeterMcCluskeyMar 2, 2025, 8:52 PM

17 points

1 comment3 min readLW link

Statistical Challenges with Making Super IQ babies

Jan Christian RefsgaardMar 2, 2025, 8:26 PM

154 points

26 comments9 min readLW link

Cautions about LLMs in Human Cognitive Loops

Alice BlairMar 2, 2025, 7:53 PM

39 points

11 comments7 min readLW link

Self-fulfilling misalignment data might be poisoning our AI models

TurnTroutMar 2, 2025, 7:51 PM

153 points

28 comments1 min readLW link

(turntrout.com)

Spencer Greenberg hiring a personal/professional/research remote assistant for 5-10 hours per week

spencergMar 2, 2025, 6:01 PM

13 points

0 comments LW link

[Question] Will LLM agents become the first takeover-capable AGIs?

Seth HerdMar 2, 2025, 5:15 PM

36 points

10 comments1 min readLW link

Not-yet-falsifiable beliefs?

Benjamin HendricksMar 2, 2025, 2:11 PM

6 points

4 comments1 min readLW link

Saving Zest

jefftkMar 2, 2025, 12:00 PM

24 points

1 comment1 min readLW link

(www.jefftk.com)

Open Thread Spring 2025

Ben PaceMar 2, 2025, 2:33 AM

19 points

50 comments1 min readLW link

[Question] help, my self image as rational is affecting my ability to empathize with others

KvmanThinkingMar 2, 2025, 2:06 AM

1 point

13 comments1 min readLW link

Maintaining Alignment during RSI as a Feedback Control Problem

berenMar 2, 2025, 12:21 AM

66 points

6 comments11 min readLW link

AI Safety Policy Won’t Go On Like This – AI Safety Advocacy Is Failing Because Nobody Cares.

henophiliaMar 1, 2025, 8:15 PM

1 point

1 comment1 min readLW link

(blog.hermesloom.org)

Meaning Machines

appromoximateMar 1, 2025, 7:16 PM

0 points

0 comments13 min readLW link

[Question] Share AI Safety Ideas: Both Crazy and Not

ankMar 1, 2025, 7:08 PM

16 points

28 comments1 min readLW link

Historiographical Compressions: Renaissance as An Example

adamShimiMar 1, 2025, 6:21 PM

17 points

4 comments7 min readLW link

(formethods.substack.com)

Real-Time Gigstats

jefftkMar 1, 2025, 2:10 PM

9 points

0 comments1 min readLW link

(www.jefftk.com)

Open problems in emergent misalignment

Jan Betley and Daniel Tan

Mar 1, 2025, 9:47 AM

82 points

13 comments7 min readLW link

Estimating the Probability of Sampling a Trained Neural Network at Random

Adam Scherlis and Nora Belrose

Mar 1, 2025, 2:11 AM

32 points

10 comments1 min readLW link

(arxiv.org)

[Question] What nation did Trump prevent from going to war (Feb. 2025)?

James CamachoMar 1, 2025, 1:46 AM

3 points

3 comments1 min readLW link

AXRP Episode 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

DanielFilanMar 1, 2025, 1:20 AM

13 points

0 comments13 min readLW link

TamperSec is hiring for 3 Key Roles!

Jonathan_HFeb 28, 2025, 11:10 PM

15 points

0 comments4 min readLW link

Do we want alignment faking?

Florian_DietzFeb 28, 2025, 9:50 PM

7 points

4 comments1 min readLW link

Few concepts mixing dark fantasy and science fiction

Marek ZegarekFeb 28, 2025, 9:03 PM

0 points

0 comments3 min readLW link

Latent Space Collapse? Understanding the Effects of Narrow Fine-Tuning on LLMs

tenseisohamFeb 28, 2025, 8:22 PM

3 points

0 comments9 min readLW link

How to Contribute to Theoretical Reward Learning Research

Joar SkalseFeb 28, 2025, 7:27 PM

16 points

0 comments21 min readLW link

Other Papers About the Theory of Reward Learning

Joar SkalseFeb 28, 2025, 7:26 PM

16 points

0 comments5 min readLW link

Defining and Characterising Reward Hacking

Joar SkalseFeb 28, 2025, 7:25 PM

15 points

0 comments4 min readLW link

Misspecification in Inverse Reinforcement Learning—Part II

Joar SkalseFeb 28, 2025, 7:24 PM

9 points

0 comments7 min readLW link

STARC: A General Framework For Quantifying Differences Between Reward Functions

Joar SkalseFeb 28, 2025, 7:24 PM

11 points

0 comments8 min readLW link

Misspecification in Inverse Reinforcement Learning

Joar SkalseFeb 28, 2025, 7:24 PM

19 points

0 comments11 min readLW link

Partial Identifiability in Reward Learning

Joar SkalseFeb 28, 2025, 7:23 PM

16 points

0 comments12 min readLW link

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar SkalseFeb 28, 2025, 7:20 PM

26 points

4 comments14 min readLW link

An Open Letter To EA and AI Safety On Decelerating AI Development

kenneth_diaoFeb 28, 2025, 5:21 PM

8 points

0 comments14 min readLW link

(graspingatwaves.substack.com)

Dance Weekend Pay II

jefftkFeb 28, 2025, 3:10 PM

11 points

0 comments1 min readLW link

(www.jefftk.com)

Existentialists and Trolleys

David GrossFeb 28, 2025, 2:01 PM

5 points

3 comments7 min readLW link

On Emergent Misalignment

ZviFeb 28, 2025, 1:10 PM

88 points

5 comments22 min readLW link

(thezvi.wordpress.com)

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Jacob DunefskyFeb 28, 2025, 12:01 PM

20 points

1 comment14 min readLW link

(arxiv.org)

Tetherware #2: What every human should know about our most likely AI future

Jáchym FibírFeb 28, 2025, 11:12 AM

3 points

0 comments11 min readLW link

(tetherware.substack.com)

Notes on Superwisdom & Moral RSI

welfvhFeb 28, 2025, 10:34 AM

1 point

4 comments1 min readLW link

Cycles (a short story by Claude 3.7 and me)

Knight LeeFeb 28, 2025, 7:04 AM

9 points

0 comments5 min readLW link

January-February 2025 Progress in Guaranteed Safe AI

QuinnFeb 28, 2025, 3:10 AM

15 points

1 comment8 min readLW link

(gsai.substack.com)

Exploring unfaithful/deceptive CoT in reasoning models

Lucy WingardFeb 28, 2025, 2:54 AM

4 points

0 comments6 min readLW link

Weirdness Points

lsusrFeb 28, 2025, 2:23 AM

62 points

19 comments3 min readLW link

Do you need years more therapy, or could one conversation resolve the issue?

ChipmonkFeb 28, 2025, 12:06 AM

9 points

10 comments1 min readLW link

[New Jersey] HPMOR 10 Year Anniversary Party 🎉

🟠UnlimitedOranges🟠Feb 27, 2025, 10:30 PM

4 points

0 comments1 min readLW link

OpenAI releases GPT-4.5

Seth HerdFeb 27, 2025, 9:40 PM

34 points

12 comments3 min readLW link

(openai.com)

The Elicitation Game: Evaluating capability elicitation techniques

Teun van der Weij, Felix Hofstätter, JaydenTeoh, HenningB and Francis Rhys Ward

Feb 27, 2025, 8:33 PM

10 points

0 comments2 min readLW link

For the Sake of Pleasure Alone

Greenless MirrorFeb 27, 2025, 8:07 PM

4 points

14 comments12 min readLW link

Keeping AI Subordinate to Human Thought: A Proposal for Public AI Conversations

syhFeb 27, 2025, 8:00 PM

−1 points

0 comments1 min readLW link

(medium.com)

How to Corner Liars: A Miasma-Clearing Protocol

ymeskhoutFeb 27, 2025, 5:18 PM

62 points

23 comments7 min readLW link

(www.ymeskhout.com)