Formalizing Embeddedness Failures in Universal Artificial Intelligence

Cole WyethMay 26, 2025, 12:36 PM

39 points

0 comments1 min readLW link

(arxiv.org)

Reward button alignment

Steven ByrnesMay 22, 2025, 5:36 PM

50 points

15 comments12 min readLW link

Unexploitable search: blocking malicious use of free parameters

Jacob Pfau and Geoffrey Irving

May 21, 2025, 5:23 PM

34 points

16 comments6 min readLW link

Modeling versus Implementation

Cole WyethMay 18, 2025, 1:38 PM

27 points

10 comments3 min readLW link

Proof Section to an Introduction to Reinforcement Learning for Understanding Infra-Bayesianism

Brittany GelbMay 17, 2025, 2:36 AM

3 points

0 comments9 min readLW link

An Introduction to Reinforcement Learning for Understanding Infra-Bayesianism

Brittany GelbMay 17, 2025, 2:34 AM

11 points

0 comments20 min readLW link

It Is Untenable That Near-Future AI Scenario Models Like “AI 2027” Don’t Include Open Source AI

Andrew DicksonMay 16, 2025, 2:20 AM

36 points

17 comments5 min readLW link

Problems with instruction-following as an alignment target

Seth HerdMay 15, 2025, 3:41 PM

48 points

14 comments10 min readLW link

What if Agent-4 breaks out?

Alvin ÅnestrandMay 15, 2025, 9:15 AM

10 points

0 comments6 min readLW link

Dodging systematic human errors in scalable oversight

Geoffrey IrvingMay 14, 2025, 3:19 PM

33 points

3 comments4 min readLW link

Working through a small tiling result

James PayorMay 13, 2025, 8:28 PM

66 points

9 comments5 min readLW link

Measuring Schelling Coordination—Reflections on Subversion Strategy Eval

Graeme FordMay 12, 2025, 7:06 PM

5 points

0 comments8 min readLW link

Political sycophancy as a model organism of scheming

Alex Mallen and Vivek Hebbar

May 12, 2025, 5:49 PM

39 points

0 comments14 min readLW link

AIs at the current capability level may be important for future safety work

ryan_greenblattMay 12, 2025, 2:06 PM

81 points

2 comments4 min readLW link

Highly Opinionated Advice on How to Write ML Papers

Neel NandaMay 12, 2025, 1:59 AM

60 points

4 comments32 min readLW link

Absolute Zero: Alpha Zero for LLM

alapmiMay 11, 2025, 8:42 PM

23 points

16 comments1 min readLW link

Glass box learners want to be black box

Cole WyethMay 10, 2025, 11:05 AM

46 points

10 comments4 min readLW link

Mind the Coherence Gap: Lessons from Steering Llama with Goodfire

eitan sprejerMay 9, 2025, 9:29 PM

4 points

1 comment6 min readLW link

Slow corporations as an intuition pump for AI R&D automation

ryan_greenblatt and elifland

9 May 2025 14:49 UTC

91 points

23 comments9 min readLW link

Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI

Steven Byrnes8 May 2025 21:11 UTC

24 points

0 comments18 min readLW link