All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 20242025

AllJanFeb Mar Apr May Jun Jul

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 252627 28 29 30 31

Anomalous Tokens in DeepSeek-V3 and r1

henryJan 25, 2025, 10:55 PM

137 points

3 comments7 min readLW link

The Rising Sea

Jesse HooglandJan 25, 2025, 8:48 PM

92 points

2 comments2 min readLW link

Monet: Mixture of Monosemantic Experts for Transformers Explained

CalebMarescaJan 25, 2025, 7:37 PM

20 points

2 comments11 min readLW link

AI and Non-Existence.

ElevenJan 25, 2025, 7:36 PM

−3 points

9 comments2 min readLW link

Agents don’t have to be aligned to help us achieve an indefinite pause.

HastingsJan 25, 2025, 6:51 PM

29 points

0 comments3 min readLW link

[Question] AI Safety in secret

Michael FloodJan 25, 2025, 6:16 PM

7 points

0 comments1 min readLW link

On polytopes

Dmitry VaintrobJan 25, 2025, 1:56 PM

56 points

5 comments12 min readLW link

Attribution-based parameter decomposition

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel and Lee Sharkey

Jan 25, 2025, 1:12 PM

108 points

22 comments4 min readLW link

(publications.apolloresearch.ai)

A concise definition of what it means to win

testingthewatersJan 25, 2025, 6:37 AM

4 points

1 comment5 min readLW link

(aclevername.substack.com)

[Question] A Floating Cube—Rejected HLE submission

Shankar SivarajanJan 25, 2025, 4:52 AM

7 points

1 comment1 min readLW link

Why I’m Pouring Cold Water in My Left Ear, and You Should Too

MaloewJan 24, 2025, 11:13 PM

12 points

0 comments2 min readLW link

Counterintuitive effects of minimum prices

dynomightJan 24, 2025, 11:05 PM

25 points

0 comments8 min readLW link

(dynomight.net)

AXRP Episode 38.6 - Joel Lehman on Positive Visions of AI

DanielFilanJan 24, 2025, 11:00 PM

10 points

0 comments9 min readLW link

Locating and Editing Knowledge in LMs

Dhananjay AshokJan 24, 2025, 10:53 PM

1 point

0 comments4 min readLW link

How are Those AI Participants Doing Anyway?

mushroomsoupJan 24, 2025, 10:37 PM

4 points

0 comments10 min readLW link

Six Thoughts on AI Safety

boazbarakJan 24, 2025, 10:20 PM

91 points

55 comments15 min readLW link

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals

johnswentworth and David Lorell

Jan 24, 2025, 8:20 PM

181 points

61 comments5 min readLW link

Yudkowsky on The Trajectory podcast

Seth HerdJan 24, 2025, 7:52 PM

71 points

39 comments2 min readLW link

(www.youtube.com)

Empirical Insights into Feature Geometry in Sparse Autoencoders

Jason Boxi ZhangJan 24, 2025, 7:02 PM

7 points

0 comments11 min readLW link

Ideas for CoT Models: A Geometric Perspective on Latent Space Reasoning

Rohan GanapavarapuJan 24, 2025, 7:01 PM

2 points

0 comments2 min readLW link

(rohan.ga)

Liron Shapira vs Ken Stanley on Doom Debates. A review

TheManxLoinerJan 24, 2025, 6:01 PM

9 points

0 comments14 min readLW link

Is there such a thing as an impossible protein?

Abhishaike MahajanJan 24, 2025, 5:12 PM

15 points

3 comments4 min readLW link

(www.owlposting.com)

Stargate AI-1

ZviJan 24, 2025, 3:20 PM

85 points

1 comment18 min readLW link

(thezvi.wordpress.com)

QFT and neural nets: the basic idea

Dmitry VaintrobJan 24, 2025, 1:54 PM

26 points

0 comments8 min readLW link

Eliciting bad contexts

Geoffrey Irving, Joseph Bloom and Tomek Korbak

Jan 24, 2025, 10:39 AM

34 points

9 comments3 min readLW link

Insights from “The Manga Guide to Physiology”

TurnTroutJan 24, 2025, 5:18 AM

26 points

3 comments1 min readLW link

(turntrout.com)

[Question] Do you consider perfect surveillance inevitable?

samuelshadrachJan 24, 2025, 4:57 AM

16 points

34 comments1 min readLW link

Uncontrollable: A Surprisingly Good Introduction to AI Risk

PeterMcCluskeyJan 24, 2025, 4:30 AM

11 points

0 comments1 min readLW link

(bayesianinvestor.com)

Contra Dances Getting Shorter and Earlier

jefftkJan 23, 2025, 11:30 PM

11 points

0 comments2 min readLW link

(www.jefftk.com)

Starting Thoughts on RLHF

Michael FloodJan 23, 2025, 10:16 PM

2 points

0 comments5 min readLW link

Updating and Editing Factual Knowledge in Language Models

Dhananjay AshokJan 23, 2025, 7:34 PM

2 points

2 comments10 min readLW link

AI companies are unlikely to make high-assurance safety cases if timelines are short

ryan_greenblattJan 23, 2025, 6:41 PM

145 points

5 comments13 min readLW link

AISN #46: The Transition

Corin Katzke and Dan H

Jan 23, 2025, 6:09 PM

8 points

0 comments5 min readLW link

(newsletter.safe.ai)

What does success look like?

Raymond DouglasJan 23, 2025, 5:48 PM

11 points

0 comments3 min readLW link

AI #100: Meet the New Boss

ZviJan 23, 2025, 3:40 PM

50 points

4 comments69 min readLW link

(thezvi.wordpress.com)

[Cross-post] Every Bay Area “Walled Compound”

davekastenJan 23, 2025, 3:05 PM

37 points

3 comments3 min readLW link

Writing experiments and the banana escape valve

Dmitry VaintrobJan 23, 2025, 1:11 PM

34 points

1 comment2 min readLW link

MONA: Managed Myopia with Approval Feedback

Seb Farquhar, David Lindner and Rohin Shah

Jan 23, 2025, 12:24 PM

80 points

30 comments9 min readLW link

[Question] How useful would alien alignment research be?

Donald HobsonJan 23, 2025, 10:59 AM

17 points

5 comments1 min readLW link

What are the differences between AGI, transformative AI, and superintelligence?

Vishakha and Algon

Jan 23, 2025, 10:03 AM

10 points

3 comments3 min readLW link

(aisafety.info)

Why Aligning an LLM is Hard, and How to Make it Easier

RogerDearnaleyJan 23, 2025, 6:44 AM

33 points

3 comments4 min readLW link

Tail SP 500 Call Options

sapphireJan 23, 2025, 5:21 AM

67 points

28 comments2 min readLW link

A hierarchy of disagreement

Adam ZernerJan 23, 2025, 3:17 AM

21 points

4 comments8 min readLW link

Early Experiments in Human Auditing for AI Control

Joey Yudelson and Buck

Jan 23, 2025, 1:34 AM

27 points

0 comments7 min readLW link

You Have Two Brains

EneaszJan 23, 2025, 12:52 AM

24 points

5 comments5 min readLW link

(deathisbad.substack.com)

[Question] are there 2 types of alignment?

KvmanThinkingJan 23, 2025, 12:08 AM

4 points

9 comments1 min readLW link

Theory of Change for AI Safety Camp

Linda LinseforsJan 22, 2025, 10:07 PM

36 points

3 comments7 min readLW link

On DeepSeek’s r1

ZviJan 22, 2025, 7:50 PM

55 points

2 comments35 min readLW link

(thezvi.wordpress.com)

Detect Goodhart and shut down

Jeremy GillenJan 22, 2025, 6:45 PM

70 points

21 comments7 min readLW link

Recursive Self-Modeling as a Plausible Mechanism for Real-time Introspection in Current Language Models

rifeJan 22, 2025, 6:36 PM

8 points

6 comments2 min readLW link