18 Dec 2025 23:41 UTC

105 points

66 comments6 min readLW link

OpenAI: Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations

Marcus Williams and micahcarroll

18 Dec 2025 22:55 UTC

25 points

1 comment1 min readLW link

(alignment.openai.com)

Scalable End-to-End Interpretability

jsteinhardt18 Dec 2025 22:37 UTC

120 points

3 comments3 min readLW link

My Trip to NeurIPS 2025

Adam Newgas18 Dec 2025 22:31 UTC

15 points

0 comments4 min readLW link

(www.boristhebrave.com)

Leading by example

martinkunev18 Dec 2025 20:30 UTC

3 points

2 comments3 min readLW link

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas and Owain_Evans

18 Dec 2025 20:21 UTC

154 points

11 comments8 min readLW link

(arxiv.org)

A Study Of Instinct

LoganStrohl18 Dec 2025 20:19 UTC

30 points

0 comments4 min readLW link

Estimating The Portion of Income Consumed By Essentials Between 1985 and 2025

Mars_Will_Be_Ours18 Dec 2025 19:19 UTC

2 points

2 comments3 min readLW link

(shoutinginthedarkforest.substack.com)

Chemical (hunger) argument paraphrased

lemonhope18 Dec 2025 18:58 UTC

10 points

7 comments1 min readLW link

BashArena: A Control Setting for Highly Privileged AI Agents

james.lucassen and Adam Kaufman

18 Dec 2025 18:19 UTC

58 points

0 comments15 min readLW link

(blog.redwoodresearch.org)

AI Safety Orgs Should Apply for Government Grants

DusanDNesic18 Dec 2025 18:01 UTC

25 points

0 comments5 min readLW link

Good if make prior after data instead of before

dynomight18 Dec 2025 17:53 UTC

117 points

18 comments9 min readLW link

(dynomight.net)

AI #147: Flash Forward

Zvi18 Dec 2025 16:50 UTC

31 points

2 comments58 min readLW link

(thezvi.wordpress.com)

50 Things I Know

Rebecca Dai18 Dec 2025 16:32 UTC

6 points

8 comments7 min readLW link

(rebeccadai.substack.com)

Announcing Spring 2026 AI Forecasting Benchmark

Ben Wilson18 Dec 2025 15:43 UTC

2 points

0 comments4 min readLW link

(www.metaculus.com)

Deep Learning and Precipitation Reactions: A Tale of Universality

Max Hennick18 Dec 2025 14:34 UTC

57 points

4 comments18 min readLW link

A Functional Typology of Cognitive Capabilities (Interactive Visualization)

Anurag 18 Dec 2025 14:06 UTC

2 points

0 comments4 min readLW link

The Undervalued Kleene Hierarchy

milanrosko18 Dec 2025 11:57 UTC

10 points

2 comments6 min readLW link

[Paper] Self-Transparency Failures in Expert-Persona LLMs

Alex Diep18 Dec 2025 9:09 UTC

8 points

0 comments6 min readLW link

Solstice Sundowners

teegs18 Dec 2025 8:12 UTC

1 point

0 comments1 min readLW link

A basic case for donating to the Berkeley Genomics Project

TsviBT18 Dec 2025 1:55 UTC

85 points

5 comments4 min readLW link

Apply to MATS Summer 2026!

Raj Thimmiah, Ryan Kidd and Elise Racine

18 Dec 2025 1:51 UTC

31 points

0 comments1 min readLW link

Making Linear Probes Interpretable

ZuiderveldTimJ18 Dec 2025 1:48 UTC

17 points

0 comments10 min readLW link

A browser game about AI safety

NickSharp17 Dec 2025 22:36 UTC

18 points

4 comments1 min readLW link

What if we could grow human tissue by recapitulating embryogenesis?

Abhishaike Mahajan17 Dec 2025 21:53 UTC

23 points

0 comments1 min readLW link

(www.owlposting.com)

Transmitting Misalignment with Subliminal Learning via Paraphrasing

Matthew Bozoukov, Taywon Min, CallumMcDougall and J Rosser

17 Dec 2025 19:34 UTC

39 points

0 comments10 min readLW link

Shallow review of technical AI safety, 2025

technicalities, Tomáš Gavenčiak, Stephen McAleese, peligrietzer, Stag, jordinne, ozziegooen, Violet Hour and lenz

17 Dec 2025 18:18 UTC

191 points

9 comments47 min readLW link

Announcing RoastMyPost: LLMs Eval Blog Posts and More

ozziegooen17 Dec 2025 18:10 UTC

110 points

17 comments5 min readLW link

Alignment Fine-Tuning: Lessons from Operant Conditioning

foodforthought17 Dec 2025 16:57 UTC

5 points

4 comments10 min readLW link

Bryan Caplan on Ethical Intuitionism

vatsal_newsletter17 Dec 2025 16:48 UTC

−5 points

0 comments1 min readLW link

(www.readvatsal.com)

The Bleeding Mind

Adele Lopez17 Dec 2025 16:27 UTC

68 points

9 comments6 min readLW link

Could space debris block access to outer space?

fin17 Dec 2025 15:59 UTC

12 points

5 comments3 min readLW link

(www.forethought.org)

An intuitive explanation of backdoor paths using DAGs

enterthewoods17 Dec 2025 15:42 UTC

10 points

0 comments6 min readLW link

Still Too Soon

Gordon Seidoh Worley17 Dec 2025 15:40 UTC

75 points

3 comments2 min readLW link

(www.uncertainupdates.com)

The $140K Question: Cost Changes Over Time

Zvi17 Dec 2025 14:10 UTC

29 points

2 comments18 min readLW link

(thezvi.wordpress.com)

[Question] Can you recommend some reading about effective environmentalism?

SpectrumDT17 Dec 2025 11:15 UTC

3 points

0 comments1 min readLW link

Memory Consolidation

Elliot Callender17 Dec 2025 11:03 UTC

2 points

0 comments2 min readLW link

(substack.com)

On publishing every day for 30 days

Alexandre Variengien17 Dec 2025 8:30 UTC

11 points

0 comments5 min readLW link

(alexandrevariengien.com)

Dancing in a World of Horseradish

lsusr17 Dec 2025 5:50 UTC

136 points

31 comments4 min readLW link

Video and transcript of talk on human-like-ness in AI safety

Joe Carlsmith17 Dec 2025 4:09 UTC

10 points

0 comments36 min readLW link

Lessons from a failed ambitious alignment program

Kabir Kumar17 Dec 2025 1:50 UTC

57 points

5 comments3 min readLW link

43 SAE Features Differentiate Concealment from Confession in Anthropic’s Deceptive Model Organism

James Hoffend17 Dec 2025 1:40 UTC

12 points

0 comments4 min readLW link

Announcing TARA: Receive (and Give) Technical AI Safety Training Without Leaving Your Home City

Zac Broeren17 Dec 2025 1:33 UTC

5 points

0 comments4 min readLW link

Announcing: MIRI Technical Governance Team Research Fellowship

yams, peterbarnett, Aaron_Scher and Robi Rahman

17 Dec 2025 0:02 UTC

61 points

5 comments2 min readLW link

(techgov.intelligence.org)

Non-Scheming Saints (Whether Human Or Digital) Might Be Shirking Their Governance Duties, And, If True, It Is Probably An Objective Tragedy

JenniferRM16 Dec 2025 23:56 UTC

42 points

3 comments9 min readLW link

A Primer on Operant Conditioning

foodforthought16 Dec 2025 21:26 UTC

5 points

0 comments4 min readLW link

Towards training-time mitigations for alignment faking in RL

Vlad Mikulik, gasteigerjo, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger and evhub

16 Dec 2025 21:01 UTC

39 points

1 comment5 min readLW link

(alignment.anthropic.com)

Measuring Drug Target Success

sarahconstantin16 Dec 2025 21:00 UTC

19 points

3 comments2 min readLW link

(sarahconstantin.substack.com)

A Study in Attention

hamilton16 Dec 2025 20:39 UTC

14 points

0 comments2 min readLW link

Emergent Sycophancy

ohdearohdear16 Dec 2025 20:21 UTC

8 points

0 comments5 min readLW link