Help keep AI un­der hu­man con­trol: Pal­isade Re­search 2026 fundraiser

18 Dec 2025 23:41 UTC
105 points
66 comments6 min readLW link

OpenAI: Sidestep­ping Eval­u­a­tion Aware­ness and An­ti­ci­pat­ing Misal­ign­ment with Pro­duc­tion Evaluations

18 Dec 2025 22:55 UTC
25 points
1 comment1 min readLW link
(alignment.openai.com)

Scal­able End-to-End Interpretability

jsteinhardt18 Dec 2025 22:37 UTC
120 points
3 comments3 min readLW link

My Trip to NeurIPS 2025

Adam Newgas18 Dec 2025 22:31 UTC
15 points
0 comments4 min readLW link
(www.boristhebrave.com)

Lead­ing by example

martinkunev18 Dec 2025 20:30 UTC
3 points
2 comments3 min readLW link

Ac­ti­va­tion Or­a­cles: Train­ing and Eval­u­at­ing LLMs as Gen­eral-Pur­pose Ac­ti­va­tion Explainers

18 Dec 2025 20:21 UTC
154 points
11 comments8 min readLW link
(arxiv.org)

A Study Of Instinct

LoganStrohl18 Dec 2025 20:19 UTC
30 points
0 comments4 min readLW link

Es­ti­mat­ing The Por­tion of In­come Con­sumed By Essen­tials Between 1985 and 2025

Mars_Will_Be_Ours18 Dec 2025 19:19 UTC
2 points
2 comments3 min readLW link
(shoutinginthedarkforest.substack.com)

Chem­i­cal (hunger) ar­gu­ment paraphrased

lemonhope18 Dec 2025 18:58 UTC
10 points
7 comments1 min readLW link

BashArena: A Con­trol Set­ting for Highly Priv­ileged AI Agents

18 Dec 2025 18:19 UTC
58 points
0 comments15 min readLW link
(blog.redwoodresearch.org)

AI Safety Orgs Should Ap­ply for Govern­ment Grants

DusanDNesic18 Dec 2025 18:01 UTC
25 points
0 comments5 min readLW link

Good if make prior af­ter data in­stead of before

dynomight18 Dec 2025 17:53 UTC
117 points
18 comments9 min readLW link
(dynomight.net)

AI #147: Flash Forward

Zvi18 Dec 2025 16:50 UTC
31 points
2 comments58 min readLW link
(thezvi.wordpress.com)

50 Things I Know

Rebecca Dai18 Dec 2025 16:32 UTC
6 points
8 comments7 min readLW link
(rebeccadai.substack.com)

An­nounc­ing Spring 2026 AI Fore­cast­ing Benchmark

Ben Wilson18 Dec 2025 15:43 UTC
2 points
0 comments4 min readLW link
(www.metaculus.com)

Deep Learn­ing and Pre­cip­i­ta­tion Re­ac­tions: A Tale of Universality

Max Hennick18 Dec 2025 14:34 UTC
57 points
4 comments18 min readLW link

A Func­tional Ty­pol­ogy of Cog­ni­tive Ca­pa­bil­ities (In­ter­ac­tive Vi­su­al­iza­tion)

Anurag 18 Dec 2025 14:06 UTC
2 points
0 comments4 min readLW link

The Un­der­val­ued Kleene Hierarchy

milanrosko18 Dec 2025 11:57 UTC
10 points
2 comments6 min readLW link

[Paper] Self-Trans­parency Failures in Ex­pert-Per­sona LLMs

Alex Diep18 Dec 2025 9:09 UTC
8 points
0 comments6 min readLW link

Sols­tice Sundowners

teegs18 Dec 2025 8:12 UTC
1 point
0 comments1 min readLW link

A ba­sic case for donat­ing to the Berkeley Ge­nomics Project

TsviBT18 Dec 2025 1:55 UTC
85 points
5 comments4 min readLW link

Ap­ply to MATS Sum­mer 2026!

18 Dec 2025 1:51 UTC
31 points
0 comments1 min readLW link

Mak­ing Lin­ear Probes Interpretable

ZuiderveldTimJ18 Dec 2025 1:48 UTC
17 points
0 comments10 min readLW link

A browser game about AI safety

NickSharp17 Dec 2025 22:36 UTC
18 points
4 comments1 min readLW link

What if we could grow hu­man tis­sue by re­ca­pitu­lat­ing em­bryo­ge­n­e­sis?

Abhishaike Mahajan17 Dec 2025 21:53 UTC
23 points
0 comments1 min readLW link
(www.owlposting.com)

Trans­mit­ting Misal­ign­ment with Sublimi­nal Learn­ing via Paraphrasing

17 Dec 2025 19:34 UTC
39 points
0 comments10 min readLW link

Shal­low re­view of tech­ni­cal AI safety, 2025

17 Dec 2025 18:18 UTC
191 points
9 comments47 min readLW link

An­nounc­ing RoastMyPost: LLMs Eval Blog Posts and More

ozziegooen17 Dec 2025 18:10 UTC
110 points
17 comments5 min readLW link

Align­ment Fine-Tun­ing: Les­sons from Oper­ant Con­di­tion­ing

foodforthought17 Dec 2025 16:57 UTC
5 points
4 comments10 min readLW link

Bryan Ca­plan on Eth­i­cal Intuitionism

vatsal_newsletter17 Dec 2025 16:48 UTC
−5 points
0 comments1 min readLW link
(www.readvatsal.com)

The Bleed­ing Mind

Adele Lopez17 Dec 2025 16:27 UTC
68 points
9 comments6 min readLW link

Could space de­bris block ac­cess to outer space?

fin17 Dec 2025 15:59 UTC
12 points
5 comments3 min readLW link
(www.forethought.org)

An in­tu­itive ex­pla­na­tion of back­door paths us­ing DAGs

enterthewoods17 Dec 2025 15:42 UTC
10 points
0 comments6 min readLW link

Still Too Soon

Gordon Seidoh Worley17 Dec 2025 15:40 UTC
75 points
3 comments2 min readLW link
(www.uncertainupdates.com)

The $140K Ques­tion: Cost Changes Over Time

Zvi17 Dec 2025 14:10 UTC
29 points
2 comments18 min readLW link
(thezvi.wordpress.com)

[Question] Can you recom­mend some read­ing about effec­tive en­vi­ron­men­tal­ism?

SpectrumDT17 Dec 2025 11:15 UTC
3 points
0 comments1 min readLW link

Me­mory Consolidation

Elliot Callender17 Dec 2025 11:03 UTC
2 points
0 comments2 min readLW link
(substack.com)

On pub­lish­ing ev­ery day for 30 days

Alexandre Variengien17 Dec 2025 8:30 UTC
11 points
0 comments5 min readLW link
(alexandrevariengien.com)

Danc­ing in a World of Horseradish

lsusr17 Dec 2025 5:50 UTC
136 points
31 comments4 min readLW link

Video and tran­script of talk on hu­man-like-ness in AI safety

Joe Carlsmith17 Dec 2025 4:09 UTC
10 points
0 comments36 min readLW link

Les­sons from a failed am­bi­tious al­ign­ment program

Kabir Kumar17 Dec 2025 1:50 UTC
57 points
5 comments3 min readLW link

43 SAE Fea­tures Differ­en­ti­ate Con­ceal­ment from Con­fes­sion in An­thropic’s De­cep­tive Model Organism

James Hoffend17 Dec 2025 1:40 UTC
12 points
0 comments4 min readLW link

An­nounc­ing TARA: Re­ceive (and Give) Tech­ni­cal AI Safety Train­ing Without Leav­ing Your Home City

Zac Broeren17 Dec 2025 1:33 UTC
5 points
0 comments4 min readLW link

An­nounc­ing: MIRI Tech­ni­cal Gover­nance Team Re­search Fellowship

17 Dec 2025 0:02 UTC
61 points
5 comments2 min readLW link
(techgov.intelligence.org)

Non-Schem­ing Saints (Whether Hu­man Or Digi­tal) Might Be Shirk­ing Their Gover­nance Du­ties, And, If True, It Is Prob­a­bly An Ob­jec­tive Tragedy

JenniferRM16 Dec 2025 23:56 UTC
42 points
3 comments9 min readLW link

A Primer on Oper­ant Conditioning

foodforthought16 Dec 2025 21:26 UTC
5 points
0 comments4 min readLW link

Towards train­ing-time miti­ga­tions for al­ign­ment fak­ing in RL

16 Dec 2025 21:01 UTC
39 points
1 comment5 min readLW link
(alignment.anthropic.com)

Mea­sur­ing Drug Tar­get Success

sarahconstantin16 Dec 2025 21:00 UTC
19 points
3 comments2 min readLW link
(sarahconstantin.substack.com)

A Study in Attention

hamilton16 Dec 2025 20:39 UTC
14 points
0 comments2 min readLW link

Emer­gent Sycophancy

ohdearohdear16 Dec 2025 20:21 UTC
8 points
0 comments5 min readLW link