Test your in­ter­pretabil­ity tech­niques by de-cen­sor­ing Chi­nese models

15 Jan 2026 16:33 UTC
91 points
14 comments20 min readLW link

Reflec­tions on TA-ing Har­vard’s first AI safety course

Roy Rinberg15 Jan 2026 16:28 UTC
79 points
4 comments9 min readLW link

I Made a Judg­ment Cal­ibra­tion Game for Begin­ners (Cal­ibrate)

Luise Woehlke15 Jan 2026 15:04 UTC
15 points
2 comments1 min readLW link

AI #151: While Claude Coworks

Zvi15 Jan 2026 14:30 UTC
38 points
5 comments31 min readLW link
(thezvi.wordpress.com)

Cor­rigi­bil­ity Scales To Value Alignment

PeterMcCluskey15 Jan 2026 0:05 UTC
13 points
12 comments5 min readLW link
(bayesianinvestor.com)

Deeper Re­views for the top 15 (of the 2024 Re­view)

Raemon14 Jan 2026 23:59 UTC
45 points
4 comments5 min readLW link

If we get pri­mary cruxes right, sec­ondary cruxes will be solved automatically

Jordan Arel14 Jan 2026 22:44 UTC
1 point
1 comment4 min readLW link

Boltz­mann Tulpas

Mariven14 Jan 2026 21:45 UTC
21 points
6 comments13 min readLW link
(mariven.substack.com)

Sta­tus In A Tribe Of One

J Bostock14 Jan 2026 20:44 UTC
27 points
2 comments2 min readLW link

Quan­tify­ing Love and Hatred

RobinHa14 Jan 2026 20:40 UTC
10 points
8 comments1 min readLW link

Why we are ex­cited about con­fes­sion!

14 Jan 2026 20:37 UTC
138 points
32 comments9 min readLW link
(alignment.openai.com)

Why Mo­ti­vated Rea­son­ing?

johnswentworth14 Jan 2026 19:55 UTC
78 points
20 comments5 min readLW link

When Will They Take Our Jobs?

Zvi14 Jan 2026 19:40 UTC
39 points
1 comment8 min readLW link
(thezvi.wordpress.com)

The Many Ways of Knowing

Gordon Seidoh Worley14 Jan 2026 17:00 UTC
18 points
1 comment5 min readLW link
(www.uncertainupdates.com)

GD Roundup #4 - in­fer­ence, mo­nop­o­lies, and AI Jesus

Raymond Douglas14 Jan 2026 15:43 UTC
38 points
0 comments6 min readLW link

AI Safety at the Fron­tier: Paper High­lights of De­cem­ber 2025

gasteigerjo14 Jan 2026 14:29 UTC
16 points
0 comments7 min readLW link
(aisafetyfrontier.substack.com)

Back­yard cat fight shows Schel­ling points pre­ex­ist language

jchan14 Jan 2026 14:10 UTC
172 points
25 comments3 min readLW link

Pa­ram­e­ters Are Like Pixels

omegastick14 Jan 2026 13:45 UTC
15 points
6 comments2 min readLW link
(dumbideas.xyz)

[Closed] Ap­ply to Vanessa’s men­tor­ship at PIBBSS

Vanessa Kosoy14 Jan 2026 9:15 UTC
40 points
0 comments2 min readLW link

Lit re­view of some in­ter­na­tional organisations

rosehadshar14 Jan 2026 7:52 UTC
6 points
0 comments22 min readLW link
(www.forethought.org)

If re­searchers shared their #1 idea daily, we’d nav­i­gate ex­is­ten­tial challenges far more effectively

Jordan Arel14 Jan 2026 6:25 UTC
5 points
4 comments2 min readLW link

The Eter­nal Labyrinth

Bridgett Kay14 Jan 2026 3:19 UTC
11 points
4 comments15 min readLW link
(dxmrevealed.wordpress.com)

How Much of AI Labs’ Re­search Is Safety?

Lennart Finke14 Jan 2026 1:40 UTC
13 points
7 comments3 min readLW link

We need to make our­selves peo­ple the mod­els can come to with problems

Lydia Nottingham14 Jan 2026 0:43 UTC
21 points
2 comments2 min readLW link
(lydianottingham.substack.com)

A differ­ent take on the “Off-switch” prob­lem: Ex­is­ten­tial Logic as a safety net

kosi thu13 Jan 2026 21:22 UTC
−5 points
1 comment1 min readLW link

Analysing CoT al­ign­ment in think­ing LLMs with low-di­men­sional steering

edoinni13 Jan 2026 20:45 UTC
6 points
0 comments7 min readLW link

Global CoT Anal­y­sis: Ini­tial at­tempts to un­cover pat­terns across many chains of thought

13 Jan 2026 20:40 UTC
52 points
0 comments18 min readLW link

Play­ing Dumb: De­tect­ing Sand­bag­ging in Fron­tier LLMs via Con­sis­tency Checks

James Sullivan13 Jan 2026 19:28 UTC
11 points
0 comments5 min readLW link

Claude Coworks

Zvi13 Jan 2026 19:00 UTC
39 points
2 comments12 min readLW link
(thezvi.wordpress.com)

Lan­guage mod­els re­sem­ble more than just lan­guage cor­tex, show neuroscientists

Mordechai Rorvig13 Jan 2026 18:05 UTC
7 points
0 comments1 min readLW link
(www.foommagazine.org)

Schel­ling Co­or­di­na­tion in LLMs: A Review

Callum-Luis Kindred13 Jan 2026 16:25 UTC
10 points
1 comment8 min readLW link

A tale of two door­men: a bizarre AI in­ci­dent on Christmas

Rebecca Dai13 Jan 2026 15:42 UTC
31 points
6 comments3 min readLW link
(rebeccadai.substack.com)

Fixed Buck­ets Can’t (Phenom­e­nally) Bind

algekalipso13 Jan 2026 15:30 UTC
14 points
8 comments21 min readLW link

The Real­ity of Wholes: Why the Uni­verse Isn’t Just a Cel­lu­lar Automaton

algekalipso13 Jan 2026 15:28 UTC
7 points
3 comments15 min readLW link

An­tiPaSTO: Self-Su­per­vised Hon­esty Steer­ing via Anti-Par­allel Representations

wassname13 Jan 2026 12:55 UTC
6 points
0 comments11 min readLW link

Con­tra Dance as a Model For Post-AI Culture

jefftk13 Jan 2026 6:50 UTC
43 points
9 comments2 min readLW link
(www.jefftk.com)

Mak­ing LLM Graders Consistent

Davey Morse13 Jan 2026 3:32 UTC
9 points
0 comments1 min readLW link

At­tempt­ing to in­fluence trans­former rep­re­sen­ta­tions via initialization

speck144713 Jan 2026 0:49 UTC
11 points
0 comments10 min readLW link

When does com­pe­ti­tion lead to recog­nis­able val­ues?

12 Jan 2026 23:13 UTC
66 points
18 comments25 min readLW link
(postagi.org)

Lies, Damned Lies, and Proofs: For­mal Meth­ods are not Slopless

12 Jan 2026 22:32 UTC
102 points
10 comments7 min readLW link

Pro or Aver­age Joe? Do mod­els in­fer our tech­ni­cal abil­ity and can we con­trol this judge­ment?

tobypullan12 Jan 2026 20:52 UTC
12 points
0 comments9 min readLW link

Dat­ing Roundup #10: Gen­dered Expectations

Zvi12 Jan 2026 20:30 UTC
28 points
4 comments16 min readLW link
(thezvi.wordpress.com)

Au­to­mated In­ter­pretabil­ity-Driven Model Au­dit­ing and Con­trol: A Re­search Agenda

fbarez12 Jan 2026 19:55 UTC
9 points
0 comments1 min readLW link

Ten­sor-Trans­former Var­i­ants are Sur­pris­ingly Performant

Logan Riggs12 Jan 2026 19:43 UTC
87 points
16 comments4 min readLW link

The Al­gorithm Re­wards Engagement

Wes F12 Jan 2026 19:38 UTC
14 points
0 comments1 min readLW link

Black­BoxQuery [BBQ]-Bench: Mea­sur­ing Hy­poth­e­sis For­ma­tion and Ex­per­i­men­ta­tion Ca­pa­bil­ities in LLMs

Daniel Wu12 Jan 2026 19:36 UTC
10 points
0 comments12 min readLW link

Un­der­stand­ing Agency through Markov Blankets

Ashe Vazquez Nuñez12 Jan 2026 19:32 UTC
25 points
2 comments3 min readLW link

Model Re­duc­tion as In­ter­pretabil­ity: What Neu­ro­science Could Teach Us About Un­der­stand­ing Com­plex Systems

RiekeFruengel12 Jan 2026 19:31 UTC
13 points
0 comments6 min readLW link

Futarchy (and Tyranny of The Minor­ity)

maxwickham12 Jan 2026 19:27 UTC
4 points
1 comment8 min readLW link

What Hap­pens When Su­per­hu­man AIs Com­pete for Con­trol?

steveld12 Jan 2026 19:26 UTC
44 points
3 comments30 min readLW link
(blog.ai-futures.org)