A differ­ent take on the “Off-switch” prob­lem: Ex­is­ten­tial Logic as a safety net

kosi thu13 Jan 2026 21:22 UTC
−5 points
1 comment1 min readLW link

Analysing CoT al­ign­ment in think­ing LLMs with low-di­men­sional steering

edoinni13 Jan 2026 20:45 UTC
6 points
0 comments7 min readLW link

Global CoT Anal­y­sis: Ini­tial at­tempts to un­cover pat­terns across many chains of thought

13 Jan 2026 20:40 UTC
52 points
0 comments18 min readLW link

Play­ing Dumb: De­tect­ing Sand­bag­ging in Fron­tier LLMs via Con­sis­tency Checks

James Sullivan13 Jan 2026 19:28 UTC
11 points
0 comments5 min readLW link

Claude Coworks

Zvi13 Jan 2026 19:00 UTC
39 points
2 comments12 min readLW link
(thezvi.wordpress.com)

Lan­guage mod­els re­sem­ble more than just lan­guage cor­tex, show neuroscientists

Mordechai Rorvig13 Jan 2026 18:05 UTC
7 points
0 comments1 min readLW link
(www.foommagazine.org)

Schel­ling Co­or­di­na­tion in LLMs: A Review

Callum-Luis Kindred13 Jan 2026 16:25 UTC
10 points
1 comment8 min readLW link

A tale of two door­men: a bizarre AI in­ci­dent on Christmas

Rebecca Dai13 Jan 2026 15:42 UTC
31 points
6 comments3 min readLW link
(rebeccadai.substack.com)

Fixed Buck­ets Can’t (Phenom­e­nally) Bind

algekalipso13 Jan 2026 15:30 UTC
14 points
8 comments21 min readLW link

The Real­ity of Wholes: Why the Uni­verse Isn’t Just a Cel­lu­lar Automaton

algekalipso13 Jan 2026 15:28 UTC
7 points
3 comments15 min readLW link

An­tiPaSTO: Self-Su­per­vised Hon­esty Steer­ing via Anti-Par­allel Representations

wassname13 Jan 2026 12:55 UTC
6 points
0 comments11 min readLW link

Con­tra Dance as a Model For Post-AI Culture

jefftk13 Jan 2026 6:50 UTC
43 points
9 comments2 min readLW link
(www.jefftk.com)

Mak­ing LLM Graders Consistent

Davey Morse13 Jan 2026 3:32 UTC
9 points
0 comments1 min readLW link

At­tempt­ing to in­fluence trans­former rep­re­sen­ta­tions via initialization

speck144713 Jan 2026 0:49 UTC
11 points
0 comments10 min readLW link

When does com­pe­ti­tion lead to recog­nis­able val­ues?

12 Jan 2026 23:13 UTC
66 points
18 comments25 min readLW link
(postagi.org)

Lies, Damned Lies, and Proofs: For­mal Meth­ods are not Slopless

12 Jan 2026 22:32 UTC
102 points
10 comments7 min readLW link

Pro or Aver­age Joe? Do mod­els in­fer our tech­ni­cal abil­ity and can we con­trol this judge­ment?

tobypullan12 Jan 2026 20:52 UTC
12 points
0 comments9 min readLW link

Dat­ing Roundup #10: Gen­dered Expectations

Zvi12 Jan 2026 20:30 UTC
28 points
4 comments16 min readLW link
(thezvi.wordpress.com)

Au­to­mated In­ter­pretabil­ity-Driven Model Au­dit­ing and Con­trol: A Re­search Agenda

fbarez12 Jan 2026 19:55 UTC
9 points
0 comments1 min readLW link

Ten­sor-Trans­former Var­i­ants are Sur­pris­ingly Performant

Logan Riggs12 Jan 2026 19:43 UTC
87 points
16 comments4 min readLW link

The Al­gorithm Re­wards Engagement

Wes F12 Jan 2026 19:38 UTC
14 points
0 comments1 min readLW link

Black­BoxQuery [BBQ]-Bench: Mea­sur­ing Hy­poth­e­sis For­ma­tion and Ex­per­i­men­ta­tion Ca­pa­bil­ities in LLMs

Daniel Wu12 Jan 2026 19:36 UTC
10 points
0 comments12 min readLW link

Un­der­stand­ing Agency through Markov Blankets

Ashe Vazquez Nuñez12 Jan 2026 19:32 UTC
25 points
2 comments3 min readLW link

Model Re­duc­tion as In­ter­pretabil­ity: What Neu­ro­science Could Teach Us About Un­der­stand­ing Com­plex Systems

RiekeFruengel12 Jan 2026 19:31 UTC
13 points
0 comments6 min readLW link

Futarchy (and Tyranny of The Minor­ity)

maxwickham12 Jan 2026 19:27 UTC
4 points
1 comment8 min readLW link

What Hap­pens When Su­per­hu­man AIs Com­pete for Con­trol?

steveld12 Jan 2026 19:26 UTC
44 points
3 comments30 min readLW link
(blog.ai-futures.org)

Brief Ex­plo­ra­tions in LLM Value Rankings

12 Jan 2026 18:16 UTC
39 points
1 comment11 min readLW link

Prac­ti­cal challenges of con­trol mon­i­tor­ing in fron­tier AI deployments

12 Jan 2026 16:45 UTC
19 points
0 comments1 min readLW link
(arxiv.org)

Think­ing vs Unfolding

Chris Scammell12 Jan 2026 15:26 UTC
67 points
5 comments13 min readLW link

Split Per­son­al­ity Train­ing: Re­veal­ing La­tent Knowl­edge Through Alter­nate Per­son­al­ities (Re­search Re­port)

Florian_Dietz12 Jan 2026 12:29 UTC
87 points
41 comments26 min readLW link

In­ter-branch com­mu­ni­ca­tion in the mul­ti­verse via trapped ions

avturchin12 Jan 2026 12:16 UTC
7 points
32 comments4 min readLW link

--dan­ger­ously-skip-permissions

OhadA12 Jan 2026 7:37 UTC
16 points
6 comments3 min readLW link

Clos­ing the loop

Screwtape12 Jan 2026 6:37 UTC
30 points
1 comment2 min readLW link

An­nounc­ing Inkhaven 2: April 2026

Ben Pace12 Jan 2026 4:25 UTC
70 points
7 comments4 min readLW link

[Question] What po­tent con­sumer tech­nolo­gies have long re­mained in­ac­cessible?

TsviBT12 Jan 2026 3:13 UTC
32 points
11 comments4 min readLW link

Digi­tal in­ten­tion­al­ity is not about productivity

mingyuan12 Jan 2026 3:09 UTC
65 points
1 comment3 min readLW link
(mingyuan.substack.com)

De pluribus non est disputandum

Jacob Goldsmith12 Jan 2026 0:07 UTC
11 points
0 comments3 min readLW link

Strong, bi­par­ti­san lead­er­ship for re­sis­tance to Trump.

Raemon11 Jan 2026 23:08 UTC
82 points
85 comments2 min readLW link

A Cou­ple Use­ful LessWrong Userstyles

Alex Vermillion11 Jan 2026 21:26 UTC
39 points
0 comments2 min readLW link

Stretch Hatchback

jefftk11 Jan 2026 16:40 UTC
12 points
8 comments2 min readLW link
(www.jefftk.com)

We need a bet­ter way to eval­u­ate emer­gent misalignment

11 Jan 2026 16:21 UTC
86 points
9 comments6 min readLW link

Should the AI Safety Com­mu­nity Pri­ori­tize Safety Cases?

Jan Wehner11 Jan 2026 11:56 UTC
4 points
0 comments13 min readLW link

Cod­ing Agents As An In­ter­face To The Codebase

omegastick11 Jan 2026 10:31 UTC
16 points
5 comments3 min readLW link
(dumbideas.xyz)

Why AIs aren’t power-seek­ing yet

Eli Tyre11 Jan 2026 7:07 UTC
105 points
16 comments7 min readLW link

The­o­ret­i­cal pre­dic­tions on the sam­ple effi­ciency of train­ing poli­cies and ac­ti­va­tion monitors

10 Jan 2026 23:50 UTC
18 points
2 comments7 min readLW link

If AI al­ign­ment is only as hard as build­ing the steam en­g­ine, then we likely still die

MichaelDickens10 Jan 2026 23:10 UTC
35 points
8 comments4 min readLW link

How Hu­man­ity Wins

Wes R10 Jan 2026 21:55 UTC
−20 points
10 comments4 min readLW link

Pos­si­ble Prin­ci­ples of Superagency

Mariven10 Jan 2026 21:00 UTC
14 points
0 comments12 min readLW link
(mariven.substack.com)

The Case Against Con­tin­u­ous Chain-of-Thought (Neu­ralese)

RobinHa10 Jan 2026 20:32 UTC
11 points
8 comments5 min readLW link

The false con­fi­dence the­o­rem and Bayesian reasoning

viking_math10 Jan 2026 17:14 UTC
24 points
19 comments6 min readLW link