15 Jan 2026 16:33 UTC

91 points

14 comments20 min readLW link

Reflections on TA-ing Harvard’s first AI safety course

Roy Rinberg15 Jan 2026 16:28 UTC

79 points

4 comments9 min readLW link

I Made a Judgment Calibration Game for Beginners (Calibrate)

Luise Woehlke15 Jan 2026 15:04 UTC

15 points

2 comments1 min readLW link

AI #151: While Claude Coworks

Zvi15 Jan 2026 14:30 UTC

38 points

5 comments31 min readLW link

(thezvi.wordpress.com)

Corrigibility Scales To Value Alignment

PeterMcCluskey15 Jan 2026 0:05 UTC

13 points

12 comments5 min readLW link

(bayesianinvestor.com)

Deeper Reviews for the top 15 (of the 2024 Review)

Raemon14 Jan 2026 23:59 UTC

45 points

4 comments5 min readLW link

If we get primary cruxes right, secondary cruxes will be solved automatically

Jordan Arel14 Jan 2026 22:44 UTC

1 point

1 comment4 min readLW link

Boltzmann Tulpas

Mariven14 Jan 2026 21:45 UTC

21 points

6 comments13 min readLW link

(mariven.substack.com)

Status In A Tribe Of One

J Bostock14 Jan 2026 20:44 UTC

27 points

2 comments2 min readLW link

Quantifying Love and Hatred

RobinHa14 Jan 2026 20:40 UTC

10 points

8 comments1 min readLW link

Why we are excited about confession!

Boaz Barak, Gabriel Wu and Manas Joglekar

14 Jan 2026 20:37 UTC

138 points

32 comments9 min readLW link

(alignment.openai.com)

Why Motivated Reasoning?

johnswentworth14 Jan 2026 19:55 UTC

78 points

20 comments5 min readLW link

When Will They Take Our Jobs?

Zvi14 Jan 2026 19:40 UTC

39 points

1 comment8 min readLW link

(thezvi.wordpress.com)

The Many Ways of Knowing

Gordon Seidoh Worley14 Jan 2026 17:00 UTC

18 points

1 comment5 min readLW link

(www.uncertainupdates.com)

GD Roundup #4 - inference, monopolies, and AI Jesus

Raymond Douglas14 Jan 2026 15:43 UTC

38 points

0 comments6 min readLW link

AI Safety at the Frontier: Paper Highlights of December 2025

gasteigerjo14 Jan 2026 14:29 UTC

16 points

0 comments7 min readLW link

(aisafetyfrontier.substack.com)

Backyard cat fight shows Schelling points preexist language

jchan14 Jan 2026 14:10 UTC

172 points

25 comments3 min readLW link

Parameters Are Like Pixels

omegastick14 Jan 2026 13:45 UTC

15 points

6 comments2 min readLW link

(dumbideas.xyz)

[Closed] Apply to Vanessa’s mentorship at PIBBSS

Vanessa Kosoy14 Jan 2026 9:15 UTC

40 points

0 comments2 min readLW link

Lit review of some international organisations

rosehadshar14 Jan 2026 7:52 UTC

6 points

0 comments22 min readLW link

(www.forethought.org)

If researchers shared their #1 idea daily, we’d navigate existential challenges far more effectively

Jordan Arel14 Jan 2026 6:25 UTC

5 points

4 comments2 min readLW link

The Eternal Labyrinth

Bridgett Kay14 Jan 2026 3:19 UTC

11 points

4 comments15 min readLW link

(dxmrevealed.wordpress.com)

How Much of AI Labs’ Research Is Safety?

Lennart Finke14 Jan 2026 1:40 UTC

13 points

7 comments3 min readLW link

We need to make ourselves people the models can come to with problems

Lydia Nottingham14 Jan 2026 0:43 UTC

21 points

2 comments2 min readLW link

(lydianottingham.substack.com)

A different take on the “Off-switch” problem: Existential Logic as a safety net

kosi thu13 Jan 2026 21:22 UTC

−5 points

1 comment1 min readLW link

Analysing CoT alignment in thinking LLMs with low-dimensional steering

edoinni13 Jan 2026 20:45 UTC

6 points

0 comments7 min readLW link

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Riya Tyagi, daria, Arthur Conmy and Neel Nanda

13 Jan 2026 20:40 UTC

52 points

0 comments18 min readLW link

Playing Dumb: Detecting Sandbagging in Frontier LLMs via Consistency Checks

James Sullivan13 Jan 2026 19:28 UTC

11 points

0 comments5 min readLW link

Claude Coworks

Zvi13 Jan 2026 19:00 UTC

39 points

2 comments12 min readLW link

(thezvi.wordpress.com)

Language models resemble more than just language cortex, show neuroscientists

Mordechai Rorvig13 Jan 2026 18:05 UTC

7 points

0 comments1 min readLW link

(www.foommagazine.org)

Schelling Coordination in LLMs: A Review

Callum-Luis Kindred13 Jan 2026 16:25 UTC

10 points

1 comment8 min readLW link

A tale of two doormen: a bizarre AI incident on Christmas

Rebecca Dai13 Jan 2026 15:42 UTC

31 points

6 comments3 min readLW link

(rebeccadai.substack.com)

Fixed Buckets Can’t (Phenomenally) Bind

algekalipso13 Jan 2026 15:30 UTC

14 points

8 comments21 min readLW link

The Reality of Wholes: Why the Universe Isn’t Just a Cellular Automaton

algekalipso13 Jan 2026 15:28 UTC

7 points

3 comments15 min readLW link

AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

wassname13 Jan 2026 12:55 UTC

6 points

0 comments11 min readLW link

Contra Dance as a Model For Post-AI Culture

jefftk13 Jan 2026 6:50 UTC

43 points

9 comments2 min readLW link

(www.jefftk.com)

Making LLM Graders Consistent

Davey Morse13 Jan 2026 3:32 UTC

9 points

0 comments1 min readLW link

Attempting to influence transformer representations via initialization

speck144713 Jan 2026 0:49 UTC

11 points

0 comments10 min readLW link

When does competition lead to recognisable values?

Jan_Kulveit, beren, David Duvenaud and Raymond Douglas

12 Jan 2026 23:13 UTC

66 points

18 comments25 min readLW link

(postagi.org)

Lies, Damned Lies, and Proofs: Formal Methods are not Slopless

Quinn and Max von Hippel

12 Jan 2026 22:32 UTC

102 points

10 comments7 min readLW link

Pro or Average Joe? Do models infer our technical ability and can we control this judgement?

tobypullan12 Jan 2026 20:52 UTC

12 points

0 comments9 min readLW link

Dating Roundup #10: Gendered Expectations

Zvi12 Jan 2026 20:30 UTC

28 points

4 comments16 min readLW link

(thezvi.wordpress.com)

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

fbarez12 Jan 2026 19:55 UTC

9 points

0 comments1 min readLW link

Tensor-Transformer Variants are Surprisingly Performant

Logan Riggs12 Jan 2026 19:43 UTC

87 points

16 comments4 min readLW link

The Algorithm Rewards Engagement

Wes F12 Jan 2026 19:38 UTC

14 points

0 comments1 min readLW link

BlackBoxQuery [BBQ]-Bench: Measuring Hypothesis Formation and Experimentation Capabilities in LLMs

Daniel Wu12 Jan 2026 19:36 UTC

10 points

0 comments12 min readLW link

Understanding Agency through Markov Blankets

Ashe Vazquez Nuñez12 Jan 2026 19:32 UTC

25 points

2 comments3 min readLW link

Model Reduction as Interpretability: What Neuroscience Could Teach Us About Understanding Complex Systems

RiekeFruengel12 Jan 2026 19:31 UTC

13 points

0 comments6 min readLW link

Futarchy (and Tyranny of The Minority)

maxwickham12 Jan 2026 19:27 UTC

4 points

1 comment8 min readLW link

What Happens When Superhuman AIs Compete for Control?

steveld12 Jan 2026 19:26 UTC

44 points

3 comments30 min readLW link

(blog.ai-futures.org)