All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 20252026

AllJanFeb Mar Apr May Jun

All 1 2 3 4 5 6 7 8 9 10 11 12 131415 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

A different take on the “Off-switch” problem: Existential Logic as a safety net

kosi thu13 Jan 2026 21:22 UTC

−5 points

1 comment1 min readLW link

Analysing CoT alignment in thinking LLMs with low-dimensional steering

edoinni13 Jan 2026 20:45 UTC

6 points

0 comments7 min readLW link

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Riya Tyagi, daria, Arthur Conmy and Neel Nanda

13 Jan 2026 20:40 UTC

52 points

0 comments18 min readLW link

Playing Dumb: Detecting Sandbagging in Frontier LLMs via Consistency Checks

James Sullivan13 Jan 2026 19:28 UTC

11 points

0 comments5 min readLW link

Claude Coworks

Zvi13 Jan 2026 19:00 UTC

39 points

2 comments12 min readLW link

(thezvi.wordpress.com)

Language models resemble more than just language cortex, show neuroscientists

Mordechai Rorvig13 Jan 2026 18:05 UTC

7 points

0 comments1 min readLW link

(www.foommagazine.org)

Schelling Coordination in LLMs: A Review

Callum-Luis Kindred13 Jan 2026 16:25 UTC

10 points

1 comment8 min readLW link

A tale of two doormen: a bizarre AI incident on Christmas

Rebecca Dai13 Jan 2026 15:42 UTC

31 points

6 comments3 min readLW link

(rebeccadai.substack.com)

Fixed Buckets Can’t (Phenomenally) Bind

algekalipso13 Jan 2026 15:30 UTC

14 points

8 comments21 min readLW link

The Reality of Wholes: Why the Universe Isn’t Just a Cellular Automaton

algekalipso13 Jan 2026 15:28 UTC

7 points

3 comments15 min readLW link

AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

wassname13 Jan 2026 12:55 UTC

6 points

0 comments11 min readLW link

Contra Dance as a Model For Post-AI Culture

jefftk13 Jan 2026 6:50 UTC

43 points

9 comments2 min readLW link

(www.jefftk.com)

Making LLM Graders Consistent

Davey Morse13 Jan 2026 3:32 UTC

9 points

0 comments1 min readLW link

Attempting to influence transformer representations via initialization

speck144713 Jan 2026 0:49 UTC

11 points

0 comments10 min readLW link

When does competition lead to recognisable values?

Jan_Kulveit, beren, David Duvenaud and Raymond Douglas

12 Jan 2026 23:13 UTC

66 points

18 comments25 min readLW link

(postagi.org)

Lies, Damned Lies, and Proofs: Formal Methods are not Slopless

Quinn and Max von Hippel

12 Jan 2026 22:32 UTC

102 points

10 comments7 min readLW link

Pro or Average Joe? Do models infer our technical ability and can we control this judgement?

tobypullan12 Jan 2026 20:52 UTC

12 points

0 comments9 min readLW link

Dating Roundup #10: Gendered Expectations

Zvi12 Jan 2026 20:30 UTC

28 points

4 comments16 min readLW link

(thezvi.wordpress.com)

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

fbarez12 Jan 2026 19:55 UTC

9 points

0 comments1 min readLW link

Tensor-Transformer Variants are Surprisingly Performant

Logan Riggs12 Jan 2026 19:43 UTC

87 points

16 comments4 min readLW link

The Algorithm Rewards Engagement

Wes F12 Jan 2026 19:38 UTC

14 points

0 comments1 min readLW link

BlackBoxQuery [BBQ]-Bench: Measuring Hypothesis Formation and Experimentation Capabilities in LLMs

Daniel Wu12 Jan 2026 19:36 UTC

10 points

0 comments12 min readLW link

Understanding Agency through Markov Blankets

Ashe Vazquez Nuñez12 Jan 2026 19:32 UTC

25 points

2 comments3 min readLW link

Model Reduction as Interpretability: What Neuroscience Could Teach Us About Understanding Complex Systems

RiekeFruengel12 Jan 2026 19:31 UTC

13 points

0 comments6 min readLW link

Futarchy (and Tyranny of The Minority)

maxwickham12 Jan 2026 19:27 UTC

4 points

1 comment8 min readLW link

What Happens When Superhuman AIs Compete for Control?

steveld12 Jan 2026 19:26 UTC

44 points

3 comments30 min readLW link

(blog.ai-futures.org)

Brief Explorations in LLM Value Rankings

Tim Hua, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

12 Jan 2026 18:16 UTC

39 points

1 comment11 min readLW link

Practical challenges of control monitoring in frontier AI deployments

David Lindner and Charlie Griffin

12 Jan 2026 16:45 UTC

19 points

0 comments1 min readLW link

(arxiv.org)

Thinking vs Unfolding

Chris Scammell12 Jan 2026 15:26 UTC

67 points

5 comments13 min readLW link

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Florian_Dietz12 Jan 2026 12:29 UTC

87 points

41 comments26 min readLW link

Inter-branch communication in the multiverse via trapped ions

avturchin12 Jan 2026 12:16 UTC

7 points

32 comments4 min readLW link

--dangerously-skip-permissions

OhadA12 Jan 2026 7:37 UTC

16 points

6 comments3 min readLW link

Closing the loop

Screwtape12 Jan 2026 6:37 UTC

30 points

1 comment2 min readLW link

Announcing Inkhaven 2: April 2026

Ben Pace12 Jan 2026 4:25 UTC

70 points

7 comments4 min readLW link

[Question] What potent consumer technologies have long remained inaccessible?

TsviBT12 Jan 2026 3:13 UTC

32 points

11 comments4 min readLW link

Digital intentionality is not about productivity

mingyuan12 Jan 2026 3:09 UTC

65 points

1 comment3 min readLW link

(mingyuan.substack.com)

De pluribus non est disputandum

Jacob Goldsmith12 Jan 2026 0:07 UTC

11 points

0 comments3 min readLW link

Strong, bipartisan leadership for resistance to Trump.

Raemon11 Jan 2026 23:08 UTC

82 points

85 comments2 min readLW link

A Couple Useful LessWrong Userstyles

Alex Vermillion11 Jan 2026 21:26 UTC

39 points

0 comments2 min readLW link

Stretch Hatchback

jefftk11 Jan 2026 16:40 UTC

12 points

8 comments2 min readLW link

(www.jefftk.com)

We need a better way to evaluate emergent misalignment

yix and Broyojo

11 Jan 2026 16:21 UTC

86 points

9 comments6 min readLW link

Should the AI Safety Community Prioritize Safety Cases?

Jan Wehner11 Jan 2026 11:56 UTC

4 points

0 comments13 min readLW link

Coding Agents As An Interface To The Codebase

omegastick11 Jan 2026 10:31 UTC

16 points

5 comments3 min readLW link

(dumbideas.xyz)

Why AIs aren’t power-seeking yet

Eli Tyre11 Jan 2026 7:07 UTC

105 points

16 comments7 min readLW link

Theoretical predictions on the sample efficiency of training policies and activation monitors

Alek Westover and Vivek Hebbar

10 Jan 2026 23:50 UTC

18 points

2 comments7 min readLW link

If AI alignment is only as hard as building the steam engine, then we likely still die

MichaelDickens10 Jan 2026 23:10 UTC

35 points

8 comments4 min readLW link

How Humanity Wins

Wes R10 Jan 2026 21:55 UTC

−20 points

10 comments4 min readLW link

Possible Principles of Superagency

Mariven10 Jan 2026 21:00 UTC

14 points

0 comments12 min readLW link

(mariven.substack.com)

The Case Against Continuous Chain-of-Thought (Neuralese)

RobinHa10 Jan 2026 20:32 UTC

11 points

8 comments5 min readLW link

The false confidence theorem and Bayesian reasoning

viking_math10 Jan 2026 17:14 UTC

24 points

19 comments6 min readLW link