In­tro­spec­tive In­ter­pretabil­ity: a Defi­ni­tion, Mo­ti­va­tion, and Open Problems

Belinda Li9 Feb 2026 23:53 UTC
10 points
0 comments13 min readLW link

Job List­ing (Closed): CBAI Oper­a­tions Associate

9 Feb 2026 23:36 UTC
1 point
0 comments1 min readLW link

Weight-Sparse Cir­cuits May Be In­ter­pretable Yet Unfaithful

jacob_drori9 Feb 2026 23:25 UTC
136 points
5 comments8 min readLW link

Gw­ern’s 2025 Inkhaven Writ­ing Interview

gwern9 Feb 2026 22:11 UTC
49 points
2 comments31 min readLW link
(gwern.net)

Claude Opus 4.6: Sys­tem Card Part 1: Mun­dane Align­ment and Model Welfare

Zvi9 Feb 2026 21:30 UTC
36 points
5 comments26 min readLW link
(thezvi.wordpress.com)

Closure

Vadim Golub9 Feb 2026 21:17 UTC
3 points
0 comments2 min readLW link

Aure­lius: Propos­ing Align­ment as an Emer­gent Property

Austin McCaffrey9 Feb 2026 20:13 UTC
−5 points
0 comments1 min readLW link
(github.com)

Distributed vs cen­tral­ized agents

Richard_Ngo9 Feb 2026 20:06 UTC
51 points
9 comments1 min readLW link

Stone Age Billion­aire Can’t Words Good

Eneasz9 Feb 2026 18:51 UTC
169 points
95 comments12 min readLW link
(deathisbad.substack.com)

Do Models Con­tinue Misal­igned Ac­tions? [eval]

Jordan Taylor9 Feb 2026 16:59 UTC
76 points
12 comments11 min readLW link

the ex­traor­di­nary as mundane

Derek DeHart9 Feb 2026 16:26 UTC
3 points
2 comments5 min readLW link
(dehart.substack.com)

Large Lan­guage Models Live in Time

Eleni Angelou9 Feb 2026 15:08 UTC
20 points
2 comments4 min readLW link

Sym­pa­thy for the Model, or, Welfare Con­cerns as Takeover Risk

J Bostock9 Feb 2026 14:19 UTC
42 points
37 comments3 min readLW link

Opus 4.6 Rea­son­ing Doesn’t Ver­bal­ize Align­ment Fak­ing, but Be­hav­ior Persists

9 Feb 2026 12:55 UTC
118 points
13 comments8 min readLW link

Does an AI So­ciety Need an Im­mune Sys­tem? Ac­cept­ing Yam­polskiy’s Im­pos­si­bil­ity Results

Hiroshi Yamakawa9 Feb 2026 12:32 UTC
13 points
0 comments10 min readLW link

Can Hard­ware Save Us from Soft­ware?

Alvin Ånestrand9 Feb 2026 11:57 UTC
23 points
2 comments12 min readLW link
(forecastingaifutures.substack.com)

Com­plex­ity Science as Bridge to Eastern Philosophy

pchvykov9 Feb 2026 10:40 UTC
1 point
2 comments2 min readLW link

De­sign sketches for a more sen­si­ble world

9 Feb 2026 10:22 UTC
26 points
2 comments4 min readLW link
(www.forethought.org)

De­sign sketches for an­gels-on-the-shoulder

9 Feb 2026 9:52 UTC
23 points
0 comments2 min readLW link
(www.forethought.org)

Model In­tegrity and Character

Oliver Klingefjord9 Feb 2026 8:15 UTC
12 points
3 comments6 min readLW link

Eleven Prac­ti­cal Ways to Pre­pare for AGI

John-Clark Levin9 Feb 2026 7:57 UTC
24 points
16 comments5 min readLW link

The differ­ence in risk/​re­ward for Hu­man­ity as a su­per-or­ganism vs. as a col­lec­tion of individuals

ZhanRocks9 Feb 2026 7:53 UTC
1 point
0 comments1 min readLW link

An­swer in your head

throwaway8355439 Feb 2026 7:41 UTC
16 points
2 comments3 min readLW link

Eval­u­at­ing Con­flict of Interest

warner9 Feb 2026 7:30 UTC
1 point
0 comments2 min readLW link

Three vi­sions for diffuse control

Alek Westover9 Feb 2026 6:41 UTC
8 points
0 comments3 min readLW link

Ob­ser­va­tions and Complexity

Ape in the coat9 Feb 2026 6:13 UTC
9 points
2 comments3 min readLW link
(apeinthecoat102771.substack.com)

A Perfect Re­s­ur­rec­tion

MarkelKori9 Feb 2026 1:33 UTC
9 points
16 comments3 min readLW link

Em­pa­thy Has Out­worn Its Place in Politics

Character#27368 Feb 2026 23:22 UTC
−26 points
8 comments4 min readLW link

The Two-Board Prob­lem: Train­ing En­vi­ron­ment for Re­search Agents

Valerii K.8 Feb 2026 23:13 UTC
4 points
0 comments9 min readLW link

Join My New Move­ment for the Post-AI World

E.G. Blee-Goldman8 Feb 2026 22:18 UTC
0 points
0 comments7 min readLW link

Dona­tions, The Fifth Year

jenn8 Feb 2026 22:04 UTC
39 points
0 comments4 min readLW link
(www.jenn.site)

Every Mea­sure­ment Has a Scale

CarolusRenniusVitellius8 Feb 2026 20:07 UTC
17 points
4 comments4 min readLW link
(charlesr-w.github.io)

UtopiaBench

nielsrolf8 Feb 2026 18:19 UTC
67 points
10 comments1 min readLW link

Smokey, This is not ’Nam Or: [Already] over the [red] line!

Davidmanheim8 Feb 2026 12:24 UTC
110 points
22 comments4 min readLW link

The op­ti­mal age to freeze eggs is 19

GeneSmith8 Feb 2026 9:44 UTC
195 points
48 comments6 min readLW link

It Is Rea­son­able To Re­search How To Use Model In­ter­nals In Training

Neel Nanda8 Feb 2026 3:44 UTC
103 points
15 comments4 min readLW link

Claude’s Bad Primer Fanfic

abramdemski8 Feb 2026 0:39 UTC
24 points
12 comments54 min readLW link

Can thoughtcrimes scare a cau­tious satis­ficer?

Knight Lee7 Feb 2026 23:28 UTC
4 points
4 comments1 min readLW link

[Question] What should I try to do this year?

abstractapplic7 Feb 2026 22:06 UTC
36 points
4 comments1 min readLW link

Does fo­cus­ing on an­i­mal welfare make sense if you’re AI-pilled?

GradientDissenter7 Feb 2026 20:51 UTC
13 points
7 comments8 min readLW link

Near-In­stantly Abort­ing the Worst Pain Imag­in­able with Psychedelics

eleweek7 Feb 2026 16:11 UTC
217 points
13 comments13 min readLW link
(psychotechnology.substack.com)

Why yeast-based vac­cines could be a big deal for biosecurity

delton1377 Feb 2026 16:08 UTC
62 points
8 comments11 min readLW link

Prompt in­jec­tion in Google Trans­late re­veals base model be­hav­iors be­hind task-spe­cific fine-tuning

megasilverfist7 Feb 2026 13:56 UTC
160 points
27 comments3 min readLW link

Eu­nifi­ca­tion: a His­tor­i­cal Perspective

Martin Sustrik7 Feb 2026 13:31 UTC
19 points
5 comments5 min readLW link
(www.250bpm.com)

Vot­ing Re­sults for the 2024 Review

RobertM7 Feb 2026 3:48 UTC
98 points
0 comments1 min readLW link

Play­ing with an In­frared Camera

jefftk7 Feb 2026 3:30 UTC
33 points
1 comment1 min readLW link
(www.jefftk.com)

Honey, I shrunk the brain

Andy_McKenzie7 Feb 2026 0:01 UTC
128 points
1 comment5 min readLW link
(neurobiology.substack.com)

Strat­egy of von Neu­mann and strat­egy of Rosenbergs

avturchin6 Feb 2026 22:50 UTC
5 points
4 comments2 min readLW link

Data-Cen­tric In­ter­pretabil­ity for LLM-based Multi-Agent Re­in­force­ment Learning

6 Feb 2026 19:27 UTC
10 points
0 comments4 min readLW link

Parks Aren’t Nature

Sable6 Feb 2026 18:27 UTC
50 points
11 comments8 min readLW link
(affablyevil.substack.com)