Us­ing fic­tion to imag­ine a path­way to friendlyAGI

Rick Moss23 Feb 2026 23:48 UTC
3 points
0 comments2 min readLW link

When Bench­marks Lie: Eval­u­at­ing Mal­i­cious Prompt Clas­sifiers Un­der True Distri­bu­tion Shift

Max Fomin23 Feb 2026 23:44 UTC
1 point
2 comments6 min readLW link

The per­sona se­lec­tion model

Sam Marks23 Feb 2026 22:56 UTC
176 points
53 comments43 min readLW link
(alignment.anthropic.com)

Agenda Reflec­tion: Test­ing Au­to­mated Align­ment

Ariel_23 Feb 2026 21:53 UTC
11 points
0 comments2 min readLW link
(zenodo.org)

Claude Son­net 4.6 Gives You Flexibility

Zvi23 Feb 2026 20:30 UTC
29 points
1 comment9 min readLW link
(thezvi.wordpress.com)

Se­crets of the LessWrong RSS Feed

Brendan Long23 Feb 2026 20:12 UTC
36 points
6 comments4 min readLW link

Which ques­tions can’t we punt?

Lizka23 Feb 2026 19:17 UTC
39 points
2 comments15 min readLW link

Ex­po­nen­tial GDP growth from lin­ear growth in va­ri­ety of goods

Will_Howard23 Feb 2026 18:50 UTC
4 points
2 comments5 min readLW link
(open.substack.com)

Pre-train­ing data poi­son­ing likely makes in­stal­ling se­cret loy­alties easier

Joe Kwon23 Feb 2026 18:12 UTC
12 points
0 comments4 min readLW link

The 2028 Global In­tel­li­gence Cri­sis—a fi­nance-ori­ented vignette

Rasool23 Feb 2026 17:12 UTC
50 points
13 comments1 min readLW link
(www.citriniresearch.com)

AI Im­pact Sum­mit 2026 : A Field Report

23 Feb 2026 16:58 UTC
38 points
1 comment9 min readLW link

The map of the map is not the map

jimmy23 Feb 2026 16:54 UTC
18 points
3 comments9 min readLW link

Fact-check­ing an AI op­ti­mist ar­ti­cle in The Economist

ToSummarise23 Feb 2026 13:56 UTC
41 points
3 comments4 min readLW link
(www.tosummarise.com)

Re­view: “We can’t dis­agree for­ever”

Martin Randall23 Feb 2026 13:17 UTC
15 points
0 comments3 min readLW link

Why I Think Pause is Impossible

E.G. Blee-Goldman23 Feb 2026 11:58 UTC
1 point
4 comments6 min readLW link

Can Aha Mo­ments be Fake? Iden­ti­fy­ing True and Dec­o­ra­tive Think­ing Steps in CoT

Jiachen Zhao23 Feb 2026 11:51 UTC
24 points
0 comments10 min readLW link
(arxiv.org)

A World Without Vio­let: Pe­cu­liar Con­se­quences of Grant­ing Mo­ral Sta­tus to Ar­tifi­cial Intelligences

Sever Topan23 Feb 2026 7:23 UTC
17 points
8 comments4 min readLW link
(severtopan.substack.com)

Was It Owl a Dream?

Yovel Rom23 Feb 2026 5:07 UTC
17 points
4 comments4 min readLW link
(yovelrom.substack.com)

In­nate Immunity

joec23 Feb 2026 5:00 UTC
23 points
2 comments6 min readLW link

Why I Tran­si­tioned: A Third (FtM) Perspective

Character#273623 Feb 2026 4:39 UTC
22 points
6 comments14 min readLW link

The power of a sim­ple 3-way truth scale

Bruce Lewis23 Feb 2026 2:41 UTC
4 points
2 comments2 min readLW link

Stor­ing Food

jefftk23 Feb 2026 1:40 UTC
77 points
9 comments2 min readLW link
(www.jefftk.com)

Old SUNY Dorm Logic is not helping ru­ral pop­u­la­tion col­lapse in NY.

Edd Schneider23 Feb 2026 1:28 UTC
9 points
4 comments3 min readLW link

Chang­ing the world for the worse

mingyuan22 Feb 2026 23:55 UTC
129 points
17 comments3 min readLW link
(mingyuan.substack.com)

The Scal­able For­mal Over­sight Re­search Program

Max von Hippel22 Feb 2026 22:40 UTC
34 points
4 comments9 min readLW link

Adapters as Rep­re­sen­ta­tional Hy­pothe­ses: What Adapter Meth­ods Tell Us About Trans­former Geometry

wassname22 Feb 2026 22:12 UTC
18 points
0 comments5 min readLW link

A Dialec­tic on Clas­si­cal Utilitarianism

James Brobin22 Feb 2026 19:32 UTC
1 point
1 comment2 min readLW link

My RSS Reader is Done

Brendan Long22 Feb 2026 19:06 UTC
36 points
2 comments1 min readLW link
(www.brendanlong.com)

What to Do About AGI

Gordon Seidoh Worley22 Feb 2026 19:00 UTC
18 points
1 comment2 min readLW link

Map­ping LLM at­trac­tor states

Adam Bricknell22 Feb 2026 18:10 UTC
18 points
8 comments3 min readLW link

In­san­i­tyBench: Cryp­tic Puz­zles as a Probe for Lat­eral Thinking

RobinHa22 Feb 2026 14:20 UTC
48 points
1 comment4 min readLW link
(www.robinhaselhorst.com)

The world won’t end, but we should be ashamed for trying

George3d622 Feb 2026 13:01 UTC
−20 points
0 comments12 min readLW link
(cerebralab.com)

First Fore­cast­ing Dojo Group Meetup

Vojtech Brynych22 Feb 2026 7:19 UTC
3 points
2 comments1 min readLW link

Life’s para­dox and AI’s ac­cen­tu­a­tion of it

geyab4661722 Feb 2026 4:50 UTC
−1 points
0 comments3 min readLW link

Mul­ti­ple In­de­pen­dent Se­man­tic Axes in Gemma 3 270M

CharlesL22 Feb 2026 1:55 UTC
15 points
2 comments3 min readLW link

A Tax­on­omy of Traces

aleph_four22 Feb 2026 1:28 UTC
0 points
0 comments10 min readLW link

Hier­ar­chi­cal Goal In­duc­tion With Ethics

aleph_four22 Feb 2026 0:53 UTC
3 points
0 comments4 min readLW link

Did Claude 3 Opus al­ign it­self via gra­di­ent hack­ing?

Fiora Starlight21 Feb 2026 22:24 UTC
391 points
49 comments20 min readLW link

If you don’t feel deeply con­fused about AGI risk, some­thing’s wrong

Dave Banerjee21 Feb 2026 15:34 UTC
95 points
18 comments5 min readLW link
(open.substack.com)

Ponzi schemes as a demon­stra­tion of out-of-dis­tri­bu­tion generalization

TFD21 Feb 2026 13:19 UTC
9 points
2 comments6 min readLW link
(www.thefloatingdroid.com)

LLMs and Liter­a­ture: Where Value Ac­tu­ally Comes From

derelict543221 Feb 2026 13:16 UTC
13 points
13 comments4 min readLW link

The Spec­tre haunt­ing the “AI Safety” Community

Gabriel Alfour21 Feb 2026 11:14 UTC
233 points
28 comments6 min readLW link
(cognition.cafe)

LessWrong’s goals over­lap HowTruth­ful’s

Bruce Lewis21 Feb 2026 4:19 UTC
7 points
4 comments2 min readLW link

Align­ment to Evil

Matrice Jacobine21 Feb 2026 3:29 UTC
61 points
12 comments1 min readLW link
(tetraspace.substack.com)

Re­port­ing Tasks as Re­ward-Hack­able: Bet­ter Than Inoc­u­la­tion Prompt­ing?

RogerDearnaley21 Feb 2026 1:59 UTC
40 points
4 comments5 min readLW link

Robert Sapolsky Is Sim­ply Not Talk­ing About Compatibilism

Julius21 Feb 2026 1:27 UTC
26 points
4 comments8 min readLW link
(thegreymatter.substack.com)

TT Self Study Jour­nal # 7

TristanTrim21 Feb 2026 1:22 UTC
13 points
2 comments4 min readLW link

How will we do SFT on mod­els with opaque rea­son­ing?

21 Feb 2026 0:00 UTC
32 points
17 comments7 min readLW link

Agent-first con­text menus

Surya Kasturi20 Feb 2026 23:45 UTC
3 points
1 comment2 min readLW link

Hu­man per­cep­tion of re­la­tional knowl­edge on graph­i­cal interfaces

Surya Kasturi20 Feb 2026 23:45 UTC
3 points
1 comment1 min readLW link