Did Claude 3 Opus al­ign it­self via gra­di­ent hack­ing?

Fiora Starlight21 Feb 2026 22:24 UTC
391 points
49 comments20 min readLW link

If you don’t feel deeply con­fused about AGI risk, some­thing’s wrong

Dave Banerjee21 Feb 2026 15:34 UTC
95 points
18 comments5 min readLW link
(open.substack.com)

Ponzi schemes as a demon­stra­tion of out-of-dis­tri­bu­tion generalization

TFD21 Feb 2026 13:19 UTC
9 points
2 comments6 min readLW link
(www.thefloatingdroid.com)

LLMs and Liter­a­ture: Where Value Ac­tu­ally Comes From

derelict543221 Feb 2026 13:16 UTC
13 points
13 comments4 min readLW link

The Spec­tre haunt­ing the “AI Safety” Community

Gabriel Alfour21 Feb 2026 11:14 UTC
233 points
28 comments6 min readLW link
(cognition.cafe)

LessWrong’s goals over­lap HowTruth­ful’s

Bruce Lewis21 Feb 2026 4:19 UTC
7 points
4 comments2 min readLW link

Align­ment to Evil

Matrice Jacobine21 Feb 2026 3:29 UTC
61 points
12 comments1 min readLW link
(tetraspace.substack.com)

Re­port­ing Tasks as Re­ward-Hack­able: Bet­ter Than Inoc­u­la­tion Prompt­ing?

RogerDearnaley21 Feb 2026 1:59 UTC
40 points
4 comments5 min readLW link

Robert Sapolsky Is Sim­ply Not Talk­ing About Compatibilism

Julius21 Feb 2026 1:27 UTC
26 points
4 comments8 min readLW link
(thegreymatter.substack.com)

TT Self Study Jour­nal # 7

TristanTrim21 Feb 2026 1:22 UTC
13 points
2 comments4 min readLW link

How will we do SFT on mod­els with opaque rea­son­ing?

21 Feb 2026 0:00 UTC
32 points
17 comments7 min readLW link

Agent-first con­text menus

Surya Kasturi20 Feb 2026 23:45 UTC
3 points
1 comment2 min readLW link

Hu­man per­cep­tion of re­la­tional knowl­edge on graph­i­cal interfaces

Surya Kasturi20 Feb 2026 23:45 UTC
3 points
1 comment1 min readLW link

Ho­doscope: Vi­su­al­iza­tion for Effi­cient Hu­man Supervision

20 Feb 2026 23:41 UTC
9 points
0 comments2 min readLW link
(hodoscope.dev)

Car­rot-Parsnip: A So­cial De­duc­tion Game for LLM Evals

Bicuspid Valve20 Feb 2026 23:06 UTC
11 points
0 comments7 min readLW link

Can Cur­rent AI Match (or Out­match) Pro­fes­sion­als in Eco­nom­i­cally Valuable Tasks?

saahir.vazirani20 Feb 2026 21:38 UTC
6 points
0 comments5 min readLW link

METR’s 14h 50% Hori­zon Im­pacts The Econ­omy More Than ASI Timelines

Michaël Trazzi20 Feb 2026 21:08 UTC
45 points
11 comments2 min readLW link

New video from Pal­isade Re­search: No One Un­der­stands Why AI Works

peterbarnett20 Feb 2026 20:29 UTC
62 points
2 comments1 min readLW link
(www.youtube.com)

An­nounc­ing: Iliad In­ten­sive + Iliad Fellowship

20 Feb 2026 20:13 UTC
82 points
15 comments1 min readLW link

ARENA 8.0 - Call for Applicants

20 Feb 2026 18:28 UTC
31 points
1 comment6 min readLW link

Mili­taries are go­ing au­tonomous. But will AI lead to new wars? A tour of re­cent research

Mordechai Rorvig20 Feb 2026 18:26 UTC
1 point
0 comments2 min readLW link
(www.foommagazine.org)

Un­prece­dented Catas­tro­phes Have Non-Canon­i­cal Probabilities

E.G. Blee-Goldman20 Feb 2026 18:23 UTC
6 points
2 comments14 min readLW link

Mechanis­tic In­ter­pretabil­ity of Biolog­i­cal Foun­da­tion Models

Ihor Kendiukhov20 Feb 2026 18:01 UTC
34 points
1 comment26 min readLW link

On Steven Byrnes’ ruth­less ASI, (dis)analo­gies with hu­mans and al­ign­ment proposals

StanislavKrym20 Feb 2026 15:32 UTC
9 points
2 comments2 min readLW link

Some Ques­tions For Democrats About Epstein

Alexander Turok20 Feb 2026 15:24 UTC
−28 points
3 comments4 min readLW link

AGI is Here

Gordon Seidoh Worley20 Feb 2026 15:21 UTC
68 points
39 comments2 min readLW link

Mind the Gap

Bridgett Kay20 Feb 2026 14:35 UTC
6 points
0 comments5 min readLW link
(dxmrevealed.wordpress.com)

AI #156 Part 2: Er­rors in Rhetoric

Zvi20 Feb 2026 14:31 UTC
45 points
0 comments32 min readLW link
(thezvi.wordpress.com)

AI for so­cietal de­ci­sion mak­ing—How promis­ing is the space? 80,000 Hours profile

Zershaaneh Qureshi20 Feb 2026 13:28 UTC
3 points
0 comments2 min readLW link

How To Es­cape Su­per Mario Bros

omegastick20 Feb 2026 11:54 UTC
70 points
8 comments9 min readLW link
(dumbideas.xyz)

Hu­man Fine-Tuning

20 Feb 2026 10:20 UTC
3 points
0 comments16 min readLW link
(cognition.cafe)

The Prob­lem of Coun­terev­i­dence and the Fu­til­ity of Theodicy

Ape in the coat20 Feb 2026 7:36 UTC
2 points
6 comments4 min readLW link
(substack.com)

A Claude Skill To Com­ment On Docs

Tim Hua20 Feb 2026 2:28 UTC
26 points
1 comment2 min readLW link

Co­op­er­a­tionism: first draft for a moral frame­work that does not re­quire consciousness

Épiphanie Gédéon19 Feb 2026 21:07 UTC
26 points
5 comments8 min readLW link

Flam­in­gos (among other things) re­duce emer­gent misalignment

eekay19 Feb 2026 19:17 UTC
13 points
3 comments7 min readLW link

Funker­ing!

flying buttress19 Feb 2026 18:14 UTC
13 points
0 comments1 min readLW link

Sub­jec­tivity vs Agency: AI “Wak­ing Up”?

Jonathan Moregård19 Feb 2026 17:19 UTC
4 points
0 comments5 min readLW link
(honestliving.substack.com)

You May Already Be Canadian

jefftk19 Feb 2026 16:00 UTC
120 points
14 comments1 min readLW link
(www.jefftk.com)

AI Re­searchers and Ex­ec­u­tives Con­tinue to Un­der­es­ti­mate the Near-Fu­ture Risks of Open Models

Andrew Dickson19 Feb 2026 15:56 UTC
23 points
1 comment16 min readLW link

AI #156 Part 1: They Do Mean The Effect On Jobs

Zvi19 Feb 2026 14:20 UTC
53 points
7 comments36 min readLW link
(thezvi.wordpress.com)

Ter­mi­nal Cynicism

19 Feb 2026 13:51 UTC
24 points
25 comments10 min readLW link
(cognition.cafe)

How much in­for­ma­tion does an op­ti­mal policy con­tain about its en­vi­ron­ment?

19 Feb 2026 13:05 UTC
30 points
0 comments10 min readLW link

All hands on deck to build the dat­a­cen­ter lie detector

Naci Cankaya19 Feb 2026 11:42 UTC
32 points
2 comments5 min readLW link
(open.substack.com)

A Tech­ni­cal Primer on Mechanis­tic Interpretability

Alexei G19 Feb 2026 7:42 UTC
1 point
0 comments11 min readLW link
(alexeigannon.com)

Power Laws Are Not Enough

CarolusRenniusVitellius19 Feb 2026 4:31 UTC
10 points
3 comments4 min readLW link
(charlesr-w.github.io)

Be skep­ti­cal of mile­stone an­nounce­ments by young AI startups

lc19 Feb 2026 4:19 UTC
25 points
0 comments3 min readLW link

Opus 4.5 made a biode­vice (w me)

Raye19 Feb 2026 2:31 UTC
23 points
0 comments10 min readLW link

Re­view of If Any­one Builds It, Every­one Dies

James Brobin19 Feb 2026 1:53 UTC
23 points
4 comments5 min readLW link

I want to ac­tu­ally get good at fore­cast­ing this year (Group In­vite)

Vojtech Brynych19 Feb 2026 1:41 UTC
12 points
4 comments1 min readLW link

Does GPT-2 Rep­re­sent Con­tro­versy? A Small Mech In­terp Investigation

CharlesL19 Feb 2026 1:36 UTC
6 points
0 comments2 min readLW link