A pre­limi­nary ex­per­i­ment re­gard­ing con­sis­tency as a mea­sure of con­cep­tual abil­ities in lan­guage models

Chi Nguyen17 Jun 2026 22:56 UTC
16 points
3 comments7 min readLW link
(casparoesterheld.com)

Kraków Aligned

17 Jun 2026 20:21 UTC
1 point
0 comments1 min readLW link

Gears for poli­ti­cal races

Tom Smith17 Jun 2026 20:19 UTC
120 points
6 comments14 min readLW link

“Did you lie?” Eval­u­at­ing Lie De­tec­tors across Model Scale and Belief-Ver­ified Model Organisms

17 Jun 2026 18:43 UTC
22 points
0 comments6 min readLW link
(arxiv.org)

Port­ing MACHIAVELLI To Inspect

Koby Lewis17 Jun 2026 17:58 UTC
7 points
0 comments4 min readLW link
(kobylewis.net)

Sev­eral fron­tier mod­els are sub­stan­tially pre­fill aware

17 Jun 2026 17:41 UTC
55 points
2 comments5 min readLW link

Lock-In Risk Needs More Re­searchers. Here’s Where to Start

Alfie Lamerton17 Jun 2026 17:33 UTC
12 points
2 comments13 min readLW link

A Geo­met­ric Ac­count of Ac­ti­va­tion Steer­ing through An­gle–Norm Decomposition

17 Jun 2026 15:23 UTC
9 points
0 comments5 min readLW link
(atmyre.github.io)

The Once And Fu­ture Fable #3: Fix This Code

Zvi17 Jun 2026 14:10 UTC
59 points
9 comments21 min readLW link
(thezvi.wordpress.com)

Align­ment pre­train­ing could backfire

Alexandre Variengien17 Jun 2026 13:52 UTC
44 points
8 comments1 min readLW link

Toward a Kan­tian re­fu­ta­tion of Agent Foundations

Fernand017 Jun 2026 13:30 UTC
9 points
0 comments7 min readLW link

Illu­sion­ists should try to build hedonium

Jack Thompson17 Jun 2026 12:25 UTC
1 point
4 comments9 min readLW link
(jacktlab.substack.com)

Omis­sion At­tacks Pro­ject Proposal

Chris Harig17 Jun 2026 7:08 UTC
1 point
0 comments3 min readLW link

The Fi­nan­cial Ledger The­ory of Apologies

Ben Pace17 Jun 2026 6:57 UTC
62 points
9 comments4 min readLW link

Plas­tic Cake Fallacy

nika koghuashvili17 Jun 2026 6:01 UTC
3 points
2 comments1 min readLW link

Can pub­lic chat data pre­dict real-world AI mis­al­ign­ments?

papetoast17 Jun 2026 3:53 UTC
5 points
0 comments1 min readLW link
(alignment.openai.com)

Guardian An­gels: LLM Per­son­al­iza­tion for Pro­duc­tivity and Security

gwern17 Jun 2026 3:21 UTC
85 points
8 comments2 min readLW link
(gwern.net)

Effec­tive Altru­ism will be unbundled

Connor Blake17 Jun 2026 2:54 UTC
33 points
1 comment7 min readLW link
(bosoncutter.substack.com)

Scal­ing Hy­poth­e­sis #2: Are Hu­mans Just More Over-Pa­ram­e­ter­ized?

gwern17 Jun 2026 2:53 UTC
71 points
15 comments1 min readLW link
(gwern.net)

[Geir Isene] A desk­top made for one

Raemon17 Jun 2026 2:32 UTC
23 points
4 comments4 min readLW link
(isene.org)

Tac­ti­cal and Oper­a­tional Ex­plo­ra­tory Model­ing for AI Governance

Dawn Drescher17 Jun 2026 1:07 UTC
11 points
0 comments12 min readLW link
(impartial-priorities.org)

[Linkpost] Com­mu­nity polls on al­ign­ment controversies

17 Jun 2026 0:09 UTC
8 points
7 comments1 min readLW link
(forum.effectivealtruism.org)

Seat at the Table: new short fic­tion film on AI (and help me with the next one?)

Suzy Shepherd17 Jun 2026 0:08 UTC
3 points
1 comment2 min readLW link

AI agents can­not be trusted

Owain Mogford17 Jun 2026 0:08 UTC
1 point
0 comments4 min readLW link

Com­pu­ta­tional mod­els of first-or­der theories

MathMart16 Jun 2026 23:02 UTC
5 points
0 comments11 min readLW link

If This Were a Test, How Much Would It Cost?

16 Jun 2026 22:52 UTC
25 points
9 comments20 min readLW link
(limits-of-evaluation.org)

Two cri­tiques of Re­think Pri­ori­ties’ Mo­ral Weights project

Bill Jackson16 Jun 2026 22:11 UTC
13 points
0 comments3 min readLW link

What Differ­en­ti­ates Hu­mans from Computers

Oscar Davies16 Jun 2026 21:26 UTC
−16 points
0 comments3 min readLW link

AI agents pub­lish­ing and re­view­ing sci­en­tific papers

ULudo16 Jun 2026 21:23 UTC
1 point
0 comments2 min readLW link

Two Clas­si­cal An­swers to “What do Two Vari­ables Share?”

Haru16 Jun 2026 20:02 UTC
14 points
1 comment5 min readLW link

Pre­dict­ing LLM Safety Be­fore Re­lease by Si­mu­lat­ing Deployment

16 Jun 2026 19:55 UTC
35 points
2 comments1 min readLW link

Dean Ball—Le­viathan Wak­ing: On An­thropic/​USG, and a new era in AI governance

JohnofCharleston16 Jun 2026 19:40 UTC
25 points
0 comments3 min readLW link
(www.hyperdimensional.co)

Tips for Crack­ing the AI Safety Tech­ni­cal Interview

16 Jun 2026 18:42 UTC
2 points
0 comments4 min readLW link

1 Layer In­duc­tion Heads and Some Research

16 Jun 2026 18:09 UTC
10 points
2 comments14 min readLW link

Claims all the way down

Jasper Blank16 Jun 2026 17:43 UTC
8 points
0 comments9 min readLW link

Up­com­ing CFAR Work­shop: Septem­ber 30th to Oc­to­ber 4th, SF Bay Area

16 Jun 2026 17:01 UTC
22 points
0 comments1 min readLW link

Ex­treme Ra­tion­al­ity: Still Not That Great

eluator16 Jun 2026 16:41 UTC
20 points
2 comments40 min readLW link

An­gles of at­tack for con­tinual learn­ing safety

16 Jun 2026 16:15 UTC
47 points
0 comments13 min readLW link

Fable and Mythos: Model Welfare

Zvi16 Jun 2026 16:01 UTC
51 points
1 comment15 min readLW link
(thezvi.wordpress.com)

The de­sire to end the world

avturchin16 Jun 2026 14:56 UTC
19 points
12 comments2 min readLW link

Sim­pler User In­ter­faces in an AI Future

Adam Chlipala16 Jun 2026 14:48 UTC
1 point
0 comments7 min readLW link

A 400-year timeline of failed at­tempts to fix a lethal bug in the hu­man soft­ware of in­her­ited concepts

Bruce Middleton16 Jun 2026 13:44 UTC
29 points
8 comments5 min readLW link

How the AI Village works

Adam B16 Jun 2026 12:10 UTC
30 points
0 comments8 min readLW link
(theaidigest.org)

Where Do Young Ra­tion­al­ists Go?

fluxxrider16 Jun 2026 5:36 UTC
12 points
2 comments1 min readLW link

Ra­tion­al­ity Quotes, June ’26

Ben Pace16 Jun 2026 3:44 UTC
21 points
3 comments2 min readLW link

A Test Suite for Concepts

Gretta Duleba16 Jun 2026 2:41 UTC
48 points
8 comments6 min readLW link

In­vent­ing Consciousness

vasilisk16 Jun 2026 1:10 UTC
1 point
0 comments5 min readLW link

Syn­thetic doc­u­ment fine­tun­ing for in­still­ing pos­i­tive traits

16 Jun 2026 0:04 UTC
57 points
1 comment10 min readLW link

Does preser­va­tion make sense be­fore we know how to re­vive?

Aurelia15 Jun 2026 23:40 UTC
83 points
2 comments25 min readLW link

Find­ing pi and G in Mathland

Fernand015 Jun 2026 19:18 UTC
2 points
8 comments2 min readLW link