Re­search agenda: In­ter­pre­tive debate

Shi18 Jun 2026 23:46 UTC
15 points
0 comments7 min readLW link

Does it feel any differ­ent to be re­verse-chiral life?

jessicata18 Jun 2026 22:56 UTC
10 points
0 comments10 min readLW link

Mid­jour­ney’s Spa, or when sci-fi tries to be­come mundane

dr_s18 Jun 2026 22:28 UTC
19 points
1 comment7 min readLW link

Re­in­force­ment learn­ing to­wards broadly and per­sis­tently benefi­cial models

papetoast18 Jun 2026 22:11 UTC
11 points
0 comments1 min readLW link
(alignment.openai.com)

The dis­til­la­tion dou­ble bind: Distill­ing mis­al­igned mod­els ei­ther trans­fers mis­al­ign­ment or it doesn’t

18 Jun 2026 21:21 UTC
38 points
0 comments5 min readLW link
(blog.redwoodresearch.org)

CoT-forc­ing promptware

Bruce Middleton18 Jun 2026 19:33 UTC
2 points
0 comments2 min readLW link

AI that rep­re­sents you can’t be neu­tral.

agulaya2418 Jun 2026 18:50 UTC
−4 points
2 comments3 min readLW link

On “Model Or­ganisms”

J Bostock18 Jun 2026 18:42 UTC
17 points
0 comments6 min readLW link

In­tro­duc­tion: Gaus­sian Nat­u­ral Latents

Haru18 Jun 2026 18:41 UTC
12 points
0 comments3 min readLW link

GDM AI Con­trol Roadmap

18 Jun 2026 16:50 UTC
52 points
1 comment1 min readLW link

Con­tra Pace on When to Apologize

Zack_M_Davis18 Jun 2026 16:49 UTC
27 points
8 comments6 min readLW link
(zackmdavis.net)

Your Model Or­ganisms Might Be Fried

18 Jun 2026 16:18 UTC
65 points
2 comments7 min readLW link

Shard nar­cis­sism as delu­sion of unembededness

Fernand018 Jun 2026 14:29 UTC
16 points
1 comment4 min readLW link

AI #173: AI Pauses

Zvi18 Jun 2026 13:40 UTC
33 points
2 comments47 min readLW link
(thezvi.wordpress.com)

War of Dots: CRUSHING my op­po­nents with FACTS and LOGIC

momom218 Jun 2026 12:07 UTC
16 points
2 comments7 min readLW link

How far do open weights trail the fron­tier?

RobinHa18 Jun 2026 11:01 UTC
22 points
3 comments1 min readLW link
(robinhaselhorst.com)

Kar­ls­ruhe—LW/​ACX Meetup—June 2026

volis18 Jun 2026 9:55 UTC
1 point
0 comments1 min readLW link

GLM 5.2 play­ing text adventures

kqr18 Jun 2026 7:23 UTC
13 points
1 comment1 min readLW link
(entropicthoughts.com)

Lev­er­aged on be­ing right

Ben Pace18 Jun 2026 6:51 UTC
38 points
4 comments3 min readLW link

Vuln­er­a­bil­ities and ex­ploits: where are we headed?

tchauvin18 Jun 2026 5:49 UTC
9 points
0 comments5 min readLW link
(tchauvin.com)

Agents are un­der-elic­ited: A case study in op­ti­miza­tion tasks

18 Jun 2026 2:39 UTC
17 points
1 comment7 min readLW link
(fulcrum.inc)

A pre­limi­nary ex­per­i­ment re­gard­ing con­sis­tency as a mea­sure of con­cep­tual abil­ities in lan­guage models

Chi Nguyen17 Jun 2026 22:56 UTC
16 points
3 comments7 min readLW link
(casparoesterheld.com)

Kraków Aligned

17 Jun 2026 20:21 UTC
1 point
0 comments1 min readLW link

Gears for poli­ti­cal races

Tom Smith17 Jun 2026 20:19 UTC
130 points
8 comments14 min readLW link

“Did you lie?” Eval­u­at­ing Lie De­tec­tors across Model Scale and Belief-Ver­ified Model Organisms

17 Jun 2026 18:43 UTC
30 points
0 comments6 min readLW link
(arxiv.org)

Port­ing MACHIAVELLI To Inspect

Koby Lewis17 Jun 2026 17:58 UTC
7 points
0 comments4 min readLW link
(kobylewis.net)

Sev­eral fron­tier mod­els are sub­stan­tially pre­fill aware

17 Jun 2026 17:41 UTC
55 points
2 comments5 min readLW link

Lock-In Risk Needs More Re­searchers. Here’s Where to Start

Alfie Lamerton17 Jun 2026 17:33 UTC
12 points
2 comments13 min readLW link

A Geo­met­ric Ac­count of Ac­ti­va­tion Steer­ing through An­gle–Norm Decomposition

17 Jun 2026 15:23 UTC
9 points
0 comments5 min readLW link
(atmyre.github.io)

The Once And Fu­ture Fable #3: Fix This Code

Zvi17 Jun 2026 14:10 UTC
60 points
9 comments21 min readLW link
(thezvi.wordpress.com)

Align­ment pre­train­ing could backfire

Alexandre Variengien17 Jun 2026 13:52 UTC
44 points
8 comments1 min readLW link

Toward a Kan­tian re­fu­ta­tion of Agent Foundations

Fernand017 Jun 2026 13:30 UTC
9 points
0 comments7 min readLW link

Illu­sion­ists should try to build hedonium

Jack Thompson17 Jun 2026 12:25 UTC
−1 points
4 comments9 min readLW link
(jacktlab.substack.com)

Omis­sion At­tacks Pro­ject Proposal

Chris Harig17 Jun 2026 7:08 UTC
1 point
0 comments3 min readLW link

The Fi­nan­cial Ledger The­ory of Apologies

Ben Pace17 Jun 2026 6:57 UTC
53 points
9 comments4 min readLW link

Plas­tic Cake Fallacy

nika koghuashvili17 Jun 2026 6:01 UTC
3 points
2 comments1 min readLW link

Can pub­lic chat data pre­dict real-world AI mis­al­ign­ments?

papetoast17 Jun 2026 3:53 UTC
7 points
0 comments1 min readLW link
(alignment.openai.com)

Guardian An­gels: LLM Per­son­al­iza­tion for Pro­duc­tivity and Security

gwern17 Jun 2026 3:21 UTC
86 points
8 comments2 min readLW link
(gwern.net)

Effec­tive Altru­ism will be unbundled

Connor Blake17 Jun 2026 2:54 UTC
34 points
1 comment7 min readLW link
(bosoncutter.substack.com)

Scal­ing Hy­poth­e­sis #2: Are Hu­mans Just More Over-Pa­ram­e­ter­ized?

gwern17 Jun 2026 2:53 UTC
76 points
17 comments1 min readLW link
(gwern.net)

[Geir Isene] A desk­top made for one

Raemon17 Jun 2026 2:32 UTC
23 points
4 comments4 min readLW link
(isene.org)

Tac­ti­cal and Oper­a­tional Ex­plo­ra­tory Model­ing for AI Governance

Dawn Drescher17 Jun 2026 1:07 UTC
11 points
0 comments12 min readLW link
(impartial-priorities.org)

[Linkpost] Com­mu­nity polls on al­ign­ment controversies

17 Jun 2026 0:09 UTC
8 points
7 comments1 min readLW link
(forum.effectivealtruism.org)

Seat at the Table: new short fic­tion film on AI (and help me with the next one?)

Suzy Shepherd17 Jun 2026 0:08 UTC
3 points
1 comment2 min readLW link

AI agents can­not be trusted

Owain Mogford17 Jun 2026 0:08 UTC
1 point
0 comments4 min readLW link

Com­pu­ta­tional mod­els of first-or­der theories

MathMart16 Jun 2026 23:02 UTC
5 points
0 comments11 min readLW link

If This Were a Test, How Much Would It Cost?

16 Jun 2026 22:52 UTC
25 points
9 comments20 min readLW link
(limits-of-evaluation.org)

Two cri­tiques of Re­think Pri­ori­ties’ Mo­ral Weights project

Bill Jackson16 Jun 2026 22:11 UTC
13 points
0 comments3 min readLW link

What Differ­en­ti­ates Hu­mans from Computers

Oscar Davies16 Jun 2026 21:26 UTC
−16 points
0 comments3 min readLW link

AI agents pub­lish­ing and re­view­ing sci­en­tific papers

ULudo16 Jun 2026 21:23 UTC
1 point
0 comments2 min readLW link