Su­pervillain Monologues Are Unrealistic

Algon31 Oct 2025 23:58 UTC
94 points
18 comments2 min readLW link

Se­cretly Loyal AIs: Threat Vec­tors and Miti­ga­tion Strategies

Dave Banerjee31 Oct 2025 23:31 UTC
8 points
0 comments19 min readLW link
(substack.com)

Ink with­out haven

Dentosal31 Oct 2025 22:50 UTC
4 points
0 comments2 min readLW link

Ap­ply to the Cam­bridge ERA:AI Win­ter 2026 Fellowship

Kyle O’Brien31 Oct 2025 22:26 UTC
5 points
3 comments1 min readLW link

FAQ: Ex­pert Sur­vey on Progress in AI methodology

KatjaGrace31 Oct 2025 16:51 UTC
15 points
0 comments19 min readLW link
(blog.aiimpacts.org)

So­cial me­dia feeds ‘mis­al­igned’ when viewed through AI safety frame­work, show researchers

Mordechai Rorvig31 Oct 2025 16:40 UTC
13 points
3 comments1 min readLW link
(www.foommagazine.org)

Cross­word Hal­loween 2025: Man­made Horrors

jchan31 Oct 2025 16:19 UTC
7 points
0 comments1 min readLW link

De­bug­ging De­s­pair ~> A bet about Satis­fac­tion and Values

FireBrito de S. Gabriel31 Oct 2025 14:00 UTC
2 points
0 comments2 min readLW link

Halfhaven Digest #3

Taylor G. Lunt31 Oct 2025 13:41 UTC
7 points
0 comments2 min readLW link

OpenAI Moves To Com­plete Po­ten­tially The Largest Theft In Hu­man History

Zvi31 Oct 2025 13:20 UTC
77 points
12 comments19 min readLW link
(thezvi.wordpress.com)

A (bad) Defi­ni­tion of AGI

spookyuser31 Oct 2025 7:55 UTC
4 points
0 comments5 min readLW link

Model­ling, Mea­sur­ing, and In­ter­ven­ing on Goal-di­rected Be­havi­our in AI Systems

31 Oct 2025 1:28 UTC
15 points
0 comments8 min readLW link

Re­sam­pling Con­serves Re­dun­dancy & Me­di­a­tion (Ap­prox­i­mately) Un­der the Jensen-Shan­non Divergence

David Lorell31 Oct 2025 1:07 UTC
42 points
8 comments4 min readLW link

Cen­tral­iza­tion begets stagnation

Algon30 Oct 2025 23:49 UTC
6 points
0 comments2 min readLW link

Sum­mary and Com­ments on An­thropic’s Pilot Sab­o­tage Risk Report

GradientDissenter30 Oct 2025 20:19 UTC
29 points
0 comments5 min readLW link

Crit­i­cal Fal­li­bil­ism and The­ory of Con­straints in One An­a­lyzed Paragraph

Elliot Temple30 Oct 2025 20:06 UTC
2 points
0 comments28 min readLW link

AI #140: Try­ing To Hold The Line

Zvi30 Oct 2025 18:30 UTC
26 points
1 comment52 min readLW link
(thezvi.wordpress.com)

An­thropic’s Pilot Sab­o­tage Risk Report

dmz30 Oct 2025 17:50 UTC
32 points
2 comments3 min readLW link
(alignment.anthropic.com)

AISLE dis­cov­ered three new OpenSSL vulnerabilities

Jan_Kulveit30 Oct 2025 16:32 UTC
64 points
7 comments1 min readLW link
(aisle.com)

Son­net 4.5′s eval gam­ing se­ri­ously un­der­mines al­ign­ment evals, and this seems caused by train­ing on al­ign­ment evals

30 Oct 2025 15:34 UTC
144 points
22 comments14 min readLW link

Steer­ing Eval­u­a­tion-Aware Models to Act Like They Are Deployed

30 Oct 2025 15:03 UTC
62 points
12 comments18 min readLW link

On The Con­ser­va­tion of Rights

Roman Maksimovich30 Oct 2025 13:48 UTC
−2 points
2 comments8 min readLW link

When “HDMI-1” Lies To You

Gunnar_Zarncke30 Oct 2025 12:23 UTC
18 points
0 comments1 min readLW link

[Question] Why there is still one in­stance of Eliezer Yud­kowsky?

RomanS30 Oct 2025 12:00 UTC
−9 points
8 comments1 min readLW link

In­ter­view on the Heng­shui Model High School

L.M.Sherlock30 Oct 2025 10:26 UTC
21 points
2 comments30 min readLW link
(lmsherlock.substack.com)

Tran­scen­den­tal Ar­gu­men­ta­tion and the Epistemics of Discourse

0xA30 Oct 2025 6:37 UTC
1 point
2 comments3 min readLW link

Emer­gent In­tro­spec­tive Aware­ness in Large Lan­guage Models

Drake Thomas30 Oct 2025 4:42 UTC
132 points
19 comments1 min readLW link
(transformer-circuits.pub)

In­tro­duc­ing Aeon­isk: an Open Source Game and Dataset with Graded Out­come Tiers of Coun­ter­fac­tual Reasoning

threeriversainexus30 Oct 2025 3:02 UTC
1 point
0 comments4 min readLW link

Im­pos­si­bleBench: Mea­sur­ing Re­ward Hack­ing in LLM Cod­ing Agents

Ziqian Zhong30 Oct 2025 2:52 UTC
62 points
5 comments3 min readLW link
(arxiv.org)

LLM Hal­lu­ci­na­tions: An In­ter­nal Tug of War

violazhong30 Oct 2025 1:21 UTC
9 points
0 comments3 min readLW link

Ge­nius is Not About Genius

Algon30 Oct 2025 0:00 UTC
14 points
1 comment2 min readLW link

Quotes on OpenAI’s timelines to au­to­mated re­search, safety re­search, and safety col­lab­o­ra­tions be­fore re­cur­sive self improvement

TheManxLoiner29 Oct 2025 21:47 UTC
17 points
0 comments3 min readLW link

An Opinionated Guide to Pri­vacy De­spite Authoritarianism

TurnTrout29 Oct 2025 20:32 UTC
181 points
31 comments4 min readLW link
(turntrout.com)

Un­sureism: The Ra­tional Ap­proach to Reli­gious Uncertainty

Taylor G. Lunt29 Oct 2025 19:45 UTC
−7 points
3 comments5 min readLW link

Why you shouldn’t eat meat if you hate fac­tory farming

ceselder29 Oct 2025 17:00 UTC
6 points
4 comments4 min readLW link

The End of OpenAI’s Non­profit Era

garrison29 Oct 2025 16:28 UTC
41 points
0 comments9 min readLW link
(www.obsolete.pub)

An in­tro to the Ten­sor Eco­nomics blog

harsimony29 Oct 2025 16:24 UTC
15 points
0 comments12 min readLW link
(splittinginfinity.substack.com)

Uncer­tain Up­dates: Oc­to­ber 2025

Gordon Seidoh Worley29 Oct 2025 16:10 UTC
3 points
0 comments1 min readLW link
(www.uncertainupdates.com)

AI Doomers Should Raise Hell

James_Miller29 Oct 2025 16:10 UTC
−2 points
9 comments6 min readLW link

AISN #65: Mea­sur­ing Au­toma­tion and Su­per­in­tel­li­gence Mo­ra­to­rium Letter

29 Oct 2025 16:05 UTC
5 points
0 comments3 min readLW link
(newsletter.safe.ai)

TBC Epi­sode with Max Harms—Red Heart and If Any­one Builds It, Every­one Dies

Steven K Zuber29 Oct 2025 15:49 UTC
13 points
0 comments1 min readLW link
(www.thebayesianconspiracy.com)

[Question] Thresh­olds for Pas­cal’s Mug­ging?

MattAlexander29 Oct 2025 14:54 UTC
22 points
12 comments8 min readLW link

Please Do Not Sell B30A Chips to China

Zvi29 Oct 2025 14:50 UTC
62 points
6 comments7 min readLW link
(thezvi.wordpress.com)

Why Civ­i­liza­tions Are Un­sta­ble (And What This Means for AI Align­ment)

Elias_Kunnas29 Oct 2025 12:27 UTC
10 points
6 comments5 min readLW link

What can we learn from par­ent-child-al­ign­ment for AI?

Karl von Wendt29 Oct 2025 8:02 UTC
16 points
4 comments3 min readLW link

Some data from LeelaPieceOdds

Jeremy Gillen29 Oct 2025 4:27 UTC
69 points
21 comments3 min readLW link

How Do We Eval­u­ate the Qual­ity of LLMs’ Math­e­mat­i­cal Re­sponses?

Miguel Angel29 Oct 2025 1:37 UTC
5 points
0 comments13 min readLW link

Vi­su­al­iz­ing a Plat­form for Live World Models

Kuil29 Oct 2025 1:24 UTC
16 points
0 comments14 min readLW link

[Question] Why Would we get In­ner Misal­ign­ment by De­fault?

Coil29 Oct 2025 1:23 UTC
3 points
0 comments2 min readLW link

A Very Sim­ple Model of AI Dealmaking

Cleo Nardo29 Oct 2025 0:33 UTC
18 points
0 comments9 min readLW link