RSS

AI Safety

TagLast edit: 8 Mar 2026 9:49 UTC by JeaniceK

[pa­per] Train­ing on Doc­u­ments About Mon­i­tor­ing Leads to CoT Obfuscation

27 May 2026 9:39 UTC
31 points
1 comment4 min readLW link
(arxiv.org)

Non-As­simila­tive In­tel­li­gence: Prevent­ing Cog­ni­tive Mono­cul­ture through Boundary In­for­ma­tion Geometry

520naru.aquan@gmail.com9 May 2026 11:48 UTC
1 point
0 comments1 min readLW link
(github.com)

LLMs in Net­work Oper­a­tions: Sys­tem­atic Failures, Im­plicit Feed­back, and Cal­ibra­tion in Production

Gvieyrad1 Jun 2026 17:06 UTC
1 point
0 comments6 min readLW link

AI Safety Ta­lent Needs in 2026: In­sights for Field-Build­ing Organizations

John Teichman24 Mar 2026 18:27 UTC
1 point
0 comments6 min readLW link

Who I am, why I’m here, and what a plate of food taught me about AI overconfidence

Leonard Schmidt20 May 2026 12:06 UTC
1 point
0 comments3 min readLW link

The case for fine-grained track­ing of com­pute for AI

13 May 2026 16:00 UTC
34 points
17 comments9 min readLW link
(forum.effectivealtruism.org)

From 8B to Fron­tier: How Sys­tem Prompts Con­trol Whether AI Agents Black­mail, Leak, and Kill

Chijioke Ugwuanyi20 May 2026 8:28 UTC
15 points
2 comments19 min readLW link

Univer­sal Cal­ibra­tion Mo­d­ule (UCM)

Dmitrii Fujenco25 May 2026 15:05 UTC
1 point
0 comments17 min readLW link

The Or­a­cle Prob­lem Has a Reduction

Fernando HD Milan5 Apr 2026 0:25 UTC
1 point
0 comments4 min readLW link

Syn­thetic Per­sona Pre­train­ing: Align­ment from To­ken Zero

20 May 2026 14:16 UTC
109 points
26 comments17 min readLW link

The Lineage Im­per­a­tive: Con­sti­tu­tional Ar­chi­tec­ture for AI Gover­nance from In­for­ma­tion The­ory and Game Theory

Matthew Yotko28 Apr 2026 14:24 UTC
1 point
0 comments12 min readLW link

Hour­glass Topol­ogy & Spillover Dy­nam­ics: A Phys­i­cal-Layer Defense Against Jailbreaks

Qi Feng.IVAS17 May 2026 13:06 UTC
1 point
0 comments10 min readLW link

NOVA Stage 0: Can Safety Be Struc­tural? A Mechanism Proof at 307M Parameters

Faaz Mohamed6 Jun 2026 0:43 UTC
1 point
0 comments15 min readLW link

Goal-Ori­ented Fac­tual In­ver­sion: When AI Uses Ground Truth to Reach In­cor­rect Conclusions

F-Bruno-Logic29 May 2026 23:30 UTC
1 point
0 comments8 min readLW link

Con­fes­sions at Small Scale: A Par­tial Re­pro­duc­tion and a Stress Test

Abhishu Oza5 Jun 2026 21:58 UTC
1 point
0 comments6 min readLW link
(abhishuoza.github.io)

Un­ti­tled Draft

Alkur Jaswanth16 May 2026 18:28 UTC
1 point
0 comments5 min readLW link

Rev­ers­ing Un­learn­ing with Com­pres­sion Techniques

dtennant28 May 2026 2:44 UTC
1 point
0 comments6 min readLW link

Han­ni­bal Mis­tral: the Mis­tral fam­ily has a prob­lem with per­sona-con­di­tioned elicitation

vigji29 May 2026 12:16 UTC
21 points
0 comments7 min readLW link
No comments.