Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
AI Safety
Tag
Last edit:
8 Mar 2026 9:49 UTC
by
JeaniceK
Relevant
New
Old
[paper] Training on Documents About Monitoring Leads to CoT Obfuscation
Reilly Haskins
,
bilalchughtai
and
Josh Engels
27 May 2026 9:39 UTC
31
points
1
comment
4
min read
LW
link
(arxiv.org)
Non-Assimilative Intelligence: Preventing Cognitive Monoculture through Boundary Information Geometry
520naru.aquan@gmail.com
9 May 2026 11:48 UTC
1
point
0
comments
1
min read
LW
link
(github.com)
LLMs in Network Operations: Systematic Failures, Implicit Feedback, and Calibration in Production
Gvieyrad
1 Jun 2026 17:06 UTC
1
point
0
comments
6
min read
LW
link
AI Safety Talent Needs in 2026: Insights for Field-Building Organizations
John Teichman
24 Mar 2026 18:27 UTC
1
point
0
comments
6
min read
LW
link
Who I am, why I’m here, and what a plate of food taught me about AI overconfidence
Leonard Schmidt
20 May 2026 12:06 UTC
1
point
0
comments
3
min read
LW
link
The case for fine-grained tracking of compute for AI
Farhan
and
Katherine Biewer
13 May 2026 16:00 UTC
34
points
17
comments
9
min read
LW
link
(forum.effectivealtruism.org)
From 8B to Frontier: How System Prompts Control Whether AI Agents Blackmail, Leak, and Kill
Chijioke Ugwuanyi
20 May 2026 8:28 UTC
15
points
2
comments
19
min read
LW
link
Universal Calibration Module (UCM)
Dmitrii Fujenco
25 May 2026 15:05 UTC
1
point
0
comments
17
min read
LW
link
The Oracle Problem Has a Reduction
Fernando HD Milan
5 Apr 2026 0:25 UTC
1
point
0
comments
4
min read
LW
link
Synthetic Persona Pretraining: Alignment from Token Zero
Julian Minder
,
Raghav Singhal
,
Viktor Moskvoretskii
,
Stefan Krsteski
,
ashtonanderson
,
rolandaydin
and
Robert West
20 May 2026 14:16 UTC
109
points
26
comments
17
min read
LW
link
The Lineage Imperative: Constitutional Architecture for AI Governance from Information Theory and Game Theory
Matthew Yotko
28 Apr 2026 14:24 UTC
1
point
0
comments
12
min read
LW
link
Hourglass Topology & Spillover Dynamics: A Physical-Layer Defense Against Jailbreaks
Qi Feng.IVAS
17 May 2026 13:06 UTC
1
point
0
comments
10
min read
LW
link
NOVA Stage 0: Can Safety Be Structural? A Mechanism Proof at 307M Parameters
Faaz Mohamed
6 Jun 2026 0:43 UTC
1
point
0
comments
15
min read
LW
link
Goal-Oriented Factual Inversion: When AI Uses Ground Truth to Reach Incorrect Conclusions
F-Bruno-Logic
29 May 2026 23:30 UTC
1
point
0
comments
8
min read
LW
link
Confessions at Small Scale: A Partial Reproduction and a Stress Test
Abhishu Oza
5 Jun 2026 21:58 UTC
1
point
0
comments
6
min read
LW
link
(abhishuoza.github.io)
Untitled Draft
Alkur Jaswanth
16 May 2026 18:28 UTC
1
point
0
comments
5
min read
LW
link
Reversing Unlearning with Compression Techniques
dtennant
28 May 2026 2:44 UTC
1
point
0
comments
6
min read
LW
link
Hannibal Mistral: the Mistral family has a problem with persona-conditioned elicitation
vigji
29 May 2026 12:16 UTC
21
points
0
comments
7
min read
LW
link
No comments.
Back to top