AI Safety

TagLast edit: 8 Mar 2026 9:49 UTC by JeaniceK

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

Reilly Haskins, bilalchughtai and Josh Engels

27 May 2026 9:39 UTC

31 points

1 comment4 min readLW link

(arxiv.org)

Non-Assimilative Intelligence: Preventing Cognitive Monoculture through Boundary Information Geometry

520naru.aquan@gmail.com9 May 2026 11:48 UTC

1 point

0 comments1 min readLW link

(github.com)

LLMs in Network Operations: Systematic Failures, Implicit Feedback, and Calibration in Production

Gvieyrad1 Jun 2026 17:06 UTC

1 point

0 comments6 min readLW link

AI Safety Talent Needs in 2026: Insights for Field-Building Organizations

John Teichman24 Mar 2026 18:27 UTC

1 point

0 comments6 min readLW link

Who I am, why I’m here, and what a plate of food taught me about AI overconfidence

Leonard Schmidt20 May 2026 12:06 UTC

1 point

0 comments3 min readLW link

The case for fine-grained tracking of compute for AI

Farhan and Katherine Biewer

13 May 2026 16:00 UTC

34 points

17 comments9 min readLW link

(forum.effectivealtruism.org)

From 8B to Frontier: How System Prompts Control Whether AI Agents Blackmail, Leak, and Kill

Chijioke Ugwuanyi20 May 2026 8:28 UTC

15 points

2 comments19 min readLW link

Universal Calibration Module (UCM)

Dmitrii Fujenco25 May 2026 15:05 UTC

1 point

0 comments17 min readLW link

The Oracle Problem Has a Reduction

Fernando HD Milan5 Apr 2026 0:25 UTC

1 point

0 comments4 min readLW link

Synthetic Persona Pretraining: Alignment from Token Zero

Julian Minder, Raghav Singhal, Viktor Moskvoretskii, Stefan Krsteski, ashtonanderson, rolandaydin and Robert West

20 May 2026 14:16 UTC

109 points

26 comments17 min readLW link

The Lineage Imperative: Constitutional Architecture for AI Governance from Information Theory and Game Theory

Matthew Yotko28 Apr 2026 14:24 UTC

1 point

0 comments12 min readLW link

Hourglass Topology & Spillover Dynamics: A Physical-Layer Defense Against Jailbreaks

Qi Feng.IVAS17 May 2026 13:06 UTC

1 point

0 comments10 min readLW link

NOVA Stage 0: Can Safety Be Structural? A Mechanism Proof at 307M Parameters

Faaz Mohamed6 Jun 2026 0:43 UTC

1 point

0 comments15 min readLW link

Goal-Oriented Factual Inversion: When AI Uses Ground Truth to Reach Incorrect Conclusions

F-Bruno-Logic29 May 2026 23:30 UTC

1 point

0 comments8 min readLW link

Confessions at Small Scale: A Partial Reproduction and a Stress Test

Abhishu Oza5 Jun 2026 21:58 UTC

1 point

0 comments6 min readLW link

(abhishuoza.github.io)

Untitled Draft

Alkur Jaswanth16 May 2026 18:28 UTC

1 point

0 comments5 min readLW link

Reversing Unlearning with Compression Techniques

dtennant28 May 2026 2:44 UTC

1 point

0 comments6 min readLW link

Hannibal Mistral: the Mistral family has a problem with persona-conditioned elicitation

vigji29 May 2026 12:16 UTC

21 points

0 comments7 min readLW link

No comments.