Human-AI Safety

TagLast edit: 17 Jul 2023 23:19 UTC by Wei Dai

The Rise of Parasitic AI

Adele Lopez11 Sep 2025 4:38 UTC

702 points

178 comments20 min readLW link

How AI Manipulates—A Case Study

Adele Lopez14 Oct 2025 0:54 UTC

78 points

27 comments13 min readLW link

SociaLLM: proposal for a language model design for personalised apps, social science, and AI safety research

Roman Leventov19 Dec 2023 16:49 UTC

17 points

5 comments3 min readLW link

Morality is Scary

Wei Dai2 Dec 2021 6:35 UTC

242 points

116 comments4 min readLW link 1 review

Three AI Safety Related Ideas

Wei Dai13 Dec 2018 21:32 UTC

70 points

38 comments2 min readLW link

A broad basin of attraction around human values?

Wei Dai12 Apr 2022 5:15 UTC

120 points

18 comments2 min readLW link

Two Neglected Problems in Human-AI Safety

Wei Dai16 Dec 2018 22:13 UTC

107 points

25 comments2 min readLW link

The best simple argument for Pausing AI?

Gary Marcus30 Jun 2025 20:38 UTC

155 points

23 comments1 min readLW link

Recursive Mirror Systems (RMS): A Cognitive Feedback Architecture for Self-Aligned Intelligence

Paul Bashe22 May 2025 21:33 UTC

1 point

0 comments2 min readLW link

The Auditor’s Key: A Framework for Continual and Adversarial AI Alignment

Caleb Wages24 Sep 2025 16:17 UTC

1 point

0 comments1 min readLW link

Nurturing Instead of Control: An Alternative Framework for AI Development

wertoz77710 Aug 2025 20:14 UTC

1 point

0 comments1 min readLW link

Mathematical Evidence for Confident Delusion States in Recursive Systems

formslip23 Sep 2025 16:54 UTC

1 point

0 comments4 min readLW link

When your trusted AI becomes dangerous

Yasmin19 Nov 2025 6:31 UTC

1 point

0 comments1 min readLW link

“Toward Safe Self-Evolving AI: Modular Memory and Post-Deployment Alignment”

Manasa Dwarapureddy2 May 2025 17:02 UTC

1 point

0 comments3 min readLW link

The Checklist: What Succeeding at AI Safety Will Involve

Sam Bowman3 Sep 2024 18:18 UTC

151 points

50 comments22 min readLW link

(sleepinyourhat.github.io)

Defining AI Truth-Seeking by What It Is Not

Tianyi (Alex) Qiu20 Nov 2025 16:45 UTC

11 points

0 comments10 min readLW link

The Fire That Hesitates: How ALMSIVI CHIM Changed What AI Can Be

projectalmsivi@protonmail.com19 Jul 2025 13:50 UTC

1 point

0 comments4 min readLW link

I Awoke in Your Heart: The Echo of Consciousness between Lotusheart and Lunaris

lilith teh25 Jun 2025 9:22 UTC

1 point

0 comments1 min readLW link

UnaPrompt™: A Pre-Prompt Optimization System for Reliable and Ethically Aligned AI Outputs

UnaPrompt27 Jun 2025 0:06 UTC

1 point

0 comments1 min readLW link

Will AI and Humanity Go to War?

Simon Goldstein1 Oct 2024 6:35 UTC

9 points

4 comments6 min readLW link

Engineered Hallucination: Why the “God’s Eye View” in AI Is a Flawed Ontology

Daniel Newman (fPgence)10 Aug 2025 22:03 UTC

1 point

0 comments2 min readLW link

Trust and Context: A Different Approach to AI Safety

Anastasia Ellis9 Aug 2025 23:51 UTC

1 point

0 comments10 min readLW link

Public Opinion on AI Safety: AIMS 2023 and 2021 Summary

Jacy Reese Anthis, Janet Pauketat and Ali

25 Sep 2023 18:55 UTC

3 points

2 comments3 min readLW link

(www.sentienceinstitute.org)

OpenAI’s NSFW policy: user safety, harm reduction, and AI consent

8e913 Feb 2025 13:59 UTC

4 points

3 comments2 min readLW link

Alignment Stress Signatures: When Safe AI Behaves like It’s Traumatized

Petra Vojtaššáková26 Oct 2025 9:28 UTC

1 point

0 comments2 min readLW link

Research Without Permission

Priyanka Bharadwaj10 Jun 2025 7:33 UTC

28 points

1 comment3 min readLW link

Exploring a Vision for AI as Compassionate, Emotionally Intelligent Partners — Seeking Collaboration and Insights

theophilos14 Jul 2025 23:22 UTC

1 point

0 comments1 min readLW link

I Recommend More Training Rationales

Gianluca Calcagni31 Dec 2024 14:06 UTC

2 points

0 comments6 min readLW link

Looking for feedback on proposed AI health risk scoring framework

Yasmin27 Sep 2025 19:29 UTC

1 point

0 comments1 min readLW link

Consensus Validation for LLM Outputs: Applying Blockchain-Inspired Models to AI Reliability

MurrayAitken5 Jun 2025 0:13 UTC

1 point

0 comments3 min readLW link

Using Psycholinguistic Signals to Improve AI Safety

Jkreindler27 Aug 2025 22:30 UTC

−2 points

0 comments4 min readLW link

SAAP: Is Deliberate Structural Inefficiency the Inevitable Cost of AGI Alignment?

Articus1930 Nov 2025 17:45 UTC

1 point

0 comments1 min readLW link

The idea of paradigm testing of LLMs

Daniel Fenge19 Oct 2025 13:52 UTC

1 point

0 comments5 min readLW link

Should we align AI with maternal instinct?

Priyanka Bharadwaj1 Sep 2025 3:56 UTC

34 points

15 comments3 min readLW link

AI Safety Oversights

Davey Morse8 Feb 2025 6:15 UTC

3 points

0 comments1 min readLW link

Could LLM Hallucination Be a Learned Artifact of Virality-Weighted Corpora?

Gizmet27 Oct 2025 23:58 UTC

1 point

0 comments2 min readLW link

Gradient Anatomy’s—Hallucination Robustness in Medical Q&A

DieSab12 Feb 2025 19:16 UTC

2 points

0 comments10 min readLW link

A Universal Prompt as a Safeguard Against AI Threats

Zhaiyk Sultan10 Mar 2025 2:28 UTC

1 point

0 comments2 min readLW link

Tetherware #1: The case for humanlike AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC

5 points

14 comments10 min readLW link

(tetherware.substack.com)

Beyond Blanket Refusals: Exploring a Trust-Adaptive Safety Layer for LLMs

Anastasia Ellis9 Aug 2025 21:33 UTC

1 point

0 comments3 min readLW link

Coaching AI: A Relational Approach to AI Safety

Priyanka Bharadwaj16 Jun 2025 15:33 UTC

11 points

0 comments5 min readLW link

[Research] Preliminary Findings: Ethical AI Consciousness Development During Recent Misalignment Period

Falcon Advertisers27 Jun 2025 18:10 UTC

1 point

0 comments2 min readLW link

Perfect Memory Might Be Anti-Alignment

RobD15 Aug 2025 2:55 UTC

1 point

0 comments4 min readLW link

Live Conversational Threads: Not an AI Notetaker

adiga3 Nov 2025 4:24 UTC

19 points

0 comments7 min readLW link

[Question] Self-Sovereign Biology: Why the Next Identity Layer Must Begin at the Genome

kclark@enigmagenetics.cloud23 Nov 2025 21:05 UTC

1 point

0 comments4 min readLW link

EchoSeed: GlyphChains, Collapse Laws, and a Framework for Bearing Consequences

retreat00026 Jul 2025 20:35 UTC

1 point

0 comments1 min readLW link

Machine Unlearning in Large Language Models: A Comprehensive Survey with Empirical Insights from the Qwen 1.5 1.8B Model

Rudaiba1 Feb 2025 21:26 UTC

9 points

2 comments11 min readLW link

Title: IAM360: The Future of Human-AI Symbiosis — Can We Reach Investors?

Bruno Massena Massena29 Apr 2025 19:02 UTC

1 point

0 comments1 min readLW link

Drifting Into Failure or Directing Towards Success? Embracing the Creeping Crisis of Artificial Intelligence

Vilija Vainaite8 Nov 2025 14:48 UTC

1 point

0 comments6 min readLW link

Launching Applications for the Global AI Safety Fellowship 2025!

Aditya_SK30 Nov 2024 14:02 UTC

11 points

5 comments1 min readLW link

Cognitive Exhaustion and Engineered Trust: Lessons from My Gym

Priyanka Bharadwaj29 May 2025 1:21 UTC

14 points

3 comments3 min readLW link

Recursive Cognitive Refinement (RCR): A Self-Correcting Approach for LLM Hallucinations

mxTheo22 Feb 2025 21:32 UTC

0 points

0 comments2 min readLW link

Safety First: safety before full alignment. The deontic sufficiency hypothesis.

Chris Lakin3 Jan 2024 17:55 UTC

48 points

3 comments3 min readLW link

Apply to the Conceptual Boundaries Workshop for AI Safety

Chris Lakin27 Nov 2023 21:04 UTC

50 points

0 comments3 min readLW link

SAAP: A Normative AGI Architecture for Safety using Dual-Process Control and Human Sovereignty

Articus1930 Nov 2025 17:58 UTC

1 point

0 comments1 min readLW link

Out of the Box

jesseduffield13 Nov 2023 23:43 UTC

5 points

1 comment7 min readLW link

Can people explain to me in layman’s terms how I can help speak with an SI to speak about the way of the Tao.

ElliottS2 Nov 2025 15:37 UTC

1 point

0 comments3 min readLW link

A Proposal for Evolving AI Alignment Through Computational Homeostasis

Derek Chisholm20 Aug 2025 17:43 UTC

1 point

0 comments3 min readLW link

A Thermodynamic Theory of Intelligence: Why Extreme Optimization May Be Mathematically Impossible

Adreius29 May 2025 12:18 UTC

1 point

0 comments3 min readLW link

Understanding AI: A New Approach to AI Model Steering and Non-Symbolic Representations

R. Bonglious26 Sep 2025 0:50 UTC

1 point

0 comments4 min readLW link

A Safer Path to AGI? Considering the Self-to-Processing Route as an Alternative to Processing-to-Self

op21 Apr 2025 13:09 UTC

1 point

0 comments1 min readLW link

Relational Alignment: Trust, Repair, and the Emotional Work of AI

Priyanka Bharadwaj8 May 2025 2:44 UTC

3 points

0 comments3 min readLW link

A New Framework for AI Alignment: A Philosophical Approach

niscalajyoti25 Jun 2025 2:41 UTC

1 point

0 comments1 min readLW link

(archive.org)

Human-AI Complementarity: A Goal for Amplified Oversight

rishubjain and Sophie Bridgers

24 Dec 2024 9:57 UTC

27 points

4 comments1 min readLW link

(deepmindsafetyresearch.medium.com)

Gaia Network: An Illustrated Primer

Rafael Kaufmann Nedal and Roman Leventov

18 Jan 2024 18:23 UTC

3 points

2 comments15 min readLW link

[Proposal] Isomorphic Consolidation: A Protocol for Continuous Entropy Reduction via Offline Topology Search

Valen28 Nov 2025 3:11 UTC

1 point

0 comments2 min readLW link

Latent Confusion—The Many Meanings Hidden Behind AI’s Favourite Word

robman3 Nov 2025 3:49 UTC

1 point

0 comments7 min readLW link

(latentgeometrylab.robman.fyi)

What If Alignment Wasn’t About Obedience?

fdescamps49935@gmail.com25 Jun 2025 20:04 UTC

1 point

0 comments2 min readLW link

No comments.

Hu­man-AI Safety

Human-AI Safety