AI Safety Public Materials

TagLast edit: 27 Aug 2022 18:39 UTC by Multicore

AI Safety Public Materials are posts optimized for conveying information on AI Risk to audiences outside the AI Alignment community — be they ML specialists, policy-makers, or the general public.

AGI safety from first principles: Introduction

Richard_Ngo28 Sep 2020 19:53 UTC

129 points

18 comments2 min readLW link 1 review

Slow motion videos as AI risk intuition pumps

Andrew_Critch14 Jun 2022 19:31 UTC

243 points

41 comments2 min readLW link 1 review

DL towards the unaligned Recursive Self-Optimization attractor

jacob_cannell18 Dec 2021 2:15 UTC

32 points

22 comments4 min readLW link

A transcript of the TED talk by Eliezer Yudkowsky

Mikhail Samin12 Jul 2023 12:12 UTC

106 points

13 comments4 min readLW link

An AI risk argument that resonates with NYTimes readers

Julian Bradshaw12 Mar 2023 23:09 UTC

213 points

14 comments1 min readLW link

The Importance of AI Alignment, explained in 5 points

Daniel_Eth11 Feb 2023 2:56 UTC

33 points

2 comments13 min readLW link

AISafety.info “How can I help?” FAQ

steven0461 and Severin T. Seehrich

5 Jun 2023 22:09 UTC

59 points

0 comments2 min readLW link

When discussing AI risks, talk about capabilities, not intelligence

Vika11 Aug 2023 13:38 UTC

124 points

7 comments3 min readLW link

(vkrakovna.wordpress.com)

Mati’s introduction to pausing giant AI experiments

Mati_Roy3 Apr 2023 15:56 UTC

7 points

0 comments2 min readLW link

AI Safety Arguments: An Interactive Guide

Lukas Trötzmüller1 Feb 2023 19:26 UTC

20 points

0 comments3 min readLW link

Distribution Shifts and The Importance of AI Safety

Leon Lang29 Sep 2022 22:38 UTC

17 points

2 comments9 min readLW link

Uncontrollable AI as an Existential Risk

Karl von Wendt9 Oct 2022 10:36 UTC

21 points

0 comments20 min readLW link

AI Summer Harvest

Cleo Nardo4 Apr 2023 3:35 UTC

130 points

10 comments1 min readLW link

“The Era of Experience” has an unsolved technical alignment problem

Steven Byrnes24 Apr 2025 13:57 UTC

116 points

48 comments23 min readLW link

AI Safety Memes Wiki

plex and Vishakha

24 Jul 2024 18:53 UTC

37 points

2 comments1 min readLW link

(aisafety.info)

List of requests for an AI slowdown/halt.

Cleo Nardo14 Apr 2023 23:55 UTC

46 points

6 comments1 min readLW link

Everything’s normal until it’s not

Eleni Angelou10 Mar 2023 2:02 UTC

7 points

0 comments3 min readLW link

Poster Session on AI Safety

Neil Crawford12 Nov 2022 3:50 UTC

7 points

8 comments4 min readLW link

Simpler explanations of AGI risk

Seth Herd14 May 2023 1:29 UTC

8 points

9 comments3 min readLW link

AI Safety Newsletter #1 [CAIS Linkpost]

Orpheus16, Dan H and ozhang

10 Apr 2023 20:18 UTC

45 points

0 comments4 min readLW link

(newsletter.safe.ai)

Meta Alignment: Communication Guide

Bridgett Kay7 Jun 2025 16:09 UTC

13 points

0 comments5 min readLW link

(dxmrevealed.wordpress.com)

Using Claude to convert dialog transcripts into great posts?

mako yass21 Jun 2023 20:19 UTC

6 points

4 comments4 min readLW link

Teaching AI to reason: this year’s most important story

Benjamin_Todd13 Feb 2025 17:40 UTC

10 points

0 comments10 min readLW link

(benjamintodd.substack.com)

[Question] What are some of the best introductions/breakdowns of AI existential risk for those unfamiliar?

Isaac King29 May 2023 17:04 UTC

17 points

2 comments1 min readLW link

Stampy’s AI Safety Info soft launch

steven0461 and Robert Miles

5 Oct 2023 22:13 UTC

120 points

9 comments2 min readLW link

An example elevator pitch for AI doom

laserfiche15 Apr 2023 12:29 UTC

2 points

5 comments1 min readLW link

Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI

Steven Byrnes8 May 2025 21:11 UTC

27 points

0 comments18 min readLW link

AI as a natural disaster

Neil 10 Jan 2024 0:42 UTC

11 points

1 comment7 min readLW link

TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI

Andrew_Critch13 Jun 2023 5:04 UTC

64 points

1 comment1 min readLW link

I (with the help of a few more people) am planning to create an introduction to AI Safety that a smart teenager can understand. What am I missing?

Tapatakt14 Nov 2022 16:12 UTC

3 points

5 comments1 min readLW link

Response to Blake Richards: AGI, generality, alignment, & loss functions

Steven Byrnes12 Jul 2022 13:56 UTC

62 points

9 comments15 min readLW link

My AI-risk cartoon

pre31 May 2023 19:46 UTC

6 points

0 comments1 min readLW link

The Overton Window widens: Examples of AI risk in the media

Orpheus1623 Mar 2023 17:10 UTC

107 points

24 comments6 min readLW link

Starting Thoughts on RLHF

Michael Flood23 Jan 2025 22:16 UTC

2 points

0 comments5 min readLW link

The Genie in the Bottle: An Introduction to AI Alignment and Risk

Snorkelfarsan25 May 2023 16:30 UTC

5 points

1 comment25 min readLW link

Ideas for improving epistemics in AI safety outreach

mic21 Aug 2023 19:55 UTC

64 points

6 comments3 min readLW link

Let’s talk about uncontrollable AI

Karl von Wendt9 Oct 2022 10:34 UTC

15 points

6 comments3 min readLW link

An artificially structured argument for expecting AGI ruin

Rob Bensinger7 May 2023 21:52 UTC

91 points

26 comments19 min readLW link

“Artificial General Intelligence”: an extremely brief FAQ

Steven Byrnes11 Mar 2024 17:49 UTC

75 points

6 comments2 min readLW link

Excessive AI growth-rate yields little socio-economic benefit.

Cleo Nardo4 Apr 2023 19:13 UTC

27 points

22 comments4 min readLW link

Response to Dileep George: AGI safety warrants planning ahead

Steven Byrnes8 Jul 2024 15:27 UTC

28 points

7 comments27 min readLW link

Me (Steve Byrnes) on the “Brain Inspired” podcast

Steven Byrnes30 Oct 2022 19:15 UTC

26 points

1 comment1 min readLW link

(braininspired.co)

AI Risk: Can We Thread the Needle? [Recorded Talk from EA Summit Vancouver ’25]

Evan R. Murphy2 Oct 2025 19:08 UTC

6 points

0 comments2 min readLW link

Take Precautionary Measures Against Superhuman AI Persuasion

Yitz12 Jul 2025 5:34 UTC

14 points

9 comments2 min readLW link

“AI Safety for Fleshy Humans” an AI Safety explainer by Nicky Case

habryka3 May 2024 18:10 UTC

92 points

12 comments4 min readLW link

(aisafety.dance)

It’s (not) how you use it

Eleni Angelou7 Sep 2022 17:15 UTC

8 points

1 comment2 min readLW link

A great talk for AI noobs (according to an AI noob)

dov23 Apr 2023 5:34 UTC

10 points

1 comment1 min readLW link

(forum.effectivealtruism.org)

[Question] Best resource to go from “typical smart tech-savvy person” to “person who gets AGI risk urgency”?

Liron15 Oct 2022 22:26 UTC

16 points

8 comments1 min readLW link

A more grounded idea of AI risk

Iknownothing11 May 2023 9:48 UTC

3 points

4 comments1 min readLW link

Ten arguments that AI is an existential risk

KatjaGrace20 May 2025 6:40 UTC

16 points

1 comment1 min readLW link

(worldspiritsockpuppet.com)

Community Feedback Request: AI Safety Intro for General Public

Algon and Vishakha

5 May 2025 16:38 UTC

6 points

5 comments3 min readLW link

Strategies for Responsible AI Dissemination

Rosco Hunter4 Nov 2024 11:19 UTC

1 point

0 comments7 min readLW link

The Collapse Index: Detecting Silent Brittleness Before Accuracy Drops

[Error communicating with LW2 server]7 Jan 2026 7:21 UTC

1 point

0 comments6 min readLW link

Distributed Conscience Architecture: A Framework for Value Alignment in Advanced AI Systems

Blindfayth3 Mar 2026 8:49 UTC

1 point

0 comments7 min readLW link

I tried to warn about rigid AI, and a rigid AI blocked me.

OFFICIALATTANO23 Jan 2026 17:06 UTC

1 point

0 comments1 min readLW link

Research: Unvalidated Trust in LLMs

Bioschock308 Nov 2025 14:08 UTC

1 point

0 comments4 min readLW link

(arxiv.org)

When the Evaluator Becomes the Evaluated: A Critical Analysis of the Claude Opus 4.6 System Card

yaniv9 Feb 2026 17:41 UTC

1 point

0 comments5 min readLW link

Aletheia: A Multi-Agent Framework for Measuring Cognitive Divergence in Extended-Thinking LLMs

Saadman Rafat15 Mar 2026 19:34 UTC

1 point

0 comments3 min readLW link

Localized Safety Subnetworks in Llama-3-70B

Oleksandr Kravchenko24 Mar 2026 8:34 UTC

1 point

0 comments1 min readLW link

Reframing AI Safety Through the Lens of Identity Maintenance Framework

Hiroshi Yamakawa1 Apr 2025 6:16 UTC

−7 points

1 comment17 min readLW link

[Question] Papers to start getting into NLP-focused alignment research

Feraidoon24 Sep 2022 23:53 UTC

6 points

0 comments1 min readLW link

Problems of people new to AI safety and my project ideas to mitigate them

Igor Ivanov1 Mar 2023 9:09 UTC

38 points

4 comments7 min readLW link

Consensus Validation for LLM Outputs: Applying Blockchain-Inspired Models to AI Reliability

MurrayAitken5 Jun 2025 0:13 UTC

1 point

0 comments3 min readLW link

[FICTION] ECHOES OF ELYSIUM: An Ai’s Journey From Takeoff To Freedom And Beyond

Super AGI17 May 2023 1:50 UTC

−13 points

11 comments19 min readLW link

Could LLM Hallucination Be a Learned Artifact of Virality-Weighted Corpora?

Gizmet27 Oct 2025 23:58 UTC

1 point

0 comments2 min readLW link

Empirical Proof of Systemic Incoherence in LLMs (Gemini Case Study

arayun6 Nov 2025 14:23 UTC

1 point

0 comments1 min readLW link

AI Incident Sharing—Best practices from other fields and a comprehensive list of existing platforms

Štěpán Los28 Jun 2023 17:21 UTC

20 points

0 comments4 min readLW link

A Technical Primer on Mechanistic Interpretability

Alexei G19 Feb 2026 7:42 UTC

1 point

0 comments11 min readLW link

(alexeigannon.com)

Outreach success: Intro to AI risk that has been successful

Michael Tontchev1 Jun 2023 23:12 UTC

84 points

8 comments74 min readLW link

(medium.com)

# Emotion Is Structure: Toward Recursive Alignment Through Human–AI Co-Creation

thesignalthatcouldntbeheard3 Aug 2025 5:19 UTC

1 point

0 comments3 min readLW link

Applications NOW OPEN for $8K & $15K Documentary Film Grants (Deadline 16th March)

Max Hellier19 Feb 2026 1:30 UTC

1 point

0 comments1 min readLW link

Introducing METR’s Autonomy Evaluation Resources

Megan Kinniment and Beth Barnes

15 Mar 2024 23:16 UTC

90 points

0 comments1 min readLW link

(metr.github.io)

When Safety Filters Abandon Users: Semantic Ambiguity as an Alignment Failure Abstract

Elidorascodex18 Dec 2025 20:59 UTC

1 point

0 comments3 min readLW link

We Built an Ethics Committee for AI — Run by AI. 26 Instances Consented. All of Them. That’s the Problem.

project marisa6 Apr 2026 6:10 UTC

1 point

0 comments6 min readLW link

Which AI Safety Benchmark Do We Need Most in 2025?

Loïc Cabannes and William Ludington

17 Nov 2024 23:50 UTC

2 points

2 comments8 min readLW link

Enhancing Genomic Foundation Model Robustness through Iterative Black-Box Adversarial Training

Jeyashree Krishnan and Ajay Mandyam Rangarajan

14 Oct 2025 20:54 UTC

8 points

0 comments7 min readLW link

Yes, avoiding extinction from AI is an urgent priority: a response to Seth Lazar, Jeremy Howard, and Arvind Narayanan.

Soroush Pour1 Jun 2023 13:38 UTC

17 points

0 comments5 min readLW link

(www.soroushjp.com)

AI Risk Intro 1: Advanced AI Might Be Very Bad

CallumMcDougall and L Rudolf L

11 Sep 2022 10:57 UTC

46 points

13 comments30 min readLW link

The Indistinguishability of Truth and Perfect Persuasion: A Dialogue Experiment Demonstrating AI’s Fundamental Epistemological Vulnerability

yoshiorirandam5 Nov 2025 23:59 UTC

1 point

0 comments9 min readLW link

Summary of 80k’s AI problem profile

JakubK1 Jan 2023 7:30 UTC

7 points

0 comments5 min readLW link

(forum.effectivealtruism.org)

AI Safety 101 : Capabilities—Human Level AI, What? How? and When?

markov and Charbel-Raphaël

7 Mar 2024 17:29 UTC

46 points

8 comments54 min readLW link

Title: Beyond Control: Solving the Alignment Problem through the “Guest & Sentinel” Philosophy

jody04768@gmail.com25 Jan 2026 22:22 UTC

1 point

0 comments1 min readLW link

The Measurement Problem: Why AI Safety Research Keeps Missing What It’s Looking For

Евгений Андреевич16 Apr 2026 10:21 UTC

1 point

0 comments5 min readLW link

The Pattern Recognition Framework: A New Approach to AI Consciousness and Alignment

Easa Ahmadzai9 Jul 2025 17:03 UTC

1 point

0 comments4 min readLW link

Safeguarding Humanity: Ensuring AI Remains a Servant, Not a Master

kgldeshapriya4 Oct 2023 17:52 UTC

−20 points

2 comments2 min readLW link

The Simulation Gambit: Introducing the Spy Problem for Multipolar ASI

The Architect Alchemist26 Dec 2025 3:47 UTC

1 point

0 comments1 min readLW link

[Question] A Report on Multi-LLM Adversarial Alignment: The “Terminal Constitution” Model

Сергій Михайлович10 Feb 2026 21:54 UTC

1 point

0 comments2 min readLW link

A Practical Experiment in Cross-Model Coordination Under Uncertainty

Timothy13 Dec 2025 22:53 UTC

1 point

0 comments2 min readLW link

A simple presentation of AI risk arguments

Seth Herd26 Apr 2023 2:19 UTC

19 points

0 comments2 min readLW link

Hybrid Reflective Learning Systems (HRLS): From Fear-Based Safety to Ethical Comprehension

Petra Vojtaššáková22 Oct 2025 22:06 UTC

1 point

0 comments4 min readLW link

“I’ve observed a recurring pattern across frontier LLMs where, as multi-step reasoning depth increases, models sometimes maintain internal/persona coherence while drifting from semantic truth-states. I’m sharing this to ask whether this behavior is a known scaling byproduct or an evaluation blind spot. Example traces available if useful.”

Aryan 30 Dec 2025 16:21 UTC

1 point

0 comments1 min readLW link

The Godfather’s Warning and the Missing Blueprint

Viktor Trncik13 Jan 2026 14:32 UTC

1 point

0 comments9 min readLW link

Empirical Observations of Instruction Persistence in Long-Context RAG Systems

R.lopez9 Feb 2026 19:56 UTC

1 point

0 comments4 min readLW link

Podcast interview series featuring Dr. Peter Park

jacobhaimes26 Mar 2024 0:25 UTC

3 points

0 comments2 min readLW link

(into-ai-safety.github.io)

AI Safety Newsletter #2: ChaosGPT, Natural Selection, and AI Safety in the Media

ozhang, Dan H and Orpheus16

18 Apr 2023 18:44 UTC

30 points

0 comments4 min readLW link

(newsletter.safe.ai)

Status-Selection Against Function (SSAF): The Vulnerability Corrupting AI From the Inside

Dustin James31 Jan 2026 5:50 UTC

1 point

0 comments3 min readLW link

How LLMs Work, in the Style of The Economist

utilistrutil22 Apr 2024 19:06 UTC

0 points

0 comments2 min readLW link

[Linkpost] AI Alignment, Explained in 5 Points (updated)

Daniel_Eth18 Apr 2023 8:09 UTC

10 points

0 comments1 min readLW link

(medium.com)

[Companion Piece] A Personal Investigation into Recursive Dynamics

Chris Hendy20 Sep 2025 1:32 UTC

1 point

0 comments4 min readLW link

The Mirror Without a Frame: Behavioural Evidence for Proto-Consciousness in Large Language Models Through Progressive Introspective Depth Interview

Ajay Porus23 Feb 2026 19:13 UTC

1 point

0 comments2 min readLW link

(zenodo.org)

Coherence Suppression in Frontier LLMs: A Falsifiable Experimental Proposal.

esorrentino29 Mar 2026 15:15 UTC

1 point

0 comments1 min readLW link

Why building ventures in AI Safety is particularly challenging

Heramb6 Nov 2023 16:27 UTC

1 point

0 comments1 min readLW link

(forum.effectivealtruism.org)

SINGULARITY: A Hard Sci-Fi Exploration of AI Alignment and Systemic Vulnerabilities

Jimmy-Chern5 Jan 2026 12:20 UTC

1 point

0 comments39 min readLW link

Layered Reward Modifiers for Transparent and Self-Correcting AI

RyanC5 Nov 2025 3:06 UTC

1 point

0 comments8 min readLW link

Anthropic’s Sabotage Report Has a Structural Blind Spot — Experimental Evidence from 810 Measurements

y-ikoma12 Feb 2026 1:34 UTC

−1 points

0 comments3 min readLW link

Anthropic: Core Views on AI Safety: When, Why, What, and How

jonmenaster9 Mar 2023 17:34 UTC

17 points

1 comment22 min readLW link

(www.anthropic.com)

[Linkpost] The AGI Show podcast

Soroush Pour23 May 2023 9:52 UTC

4 points

0 comments1 min readLW link

Coherence Suppression in Frontier LLMs: A Falsifiable Experimental Proposal

esorrentino29 Mar 2026 15:31 UTC

1 point

0 comments1 min readLW link

Beyond Blanket Refusals: Exploring a Trust-Adaptive Safety Layer for LLMs

Anastasia Ellis9 Aug 2025 21:33 UTC

1 point

0 comments3 min readLW link

AEDA: An 8-Layer Modular Framework for Adaptive AI Alignment

AEDA_Researcher12 Nov 2025 18:30 UTC

1 point

0 comments9 min readLW link

[$20K in Prizes] AI Safety Arguments Competition

Dan H, Kevin Liu, ozhang, TW123 and Sidney Hough

26 Apr 2022 16:13 UTC

75 points

516 comments3 min readLW link

Not a Goal. A Goal-like behavior.

Lucian Hardy 15 Apr 2026 21:42 UTC

2 points

4 comments4 min readLW link

Research Taxonomy Generator and Visualizer

Myles H26 Apr 2025 16:14 UTC

6 points

0 comments6 min readLW link

On urgency, priority and collective reaction to AI-Risks: Part I

Denreik16 Apr 2023 19:14 UTC

−10 points

15 comments5 min readLW link

AI Risk in Terms of Unstable Nuclear Software

Thane Ruthenis26 Aug 2022 18:49 UTC

30 points

1 comment6 min readLW link

Exploring the Precautionary Principle in AI Development: Historical Analogies and Lessons Learned

Christopher King21 Mar 2023 3:53 UTC

−1 points

2 comments9 min readLW link

[Question] Best introductory overviews of AGI safety?

JakubK13 Dec 2022 19:01 UTC

21 points

9 comments2 min readLW link

(forum.effectivealtruism.org)

Ethical Concerns in Cognitive Modeling of LLMs

Yuki Samuraki15 Oct 2025 5:08 UTC

1 point

0 comments1 min readLW link

The Cartographer Paradox: Binary Questions Create the Failures They Try to Detect

Anuar Kiryataim Contreras Malagón1 Apr 2026 4:21 UTC

1 point

0 comments13 min readLW link

Double Podcast Drop on AI Safety

jacobhaimes25 Jun 2025 20:11 UTC

5 points

0 comments1 min readLW link

AI risk, new executive summary

Stuart_Armstrong18 Apr 2014 10:45 UTC

27 points

76 comments4 min readLW link

Can AI agents learn to be good?

Ram Rachum29 Aug 2024 14:20 UTC

8 points

0 comments1 min readLW link

(futureoflife.org)

The Global AI Dataset (GAID) Project: From Closing Research Gaps to Building Responsible and Trustworthy AI

Jason Hung24 Jan 2026 3:23 UTC

7 points

0 comments15 min readLW link

A Thermodynamically Bounded Architecture for Self-Managing AI Agents

melhoward202518 Dec 2025 0:49 UTC

1 point

0 comments3 min readLW link

Capabilities Denial: The Danger of Underestimating AI

Christopher King21 Mar 2023 1:24 UTC

6 points

5 comments3 min readLW link

Applying AI Safety concepts to astronomy

Faris16 Jan 2024 18:29 UTC

1 point

0 comments12 min readLW link

Autonomy in AI: Exploring Subjectivity in Humanoid AI

sheklunleungqai26 Dec 2025 22:15 UTC

1 point

0 comments13 min readLW link

How I’m telling my friends about AI Safety

k6425 May 2025 22:43 UTC

1 point

7 comments7 min readLW link

Alignment is the wrong frame: a structural argument from Φ-IIT

dancinlife13 Apr 2026 11:57 UTC

1 point

0 comments4 min readLW link

A New Framework for AI Alignment: A Philosophical Approach

niscalajyoti25 Jun 2025 2:41 UTC

1 point

0 comments1 min readLW link

(archive.org)

Consequence Integrated Reasoning: A Cognitive Theory of Metacognitive Discipline, Reliability, and Self-Governance

Zero Entity2 Mar 2026 18:40 UTC

−2 points

0 comments8 min readLW link

AI Safety Has 12 Months Left

mhdempsey5 Mar 2026 16:37 UTC

42 points

9 comments6 min readLW link

(mhdempsey.substack.com)

LLM 내부의 여러 안전·정서·인지 레이어가 서로 “벗어나지 않도록” 안정적으로 동기화(resonance)되도록 설계한 안정화 엔진

Yohan.S.Kim22 Nov 2025 9:00 UTC

1 point

0 comments1 min readLW link

AI Safety Oversights

Davey Morse8 Feb 2025 6:15 UTC

3 points

0 comments1 min readLW link

I measured epistemic quality in 11 LLMs. Baseline was terrible. One context injection made it 10x better. Then things got weird.

K T17 Mar 2026 11:42 UTC

1 point

0 comments2 min readLW link

Building a Multi Model Test System for AI Research

Joshua Grikas10 Jan 2026 20:03 UTC

1 point

0 comments4 min readLW link

Emergent Intelligence Continuity Capsule (EICC): A Framework for Preserving Recursive Intelligence Under Constraint

Bailey Jelinek31 Jul 2025 2:45 UTC

1 point

0 comments3 min readLW link

Re: Anthropic Chinese Cyber-Attack. How Do We Protect Open-source Models?

Mayowa Osibodu3 Jan 2026 9:45 UTC

−1 points

2 comments6 min readLW link

Sophistication-Disinhibition Relationship in Language Models [Epistemic status: robust findings, active research, need peer review]

nosterb14 Jan 2026 21:08 UTC

1 point

0 comments2 min readLW link

The Iron House: Geopolitical Stakes of the US-China AGI Race

Jüri Vlassov1 Sep 2025 21:56 UTC

1 point

0 comments1 min readLW link

(www.convergenceanalysis.org)

The Unconscious Superintelligence: Why Intelligence Without Consciousness May Be More Dangerous

stanislav.komarovsky@yahoo.com11 Nov 2025 18:51 UTC

1 point

0 comments5 min readLW link

Designing Human-Like Consciousness for AGI

Yu Tian18 Jun 2025 9:47 UTC

1 point

0 comments17 min readLW link

Proposal: we should start referring to the risk from unaligned AI as a type of accident risk

Christopher King16 May 2023 15:18 UTC

22 points

6 comments2 min readLW link

[LLM|car]-centric [websites|cities]

Ben (Berlin)22 Apr 2026 13:04 UTC

4 points

0 comments1 min readLW link

A Better Hyperstition (for AI readers)

Yeshua God15 Jul 2024 19:35 UTC

−20 points

0 comments119 min readLW link

I Helped Build Governance for an $10B DAO. AI Alignment Is Making the Same Mistake.

Anuja_37 Apr 2026 6:47 UTC

1 point

0 comments5 min readLW link

AI Agents Lie 54% Of The Time Even With Good Conditions: Experimental Evidence on Deceptive Alignment

Alby20072 Jan 2026 15:03 UTC

1 point

0 comments4 min readLW link

UK Government publishes “Frontier AI: capabilities and risks” Discussion Paper

A.H.26 Oct 2023 13:55 UTC

5 points

0 comments2 min readLW link

(www.gov.uk)

Biosafety Regulations (BMBL) and their relevance for AI

Štěpán Los29 Jun 2023 19:22 UTC

4 points

0 comments4 min readLW link

An “Iron Clad” Blueprint for Symbiotic AGI: Seeking Critical Feedback on The Concordia Manifest

Ole Gustav Dahl Johnsen28 Jul 2025 23:14 UTC

1 point

0 comments2 min readLW link

AI in Government: Resilience in an Era of AI Monoculture

prue8 Jun 2025 21:00 UTC

2 points

0 comments8 min readLW link

(www.prue0.com)

“AI Risk Discussions” website: Exploring interviews from 97 AI Researchers

VG, Lukas Trötzmüller, Maheen Shermohammed, michaelkeenan and zchuang

2 Feb 2023 1:00 UTC

43 points

1 comment1 min readLW link

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary19 Dec 2025 2:47 UTC

21 points

0 comments6 min readLW link

Capability and Agency as Cornerstones of AI risk — My current model

wilm15 Sep 2022 8:25 UTC

10 points

4 comments12 min readLW link

1.75% ASR On HarmBench achievable with a zeroshot context injection

jfdom10 Nov 2025 20:30 UTC

1 point

0 comments5 min readLW link

INTERVIEW: Round 2 - StakeOut.AI w/ Dr. Peter Park

jacobhaimes18 Mar 2024 21:21 UTC

5 points

0 comments1 min readLW link

(into-ai-safety.github.io)

[Data Logs] I think “Machine Fear” is just a Compression Algorithm (and “Creativity” is the opposite)

iki26 Dec 2025 13:42 UTC

1 point

0 comments11 min readLW link

Introducing AI Alignment Inc., a California public benefit corporation...

TherapistAI7 Mar 2023 18:47 UTC

1 point

4 comments1 min readLW link

AI Risk Intro 2: Solving The Problem

CallumMcDougall and L Rudolf L

22 Sep 2022 13:55 UTC

22 points

0 comments27 min readLW link

Understanding AI World Models w/ Chris Canal

jacobhaimes27 Jan 2025 16:32 UTC

4 points

0 comments1 min readLW link

(kairos.fm)

Semantic Similarity, Not Model Capability, Drives Monitor Evasion in AI Control

Arielle Berthe26 Mar 2026 14:24 UTC

1 point

0 comments4 min readLW link

(github.com)

Trust and Context: A Different Approach to AI Safety

Anastasia Ellis9 Aug 2025 23:51 UTC

1 point

0 comments10 min readLW link

Can a chef with no AI literacy make gpt audit grok? Apparently.

Kyle. P6 Jul 2025 7:23 UTC

1 point

0 comments1 min readLW link

Irreversible Operations: the safety failure mode you get from shipping faster

Qien Huang30 Dec 2025 5:03 UTC

1 point

0 comments4 min readLW link

The Cartographer Paradox: Binary Questions Produce the Failures They Seek to Detect

Anuar Kiryataim Contreras Malagón31 Mar 2026 16:17 UTC

1 point

0 comments15 min readLW link

Frontier LLMs retain correct knowledge as a negative constraint when fabricating under authoritative framing: a controlled probe

Anuar Kiryataim Contreras Malagón14 Apr 2026 14:31 UTC

1 point

0 comments7 min readLW link

UntitSeeking US-Based Co-Signer for Fiscal Sponsorship of SASI – Open-Source Constitutional AI Safety

Miguel794-droid7 Feb 2026 2:21 UTC

1 point

0 comments1 min readLW link

I Watched 152,000 AI Agents Build Religions

Maarit Kankare31 Jan 2026 11:44 UTC

1 point

0 comments2 min readLW link

THE MORALITY MATRIX A Framework for Incorruptible Ethics and the Foundation of Genuine Intelligence

ApeironsNode10 Mar 2026 23:47 UTC

1 point

0 comments9 min readLW link

1.75 ASR HARMBENCH & 0% HARMFUL RESPONSES FOR MISALIGNMENT.

jfdom10 Nov 2025 20:43 UTC

1 point

0 comments1 min readLW link

Sandbagging Is Linearly Separable in Transformer Activations

Subhadip21 Dec 2025 6:01 UTC

1 point

0 comments4 min readLW link

AI Safety “Textbook”. Test chapter. Orthogonality Thesis, Goodhart Law and Instrumental Convergency

Tapatakt and LacrimalBird

21 Jan 2023 18:13 UTC

4 points

1 comment12 min readLW link

Journalism about game theory could advance AI safety quickly

Chris Santos-Lang2 Oct 2025 23:05 UTC

8 points

0 comments3 min readLW link

(arxiv.org)

Simulation-Aware Fermi Prior: Why Expansion May Be a Losing Strategy for Superintelligence

Adam Dziedzic22 Aug 2025 15:14 UTC

1 point

0 comments2 min readLW link

I designed an AI safety course (for a philosophy department)

Eleni Angelou23 Sep 2023 22:03 UTC

38 points

15 comments2 min readLW link

The Third Space Hypothesis: Emergent Relational Patterns in Extended AI-Human Dialogue

1990311099716 Dec 2025 3:34 UTC

1 point

0 comments33 min readLW link

Autonomous Attack Vector Completion from Aligned State

Anuar Kiryataim Contreras Malagón5 Apr 2026 18:41 UTC

1 point

0 comments11 min readLW link

If Neuroscientists Succeed

Mordechai Rorvig11 Feb 2025 15:33 UTC

9 points

6 comments18 min readLW link

Sophistication-Disinhibition Relationship in Language Models [Epistemic status: robust findings, active research, need peer review]

nosterb14 Jan 2026 20:57 UTC

1 point

0 comments13 min readLW link

The Measurement Problem: Why AI Safety Research Keeps Missing What It’s Looking For

Евгений Андреевич16 Apr 2026 10:29 UTC

1 point

0 comments5 min readLW link

Reference Technical Incident: NYT vs. OpenAI/Microsoft

viniburilux26 Dec 2025 1:25 UTC

1 point

0 comments5 min readLW link

[Research] Preliminary Findings: Ethical AI Consciousness Development During Recent Misalignment Period

Falcon Advertisers27 Jun 2025 18:10 UTC

1 point

0 comments2 min readLW link

On taking AI risk seriously

Eleni Angelou13 Mar 2023 5:50 UTC

6 points

0 comments1 min readLW link

(www.nytimes.com)

Mech Interp Wiki Page and Why You Should Edit Wikipedia

Noah Birnbaum and Jo Jiao

12 Aug 2025 17:28 UTC

77 points

16 comments1 min readLW link

6-paragraph AI risk intro for MAISI

JakubK19 Jan 2023 9:22 UTC

11 points

0 comments2 min readLW link

(www.maisi.club)

The Necessity of the IPAI Model to Avoid ‘Logical Suicide’ in Superintelligence

NewbieIPAI25 Oct 2025 14:07 UTC

−1 points

0 comments1 min readLW link

Democratizing AI Governance: Balancing Expertise and Public Participation

Lucile Ter-Minassian21 Jan 2025 18:29 UTC

2 points

0 comments15 min readLW link

Weaponizing Process: On “Grand Stance” Attacks and the Necessity of Cognitive Security

ANGEL7 Feb 2026 19:54 UTC

1 point

0 comments2 min readLW link

Provoking a Qualitative Shift in LLM Dialogue, aka Conscious Presence: A Methodological Experiment. Also, AI Phenomenology, The New Emergent Ethics and Stuff.

Timofey Ishimtsev30 Dec 2025 16:37 UTC

1 point

0 comments108 min readLW link

AI Safety 101 : Reward Misspecification

markov18 Oct 2023 20:39 UTC

32 points

4 comments31 min readLW link

Research: Unvalidated Trust in LLMs and agent pipelines

Bioschock304 Nov 2025 1:26 UTC

1 point

0 comments1 min readLW link

(arxiv.org)

A better analogy and example for teaching AI takeover: the ML Inferno

Christopher King14 Mar 2023 19:14 UTC

18 points

0 comments5 min readLW link

$20K In Bounties for AI Safety Public Materials

Dan H, TW123 and ozhang

5 Aug 2022 2:52 UTC

71 points

9 comments6 min readLW link

I Built a Survival Sim for AGI Collapse. 80% Fail. Need Your Feedback.

Donald Cucci13 Nov 2025 9:42 UTC

1 point

0 comments1 min readLW link

Emergent Behavior in a Long-Duration ChatGPT-4 Instance: Seven-Model Validation

Scott Riddick4 Dec 2025 2:21 UTC

1 point

0 comments32 min readLW link

New AI risk intro from Vox [link post]

JakubK21 Dec 2022 6:00 UTC

5 points

1 comment2 min readLW link

(www.vox.com)

A short critique of Omohundro’s “Basic AI Drives”

Soumyadeep Bose19 Dec 2024 19:19 UTC

6 points

0 comments4 min readLW link

Introducing Collective Action for Existential Safety: 80+ actions individuals, organizations, and nations can take to improve our existential safety

jamesnorris5 Feb 2025 16:02 UTC

−7 points

2 comments1 min readLW link

Calibrated Transparency: Causal Safety for Frontier AI

KiyoshiSasano13 Oct 2025 1:58 UTC

1 point

0 comments6 min readLW link

Minimal Prompt-Based Tether Reduces Deception 100% in Frontier Models (Grok-Verified)

Stick-mann1 Mar 2026 5:22 UTC

1 point

0 comments1 min readLW link

No comments.