This tries to be a pretty comprehensive lists all AI safety, alignment, and control interventions.
Much of the collection was conducted as part of an internal report on the field for AE Studio under Diogo de Lucena. I’d like to thank Aaron Scher, who maintains the #papers-running-list at the AI alignment Slack, as well as the reviewers Cameron Berg and Martin Leitgab, for their contributions to the report.
This post doesn’t try to explain all the interventions and provides only the tersest summaries. It serves as a sort of top-level index to all the relevant posts and papers. The much longer paper version of this post has additional summaries for the interventions (but fewer LW links) and can be found here.
AI disclaimer: Many of the summaries have been cowritten or edited with ChatGPT.
Please let me know any link errors or if I overlooked any intervention, especially any type of intervention.
Incentive-compatible auditor frameworks using mechanism design principles to resist collusion and selection bias. Includes reward structure design to prevent tampering and manipulation.
AI Safety Interventions
This tries to be a pretty comprehensive lists all AI safety, alignment, and control interventions.
Much of the collection was conducted as part of an internal report on the field for AE Studio under Diogo de Lucena. I’d like to thank Aaron Scher, who maintains the #papers-running-list at the AI alignment Slack, as well as the reviewers Cameron Berg and Martin Leitgab, for their contributions to the report.
This post doesn’t try to explain all the interventions and provides only the tersest summaries. It serves as a sort of top-level index to all the relevant posts and papers. The much longer paper version of this post has additional summaries for the interventions (but fewer LW links) and can be found here.
AI disclaimer: Many of the summaries have been cowritten or edited with ChatGPT.
Please let me know any link errors or if I overlooked any intervention, especially any type of intervention.
Table of Contents
Prior Overviews
Foundational Theories
Hard Methods: Formal Guarantees
Mechanistic and Mathematical Interpretability
Scalable Oversight and Alignment Training
Robustness and Adversarial Evaluation
Behavioral and Psychological Approaches
Operational Control and Infrastructure
Governance and Institutions
Underexplored Interventions
Prior Overviews
This consolidated report drew on the following prior efforts.
Comprehensive Surveys
AI Alignment: A Comprehensive Survey (Ji et al., 2023)
Foundational Challenges in Assuring Alignment and Safety of Large Language Models (Ganguli et al., 2024) - Anthropic’s framework identifying 18 foundational challenges
The Circuits Research Landscape (Lindsey et al., 2025) - a comprehensive survey of interpretability methods
Control and Operational Approaches
An Overview of Control Measures (Greenblatt, 2023)
AI Assurance Technology Market Report 2024
Governance and Policy
Open Problems in Technical AI Governance (Reuel et al., 2024)
AI Governance to Avoid Extinction (Barnett & Scher, 2024)
2025 AI Safety Index (FLI, 2025) - an assessment of leading AI companies’ safety practices
Project Ideas and Research Directions
AI Alignment Research Project Ideas (BlueDot Impact, 2023)
AI Safety Map (2024) - an overview of AI safety ecosystem as a map!
What Everyone in Technical Alignment is Doing and Why
Foundational Theories
See also AI Safety Arguments Guide
Embedded Agency
Moving beyond the Cartesian boundary model to agents that exist within and interact with their environment.
LessWrong Embedded Agency Tag
Demski & Garrabrant (2018): Agent Foundations
PreDCA Framework (Kosoy, 2022)
Decision Theory and Rational Choice
Foundations for rational choice under uncertainty, including causal vs. evidential decision theory and updateless decision theory.
LessWrong Tag Decision Theory
Hutter (2005): Universal Artificial Intelligence
Functional Decision Theory (Yudkowsky & Soares, 2017)
Optimization and Mesa-Optimization
Understanding when and how learned systems themselves become optimizers, with implications for deception and alignment faking.
LessWrong Tag Mesa-Optimization
Risks from Learned Optimization
Clarifying Mesa-Optimization
Logical Induction
MIRI’s framework for reasoning under logical uncertainty with computable algorithms.
Garrabrant et al. (2016)
Cartesian Frames and Finite Factored Sets
LessWrong Tag Cartesian Frames
Cartesian Frames (Garrabrant et al., 2021)
Finite Factored Sets (Garrabrant, 2020)
Infra-Bayesianism and Logical Uncertainty
Handling uncertainty in logical domains and imperfect models.
LessWrong Tag Infra-Bayesianism
Elementary Introduction to Infra-Bayesianism
Hard Methods: Formal Guarantees
See also LessWrong Tag Formal Verification and LessWrong Tag Corrigibility
Neural Network Verification
Mathematical verification methods to prove properties about neural networks.
Zhang et al. (2022): α-β-CROWN
Conformal Prediction
Adding confidence guarantees to existing models.
Abbasi-Yadkori et al. (2024): Mitigating LLM Hallucinations via Conformal Abstention
Proof-Carrying Models
Adapting proof-carrying code to ML where outputs must be accompanied by proofs of compliance/validity.
Necula (1997): Proof-Carrying Code
Chen et al. (2022): Proof-Carrying Models
Safe Reinforcement Learning (SafeRL)
Algorithms that maintain safety constraints during learning while maximizing returns.
García & Fernández (2015)
OmniSafe Framework (Ji et al., 2023)
Shielded RL
Integrating temporal logic monitors with learning systems to filter unsafe actions.
Alshiekh et al. (2018): Safe Reinforcement Learning via Shielding
Runtime Assurance Architectures (Simplex)
Combining high-performance unverified controllers with formally verified safety controllers.
Sha et al. (1996): Simplex Architecture
Phan et al. (2020): Resilient Simplex Architecture
Safely Interruptible Agents
Theoretical framework for shutdown indifference.
LessWrong Tag Shutdown Problem
Orseau & Armstrong (2016): Safely Interruptible Agents
Provably Corrigible Agents
Using utility heads to ensure formal guarantees of corrigibility.
LessWrong Tag Corrigibility
Nayebi (2025)
Guaranteed Safe AI (GSAI)
Comprehensive framework for AI systems with quantitative, provable safety guarantees.
GSAI Project Page
Dalrymple et al. (2024): Towards Guaranteed Safe AI
Proofs of Autonomy
Extending formal verification to autonomous agents using cryptographic frameworks.
Grigor et al. (2025): Proofs of Autonomy: Scalable and Practical Verification of AI Autonomy
Mechanistic and Mathematical Interpretability
See also LessWrong Tag Interpretability and A Transparency and Interpretability Tech Tree
Circuit Analysis and Feature Discovery
Reverse-engineering neural representations into interpretable circuits.
Anthropic’s Transformer Circuits (Olah et al., 2020)
Lindsey et al. (2025): Circuits Research Landscape
Sparse Autoencoders (SAEs)
Extracting interpretable features by learning sparse representations of activations.
Cunningham et al. (2023)
Anthropic Monosemantic Features (Templeton et al., 2023)
Goodfire.ai Steering (2024)
Feature Visualization
Understanding neural network representations through direct visualization.
Olah et al. (2017)
OpenAI Microscope (2020)
Linear Probes
Scalable analysis of model behavior and persuasion dynamics.
Jaipersaud et al. (2024)
Attribution Graphs
Interactive visualizations of feature-feature interactions.
Lindsey et al. (2025)
Causal Scrubbing
Rigorous method for testing interpretability hypotheses in neural networks.
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
Integrated Gradients
Attribution method using path integrals to attribute predictions to inputs.
Sundararajan et al. (2017): Axiomatic Attribution for Deep Networks
Chain-of-Thought Analysis
Detection of complex cognitive behaviors including alignment faking.
METR (2025): CoT May Be Highly Informative Despite “Unfaithfulness”
Model Editing (ROME)
Precise modification of factual associations within language models.
Meng et al. (2022)
Knowledge Neurons
Identifying specific components responsible for factual knowledge.
Dai et al. (2021)
Physics-Informed Model Control
Using approaches from physics to establish bounds on model behavior.
Tomaz & Jones (2025): Momentum-Point-Perplexity Mechanics
Representation Engineering
Activation-level interventions to suppress harmful trajectories.
Turner et al. (2023): Representation Engineering
Steering GPT-2-XL by adding an activation vector
Gradient Routing
Localizing computation in neural networks through gradient masking.
Cloud et al. (2024): Gradient Routing
Developmental Interpretability
Understanding how AI models acquire capabilities during training.
Kendiukhov (2024)
Singular Learning Theory (SLT)
Mathematical foundations for understanding learning dynamics and phase transitions.
Hoogland et al. (2024)
Scalable Oversight and Alignment Training
Reinforcement Learning from Human Feedback (RLHF)
Using human-labeled preferences for alignment training.
LessWrong Tag RLHF
Christiano et al. (2017): Deep Reinforcement Learning from Human Preferences
Reinforcement Learning from AI Feedback (RLAIF)
Bootstrapping alignment from smaller aligned models.
Bai et al. (2022)
Constitutional AI
Leveraging rule-based critiques to reduce reliance on human raters.
LessWrong Tag Constitutional AI
Bai et al. (2022)
Pretraining Data Filtering
Removing dual-use content during training for tamper-resistant safeguards.
O’Brien et al. (2024)
Reinforcement Learning from Reflective Feedback (RLRF)
Models generate and utilize their own self-reflective feedback for alignment.
Yang et al. (2024)
CALMA
Soni et al. (2025): CALMA: A Process for Deriving Context-aligned Axes for Language Model Alignment
Value Learning / Cooperative Inverse Reinforcement Learning (CIRL)
Building AI systems that infer human values from behavior and feedback.
LessWrong Tag Value Learning
Hadfield-Menell et al. (2016)
Imitation Learning
Learning safe behaviors from expert demonstrations.
Hussein et al. (2017)
Iterated Distillation and Amplification (IDA)
Recursively training models to decompose and amplify human supervision.
LessWrong Tag IDA
Christiano (2018): Iterated Distillation and Amplification
AI Safety via Debate
Two models in adversarial dialogue judged by humans.
LessWrong Tag Debate
Irving et al. (2018)
Recursive Reward Modeling
Training reward models for sub-tasks and combining them for harder tasks.
Leike et al. (2018): Scalable Agent Alignment via Reward Modeling
Eliciting Latent Knowledge (ELK)
Extracting truthful internal representations even when deceptive behavior could arise.
LessWrong Tag ELK
Christiano et al. (2021): Eliciting latent knowledge:
How to tell if your eyes deceive you
Shard Theory
Framework for understanding how values and goals emerge through training.
LessWrong Tag Shard Theory
Turner & Udell (2022): Shard Theory Overview
Shah & Gleave (2019): Reward is not the optimization target
Robustness and Adversarial Evaluation
See also LessWrong Tag Adversarial Examples, AI Safety 101: Unrestricted Adversarial Training, An Overview of 11 Proposals for Building Safe Advanced AI
Adversarial Training
Augmenting training with adversarial examples including jailbreak defenses.
Goodfellow et al. (2015): Explaining and Harnessing Adversarial Examples
Madry et al. (2017): Towards Deep Learning Models Resistant to Adversarial Attacks
Redwood Research (2022): Adversarial Training for Language Models
Redwood Research (2023): Training to Avoid Harmful Content
Prompt Injection Defenses
Defense systems against prompt injection attacks.
Perez & Ribas (2022)
Lakera Guard (2024)
Red-Teaming and Capability Evaluations
Testing for misuse, capability hazards, and safety failures.
ARC Evals (2023)
Redwood Research Robust Injury Classifier (2023)
AE Studio Simulations (2024)
Zou et al. (2025)
OS-HARM Benchmark
Evaluating agent vulnerabilities in realistic desktop environments.
Kuntz et al. (2024)
Goal Drift Evaluation
Assessing whether agents maintain intended objectives over extended interactions.
Arike et al. (2025)
Attempt to Persuade Eval (APE)
Measuring models’ willingness to attempt persuasion on harmful topics.
Kowal et al. (2025)
INTIMA Benchmark
Evaluating AI companionship behaviors that can lead to emotional dependency.
Kaffee et al. (2025)
Signal-to-Noise Analysis for Evaluations
Ensuring safety assessments accurately distinguish model capabilities.
Heineman et al. (2024)
Data Scaling Laws for Domain Robustness
Systematic data curation to enhance model robustness.
Skywork-SWE (2025)
Behavioral and Psychological Approaches
See also LessWrong Tag Human-AI Interaction
LLM Psychology
Treating LLMs as psychological subjects to probe reasoning and behavior.
AI Psychology
Kulveit (2024): Three-Layer Model
Kaffee et al. (2025): INTIMA Benchmark
Persona Vectors
Automated monitoring and control of personality traits.
Chen et al. (2025)
Self-Other Overlap Fine-Tuning (SOO-FT)
Fine-tuning with paired prompts to reduce deceptive behavior.
Carauleanu et al. (2024)
Alignment Faking Detection
Identifying when models strategically fake alignment.
Hubinger (2024): Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
How likely is deceptive alignment?
Park et al. (2023): AI Deception Survey
Brain-Like AGI Safety
Reverse-engineering human pro-social instincts and building AGI using architectures with similar effects.
Byrnes (2022-2024): Intro to Brain-Like AGI Safety
Byrnes (2025): 2024 Review & 2025 Plans
Why Modelling Multi-Objective Homeostasis is Essential for AI Alignment
Robopsychology and Simulator Theory
Understanding LLMs as universal simulators rather than goal-pursuing agents.
LessWrong Tag Simulators
Janus (2022): Simulators Sequence
Alexander (2023): Janus’ Simulators
Operational Control and Infrastructure
See also LessWrong Tag AI Control and Notes on Control Evaluations for Safety Cases
AI Control Framework
Designing protocols for deploying powerful but untrusted AI systems.
Greenblatt et al. (2024)
Greenblatt (2023): Overview of Control Measures
Permission Management and Sandboxing
Fine-grained permission systems and OS-level sandboxing for AI agents.
CalypsoAI (2024)
Model Cascades
Using confidence calibration to defer uncertain tasks to more capable models.
Rabanser et al. (2025): GATEKEEPER
Guillotine Hypervisor
Advanced containment architecture for isolating potentially malicious AI systems.
Rosenthal et al. (2024)
AI Hardware Security
Physical high-performance computing hardware assurance for compliance.
TamperSec (2024)
Artifact and Experiment Lineage Tracking
Tracking systems linking AI outputs to precise production trajectories.
Weights & Biases (2024)
CalypsoAI (2024)
Shutdown Mechanisms and Cluster Kill Switches
Safely Interruptible Agents framework (MIRI 2016)
Core Safety Values for Provably Corrigible Agents
Watermarking and Output Detection
Digital watermarking techniques for AI-generated content.
Kirchenbauer et al. (2023)
Google SynthID (2023)
Sadasivan et al. (2024): Detection Reliability
Zhao et al. (2023): Provable Robust Watermarking
Mitchell et al. (2024): Standards and Policy
Copyleaks (2024)
Blackbird.AI (2024)
Steganography and Context Leak Countermeasures
Preventing covert channels and hidden information in AI systems.
StegoAttack (Wang et al., 2025)
Tensor Steganography (Snyk, 2024)
Encoded Reasoning (Pfau et al., 2024)
Steganalysis (Chen et al., 2024)
Runtime AI Firewalls and Content Filtering
Real-time interception and filtering during AI inference.
Lakera Guard (2024)
Nightfall AI (2024)
ActiveFence (2024)
AI System Observability and Drift Detection
Continuous monitoring for performance degradation and anomalous behavior.
Arize AI (2024)
TruEra (2024)
Fiddler AI (2024)
Harmony Intelligence (2024)
Governance and Institutions
See also LessWrong Tag AI Governance and Advice for Entering AI Safety Research
Pre-Deployment External Safety Testing
Third-party evaluations before AI system release.
FLI AI Safety Index (2024)
FLI AI Safety Index (2025)
Citadel AI (2024)
Giskard (2024)
Attestable Audits
Using Trusted Execution Environments for verifiable safety benchmarks.
Schnabl et al. (2025)
Probabilistic Risk Assessment (PRA) for AI
Structured risk evaluation adapted from high-reliability industries.
Wisakanto et al. (2025)
Regulation: EU AI Act and US EO 14110
Risk-based regulatory obligations and safety testing mandates.
EU AI Act
US Executive Order 14110
System Cards and Preparedness Frameworks
Labs release safety evidence and define deployment gates.
OpenAI GPT-4 System Card (2023)
Anthropic Claude 3 Model Card (2024)
Anthropic Responsible Scaling Policy (2024)
AI Governance Platforms
End-to-end governance workflows with compliance linkage.
Fairly AI (2024)
Saidot AI (2024)
Ecosystem Development and Meta-Interventions
Research infrastructure, community building, and coordination.
AI Safety Map (2024)
TRecursive (2024)
LessWrong (2024)
AI Safety Support (2024)
PauseAI (2024)
Berg et al. (2023): Neglected Approaches
Underexplored Interventions
This is your chance to work on something nobody has worked on before. Feedback Wanted: Shortlist of AI Safety Ideas, Ten AI Safety Projects I’d Like People to Work On, AI alignment project ideas
See also LessWrong Tag AI Safety Research
Compositional Formal Specifications for Prompts/Agents
Treating prompts and agent orchestration as formal programs with verifiable properties.
mentioned in Ji et al. (2023): AI Alignment Survey
mentioned in Greenblatt (2023): An overview of control measures
Control-Theoretic Certificates for Tool-Using Agents
Extending barrier certificates to multi-step, multi-API agent action graphs.
mentioned in Greenblatt (2023): An overview of control measures
AI-BSL: Capability-Tiered Physical Containment Standards
Biosafety-level-like standards for labs training frontier models.
mentioned in FLI AI Safety Index (2025)
Ji et al. (2023)
Anthropic RSP (2024)
Oversight Mechanism Design
Incentive-compatible auditor frameworks using mechanism design principles to resist collusion and selection bias. Includes reward structure design to prevent tampering and manipulation.
Designing Agent Incentives to Avoid Reward Tampering
mentioned in FLI AI Safety Index (2025)
mentioned in BlueDot Impact (2023)
Liability and Insurance Instruments
Risk transfer mechanisms including catastrophe bonds and mandatory coverage.
mentioned in FLI AI Safety Index (2025)
mentioned in BlueDot Impact (2023)
Dataset Hazard Engineering
Systematic hazard analysis for data pipelines using safety engineering methods.
Ji et al. (2023)
Wisakanto et al. (2025): PRA for AI
mentioned in FLI AI Safety Index (2025)
Automated Alignment Research
Using AI systems to accelerate safety research.
Carlsmith (2025)
mentioned in AE Studio (2024)
Note: I’m currently collecting a longer list of papers and projects in this category. A lot of people are working on this!
Deliberative and Cultural Interventions
Integration of broader human values through citizen assemblies and stakeholder panels.
AE Studio Scenario Planning (2024)
mentioned in FLI AI Safety Index (2025)
mentioned in BlueDot Impact (2023)
Deceptive Behavior Detection and Mitigation
Systematic approaches for detecting and preventing deceptive behaviors.
mentioned in Ganguli et al. (2024): Foundational Challenges
Carauleanu et al. (2024): SOO-FT
Generalization Control and Capability Containment
Frameworks for controlling how AI systems generalize to new tasks.
mentioned in Ganguli et al. (2024): Foundational Challenges
Multi-Agent Safety and Coordination Protocols
Safety frameworks for environments with multiple interacting AI systems.
mentioned in Ganguli et al. (2024): Foundational Challenges
Technical Governance Implementation Tools
Technical tools for implementing, monitoring, and enforcing AI governance policies.
Reuel et al. (2024): Open Problems in Technical AI Governance
International AI Coordination Mechanisms
Infrastructure and protocols for coordinating AI governance across international boundaries.
Barnett & Scher (2024): AI Governance to Avoid Extinction
Systemic Disempowerment Measurement
Quantitative frameworks for measuring human disempowerment as AI capabilities advance.
Kulveit et al. (2025): Gradual Disempowerment Framework