AI Safety Interventions

This tries to be a pretty comprehensive lists all AI safety, alignment, and control interventions.

Much of the collection was conducted as part of an internal report on the field for AE Studio under Diogo de Lucena. I’d like to thank Aaron Scher, who maintains the #papers-running-list at the AI alignment Slack, as well as the reviewers Cameron Berg and Martin Leitgab, for their contributions to the report.

This post doesn’t try to explain all the interventions and provides only the tersest summaries. It serves as a sort of top-level index to all the relevant posts and papers. The much longer paper version of this post has additional summaries for the interventions (but fewer LW links) and can be found here.

AI disclaimer: Many of the summaries have been cowritten or edited with ChatGPT.

Please let me know any link errors or if I overlooked any intervention, especially any type of intervention.

Table of Contents


Prior Overviews

This consolidated report drew on the following prior efforts.

Comprehensive Surveys

Control and Operational Approaches

Governance and Policy

Project Ideas and Research Directions


Foundational Theories

See also AI Safety Arguments Guide

Embedded Agency

Moving beyond the Cartesian boundary model to agents that exist within and interact with their environment.

Decision Theory and Rational Choice

Foundations for rational choice under uncertainty, including causal vs. evidential decision theory and updateless decision theory.

Optimization and Mesa-Optimization

Understanding when and how learned systems themselves become optimizers, with implications for deception and alignment faking.

Logical Induction

MIRI’s framework for reasoning under logical uncertainty with computable algorithms.

Cartesian Frames and Finite Factored Sets

Infra-Bayesianism and Logical Uncertainty

Handling uncertainty in logical domains and imperfect models.


Hard Methods: Formal Guarantees

See also LessWrong Tag Formal Verification and LessWrong Tag Corrigibility

Neural Network Verification

Mathematical verification methods to prove properties about neural networks.

Conformal Prediction

Adding confidence guarantees to existing models.

Proof-Carrying Models

Adapting proof-carrying code to ML where outputs must be accompanied by proofs of compliance/​validity.

Safe Reinforcement Learning (SafeRL)

Algorithms that maintain safety constraints during learning while maximizing returns.

Shielded RL

Integrating temporal logic monitors with learning systems to filter unsafe actions.

Runtime Assurance Architectures (Simplex)

Combining high-performance unverified controllers with formally verified safety controllers.

Safely Interruptible Agents

Theoretical framework for shutdown indifference.

Provably Corrigible Agents

Using utility heads to ensure formal guarantees of corrigibility.

Guaranteed Safe AI (GSAI)

Comprehensive framework for AI systems with quantitative, provable safety guarantees.

Proofs of Autonomy

Extending formal verification to autonomous agents using cryptographic frameworks.


Mechanistic and Mathematical Interpretability

See also LessWrong Tag Interpretability and A Transparency and Interpretability Tech Tree

Circuit Analysis and Feature Discovery

Reverse-engineering neural representations into interpretable circuits.

Sparse Autoencoders (SAEs)

Extracting interpretable features by learning sparse representations of activations.

Feature Visualization

Understanding neural network representations through direct visualization.

Linear Probes

Scalable analysis of model behavior and persuasion dynamics.

Attribution Graphs

Interactive visualizations of feature-feature interactions.

Causal Scrubbing

Rigorous method for testing interpretability hypotheses in neural networks.

Integrated Gradients

Attribution method using path integrals to attribute predictions to inputs.

Chain-of-Thought Analysis

Detection of complex cognitive behaviors including alignment faking.

Model Editing (ROME)

Precise modification of factual associations within language models.

Knowledge Neurons

Identifying specific components responsible for factual knowledge.

Physics-Informed Model Control

Using approaches from physics to establish bounds on model behavior.

Representation Engineering

Activation-level interventions to suppress harmful trajectories.

Gradient Routing

Localizing computation in neural networks through gradient masking.

Developmental Interpretability

Understanding how AI models acquire capabilities during training.

Singular Learning Theory (SLT)

Mathematical foundations for understanding learning dynamics and phase transitions.


Scalable Oversight and Alignment Training

Reinforcement Learning from Human Feedback (RLHF)

Using human-labeled preferences for alignment training.

Reinforcement Learning from AI Feedback (RLAIF)

Bootstrapping alignment from smaller aligned models.

Constitutional AI

Leveraging rule-based critiques to reduce reliance on human raters.

Pretraining Data Filtering

Removing dual-use content during training for tamper-resistant safeguards.

Reinforcement Learning from Reflective Feedback (RLRF)

Models generate and utilize their own self-reflective feedback for alignment.

CALMA

Value Learning /​ Cooperative Inverse Reinforcement Learning (CIRL)

Building AI systems that infer human values from behavior and feedback.

Imitation Learning

Learning safe behaviors from expert demonstrations.

Iterated Distillation and Amplification (IDA)

Recursively training models to decompose and amplify human supervision.

AI Safety via Debate

Two models in adversarial dialogue judged by humans.

Recursive Reward Modeling

Training reward models for sub-tasks and combining them for harder tasks.

Eliciting Latent Knowledge (ELK)

Extracting truthful internal representations even when deceptive behavior could arise.

Shard Theory

Framework for understanding how values and goals emerge through training.


Robustness and Adversarial Evaluation

See also LessWrong Tag Adversarial Examples, AI Safety 101: Unrestricted Adversarial Training, An Overview of 11 Proposals for Building Safe Advanced AI

Adversarial Training

Augmenting training with adversarial examples including jailbreak defenses.

Prompt Injection Defenses

Defense systems against prompt injection attacks.

Red-Teaming and Capability Evaluations

Testing for misuse, capability hazards, and safety failures.

OS-HARM Benchmark

Evaluating agent vulnerabilities in realistic desktop environments.

Goal Drift Evaluation

Assessing whether agents maintain intended objectives over extended interactions.

Attempt to Persuade Eval (APE)

Measuring models’ willingness to attempt persuasion on harmful topics.

INTIMA Benchmark

Evaluating AI companionship behaviors that can lead to emotional dependency.

Signal-to-Noise Analysis for Evaluations

Ensuring safety assessments accurately distinguish model capabilities.

Data Scaling Laws for Domain Robustness

Systematic data curation to enhance model robustness.


Behavioral and Psychological Approaches

See also LessWrong Tag Human-AI Interaction

LLM Psychology

Treating LLMs as psychological subjects to probe reasoning and behavior.

Persona Vectors

Automated monitoring and control of personality traits.

Self-Other Overlap Fine-Tuning (SOO-FT)

Fine-tuning with paired prompts to reduce deceptive behavior.

Alignment Faking Detection

Identifying when models strategically fake alignment.

Brain-Like AGI Safety

Reverse-engineering human pro-social instincts and building AGI using architectures with similar effects.

Robopsychology and Simulator Theory

Understanding LLMs as universal simulators rather than goal-pursuing agents.


Operational Control and Infrastructure

See also LessWrong Tag AI Control and Notes on Control Evaluations for Safety Cases

AI Control Framework

Designing protocols for deploying powerful but untrusted AI systems.

Permission Management and Sandboxing

Fine-grained permission systems and OS-level sandboxing for AI agents.

Model Cascades

Using confidence calibration to defer uncertain tasks to more capable models.

Guillotine Hypervisor

Advanced containment architecture for isolating potentially malicious AI systems.

AI Hardware Security

Physical high-performance computing hardware assurance for compliance.

Artifact and Experiment Lineage Tracking

Tracking systems linking AI outputs to precise production trajectories.

Shutdown Mechanisms and Cluster Kill Switches

Watermarking and Output Detection

Digital watermarking techniques for AI-generated content.

Steganography and Context Leak Countermeasures

Preventing covert channels and hidden information in AI systems.

Runtime AI Firewalls and Content Filtering

Real-time interception and filtering during AI inference.

AI System Observability and Drift Detection

Continuous monitoring for performance degradation and anomalous behavior.


Governance and Institutions

See also LessWrong Tag AI Governance and Advice for Entering AI Safety Research

Pre-Deployment External Safety Testing

Third-party evaluations before AI system release.

Attestable Audits

Using Trusted Execution Environments for verifiable safety benchmarks.

Probabilistic Risk Assessment (PRA) for AI

Structured risk evaluation adapted from high-reliability industries.

Regulation: EU AI Act and US EO 14110

Risk-based regulatory obligations and safety testing mandates.

System Cards and Preparedness Frameworks

Labs release safety evidence and define deployment gates.

AI Governance Platforms

End-to-end governance workflows with compliance linkage.

Ecosystem Development and Meta-Interventions

Research infrastructure, community building, and coordination.


Underexplored Interventions

This is your chance to work on something nobody has worked on before. Feedback Wanted: Shortlist of AI Safety Ideas, Ten AI Safety Projects I’d Like People to Work On, AI alignment project ideas

See also LessWrong Tag AI Safety Research

Compositional Formal Specifications for Prompts/​Agents

Treating prompts and agent orchestration as formal programs with verifiable properties.

Control-Theoretic Certificates for Tool-Using Agents

Extending barrier certificates to multi-step, multi-API agent action graphs.

AI-BSL: Capability-Tiered Physical Containment Standards

Biosafety-level-like standards for labs training frontier models.

Oversight Mechanism Design

Incentive-compatible auditor frameworks using mechanism design principles to resist collusion and selection bias. Includes reward structure design to prevent tampering and manipulation.

Liability and Insurance Instruments

Risk transfer mechanisms including catastrophe bonds and mandatory coverage.

Dataset Hazard Engineering

Systematic hazard analysis for data pipelines using safety engineering methods.

Automated Alignment Research

Using AI systems to accelerate safety research.

Note: I’m currently collecting a longer list of papers and projects in this category. A lot of people are working on this!

Deliberative and Cultural Interventions

Integration of broader human values through citizen assemblies and stakeholder panels.

Deceptive Behavior Detection and Mitigation

Systematic approaches for detecting and preventing deceptive behaviors.

Generalization Control and Capability Containment

Frameworks for controlling how AI systems generalize to new tasks.

Multi-Agent Safety and Coordination Protocols

Safety frameworks for environments with multiple interacting AI systems.

Technical Governance Implementation Tools

Technical tools for implementing, monitoring, and enforcing AI governance policies.

International AI Coordination Mechanisms

Infrastructure and protocols for coordinating AI governance across international boundaries.

Systemic Disempowerment Measurement

Quantitative frameworks for measuring human disempowerment as AI capabilities advance.

No comments.