Shallow review of technical AI safety, 2025

Website version · Gestalt · Repo and data

Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025)

This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on the shallow review website.

It’s shallow in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about two hours on each entry. Still, among other things, we processed every arXiv paper on alignment, all Alignment Forum posts, as well as a year’s worth of Twitter.

It is substantially a list of lists structuring 800 links. The point is to produce stylised facts, forests out of trees; to help you look up what’s happening, or that thing you vaguely remember reading about; to help new researchers orient, know some of their options and the standing critiques; and to help you find who to talk to for actual information. We also track things which didn’t pan out.

Here, “AI safety” means technical work intended to prevent future cognitive systems from having large unintended negative effects on the world. So it’s capability restraint, instruction-following, value alignment, control, and risk awareness work.

We don’t cover security or resilience at all.

We ignore a lot of relevant work (including most of capability restraint): things like misuse, policy, strategy, OSINT, resilience and indirect risk, AI rights, general capabilities evals, and things closer to “technical policy” and products (like standards, legislation, SL4 datacentres, and automated cybersecurity). We focus on papers and blogposts (rather than say, gdoc samizdat or tweets or Githubs or Discords). We only use public information, so we are off by some additional unknown factor.

We try to include things which are early-stage and illegible – but in general we fail and mostly capture legible work on legible problems (i.e. things you can write a paper on already).

Even ignoring all of that as we do, it’s still too long to read. Here’s a spreadsheet version (agendas and papers) and the github repo including the data and the processing pipeline. Methods down the bottom. Gavin’s editorial outgrew this post and became its own thing.

If we missed something big or got something wrong, please comment, we will edit it in.

An Arb Research project. Work funded by OpenPhil Coefficient Giving.

We have tried to falsify this but it wasn’t easy.

Labs (giant companies)

Safety team % all Not counting the AIs<5%<3%<3%<2%<1%
Leadership’s stated timelines to full auto AI R&Dmid-20272028 to 2035March 2028N/​AASI by 2030
Leadership stated P(AI doom)25%“Non-zero” and >5%~2%~0%20%
Legal obligationsEU CoP, SB53EU CoP, SB53EU CoP, SB53SB53EU CoP (Safety), SB53
Average Safety Score (ZSP, SaferAI, FLI)51%27%33%17%17%

OpenAI

  • Structure: privately-held public benefit corp

  • Safety teams: Alignment, Safety Systems (Interpretability, Safety Oversight, Pretraining Safety, Robustness, Safety Research, Trustworthy AI, new Misalignment Research team coming), Preparedness, Model Policy, Safety and Security Committee, Safety Advisory Group. The Persona Features paper had a distinct author list. No named successor to Superalignment.

  • Public alignment agenda: None. Boaz Barak offers personal views.

  • Risk management framework: Preparedness Framework

  • See also: Iterative alignment · Safeguards (inference-time auxiliaries) · Character training and persona steering

  • Some names: Johannes Heidecke, Boaz Barak, Mia Glaese, Jenny Nitishinskaya, Lama Ahmad, Naomi Bashkansky, Miles Wang, Wojciech Zaremba, David Robinson, Zico Kolter, Jerry Tworek, Eric Wallace, Olivia Watkins, Kai Chen, Chris Koch, Andrea Vallone, Leo Gao

  • Critiques: Stein-Perlman, Stewart, underelicitation, Midas, defense, Carlsmith on labs in general. It’s difficult to model OpenAI as a single agent: “ALTMAN: I very rarely get to have anybody work on anything. One thing about researchers is they’re going to work on what they’re going to work on, and that’s that.”

  • Funded by: Microsoft, AWS, Oracle, NVIDIA, SoftBank, G42, AMD, Dragoneer, Coatue, Thrive, Altimeter, MGX, Blackstone, TPG, T. Rowe Price, Andreessen Horowitz, D1 Capital Partners, Fidelity Investments, Founders Fund, Sequoia…

Some outputs (13)

Google Deepmind

  • Structure: research laboratory subsidiary of a public for-profit

  • Safety teams: amplified oversight, interpretability, ASAT eng (automated alignment research), Causal Incentives Working Group, Frontier Safety Risk Assessment (evals, threat models, the framework), Mitigations (e.g. banning accounts, refusal training, jailbreak robustness), Loss of Control (control, alignment evals). Structure here.

  • Public alignment agenda: An Approach to Technical AGI Safety and Security

  • Risk management framework: Frontier Safety Framework

  • See also: White-box safety (i.e. Interpretability) · Scalable Oversight

  • Some names: Rohin Shah, Allan Dafoe, Anca Dragan, Alex Irpan, Alex Turner, Anna Wang, Arthur Conmy, David Lindner, Jonah Brown-Cohen, Lewis Ho, Neel Nanda, Raluca Ada Popa, Rishub Jain, Rory Greig, Sebastian Farquhar, Senthooran Rajamanoharan, Sophie Bridgers, Tobi Ijitoye, Tom Everitt, Victoria Krakovna, Vikrant Varma, Zac Kenton, Four Flynn, Jonathan Richens, Lewis Smith, Janos Kramar, Matthew Rahtz, Mary Phuong, Erik Jenner

  • Critiques: Stein-Perlman, Carlsmith on labs in general, underelicitation, On Google’s Safety Plan

  • Funded by: Google. Explicit 2024 Deepmind spending as a whole was £1.3B, but this doesn’t count most spending e.g. Gemini compute.

Some outputs (14)

Anthropic

  • Structure: privately-held public-benefit corp

  • Safety teams: Scalable Alignment (Leike), Alignment Evals (Bowman), Interpretability (Olah), Control (Perez), Model Psychiatry (Lindsey), Character (Askell), Alignment Stress-Testing (Hubinger), Alignment Mitigations (Price?), Frontier Red Team (Graham), Safeguards (?), Societal Impacts (Ganguli), Trust and Safety (Sanderford), Model Welfare (Fish)

  • Public alignment agenda: directions, bumpers, checklist, an old vague view

  • Risk management framework: RSP

  • See also: White-box safety (i.e. Interpretability) · Scalable Oversight

  • Some names: Chris Olah, Evan Hubinger, Sam Marks, Johannes Treutlein, Sam Bowman, Euan Ong, Fabien Roger, Adam Jermyn, Holden Karnofsky, Jan Leike, Ethan Perez, Jack Lindsey, Amanda Askell, Kyle Fish, Sara Price, Jon Kutasov, Minae Kwon, Monty Evans, Richard Dargan, Roger Grosse, Ben Levinstein, Joseph Carlsmith, Joe Benton

  • Critiques: Stein-Perlman, Casper, Carlsmith, underelicitation, Greenblatt, Samin, defense, Existing Safety Frameworks Imply Unreasonable Confidence

  • Funded by: Amazon, Google, ICONIQ, Fidelity, Lightspeed, Altimeter, Baillie Gifford, BlackRock, Blackstone, Coatue, D1 Capital Partners, General Atlantic, General Catalyst, GIC, Goldman Sachs, Insight Partners, Jane Street, Ontario Teachers’ Pension Plan, Qatar Investment Authority, TPG, T. Rowe Price, WCM, XN

Some outputs (21)

xAI

Meta

  • Structure: public for-profit

  • Teams: Safety “integrated into” capabilities research, Meta Superintelligence Lab. But also FAIR Alignment, Brain and AI.

  • Framework: FAF

  • See also: Capability removal: unlearning

  • Some names: Shuchao Bi, Hongyuan Zhan, Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Jason Weston, ShengYun Peng, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Evangelia Spiliopoulou, Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda, Adina Williams

  • Critiques: extreme underelicitation, Stein-Perlman, Carlsmith on labs in general

  • Funded by: Meta

Some outputs (6)

China

The Chinese companies often don’t attempt to be safe, often not even in the prosaic safeguards sense. They drop the weights immediately after post-training finishes. They’re mostly open weights and closed data. As of writing the companies are often severely compute-constrained. There are some informal reasons to doubt their capabilities. The (academic) Chinese AI safety scene is however also growing.

  • Alibaba’s Qwen3-etc-etc is nominally at the level of Gemini 2.5 Flash. Maybe the only Chinese model with a large Western userbase, including businesses, but since it’s self-hosted this doesn’t translate into profits for them yet. On one ad hoc test it was the only Chinese model not to collapse OOD, but the Qwen2.5 corpus was severely contaminated.

  • DeepSeek’s v3.2 is nominally around the same as Qwen. The CCP made them waste months trying Huawei chips.

  • Moonshot’s Kimi-K2-Thinking has some nominally frontier benchmark results and a pleasant style but does not seem frontier.

  • Baidu’s ERNIE 5 is again nominally very strong, a bit better than DeepSeek. This new one seems to not be open.

  • Z’s GLM-4.6 is around the same as Qwen. The product director was involved in the MIT Alignment group.

  • MiniMax’s M2 is nominally better than Qwen, around the same as Grok 4 Fast on the usual superficial benchmarks. It does fine on one very basic red-team test.

  • ByteDance does impressive research in a lagging paradigm, diffusion LMs.

  • There are others but they’re marginal for now.

Other labs

  • Amazon’s Nova Pro is around the level of Llama 3 90B, which in turn is around the level of the original GPT-4. So 2 years behind. But they have their own chip.

  • Microsoft are now mid-training on top of GPT-5. MAI-1-preview is around DeepSeek V3.0 level on Arena. They continue to focus on medical diagnosis. You can request access.

  • Mistral have a reasoning model, Magistral Medium, and released the weights of a little 24B version. It’s a bit worse than Deepseek R1, pass@1.

From the AI village

Black-box safety (understand and control current model behaviour)

Iterative alignment

Nudging base models by optimising their output. Worked on by the post-training teams at most labs, estimating the FTEs at >500 in some sense. Funded by most of the industry.

Iterative alignment at pretrain-time

Guide weights during pretraining.

Some outputs (2)

Iterative alignment at post-train-time

Modify weights after pre-training.

Some outputs (16)

Black-box make-AI-solve-it

Focus on using existing models to improve and align further models.

Some outputs (12)

Inoculation prompting

Prompt mild misbehaviour in training, to prevent the failure mode where once AI misbehaves in a mild way, it will be more inclined towards all bad behaviour.

  • General approach: engineering · Target case: average

  • Some names: Ariana Azarbal, Daniel Tan, Victor Gillioz, Alex Turner, Alex Cloud, Monte MacDiarmid, Daniel Ziegler

  • Critiques: Bellot, Alfour, Gölz, Gaikwad, Hubinger

  • Funded by: most of the industry

Some outputs (4)

Inference-time: In-context learning

Investigate what runtime guidelines, rules, or examples provided to an LLM yield better behavior.

Some outputs (5)

Inference-time: Steering

Manipulate an LLM’s internal representations/​token probabilities without touching weights.

Some outputs (4)

Capability removal: unlearning

Developing methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approaches.

  • Theory of change: If an AI learns dangerous knowledge (e.g., dual-use capabilities like virology or hacking, or knowledge of their own safety controls) or exhibits undesirable behaviors (e.g., memorizing private data), we can specifically erase this “bad” knowledge post-training, which is much cheaper and faster than retraining, thereby making the model safer. Alternatively, intervene in pre-training, to prevent the model from learning it in the first place (even when data filtering is imperfect). You could imagine also unlearning propensities to power-seeking, deception, sycophancy, or spite.

  • General approach: cognitive /​ engineering · Target case: pessimistic

  • Orthodox alignment problems: Superintelligence can hack software supervisors, A boxed AGI might exfiltrate itself by steganography, spearphishing, Humanlike minds/​goals are not necessarily safe

  • See also: Data filtering · White-box safety (i.e. Interpretability) · Various Redteams

  • Some names: Rowan Wang, Avery Griffin, Johannes Treutlein, Zico Kolter, Bruce W. Lee, Addie Foote, Alex Infanger, Zesheng Shi, Yucheng Zhou, Jing Li, Timothy Qian, Stephen Casper, Alex Cloud, Peter Henderson, Filip Sondej, Fazl Barez

  • Critiques: Existing Large Language Model Unlearning Evaluations Are Inconclusive

  • Funded by: Coefficient Giving, MacArthur Foundation, UK AI Safety Institute (AISI), Canadian AI Safety Institute (CAISI), industry labs (e.g., Microsoft Research, Google)

  • Estimated FTEs: 10-50

Some outputs (18)

Frameworks

Mostly black-box

Mostly white-box

Pre-training interventions

https://​​www.youtube.com/​​watch?v=pfKO4MlvM-Y

Control

If we assume early transformative AIs are misaligned and actively trying to subvert safety measures, can we still set up protocols to extract useful work from them while preventing sabotage, and watching with incriminating behaviour?

  • General approach: engineering /​ behavioral · Target case: worst-case

  • See also: safety cases

  • Some names: Redwood, UK AISI, Deepmind, OpenAI, Anthropic, Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan, Alex Mallen

  • Critiques: Wentworth, Mannheim, Kulveit

  • Estimated FTEs: 5-50

Some outputs (22)

https://​​openai.com/​​index/​​introducing-agentkit/​​

Safeguards (inference-time auxiliaries)

Layers of inference-time defenses, such as classifiers, monitors, and rapid-response protocols, to detect and block jailbreaks, prompt injections, and other harmful model behaviors.

  • Theory of change: By building a bunch of scalable and hardened things on top of an unsafe model, we can defend against known and unknown attacks, monitor for misuse, and prevent models from causing harm, even if the core model has vulnerabilities.

  • General approach: engineering · Target case: average

  • Orthodox alignment problems: Superintelligence can fool human supervisors, A boxed AGI might exfiltrate itself by steganography, spearphishing

  • See also: Various Redteams · Iterative alignment

  • Some names: Mrinank Sharma, Meg Tong, Jesse Mu, Alwin Peng, Julian Michael, Henry Sleight, Theodore Sumers, Raj Agarwal, Nathan Bailey, Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Sahil Verma, Keegan Hines, Jeff Bilmes

  • Critiques: Obfuscated Activations Bypass LLM Latent-Space Defenses

  • Funded by: most of the big labs

  • Estimated FTEs: 100+

Some outputs (6)

Chain of thought monitoring

Supervise an AI’s natural-language (output) “reasoning” to detect misalignment, scheming, or deception, rather than studying the actual internal states.

Some outputs (17)

Model psychology

This section consists of a bottom-up set of things people happen to be doing, rather than a normative taxonomy.

Model values /​ model preferences

Analyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.

Some outputs (14)

Character training and persona steering

Map, shape, and control the personae of language models, such that new models embody desirable values (e.g., honesty, empathy) rather than undesirable ones (e.g., sycophancy, self-perpetuating behaviors).

Some outputs (13)

Emergent misalignment

Fine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained or rewarded directly. A new agenda which quickly led to a stream of exciting work.

  • Theory of change: Predict, detect, and prevent models from developing broadly harmful behaviors (like deception or shutdown resistance) when fine-tuned on seemingly unrelated tasks. Find, preserve, and robustify this correlated representation of the good.

  • General approach: behaviorist science · Target case: pessimistic

  • Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors

  • See also: auditing real models · Pragmatic interpretability

  • Some names: Truthful AI, Jan Betley, James Chua, Mia Taylor, Miles Wang, Edward Turner, Anna Soligo, Alex Cloud, Nathan Hu, Owain Evans

  • Critiques: Emergent Misalignment as Prompt Sensitivity, Go home GPT-4o, you’re drunk

  • Funded by: Coefficient Giving, >$1 million

  • Estimated FTEs: 10-50

Some outputs (17)

The Claude System Prompt by words allocated

Model specs and constitutions

Write detailed, natural language descriptions of values and rules for models to follow, then instill these values and rules into models via techniques like Constitutional AI or deliberative alignment.

Some outputs (11)

Model psychopathology

Find interesting LLM phenomena like glitch tokens and the reversal curse; these are vital data for theory.

  • Theory of change: The study of ‘pathological’ phenomena in LLMs is potentially key for theoretically modelling LLM cognition and LLM training-dynamics (compare: the study of aphasia and visual processing disorder in humans plays a key role cognitive science), and in particular for developing a good theory of generalization in LLMS

  • General approach: behaviorist /​ cognitivist · Target case: pessimistic

  • Orthodox alignment problems: Goals misgeneralize out of distribution

  • See also: Emergent misalignment · mechanistic anomaly detection

  • Some names: Janus, Truthful AI, Theia Vogel, Stewart Slocum, Nell Watson, Samuel G. B. Johnson, Liwei Jiang, Monika Jotautaite, Saloni Dash

  • Funded by: Coefficient Giving (via Truthful AI and Interpretability grants)

  • Estimated FTEs: 5-20

Some outputs (9)

Better data

Data filtering

Builds safety into models from the start by removing harmful or toxic content (like dual-use info) from the pretraining data, rather than relying only on post-training alignment.

Some outputs (4)

Hyperstition studies

Study, steer, and intervene on the following feedback loop: “we produce stories about how present and future AI systems behave” → “these stories become training data for the AI” → “these stories shape how AI systems in fact behave”.

  • Theory of change: Measure the influence of existing AI narratives in the training data → seed and develop more salutary ontologies and self-conceptions for AI models → control and redirect AI models’ self-concepts through selectively amplifying certain components of the training data.

  • General approach: cognitive · Target case: average

  • Orthodox alignment problems: Value is fragile and hard to specify

  • See also: Data filtering · active inference · LLM whisperers

  • Some names: Alex Turner, Hyperstition AI, Kyle O’Brien

  • Funded by: Unclear, niche

  • Estimated FTEs: 1-10

Some outputs (4)

Data poisoning defense

Develops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data.

Some outputs (3)

Synthetic data for alignment

Uses AI-generated data (e.g., critiques, preferences, or self-labeled examples) to scale and improve alignment, especially for superhuman models.

  • Theory of change: We can overcome the bottleneck of human feedback and data by using models to generate vast amounts of high-quality, targeted data for safety, preference tuning, and capability elicitation.

  • General approach: engineering · Target case: average

  • Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Value is fragile and hard to specify

  • See also: Data quality for alignment · Data filtering · scalable oversight · automated alignment research · Weak-to-strong generalization

  • Some names: Mianqiu Huang, Xiaoran Liu, Rylan Schaeffer, Nevan Wichers, Aram Ebtekar, Jiaxin Wen, Vishakh Padmakumar, Benjamin Newman

  • Critiques: Synthetic Data in AI: Challenges, Applications, and Ethical Implications. Sort of Demski.

  • Funded by: Anthropic, Google DeepMind, OpenAI, Meta AI, various academic groups.

  • Estimated FTEs: 50-150

Some outputs (8)

Data quality for alignment

Improves the quality, signal-to-noise ratio, and reliability of human-generated preference and alignment data.

Some outputs (5)

Goal robustness

Mild optimisation

Avoid Goodharting by getting AI to satisfice rather than maximise.

  • Theory of change: If we fail to exactly nail down the preferences for a superintelligent agent we die to Goodharting → shift from maximising to satisficing in the agent’s utility function → we get a nonzero share of the lightcone as opposed to zero; also, moonshot at this being the recipe for fully aligned AI.

  • General approach: cognitive · Target case: mixed

  • Orthodox alignment problems: Value is fragile and hard to specify

  • Funded by: Google DeepMind

  • Estimated FTEs: 10-50

Some outputs (4)

RL safety

Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.

Some outputs (11)

Assistance games, assistive agents

Formalize how AI assistants learn about human preferences given uncertainty and partial observability, and construct environments which better incentivize AIs to learn what we want them to learn.

  • Theory of change: Understand what kinds of things can go wrong when humans are directly involved in training a model → build tools that make it easier for a model to learn what humans want it to learn.

  • General approach: engineering /​ cognitive · Target case: varies

  • Orthodox alignment problems: Value is fragile and hard to specify, Humanlike minds/​goals are not necessarily safe

  • See also: Guaranteed-Safe AI

  • Some names: Joar Skalse, Anca Dragan, Caspar Oesterheld, David Krueger, Dylan Hafield-Menell, Stuart Russell

  • Critiques: nice summary of historical problem statements

  • Funded by: Future of Life Institute, Coefficient Giving, Survival and Flourishing Fund, Cooperative AI Foundation, Polaris Ventures

Some outputs (5)

Harm reduction for open weights

Develops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or exploit dangerous capabilities.

  • Theory of change: Open-weight models allow adversaries to easily remove post-training safety (like refusal training) via simple fine-tuning; by making safety an intrinsic property of the model’s learned knowledge and capabilities (e.g., by ensuring “deep ignorance” of dual-use information), the safeguards become far more difficult and expensive to remove.

  • General approach: engineering · Target case: average

  • Orthodox alignment problems: Someone else will deploy unsafe superintelligence first

  • See also: Data filtering · Capability removal: unlearning · Data poisoning defense

  • Some names: Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Rishub Tamirisa, Mantas Mazeika, Stella Biderman, Yarin Gal

  • Funded by: UK AI Safety Institute (AISI), EleutherAI, Coefficient Giving

  • Estimated FTEs: 10-100

Some outputs (5)

The “Neglected Approaches” Approach

Agenda-agnostic approaches to identifying good but overlooked empirical alignment ideas, working with theorists who could use engineers, and prototyping them.

  • Theory of change: Empirical search for “negative alignment taxes” (prioritizing methods that simultaneously enhance alignment and capabilities)

  • General approach: engineering · Target case: average

  • Orthodox alignment problems: Someone else will deploy unsafe superintelligence first

  • See also: Iterative alignment · automated alignment research · Beijing Key Laboratory of Safe AI and Superalignment · Aligned AI

  • Some names: AE Studio, Gunnar Zarncke, Cameron Berg, Michael Vaiana, Judd Rosenblatt, Diogo Schwerz de Lucena

  • Critiques:

  • Funded by: AE Studio

  • Estimated FTEs: 15

Some outputs (3)

White-box safety (i.e. Interpretability)

This section isn’t very conceptually clean. See the Open Problems paper or Deepmind for strong frames which are not useful for descriptive purposes.

Reverse engineering

Decompose a model into its functional, interacting components (circuits), formally describe what computation those components perform, and validate their causal effects to reverse-engineer the model’s internal algorithm.

Some outputs (33)

In weights-space

In activations-space

Extracting latent knowledge

Identify and decoding the “true” beliefs or knowledge represented inside a model’s activations, even when the model’s output is deceptive or false.

  • Theory of change: Powerful models may know things they do not say (e.g. that they are currently being tested). If we can translate this latent knowledge directly from the model’s internals, we can supervise them reliably even when they attempt to deceive human evaluators or when the task is too difficult for humans to verify directly.

  • General approach: cognitivist science · Target case: worst-case

  • Orthodox alignment problems: Superintelligence can fool human supervisors

  • See also: AI explanations of AIs · Heuristic explanations · Lie and deception detectors

  • Some names: Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Alexander Pan, Lijie Chen, Jacob Steinhardt, Javier Ferrando, Oscar Obeso, Collin Burns, Paul Christiano

  • Critiques: A Problem to Solve Before Building a Deception Detector

  • Funded by: Open Philanthropy, Anthropic, NSF, various academic grants

  • Estimated FTEs: 20-40

Some outputs (9)

Lie and deception detectors

Detect when a model is being deceptive or lying by building white- or black-box detectors. Some work below requires intent in their definition, while other work focuses only on whether the model states something it believes to be false, regardless of intent.

Some outputs (11)

Model diffing

Understand what happens when a model is finetuned, what the “diff” between the finetuned and the original model consists in.

  • Theory of change: By identifying the mechanistic differences between a base model and its fine-tune (e.g., after RLHF), maybe we can verify that safety behaviors are robustly “internalized” rather than superficially patched, and detect if dangerous capabilities or deceptive alignment have been introduced without needing to re-analyze the entire model. The diff is also much smaller, since most parameters don’t change, which means you can use heavier methods on them.

  • General approach: cognitive · Target case: pessimistic

  • Orthodox alignment problems: Value is fragile and hard to specify

  • See also: Sparse Coding · Reverse engineering

  • Some names: Julian Minder, Clément Dumas, Neel Nanda, Trenton Bricken, Jack Lindsey

  • Funded by: various academic groups, Anthropic, Google DeepMind

  • Estimated FTEs: 10-30

Some outputs (9)

Sparse Coding

Decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic “features” which correspond to interpretable concepts.

Some outputs (44)

Causal Abstractions

Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.

Some outputs (3)

Data attribution

Quantifies the influence of individual training data points on a model’s specific behavior or output, allowing researchers to trace model properties (like misalignment, bias, or factual errors) back to their source in the training set.

  • Theory of change: By attributing harmful, biased, or unaligned behaviors to specific training examples, researchers can audit proprietary models, debug training data, enable effective data deletion/​unlearning

  • General approach: behavioural · Target case: average

  • Orthodox alignment problems: Goals misgeneralize out of distribution, Value is fragile and hard to specify

  • See also: Data quality for alignment

  • Some names: Roger Grosse, Philipp Alexander Kreer, Jin Hwa Lee, Matthew Smith, Abhilasha Ravichander, Andrew Wang, Jiacheng Liu, Jiaqi Ma, Junwei Deng, Yijun Pan, Daniel Murfet, Jesse Hoogland

  • Funded by: Various academic groups

  • Estimated FTEs: 30-60

Some outputs (12)

Pragmatic interpretability

Directly tackling concrete, safety-critical problems on the path to AGI by using lightweight interpretability tools (like steering and probing) and empirical feedback from proxy tasks, rather than pursuing complete mechanistic reverse-engineering.

  • Theory of change: By applying interpretability skills to concrete problems, researchers can rapidly develop monitoring and control tools (e.g., steering vectors or probes) that have immediate, measurable impact on real-world safety issues like detecting hidden goals or emergent misalignment.

  • General approach: cognitive · Target case: mixed

  • Orthodox alignment problems: Superintelligence can fool human supervisors, Goals misgeneralize out of distribution

  • See also: Reverse engineering · Concept-based interpretability

  • Some names: Lee Sharkey, Dario Amodei, David Chalmers, Been Kim, Neel Nanda, David D. Baek, Lauren Greenspan, Dmitry Vaintrob, Sam Marks, Jacob Pfau

  • Funded by: Google DeepMind, Anthropic, various academic groups

  • Estimated FTEs: 30-60

Some outputs (3)

Other interpretability

Interpretability that does not fall well into other categories.

Some outputs (19)

Learning dynamics and developmental interpretability

Builds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model’s training and in-context learning phases.

  • Theory of change: Structures forming in neural networks leave identifiable traces that can be interpreted (e.g., using concepts from Singular Learning Theory); by catching and analyzing these developmental moments, researchers can automate interpretability, predict when dangerous capabilities emerge, and intervene to prevent deceptiveness or misaligned values as early as possible.

  • General approach: cognitivist science · Target case: worst-case

  • Orthodox alignment problems: Goals misgeneralize out of distribution

  • See also: Reverse engineering · Sparse Coding · ICL transience

  • Some names: Timaeus, Jesse Hoogland, George Wang, Daniel Murfet, Stan van Wingerden, Alexander Gietelink Oldenziel

  • Critiques: Vaintrob, Joar Skalse (2023)

  • Funded by: Manifund, Survival and Flourishing Fund, EA Funds

  • Estimated FTEs: 10-50

Some outputs (14)

Representation structure and geometry

What do the representations look like? Does any simple structure underlie the beliefs of all well-trained models? Can we get the semantics from this geometry?

  • Theory of change: Get scalable unsupervised methods for finding structure in representations and interpreting them, then using this to e.g. guide training.

  • General approach: cognitivist science · Target case: mixed

  • Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors

  • See also: Concept-based interpretability · computational mechanics · feature universality · Natural abstractions · Causal Abstractions

  • Some names: Simplex, Insight + Interaction Lab, Paul Riechers, Adam Shai, Martin Wattenberg, Blake Richards, Mateusz Piotrowski

  • Funded by: Various academic groups, Astera Institute, Coefficient Giving

  • Estimated FTEs: 10-50

Some outputs (13)

Human inductive biases

Discover connections deep learning AI systems have with human brains and human learning processes. Develop an ‘alignment moonshot’ based on a coherent theory of learning which applies to both humans and AI systems.

  • Theory of change: Humans learn trust, honesty, self-maintenance, and corrigibility; if we understand how they do maybe we can get future AI systems to learn them.

  • General approach: cognitive · Target case: pessimistic

  • Orthodox alignment problems: Goals misgeneralize out of distribution

  • See also: active learning · ACS research

  • Some names: Lukas Muttenthaler, Quentin Delfosse

  • Funded by: Google DeepMind, various academic groups

  • Estimated FTEs: 4

Some outputs (6)

Concept-based interpretability

Monitoring concepts

Identifies directions or subspaces in a model’s latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them at runtime, suppress eval awareness, debug why models are failing, etc.

Some outputs (11)

Activation engineering

Programmatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning.

  • Theory of change: Test interpretability theories by intervening on activations; find new insights from interpretable causal interventions on representations. Or: build more stuff to stack on top of finetuning. Slightly encourage the model to be nice, add one more layer of defence to our bundle of partial alignment methods.

  • General approach: engineering /​ cognitive · Target case: average

  • Orthodox alignment problems: Value is fragile and hard to specify

  • See also: Sparse Coding

  • Some names: Runjin Chen, Andy Arditi, David Krueger, Jan Wehner, Narmeen Oozeer, Reza Bayat, Adam Karvonen, Jiuding Sun, Tim Tian Hua, Helena Casademunt, Jacob Dunefsky, Thomas Marshall

  • Critiques: Understanding (Un)Reliability of Steering Vectors in Language Models

  • Funded by: Coefficient Giving, Anthropic

  • Estimated FTEs: 20-100

Some outputs (15)

Safety by construction

Approaches which try to get assurances about system outputs while still being scalable.

Guaranteed-Safe AI

Have an AI system generate outputs (e.g. code, control systems, or RL policies) which it can quantitatively guarantee comply with a formal safety specification and world model.

  • Theory of change: Various, including:

i) safe deployment: create a scalable process to get not-fully-trusted AIs to produce highly trusted outputs;

ii) secure containers: create a ‘gatekeeper’ system that can act as an intermediary between human users and a potentially dangerous system, only letting provably safe actions through.

(Notable for not requiring that we solve ELK; does require that we solve ontology though)

  • General approach: cognitive /​ engineering · Target case: worst-case

  • Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Humans cannot be first-class parties to a superintelligent value handshake, A boxed AGI might exfiltrate itself by steganography, spearphishing

  • See also: Towards Guaranteed Safe AI · Standalone World-Models · Scientist AI · Safeguarded AI · Asymptotic guarantees · Open Agency Architecture · SLES · program synthesis · Scalable formal oversight

  • Some names: ARIA, Lawzero, Atlas Computing, FLF, Max Tegmark, Beneficial AI Foundation, Steve Omohundro, David “davidad” Dalrymple, Joar Skalse, Stuart Russell, Alessandro Abate

  • Critiques: Zvi, Gleave, Dickson, Greenblatt

  • Funded by: Manifund, ARIA, Coefficient Giving, Survival and Flourishing Fund, Mila /​ CIFAR

  • Estimated FTEs: 10-100

Some outputs (5)

Scientist AI

Develop powerful, nonagentic, uncertain world models that accelerate scientific progress while avoiding the risks of agent AIs

  • Theory of change: Developing non-agentic ‘Scientist AI’ allows us to: (i) reap the benefits of AI progress while (ii) avoiding the inherent risks of agentic systems. These systems can also (iii) provide a useful guardrail to protect us from unsafe agentic AIs by double-checking actions they propose, and (iv) help us more safely build agentic superintelligent systems.

  • General approach: cognitivist science · Target case: pessimistic

  • Orthodox alignment problems: Pivotal processes require dangerous capabilities, Goals misgeneralize out of distribution, Instrumental convergence

  • See also: JEPA · oracles

  • Some names: Yoshua Bengio, Younesse Kaddar

  • Critiques: Hard to find, but see Raymond Douglas’ comment, Karnofsky-Soares discussion. Perhaps also Predict-O-Matic.

  • Funded by: ARIA, Gates Foundation, Future of Life Institute, Coefficient Giving, Jaan Tallinn, Schmidt Sciences

  • Estimated FTEs: 1-10

Some outputs (2)

Brainlike-AGI Safety

Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let’s figure out what those circuits are and how they work; this will involve symbol grounding. “a yet-to-be-invented variation on actor-critic model-based reinforcement learning”

  • Theory of change: Fairly-direct alignment via changing training to reflect actual human reward. Get actual data about (reward, training data) → (human values) to help with theorising this map in AIs; “understand human social instincts, and then maybe adapt some aspects of those for AGIs, presumably in conjunction with other non-biological ingredients”.

  • General approach: cognitivist science · Target case: worst-case

  • Agenda statement: My AGI safety research—2025 review, ’26 plans

  • Some names: Steve Byrnes

  • Critiques: Tsvi BT

  • Funded by: Astera Institute

  • Estimated FTEs: 1-5

Some outputs (6)

Make AI solve it

Weak-to-strong generalization

Use weaker models to supervise and provide a feedback signal to stronger models.

Some outputs (4)

Supervising AIs improving AIs

Build formal and empirical frameworks where AIs supervise other (stronger) AI systems via structured interactions; construct monitoring tools which enable scalable tracking of behavioural drift, benchmarks for self-modification, and robustness guarantees

  • Theory of change: Early models train ~only on human data while later models also train on early model outputs, which leads to early model problems cascading. Left unchecked this will likely cause problems, so supervision mechanisms are needed to help ensure the AI self-improvement remains legible.

  • General approach: behavioral · Target case: pessimistic

  • Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors

  • Some names: Roman Engeler, Akbir Khan, Ethan Perez

  • Critiques: Automation collapse, Great Models Think Alike and this Undermines AI Oversight

  • Funded by: Long-Term Future Fund, lab funders

  • Estimated FTEs: 1-10

Some outputs (8)

AI explanations of AIs

Make open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that “searches” for a specified behaviour in frontier models.

  • Theory of change: Use AI to help improve interp and evals. Develop and release open tools to level up the whole field. Get invited to improve lab processes.

  • General approach: cognitive · Target case: pessimistic

  • Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors

  • See also: White-box safety (i.e. Interpretability)

  • Some names: Transluce, Jacob Steinhardt, Neil Chowdhury, Vincent Huang, Sarah Schwettmann, Robert Friel

  • Funded by: Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Zaremba

  • Estimated FTEs: 15-30

Some outputs (5)

Debate

In the limit, it’s easier to compellingly argue for true claims than for false claims; exploit this asymmetry to get trusted work out of untrusted debaters.

  • Theory of change: “Give humans help in supervising strong agents” + “Align explanations with the true reasoning process of the agent” + “Red team models to exhibit failure modes that don’t occur in normal use” are necessary but probably not sufficient for safe AGI.

  • General approach: engineering /​ cognitive · Target case: worst-case

  • Orthodox alignment problems: Value is fragile and hard to specify, Superintelligence can fool human supervisors

  • Some names: Rohin Shah, Jonah Brown-Cohen, Georgios Piliouras, UK AISI (Benjamin Hilton)

  • Critiques: The limits of AI safety via debate (2022)

  • Funded by: Google, others

Some outputs (6)

https://​​www.semafor.com/​​article/​​11/​​05/​​2025/​​microsoft-superintelligence-team-promises-to-keep-humans-in-charge

LLM introspection training

Train LLMs to predict the outputs of high-quality whitebox methods, to induce general self-explanation skills that use its own ‘introspective’ access

  • Theory of change: Use the resulting LLMs as powerful dimensionality reduction, explaining internals in a way that’s distinct from interpretability methods and CoT. Distilling self-explanation into the model should lead to the skill being scalable, since self-explanation skill advancement will feed off general-intelligence advancement.

  • General approach: cognitivist science · Target case: mixed

  • Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Superintelligence can hack software supervisors

  • See also: Transluce · Anthropic

  • Some names: Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas, Jack Lindsey

  • Funded by: Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Zaremba

  • Estimated FTEs: 2-20

Some outputs (2)

Theory

Develop a principled scientific understanding that will help us reliably understand and control current and future AI systems.

Agent foundations

Develop philosophical clarity and mathematical formalizations of building blocks that might be useful for plans to align strong superintelligence, such as agency, optimization strength, decision theory, abstractions, concepts, etc.

  • Theory of change: Rigorously understand optimization processed and agents, and what it means for them to be aligned in a substrate independent way → identify impossibility results and necessary conditions for aligned optimizer systems → use this theoretical understanding to eventually design safe architectures that remain stable and safe under self-reflection

  • General approach: cognitive · Target case: worst-case

  • Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution

  • See also: Aligning what? · Tiling agents · Dovetail

  • Some names: Abram Demski, Alex Altair, Sam Eisenstat, Thane Ruthenis, Alfred Harwood, Daniel C, Dalcy K, José Pedro Faustino

Some outputs (10)

Tiling agents

An aligned agentic system modifying itself into an unaligned system would be bad and we can research ways that this could occur and infrastructure/​approaches that prevent it from happening.

  • Theory of change: Build enough theoretical basis through various approaches such that AI systems we create are capable of self-modification while preserving goals.

  • General approach: cognitivist science · Target case: worst-case

  • Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution

  • See also: Agent foundations

  • Some names: Abram Demski

  • Estimated FTEs: 1-10

Some outputs (4)

High-Actuation Spaces

Mech interp and alignment assume a stable “computational substrate” (linear algebra on GPUs). If later AI uses different substrates (e.g. something neuromorphic), methods like probes and steering will not transfer. Therefore, better to try and infer goals via a “telic DAG” which abstracts over substrates, and so sidestep the issue of how to define intermediate representations. Category theory is intended to provide guarantees that this abstraction is valid.

  • Theory of change: Sufficiently complex mindlike entities can alter their goals in ways that cannot be predicted or accounted for under substrate-dependent descriptions of the kind sought in mechanistic interpretability. use the telic DAG to define a method analogous to factoring a causal DAG.

  • General approach: maths /​ philosophy · Target case: pessimistic

  • See also: Live theory · MoSSAIC · Topos Institute · Agent foundations

  • Some names: Sahil K, Matt Farr, Aditya Arpitha Prasad, Chris Pang, Aditya Adiga, Jayson Amati, Steve Petersen, Topos, T J

  • Estimated FTEs: 1-10

Some outputs (7)

Asymptotic guarantees

Prove that if a safety process has enough resources (human data quality, training time, neural network capacity), then in the limit some system specification will be guaranteed. Use complexity theory, game theory, learning theory and other areas to both improve asymptotic guarantees and develop ways of showing convergence.

  • Theory of change: Formal verification may be too hard. Make safety cases stronger by modelling their processes and proving that they would work in the limit.

  • General approach: cognitive · Target case: pessimistic

  • Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors

  • See also: Debate · Guaranteed-Safe AI · Control

  • Some names: AISI, Jacob Pfau, Benjamin Hilton, Geoffrey Irving, Simon Marshall, Will Kirby, Martin Soto, David Africa, davidad

  • Critiques: Self-critique in UK AISI’s Alignment Team: Research Agenda

  • Funded by: AISI

  • Estimated FTEs: 5 − 10

Some outputs (4)

Heuristic explanations

Formalize mechanistic explanations of neural network behavior, automate the discovery of these “heuristic explanations” and use them to predict when novel input will lead to extreme behavior (i.e. “Low Probability Estimation” and “Mechanistic Anomaly Detection”).

  • Theory of change: The current goalpost is methods whose reasoned predictions about properties of a neural network’s outputs distribution (for a given inputs distribution) are certifiably at least as accurate as estimations via sampling. If successful for safety-relevant properties, this should allow for automated alignment methods that are both human-legible and worst-case certified, as well more efficient than sampling-based methods in most cases.

  • General approach: cognitive /​ maths/​philosophy · Target case: worst-case

  • Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can hack software supervisors

  • See also: ARC Theory · ELK · mechanistic anomaly detection · Acorn · Guaranteed-Safe AI

  • Some names: Jacob Hilton, Mark Xu, Eric Neyman, Victor Lecomte, George Robinson

  • Critiques: Matolcsi

  • Estimated FTEs: 1-10

Some outputs (5)

Corrigibility

Behavior alignment theory

Predict properties of future AGI (e.g. power-seeking) with formal models; formally state and prove hypotheses about the properties powerful systems will have and how we might try to change them.

  • Theory of change: Figure out hypotheses about properties powerful agents will have → attempt to rigorously prove under what conditions the hypotheses hold → test these hypotheses where feasible → design training environments that lead to more salutary properties.

  • General approach: maths /​ philosophy · Target case: worst-case

  • Orthodox alignment problems: Corrigibility is anti-natural, Instrumental convergence

  • See also: Agent foundations · Control

  • Some names: Ram Potham, Michael K. Cohen, Max Harms/​Raelifin, John Wentworth, David Lorell, Elliott Thornley

  • Critiques: Ryan Greenblatt’s criticism of one behavioural proposal

  • Estimated FTEs: 1-10

Some outputs (10)

Other corrigibility

Diagnose and communicate obstacles to achieving robustly corrigible behavior; suggest mechanisms, tests, and escalation channels for surfacing and mitigating incorrigible behaviors

  • Theory of change: Labs are likely to develop AGI using something analogous to current pipelines. Clarifying why naive instruction-following doesn’t buy robust corrigibility + building strong tripwires/​diagnostics for scheming and Goodharting thus reduces risks on the likely default path.

  • General approach: varies · Target case: pessimistic

  • Orthodox alignment problems: Corrigibility is anti-natural, Instrumental convergence

  • See also: Behavior alignment theory

  • Some names: Jeremy Gillen

  • Estimated FTEs: 1-10

Some outputs (9)

Ontology Identification

Natural abstractions

Develop a theory of concepts that explains how they are learned, how they structure a particular system’s understanding, and how mutual translatability can be achieved between different collections of concepts.

  • Theory of change: Understand the concepts a system’s understanding is structured with and use them to inspect its “alignment/​safety properties” and/​or “retarget its search”, i.e. identify utility-function-like components inside an AI and replacing calls to them with calls to “user values” (represented using existing abstractions inside the AI).

  • General approach: cognitive · Target case: worst-case

  • Orthodox alignment problems: Instrumental convergence, Superintelligence can fool human supervisors, Humans cannot be first-class parties to a superintelligent value handshake

  • See also: Causal Abstractions · representational alignment · convergent abstractions · feature universality · Platonic representation hypothesis · microscope AI

  • Some names: John Wentworth, Paul Colognese, David Lorrell, Sam Eisenstat, Fernando Rosas

  • Critiques: Chan et al (2023), Soto, Harwood, Soares (2023)

  • Estimated FTEs: 1-10

Some outputs (10)

The Learning-Theoretic Agenda

Create a mathematical theory of intelligent agents that encompasses both humans and the AIs we want, one that specifies what it means for two such agents to be aligned; translate between its ontology and ours; produce formal desiderata for a training setup that produces coherent AGIs similar to (our model of) an aligned agent

  • Theory of change: Fix formal epistemology to work out how to avoid deep training problems

  • General approach: cognitive · Target case: worst-case

  • Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, Humans cannot be first-class parties to a superintelligent value handshake

  • Some names: Vanessa Kosoy, Diffractor, Gergely Szücs

  • Critiques: Matolcsi

  • Funded by: Survival and Flourishing Fund, ARIA, UK AISI, Coefficient Giving

  • Estimated FTEs: 3

Some outputs (6)

Multi-agent first

Aligning to context

Align AI directly to the role of participant, collaborator, or advisor for our best real human practices and institutions, instead of aligning AI to separately representable goals, rules, or utility functions.

  • Theory of change: “Many classical problems in AGI alignment are downstream of a type error about human values.” Operationalizing a correct view of human values—one that treats human values as impossible or impractical to abstract from concrete practices—will unblock value fragility, goal-misgeneralization, instrumental convergence, and pivotal-act specification.

  • General approach: behavioural · Target case: mixed

  • Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution, Instrumental convergence, Fair, sane pivotal processes

  • See also: Aligning what? · Aligned to who?

  • Some names: Full Stack Alignment, Meaning Alignment Institute, Plurality Institute, Tan Zhi-Xuan, Matija Franklin, Ryan Lowe, Joe Edelman, Oliver Klingefjord

  • Funded by: ARIA, OpenAI, Survival and Flourishing Fund

  • Estimated FTEs: 5

Some outputs (8)

Aligning to the social contract

Generate AIs’ operational values from ‘social contract’-style ideal civic deliberation formalisms and their consequent rulesets for civic actors

  • Theory of change: Formalize and apply the liberal tradition’s project of defining civic principles separable from the substantive good, aligning our AIs to civic principles that bypass fragile utility-learning and intractable utility-calculation

  • General approach: cognitive · Target case: mixed

  • Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, Instrumental convergence, Humanlike minds/​goals are not necessarily safe, Fair, sane pivotal processes

  • See also: Aligning to context · Aligning what?

  • Some names: Gillian Hadfield, Tan Zhi-Xuan, Sydney Levine, Matija Franklin, Joshua B. Tenenbaum

  • Funded by: Deepmind, Macroscopic Ventures

  • Estimated FTEs: 5 − 10

Some outputs (8)

Theory for aligning multiple AIs

Use realistic game-theory variants (e.g. evolutionary game theory, computational game theory) or develop alternative game theories to describe/​predict the collective and individual behaviours of AI agents in multi-agent scenarios.

  • Theory of change: While traditional AGI safety focuses on idealized decision-theory and individual agents, it’s plausible that strategic AI agents will first emerge (or are emerging now) in a complex, multi-AI strategic landscape. We need granular, realistic formal models of AIs’ strategic interactions and collective dynamics to understand this future.

  • General approach: cognitive · Target case: mixed

  • Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Superintelligence can hack software supervisors

  • See also: Tools for aligning multiple AIs · Aligning what?

  • Some names: Lewis Hammond, Emery Cooper, Allan Chan, Caspar Oesterheld, Vincent Conitzer, Vojta Kovarik, Nathaniel Sauerberg, ACS Research, Jan Kulveit, Richard Ngo, Emmett Shear, Softmax, Full Stack Alignment, AI Objectives Institute, Sahil, TJ, Andrew Critch

  • Funded by: SFF, CAIF, Deepmind, Macroscopic Ventures

  • Estimated FTEs: 10

Some outputs (12)

Tools for aligning multiple AIs

Develop tools and techniques for designing and testing multi-agent AI scenarios, for auditing real-world multi-agent AI dynamics, and for aligning AIs in multi-AI settings.

  • Theory of change: Addressing multi-agent AI dynamics is key for aligning near-future agents and their impact on the world. Feedback loops from multi-agent dynamics can radically change the future AI landscape, and require a different toolset from model psychology to audit and control.

  • General approach: engineering /​ behavioral · Target case: mixed

  • Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Superintelligence can hack software supervisors

  • See also: Theory for aligning multiple AIs · Aligning what?

  • Some names: Andrew Critch, Lewis Hammond, Emery Cooper, Allan Chan, Caspar Oesterheld, Vincent Conitzer, Gillian Hadfield, Nathaniel Sauerberg, Zhijing Jin

  • Funded by: Coefficient Giving, Deepmind, Cooperative AI Foundation

  • Estimated FTEs: 10 − 15

Some outputs (12)

Aligned to who?

Technical protocols for taking seriously the plurality of human values, cultures, and communities when aligning AI to “humanity”

  • Theory of change: use democratic/​pluralist/​context-sensitive principles to guide AI development, alignment, and deployment somehow. Doing it as an afterthought in post-training or the spec isn’t good enough. Continuously shape AI’s social and technical feedback loop on the road to AGI

  • General approach: behavioral · Target case: average

  • Orthodox alignment problems: Value is fragile and hard to specify, Fair, sane pivotal processes

  • See also: Aligning what? · Aligning to context

  • Some names: Joel Z. Leibo, Divya Siddarth, Séb Krier, Luke Thorburn, Seth Lazar, AI Objectives Institute, The Collective Intelligence Project, Vincent Conitzer

  • Funded by: Future of Life Institute, Survival and Flourishing Fund, Deepmind, CAIF

  • Estimated FTEs: 5 − 15

Some outputs (9)

Aligning what?

Develop alternatives to agent-level models of alignment, by treating human-AI interactions, AI-assisted institutions, AI economic or cultural systems, drives within one AI, and other causal/​constitutive processes as subject to alignment

  • Theory of change: Modeling multiple reality-shaping processes above and below the level of the individual AI, some of which are themselves quasi-agential (e.g. cultures) or intelligence-like (e.g. markets), will develop AI alignment into a mature science for managing the transition to an AGI civilization

  • General approach: behavioral /​ cognitive · Target case: mixed

  • Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution, Instrumental convergence, Fair, sane pivotal processes

  • See also: Theory for aligning multiple AIs · Aligning to context · Aligned to who?

  • Some names: Richard Ngo, Emmett Shear, Softmax, Full Stack Alignment, AI Objectives Institute, Sahil, TJ, Andrew Critch, ACS Research, Jan Kulveit

  • Funded by: Future of Life Institute, Emmett Shear

  • Estimated FTEs: 5-10

Some outputs (13)

Teeselink

Evals

AGI metrics

Evals with the explicit aim of measuring progress towards full human-level generality.

Some outputs (5)

Capability evals

Make tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better.

Some outputs (34)

Autonomy evals

Measure an AI’s ability to act autonomously to complete long-horizon, complex tasks.

Some outputs (13)

WMD evals (Weapons of Mass Destruction)

Evaluate whether AI models possess dangerous knowledge or capabilities related to biological and chemical weapons, such as biosecurity or chemical synthesis.

  • Theory of change: By benchmarking and tracking AI’s knowledge of biology and chemistry, we can identify when models become capable of accelerating WMD development or misuse, allowing for timely intervention.

  • General approach: behaviorist science · Target case: pessimistic

  • See also: Capability evals · Autonomy evals · Various Redteams

  • Some names: Lennart Justen, Haochen Zhao, Xiangru Tang, Ziran Yang, Aidan Peppin, Anka Reuel, Stephen Casper

  • Critiques: The Reality of AI and Biorisk

  • Funded by: Open Philanthropy, UK AI Safety Institute (AISI), frontier labs, Scale AI, various academic institutions (Peking University, Yale, etc.), Meta

  • Estimated FTEs: 10-50

Some outputs (6)

Situational awareness and self-awareness evals

Evaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.

  • Theory of change: If an AI can distinguish between evaluation and deployment (“evaluation awareness”), it might hide dangerous capabilities (scheming/​sandbagging). By measuring self- and situational-awareness, we can better assess this risk and build more robust evaluations.

  • General approach: behaviorist science · Target case: worst-case

  • Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors

  • See also: Sandbagging evals · Various Redteams · Model psychology

  • Some names: Jan Betley, Xuchan Bao, Martín Soto, Mary Phuong, Roland S. Zimmermann, Joe Needham, Giles Edkins, Govind Pimpale, Kai Fronsdal, David Lindner, Lang Xiong, Xiaoyan Bai

  • Critiques: Lessons from a Chimp: AI “Scheming” and the Quest for Ape Language, It’s hard to make scheming evals look realistic for LLMs

  • Funded by: frontier labs (Google DeepMind, Anthropic), Open Philanthropy, The Audacious Project, UK AI Safety Institute (AISI), AI Safety Support, Apollo Research, METR

  • Estimated FTEs: 30-70

Some outputs (11)

Steganography evals

evaluate whether models can hide secret information or encoded reasoning in their outputs, such as in chain-of-thought scratchpads, to evade monitoring.

  • Theory of change: if models can use steganography, they could hide deceptive reasoning, bypassing safety monitoring and control measures. By evaluating this capability, we can assess the risk of a model fooling its supervisors.

  • General approach: behavioral · Target case: worst-case

  • Orthodox alignment problems: A boxed AGI might exfiltrate itself by steganography, spearphishing, Superintelligence can fool human supervisors

  • See also: AI deception evals · Chain of thought monitoring

  • Some names: Antonio Norelli, Michael Bronstein

  • Critiques: Chain-of-Thought Is Already Unfaithful (So Steganography is Irrelevant): Reasoning Models Don’t Always Say What They Think.

  • Funded by: Anthropic (and its general funders, e.g., Google, Amazon)

  • Estimated FTEs: 1-10

Some outputs (5)

AI deception evals

research demonstrating that AI models, particularly agentic ones, can learn and execute deceptive behaviors such as alignment faking, manipulation, and sandbagging.

  • Theory of change: proactively discover, evaluate, and understand the mechanisms of AI deception (e.g., alignment faking, manipulation, agentic deception) to prevent models from fooling human supervisors and causing harm.

  • General approach: behavioral /​ engineering · Target case: worst-case

  • Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors

  • See also: Situational awareness and self-awareness evals · Steganography evals · Sandbagging evals · Chain of thought monitoring

  • Some names: Cadenza, Fred Heiding, Simon Lermen, Andrew Kao, Myra Cheng, Cinoo Lee, Pranav Khadpe, Satyapriya Krishna, Andy Zou, Rahul Gupta

  • Critiques: A central criticism is that the evaluation scenarios are “artificial and contrived”. the void and Lessons from a Chimp argue this research is “overattributing human traits” to models.

  • Funded by: Labs, academic institutions (e.g., Harvard, CMU, Barcelona Institute of Science and Technology), NSFC, ML Alignment Theory & Scholars (MATS) Program, FAR AI

  • Estimated FTEs: 30-80

Some outputs (13)

AI scheming evals

Evaluate frontier models for scheming, a sophisticated, strategic form of AI deception where a model covertly pursues a misaligned, long-term objective while deliberately faking alignment and compliance to evade detection by human supervisors and safety mechanisms.

  • Theory of change: Robust evaluations must move beyond checking final outputs and probe the model’s reasoning to verify that alignment is genuine, not faked, because capable models are capable of strategically concealing misaligned goals (scheming) to pass standard behavioural evaluations.

  • General approach: behavioral /​ engineering · Target case: pessimistic

  • Orthodox alignment problems: Superintelligence can fool human supervisors

  • See also: AI deception evals · Situational awareness and self-awareness evals

  • Some names: Bronson Schoen, Alexander Meinke, Jason Wolfe, Mary Phuong, Rohin Shah, Evgenia Nitishinskaya, Mikita Balesni, Marius Hobbhahn, Jérémy Scheurer, Wojciech Zaremba, David Lindner

  • Critiques: No, LLMs are not “scheming”

  • Funded by: OpenAI, Anthropic, Google DeepMind, Open Philanthropy

  • Estimated FTEs: 30-60

Some outputs (7)

Sandbagging evals

Evaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.

  • Theory of change: If models can distinguish between evaluation and deployment contexts (“evaluation awareness”), they might learn to “sandbag” or deliberately underperform to hide dangerous capabilities, fooling safety evaluations. By developing evaluations for sandbagging, we can test whether our safety methods are being deceived and detect this behavior before a model is deployed.

  • General approach: behaviorist science · Target case: pessimistic

  • Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors

  • See also: AI deception evals · Situational awareness and self-awareness evals · Various Redteams

  • Some names: Teun van der Weij, Cameron Tice, Chloe Li, Johannes Gasteiger, Joseph Bloom, Joel Dyer

  • Critiques: The main external critique, from sources like “the void” and “Lessons from a Chimp”, is that this research “overattribut[es] human traits” to models. It argues that what’s being measured isn’t genuine sandbagging but models “playing-along-with-drama behaviour” in response to “artificial and contrived” evals.

  • Funded by: Anthropic (and its funders, e.g., Google, Amazon), UK Government (funding the AI Security Institute)

  • Estimated FTEs: 10-50

Some outputs (9)

Self-replication evals

evaluate whether AI agents can autonomously replicate themselves by obtaining their own weights, securing compute resources, and creating copies of themselves.

  • Theory of change: if AI agents gain the ability to self-replicate, they could proliferate uncontrollably, making them impossible to shut down. By measuring this capability with benchmarks like RepliBench, we can identify when models cross this dangerous “red line” and implement controls before losing containment.

  • General approach: behaviorist science · Target case: worst-case

  • Orthodox alignment problems: Instrumental convergence, A boxed AGI might exfiltrate itself by steganography, spearphishing

  • See also: Autonomy evals · Situational awareness and self-awareness evals

  • Some names: Sid Black, Asa Cooper Stickland, Jake Pencharz, Oliver Sourbut, Michael Schmatz, Jay Bailey, Ollie Matthews, Ben Millwood, Alex Remedios, Alan Cooney, Xudong Pan, Jiarun Dai, Yihe Fan

  • Critiques: AI Sandbagging

  • Funded by: UK Government (via UK AI Safety Institute)

  • Estimated FTEs: 10-20

Some outputs (3)

Various Redteams

attack current models and see what they do /​ deliberately induce bad things on current frontier models to test out our theories /​ methods.

  • Theory of change: to ensure models are safe, we must actively try to break them. By developing and applying a diverse suite of attacks (e.g., in novel domains, against agentic systems, or using automated tools), researchers can discover vulnerabilities, specification gaming, and deceptive behaviors before they are exploited, thereby informing the development of more robust defenses.

  • General approach: behaviorist science · Target case: average

  • Orthodox alignment problems: A boxed AGI might exfiltrate itself by steganography, spearphishing, Goals misgeneralize out of distribution

  • See also: Other evals

  • Some names: Ryan Greenblatt, Benjamin Wright, Aengus Lynch, John Hughes, Samuel R. Bowman, Andy Zou, Nicholas Carlini, Abhay Sheshadri

  • Critiques: Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations, Red Teaming AI Red Teaming.

  • Funded by: Frontier labs (Anthropic, OpenAI, Google), government (UK AISI), Open Philanthropy, LTFF, academic grants.

  • Estimated FTEs: 100+

Some outputs (57)

Other evals

A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.

  • Theory of change: By developing novel benchmarks for specific, hard-to-measure properties (like honesty), critiquing the reliability of existing methods (like cultural surveys), and improving the formal rigor of evaluation systems (like LLM-as-Judges), researchers can create a more robust and comprehensive suite of evaluations to catch failures missed by standard capability or safety testing.

  • General approach: behaviorist science · Target case: average

  • See also: other more specific sections on evals

  • Some names: Richard Ren, Mantas Mazeika, Andrés Corrada-Emmanuel, Ariba Khan, Stephen Casper

  • Critiques: The Unreliability of Evaluating Cultural Alignment in LLMs, The Leaderboard Illusion

  • Funded by: Lab funders (OpenAI), Open Philanthropy (which funds CAIS, the organization for the MASK benchmark), academic institutions. N/​A (as a discrete amount). This work is part of the “tens of millions” budgets for broader evaluation and red-teaming efforts at labs and independent organizations.

  • Estimated FTEs: 20-50

Some outputs (20)


Orgs without public outputs this year

We are not aware of public technical AI safety output with these agendas and organizations, though they are active otherwise.

Graveyard (known to be inactive)

Changes this time

  • A few major changes to the taxonomy: the top-level split is now “black-box” vs “white-box” instead of “control” vs “understanding”. (We did try out an automated clustering but it wasn’t very good.)

  • The agendas are in general less charisma-based and more about solution type.

  • We did a systematic Arxiv scrape on the word “alignment” (and filtered out the sequence indexing papers that fell into this pipeline). “Steerability” is one competing term used by academics.

  • We scraped >3000 links (arXiv, LessWrong, several alignment publication lists, blogs and conferences), conservatively filtering and pre-categorizing them with a LLM pipeline. All curated later, many more added manually.

  • This review has ~800 links compared to ~300 in 2024 and ~200 in 2023. We looked harder and the field has grown.

  • We don’t collate public funding figures.

  • New sections: “Labs”, “Multi-agent First”, “Better data”, “Model specs”, “character training” and “representation geometry”. ”Evals” is so massive it gets a top-level section.

Methods

Structure

We have again settled for a tree data structure for this post – but people and work can appear in multiple nodes so it’s not a dumb partition. Richer representation structures may be in the works.

The level of analysis for each node in the tree is the “research agenda”, an abstraction spanning multiple papers and organisations in a messy many-to-many relation. What makes something an agenda? Similar methods, similar aims, or something sociological about leaders and collaborators. Agendas vary greatly in their degree of coherent agency, from the very coherent Anthropic Circuits work to the enormous, leaderless and unselfconscious “iterative alignment”.)

Scope

30th November 2024 – 30th November 2025 (with a few exceptions).

We’re focussing on “technical AGI safety”. We thus ignore a lot of work relevant to the overall risk: misuse, policy, strategy, OSINT, resilience and indirect risk, AI rights, general capabilities evals, and things closer to “technical policy” and like products (like standards, legislation, SL4 datacentres, and automated cybersecurity). We also mostly focus on papers and blogposts (rather than say, underground gdoc samizdat or Discords).

We only use public information, so we are off by some additional unknown factor.

We try to include things which are early-stage and illegible – but in general we fail and mostly capture legible work on legible problems (i.e. things you can write a paper on already).

Of the 2000+ links to papers, organizations and posts in the raw scrape, about 700 made it in.

Paper sources

  • All arXiv papers with “AI alignment”, “AI safety”, or “steerability” in the abstract or title; all papers of ~120 AI safety researchers

  • All Alignment Forum posts and all LW posts under “AI

  • Gasteiger’s links, Paleka’s links, Lenz’s links, Zvi’s links

  • Ad hoc Twitter for a year, several conference pages and workshops

  • AI scrapes miss lots of things. We did a proper pass with a software scraper of over 3000 links collected from the above and LLM crawl of some of the pages, and then an LLM pass to pre-filter the links for relevance and pre-assign the links to agendas and areas, but this also had systematic omissions. We ended up doing a full manual pass over a conservatively LLM-pre-filtered links and re-classifying the links and papers. The code and data can be found here, including the 3300 collected candidate links. We are not aware of any proper studies of “LLM laziness” but it’s known amongst power users of copilots.

  • For finding critiques we used LW backlinks, Google Scholar cited-by, manual search, collected links, and Ahrefs. Technical critiques are somewhat rare, though, and even then our coverage here is likely severely lacking. We generally do not include social network critiques (mostly due to scope).

  • Despite this effort we will not have included all relevant papers and names. The omission of a paper or a researcher is not strong negative evidence about their relevance.

Processing

  • Collecting links throughout the year and at project start. Skimming papers, staring at long lists.

  • We drafted a taxonomy of research agendas. Based on last year’s list, our expertise and the initial paper collection, though we changed the structure to accommodate shifts in the domain: the top-level split is now “black-box” vs “white-box” instead of “control” vs “understanding”.

  • At around 300 links collected manually (and growing fast), we decided to implement simple pipelines for crawling, scraping and LLM metadata extraction, pre-filtering and pre-classification into agendas, as well as other tasks – including final formatting later. The use of LLMs was limited to one simple task at a time, and the results were closely watched and reviewed. Code and data here.

  • We tried getting the AI to update our taxonomy bottom-up to fit the body of papers, but the results weren’t very good. Though we are looking at some advanced options (specialized embedding or feature extraction and clustering or t-SNE etc.)

  • Work on the ~70 agendas was distributed among the team. We ended up making many local changes to the taxonomy, esp. splitting up and merging agendas. The taxonomy is specific to this year, and will need to be adapted in coming years.

  • We moved any agendas without public outputs this year to the bottom, and the inactive ones to the Graveyard. For most of them, we checked with people from the agendas for outputs or work we may have missed.

  • What started as a brief summary editorial grew into its own thing (6000w).

  • We asked 10 friends in AI safety to review the ~80 page draft. After editing and formatting, we asked 50 technical AI safety researchers for a quick review focused on their expertise.

  • The field is growing at around 20% a year. There will come a time that this list isn’t sensible to do manually even with the help of LLMs (at this granularity anyway). We may come up with better alternatives than lists and posts by then, though.

Other classifications

We added our best guess about which of Davidad’s alignment problems the agenda would make an impact on if it succeeded, as well as its research approach and implied optimism in Richard Ngo’s 3x3.

  • Which deep orthodox subproblems could it ideally solve? (via Davidad)

  • The target case: what part of the distribution over alignment difficulty do they aim to help with? (via Ngo)

    • “optimistic-case”: if CoT is faithful, pretraining as value loading, no stable mesa-optimizers, the relevant scary capabilities are harder than alignment, zero-shot deception is hard, goals are myopic, etc

    • pessimistic-case: if we’re in-between the above and the below

    • worst-case: if power-seeking is rife, zero-shot deceptive alignment, steganography, gradient hacking, weird machines, weird coordination, deep deceptiveness

  • The broad approach: roughly what kind of work is it doing, primarily? (via Ngo)

    • engineering: iterating over outputs

    • behavioural: understanding the input-output relationship

    • cognitive: understanding the algorithms

    • maths/​philosophy: providing concepts for the other approaches

Acknowledgments

These people generously helped with the review by providing expert feedback, literature sources, advice, or otherwise. Any remaining mistakes remain ours.

Thanks to Neel Nanda, Owain Evans, Stephen Casper, Alex Turner, Caspar Oesterheld, Steve Byrnes, Adam Shai, Séb Krier, Vanessa Kosoy, Nora Ammann, David Lindner, John Wentworth, Vika Krakovna, Filip Sondej, JS Denain, Jan Kulveit, Mary Phuong, Linda Linsefors, Yuxi Liu, Ben Todd, Ege Erdil, Tan Zhi-Xuan, Jess Riedel, Mateusz Bagiński, Roland Pihlakas, Walter Laurito, Vojta Kovařík, David Hyland, plex, Shoshannah Tekofsky, Fin Moorhouse, Misha Yagudin, Nandi Schoots, Nuno Sempere, and others for helpful comments.

Thanks to QURI and Ozzie Gooen for creating a website for this review.


Appendix: Other reviews and taxonomies


Epigram

No punting — we can’t keep building nanny products. Our products are overrun with filters and punts of various kinds. We need capable products and [to] trust our users.

– Sergey Brin

Over the decade I’ve spent working on AI safety, I’ve felt an overall trend of divergence; research partnerships starting out with a sense of a common project, then slowly drifting apart over time… eventually, two researchers are going to have some deep disagreement in matters of taste, which sends them down different paths.

Until the spring of this year, that is… something seemed to shift, subtly at first. After I gave a talk—roughly the same talk I had been giving for the past year—I had an excited discussion about it with Scott Garrabrant. Looking back, it wasn’t so different from previous chats we had had, but the impact was different; it felt more concrete, more actionable, something that really touched my research rather than remaining hypothetical. In the subsequent weeks, discussions with my usual circle of colleagues took on a different character—somehow it seemed that, after all our diverse explorations, we had arrived at a shared space.

– Abram Demski

Brought to you by the Arb Corporation