Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025)
This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on our website.
It’s shallow in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about two hours on each entry. Still, among other things, we processed every arXiv paper on alignment, all Alignment Forum posts, as well as a year’s worth of Twitter.
It is substantially a list of lists structuring around 800 links. The point is to produce stylised facts, forests out of trees; to help you look up what’s happening, or that thing you vaguely remember reading about; to help new researchers orient, know some of their options and the standing critiques; and to help you find who to talk to for actual information. We also track things which didn’t pan out.
Here, “AI safety” means technical work intended to prevent future cognitive systems from having large unintended negative effects on the world. So it’s capability restraint, instruction-following, value alignment, control, and risk awareness work.
We don’t cover security or resilience at all.
We ignore a lot of relevant work (including most of capability restraint): things like misuse, policy, strategy, OSINT, resilience and indirect risk, AIrights, general capabilities evals, and things closer to “technical policy” and products (like standards, legislation, SL4 datacentres, and automated cybersecurity). We focus on papers and blogposts (rather than say, gdoc samizdat or tweets or Githubs or Discords). We only use public information, so we are off by some additional unknown factor.
We try to include things which are early-stage and illegible – but in general we fail and mostly capture legible work on legible problems (i.e. things you can write a paper on already).
Even ignoring all of that as we do, it’s still too long to read. Here’s a spreadsheet version and the repo including the data. Methods down the bottom. Gavin’s takes outgrew this post and became its own thing.
If we missed something big or got something wrong, please comment, we will edit it in.
An Arb Research project. Work funded by OpenPhil Coefficient Giving.
Safety teams: Alignment, Safety Systems (Interpretability, Safety Oversight, Pretraining Safety, Robustness, Safety Research, Trustworthy AI, new Misalignment Research team coming), Preparedness, Model Policy, Safety and Security Committee, Safety Advisory Group. The Persona Features paper had a distinct author list. No named successor to Superalignment.
Public alignment agenda:None. Boaz Barak offers personal views.
Some names: Johannes Heidecke, Boaz Barak, Mia Glaese, Jenny Nitishinskaya, Lama Ahmad, Naomi Bashkansky, Miles Wang, Wojciech Zaremba, David Robinson, Zico Kolter, Jerry Tworek, Eric Wallace, Olivia Watkins, Kai Chen, Chris Koch, Andrea Vallone, Leo Gao
Critiques:Stein-Perlman, Stewart, underelicitation, Midas, defense, Carlsmith on labs in general. It’s difficult to model OpenAI as a single agent: “ALTMAN: I very rarely get to have anybody work on anything. One thing about researchers is they’re going to work on what they’re going to work on, and that’s that.”
Some names: Rohin Shah, Allan Dafoe, Anca Dragan, Alex Irpan, Alex Turner, Anna Wang, Arthur Conmy, David Lindner, Jonah Brown-Cohen, Lewis Ho, Neel Nanda, Raluca Ada Popa, Rishub Jain, Rory Greig, Sebastian Farquhar, Senthooran Rajamanoharan, Sophie Bridgers, Tobi Ijitoye, Tom Everitt, Victoria Krakovna, Vikrant Varma, Zac Kenton, Four Flynn, Jonathan Richens, Lewis Smith, Janos Kramar, Matthew Rahtz, Mary Phuong, Erik Jenner
Some names: Chris Olah, Evan Hubinger, Sam Marks, Johannes Treutlein, Sam Bowman, Euan Ong, Fabien Roger, Adam Jermyn, Holden Karnofsky, Jan Leike, Ethan Perez, Jack Lindsey, Amanda Askell, Kyle Fish, Sara Price, Jon Kutasov, Minae Kwon, Monty Evans, Richard Dargan, Roger Grosse, Ben Levinstein, Joseph Carlsmith, Joe Benton
Some names: Shuchao Bi, Hongyuan Zhan, Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Jason Weston, ShengYun Peng, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Evangelia Spiliopoulou, Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda, Adina Williams
The Chinese companies don’tattempt to be safe, often not even in the prosaic safeguards sense. They drop the weights immediately after post-training finishes. They’re mostly open weights and closed data. As of writing the companies are often severely compute-constrained. There are some informal reasons to doubt their capabilities. The (academic) Chinese AI safety scene is however also growing.
Alibaba’s Qwen3-etc-etc is nominally at the level of Gemini 2.5 Flash. Maybe the only Chinese model with a large Western userbase, including businesses, but since it’s self-hosted this doesn’t translate into profits for them yet. On one ad hoc test it was the only Chinese model not to collapse OOD, but the Qwen2.5 corpus was severely contaminated.
DeepSeek’s v3.2 is nominally around the same as Qwen. The CCP made them waste months trying Huawei chips.
Moonshot’s Kimi-K2-Thinking has some nominally frontier benchmark results and a pleasant style but does not seem frontier.
Baidu’s ERNIE 5 is again nominally very strong, a bit better than DeepSeek. This new one seems to not be open.
Z’s GLM-4.6 is around the same as Qwen. The product director was involved in the MIT Alignment group.
MiniMax’s M2 is nominally better than Qwen, around the same as Grok 4 Fast on the usual superficial benchmarks. It does fine on one very basic red-team test.
ByteDance does impressive research in a lagging paradigm, diffusion LMs.
Amazon’s Nova Pro is around the level of Llama 3 90B, which in turn is around the level of the original GPT-4. So 2 years behind. But they have their own chip.
Microsoft are now mid-training on top of GPT-5. MAI-1-preview is around DeepSeek V3.0 level on Arena. They continue to focus on medical diagnosis. You can request access.
Mistral have a reasoning model, Magistral Medium, and released the weights of a little 24B version. It’s a bit worse than Deepseek R1, pass@1.
Black-box safety (understand and control current model behaviour)
Iterative alignment
Nudging base models by optimising their output. Worked on by the post-training teams at most labs, estimating the FTEs at >500 in some sense. Funded by most of the industry.
General theory of change: “LLMs don’t seem very dangerous and might scale to AGI, things are generally smooth, relevant capabilities are harder than alignment, assume no mesaoptimisers, assume that zero-shot deception is hard, assume a fundamentally humanish ontology is learned, assume no simulated agents, assume that noise in the data means that human preferences are not ruled out, assume that alignment is a superficial feature, assume that tuning for what we want will also get us to avoid what we don’t want. Maybe assume that thoughts are translucent.”
General approach: engineering · Target case: average
Prompt mild misbehaviour in training, to prevent the failure mode where once AI misbehaves in a mild way, it will be more inclined towards all bad behaviour.
Some names: Ariana Azarbal, Daniel Tan, Victor Gillioz, Alex Turner, Alex Cloud, Monte MacDiarmid, Daniel Ziegler
Developing methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approaches.
Theory of change: If an AI learns dangerous knowledge (e.g., dual-use capabilities like virology or hacking, or knowledge of their own safety controls) or exhibits undesirable behaviors (e.g., memorizing private data), we can specifically erase this “bad” knowledge post-training, which is much cheaper and faster than retraining, thereby making the model safer. Alternatively, intervene in pre-training, to prevent the model from learning it in the first place (even when data filtering is imperfect). You could imagine also unlearning propensities to power-seeking, deception, sycophancy, or spite.
General approach: cognitive / engineering · Target case: pessimistic
Orthodox alignment problems: Superintelligence can hack software supervisors, A boxed AGI might exfiltrate itself by steganography, spearphishing, Humanlike minds/goals are not necessarily safe
Some names: Rowan Wang, Avery Griffin, Johannes Treutlein, Zico Kolter, Bruce W. Lee, Addie Foote, Alex Infanger, Zesheng Shi, Yucheng Zhou, Jing Li, Timothy Qian, Stephen Casper, Alex Cloud, Peter Henderson, Filip Sondej, Fazl Barez
Funded by: Coefficient Giving, MacArthur Foundation, UK AI Safety Institute (AISI), Canadian AI Safety Institute (CAISI), industry labs (e.g., Microsoft Research, Google)
If we assume early transformative AIs are misaligned and actively trying to subvert safety measures, can we still set up protocols to extract useful work from them while preventing sabotage, and watching with incriminating behaviour?
General approach: engineering / behavioural · Target case: worst-case
See also: safety cases
Some names: Redwood, UK AISI, Deepmind, OpenAI, Anthropic, Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan, Alex Mallen
Layers of inference-time defenses, such as classifiers, monitors, and rapid-response protocols, to detect and block jailbreaks, prompt injections, and other harmful model behaviors.
Theory of change: By building a bunch of scalable and hardened things on top of an unsafe model, we can defend against known and unknown attacks, monitor for misuse, and prevent models from causing harm, even if the core model has vulnerabilities.
General approach: engineering · Target case: average
Orthodox alignment problems: Superintelligence can fool human supervisors, A boxed AGI might exfiltrate itself by steganography, spearphishing
Supervise an AI’s natural-language (output) “reasoning” to detect misalignment, scheming, or deception, rather than studying the actual internal states.
Theory of change: The reasoning process (Chain of Thought, or CoT) of an AI provides a legible signal of its internal state and intentions. By monitoring this CoT, supervisors (human or AI) can detect misalignment, scheming, or reward hacking before it results in a harmful final output. This allows for more robust oversight than supervising outputs alone, but it relies on the CoT remaining faithful (i.e., accurately reflecting the model’s reasoning) and not becoming obfuscated under optimization pressure.
General approach: engineering · Target case: average
Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors, A boxed AGI might exfiltrate itself by steganography, spearphishing
Some names: Aether, Bowen Baker, Joost Huizinga, Leo Gao, Scott Emmons, Erik Jenner, Yanda Chen, James Chua, Owain Evans, Tomek Korbak, Mikita Balesni, Xinpeng Wang, Miles Turpin, Rohin Shah
This section consists of a bottom-up set of things people happen to be doing, rather than a normative taxonomy.
Model values / model preferences
Analyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.
Theory of change: As AIs become more agentic, their behaviours and risks are increasingly determined by their goals and values. Since coherent value systems emerge with scale, we must leverage utility functions to analyse these values and apply “utility control” methods to constrain them, rather than just controlling outputs downstream of them.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Value is fragile and hard to specify
Some names: Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks
Map, shape, and control the personae of language models, such that new models embody desirable values (e.g., honesty, empathy) rather than undesirable ones (e.g., sycophancy, self-perpetuating behaviors).
Theory of change: If post-training, prompting, and activation-engineering interact with some kind of structured ‘persona space’, then better understanding it should benefit the design, control, and detection of LLM personas.
General approach: cognitive · Target case: average
Orthodox alignment problems: Value is fragile and hard to specify
Fine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained or rewarded directly. A new agenda which quickly led to a stream of exciting work.
Theory of change: Predict, detect, and prevent models from developing broadly harmful behaviors (like deception or shutdown resistance) when fine-tuned on seemingly unrelated tasks. Find, preserve, and robustify this correlated representation of the good.
General approach: behavioural · Target case: pessimistic
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors
See also: auditing real models · applied interpretability
Some names: Truthful AI, Jan Betley, James Chua, Mia Taylor, Miles Wang, Edward Turner, Anna Soligo, Alex Cloud, Nathan Hu, Owain Evans
Write detailed, natural language descriptions of values and rules for models to follow, then instill these values and rules into models via techniques like Constitutional AI or deliberative alignment.
Theory of change: Model specs and constitutions serve three purposes. First, they provide a clear standard of behavior which can be used to train models to value what we want them to value. Second, they serve as something closer to a ground truth standard for evaluating the degree of misalignment ranging from “models straightforwardly obey the spec” to “models flagrantly disobey the spec”. A combination of scalable stress-testing and reinforcement for obedience can be used to iteratively reduce the risk of misalignment. Third, they get more useful as models’ instruction-following capability improves.
General approach: engineering · Target case: average
Orthodox alignment problems: Value is fragile and hard to specify
Find interesting LLM phenomena like glitch tokens and the reversal curse; these are vital data for theory.
Theory of change: The study of ‘pathological’ phenomena in LLMs is potentially key for theoretically modelling LLM cognition and LLM training-dynamics (compare: the study of aphasia and visual processing disorder in humans plays a key role cognitive science), and in particular for developing a good theory of generalization in LLMS
General approach: behavioural/cognitive · Target case: pessimistic
Orthodox alignment problems: Goals misgeneralize out of distribution
Builds safety into models from the start by removing harmful or toxic content (like dual-use info) from the pretraining data, rather than relying only on post-training alignment.
Theory of change: By curating the pretraining data, we can prevent the model from learning dangerous capabilities (e.g., dual-use info) or undesirable behaviors (e.g., toxicity) in the first place, making safety more robust and “tamper-resistant” than post-training patches.
General approach: engineering · Target case: average
Orthodox alignment problems: Goals misgeneralize out of distribution, Value is fragile and hard to specify
Study, steer, and intervene on the following feedback loop: “we produce stories about how present and future AI systems behave” → “these stories become training data for the AI” → “these stories shape how AI systems in fact behave”.
Theory of change: Measure the influence of existing AI narratives in the training data → seed and develop more salutary ontologies and self-conceptions for AI models → control and redirect AI models’ self-concepts through selectively amplifying certain components of the training data.
General approach: cognitive · Target case: average
Orthodox alignment problems: Value is fragile and hard to specify
Develops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data.
Theory of change: By identifying and filtering out malicious training examples, we can prevent attackers from creating hidden backdoors or triggers that would cause aligned models to behave dangerously.
General approach: engineering · Target case: pessimistic
Orthodox alignment problems: Superintelligence can hack software supervisors, Someone else will deploy unsafe superintelligence first
Uses AI-generated data (e.g., critiques, preferences, or self-labeled examples) to scale and improve alignment, especially for superhuman models.
Theory of change: We can overcome the bottleneck of human feedback and data by using models to generate vast amounts of high-quality, targeted data for safety, preference tuning, and capability elicitation.
See also: data quality for alignment, data filtering, scalable oversight, automated alignment research, weak-to-strong generalization.
Orthodox problems: goals misgeneralize out of distribution, superintelligence can fool human supervisors, value is fragile and hard to specify.
Improves the quality, signal-to-noise ratio, and reliability of human-generated preference and alignment data.
Theory of change: The quality of alignment is heavily dependent on the quality of the data (e.g., human preferences); by improving the “signal” from annotators and reducing noise/bias, we will get more robustly aligned models.
General approach: engineering · Target case: average
Orthodox alignment problems: Superintelligence can fool human supervisors, Value is fragile and hard to specify
Avoid Goodharting by getting AI to satisfice rather than maximise.
Theory of change: If we fail to exactly nail down the preferences for a superintelligent agent we die to Goodharting → shift from maximising to satisficing in the agent’s utility function → we get a nonzero share of the lightcone as opposed to zero; also, moonshot at this being the recipe for fully aligned AI.
Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.
Theory of change: Standard RL objectives (like maximizing expected value) are brittle and lead to goal misgeneralization or specification gaming; by developing more robust frameworks (like pessimistic RL, minimax regret, or provable inverse reward learning), we can create agents that are safe even when misspecified.
General approach: engineering · Target case: pessimistic
Orthodox alignment problems: Goals misgeneralize out of distribution, Value is fragile and hard to specify, Superintelligence can fool human supervisors
Some names: Joar Skalse, Karim Abdel Sadek, Matthew Farrugia-Roberts, Benjamin Plaut, Fang Wu, Stephen Zhao, Alessandro Abate, Steven Byrnes, Michael Cohen
Formalize how AI assistants learn about human preferences given uncertainty and partial observability, and construct environments which better incentivize AIs to learn what we want them to learn.
Theory of change: Understand what kinds of things can go wrong when humans are directly involved in training a model → build tools that make it easier for a model to learn what humans want it to learn.
See also: Guaranteed-Safe AI
General approach: engineering / cognitive · Target case: varies
Orthodox alignment problems: Value is fragile and hard to specify, Humanlike minds/goals are not necessarily safe
Some names: Joar Skalse, Anca Dragan, Caspar Oesterheld, David Krueger, Dylan Hafield-Menell, Stuart Russell
Critiques:nice summary of historical problem statements
Funded by: Future of Life Institute, Coefficient Giving, Survival and Flourishing Fund, Cooperative AI Foundation, Polaris Ventures
Develops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or exploit dangerous capabilities.
Theory of change: Open-weight models allow adversaries to easily remove post-training safety (like refusal training) via simple fine-tuning; by making safety an intrinsic property of the model’s learned knowledge and capabilities (e.g., by ensuring “deep ignorance” of dual-use information), the safeguards become far more difficult and expensive to remove.
General approach: engineering · Target case: average
Orthodox alignment problems: Someone else will deploy unsafe superintelligence first
Agenda-agnostic approaches to identifying good but overlooked empirical alignment ideas, working with theorists who could use engineers, and prototyping them.
Theory of change: Empirical search for “negative alignment taxes” (prioritizing methods that simultaneously enhance alignment and capabilities)
General approach: engineering · Target case: average
Orthodox alignment problems: Someone else will deploy unsafe superintelligence first
See also:Iterative alignment · automated alignment research · Beijing Key Laboratory of Safe AI and Superalignment · Aligned AI
Some names: AE Studio, Gunnar Zarncke, Cameron Berg, Michael Vaiana, Judd Rosenblatt, Diogo Schwerz de Lucena
This section isn’t very conceptually clean. See the Open Problems paper or Deepmind for strong frames which are not useful for descriptive purposes.
Reverse engineering
Decompose a model into its functional, interacting components (circuits), formally describe what computation those components perform, and validate their causal effects to reverse-engineer the model’s internal algorithm.
Theory of change: By gaining a mechanical understanding of how a model works (the “circuit diagram”), we can predict how models will act in novel situations (generalization), and gain the mechanistic knowledge necessary to safely modify an AI’s goals or internal mechanisms, or allow for high-confidence alignment auditing and better feedback to safety researchers.
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors
Some names: Lucius Bushnaq, Dan Braun, Lee Sharkey, Aaron Mueller, Atticus Geiger, Sheridan Feucht, David Bau, Yonatan Belinkov, Stefan Heimersheim, Chris Olah, Leo Gao
Identifies directions or subspaces in a model’s latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them at runtime, suppress eval awareness, debug why models are failing, etc.
Theory of change: By mapping internal activations to human-interpretable concepts, we can detect dangerous capabilities or deceptive alignment directly in the mind of the model even if its overt behavior is perfectly safe. Deploy computationally cheap monitors to flag some hidden misalignment in deployed systems.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, A boxed AGI might exfiltrate itself by steganography, spearphishing
Some names: Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Tom Wollschläger, Anna Soligo, Jack Lindsey, Brian Christian, Ling Hu, Nicholas Goldowsky-Dill, Neel Nanda
Programmatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning.
Theory of change: Test interpretability theories by intervening on activations; find new insights from interpretable causal interventions on representations. Or: build more stuff to stack on top of finetuning. Slightly encourage the model to be nice, add one more layer of defence to our bundle of partial alignment methods.
General approach: engineering / cognitive · Target case: average
Orthodox alignment problems: Value is fragile and hard to specify
Some names: Runjin Chen, Andy Arditi, David Krueger, Jan Wehner, Narmeen Oozeer, Reza Bayat, Adam Karvonen, Jiuding Sun, Tim Tian Hua, Helena Casademunt, Jacob Dunefsky, Thomas Marshall
Identify and decoding the “true” beliefs or knowledge represented inside a model’s activations, even when the model’s output is deceptive or false.
Theory of change: Powerful models may know things they do not say (e.g. that they are currently being tested). If we can translate this latent knowledge directly from the model’s internals, we can supervise them reliably even when they attempt to deceive human evaluators or when the task is too difficult for humans to verify directly.
General approach: cognitive · Target case: worst case
Orthodox alignment problems: Superintelligence can fool human supervisors
Some names: Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Alexander Pan, Lijie Chen, Jacob Steinhardt, Javier Ferrando, Oscar Obeso, Collin Burns, Paul Christiano.
Detect when a model is being deceptive or lying by building white- or black-box detectors. Some work below requires intent in their definition, while other work focuses only on whether the model states something it believes to be false, regardless of intent.
Theory of change: Such detectors could flag suspicious behavior during evaluations or deployment, augment training to reduce deception, or audit models pre-deployment. Specific applications include alignment evaluations (e.g. by validating answers to introspective questions), safeguarding evaluations (catching models that “sandbag”, that is, strategically underperform to pass capability tests), and large-scale deployment monitoring. An honest version of a model could also provide oversight during training or detect cases where a model behaves in ways it understands are unsafe.
General approach: cognitive · Target case: pessimistic
Some names:Cadenza, Sam Marks, Rowan Wang, Kieron Kretschmar, Sharan Maiya, Walter Laurito, Chris Cundy, Adam Gleave, Aviel Parrack, Stefan Heimersheim, Carlo Attubato, Joseph Bloom, Jordan Taylor, Alex McKenzie, Urja Pawar, Lewis Smith, Bilal Chughtai, Neel Nanda
Understand what happens when a model is finetuned, what the “diff” between the finetuned and the original model consists in.
Theory of change: By identifying the mechanistic differences between a base model and its fine-tune (e.g., after RLHF), maybe we can verify that safety behaviors are robustly “internalized” rather than superficially patched, and detect if dangerous capabilities or deceptive alignment have been introduced without needing to re-analyze the entire model. The diff is also much smaller, since most parameters don’t change, which means you can use heavier methods on them.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Value is fragile and hard to specify
Decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic “features” which correspond to interpretable concepts.
Theory of change: Get a principled decomposition of an LLM’s activation into atomic components → identify deception and other misbehaviors.
General approach: engineering / cognitive · Target case: average
Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, Superintelligence can fool human supervisors
Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.
Theory of change: By establishing a causal mapping between a black-box neural network and a human-interpretable algorithm, we can check whether the model is using safe reasoning processes and predict its behavior on unseen inputs, rather than relying on behavioural testing alone.
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Goals misgeneralize out of distribution
Quantifies the influence of individual training data points on a model’s specific behavior or output, allowing researchers to trace model properties (like misalignment, bias, or factual errors) back to their source in the training set.
Theory of change: By attributing harmful, biased, or unaligned behaviors to specific training examples, researchers can audit proprietary models, debug training data, enable effective data deletion/unlearning
General approach: behavioural · Target case: average
Orthodox alignment problems: Goals misgeneralize out of distribution, Value is fragile and hard to specify
Some names: Roger Grosse, Philipp Alexander Kreer, Jin Hwa Lee, Matthew Smith, Abhilasha Ravichander, Andrew Wang, Jiacheng Liu, Jiaqi Ma, Junwei Deng, Yijun Pan, Daniel Murfet, Jesse Hoogland
Directly tackling concrete, safety-critical problems on the path to AGI by using lightweight interpretability tools (like steering and probing) and empirical feedback from proxy tasks, rather than pursuing complete mechanistic reverse-engineering.
Theory of change: By applying interpretability skills to concrete problems, researchers can rapidly develop monitoring and control tools (e.g., steering vectors or probes) that have immediate, measurable impact on real-world safety issues like detecting hidden goals or emergent misalignment.
Interpretability that does not fall well into other categories.
Theory of change: Explore alternative conceptual frameworks (e.g., agentic, propositional) and physics-inspired methods (e.g., renormalization). Or be “pragmatic”.
General approach: engineering / cognitive · Target case: mixed
Orthodox alignment problems: Superintelligence can fool human supervisors, Goals misgeneralize out of distribution
Learning dynamics and developmental interpretability
Builds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model’s training and in-context learning phases.
Theory of change: Structures forming in neural networks leave identifiable traces that can be interpreted (e.g., using concepts from Singular Learning Theory); by catching and analyzing these developmental moments, researchers can automate interpretability, predict when dangerous capabilities emerge, and intervene to prevent deceptiveness or misaligned values as early as possible.
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Goals misgeneralize out of distribution
What do the representations look like? Does any simple structure underlie the beliefs of all well-trained models? Can we get the semantics from this geometry?
Theory of change: Get scalable unsupervised methods for finding structure in representations and interpreting them, then using this to e.g. guide training.
Discover connections deep learning AI systems have with human brains and human learning processes. Develop an ‘alignment moonshot’ based on a coherent theory of learning which applies to both humans and AI systems.
Theory of change: Humans learn trust, honesty, self-maintenance, and corrigibility; if we understand how they do maybe we can get future AI systems to learn them.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Goals misgeneralize out of distribution
See also: active learning · ACS research
Some names: Lukas Muttenthaler, Quentin Delfosse
Funded by: Google DeepMind, various academic groups
Approaches which try to get assurances about system outputs while still being scalable.
Guaranteed-Safe AI
Have an AI system generate outputs (e.g. code, control systems, or RL policies) which it can quantitatively guarantee comply with a formal safety specification and world model.
Theory of change: Various, including:
i) safe deployment: create a scalable process to get not-fully-trusted AIs to produce highly trusted outputs;
ii) secure containers: create a ‘gatekeeper’ system that can act as an intermediary between human users and a potentially dangerous system, only letting provably safe actions through.
(Notable for not requiring that we solve ELK; does require that we solve ontology though)
General approach: cognitive / engineering · Target case: worst-case
Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Humans cannot be first-class parties to a superintelligent value handshake, A boxed AGI might exfiltrate itself by steganography, spearphishing
Some names: ARIA, Lawzero, Atlas Computing, FLF, Max Tegmark, Beneficial AI Foundation, Steve Omohundro, David “davidad” Dalrymple, Joar Skalse, Stuart Russell, Alessandro Abate
Develop powerful, nonagentic, uncertain world models that accelerate scientific progress while avoiding the risks of agent AIs
Theory of change: Developing non-agentic ‘Scientist AI’ allows us to: (i) reap the benefits of AI progress while (ii) avoiding the inherent risks of agentic systems. These systems can also (iii) provide a useful guardrail to protect us from unsafe agentic AIs by double-checking actions they propose, and (iv) help us more safely build agentic superintelligent systems.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Pivotal processes require dangerous capabilities, Goals misgeneralize out of distribution, Instrumental convergence
Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let’s figure out what those circuits are and how they work; this will involve symbol grounding. “a yet-to-be-invented variation on actor-critic model-based reinforcement learning”
Theory of change: Fairly-direct alignment via changing training to reflect actual human reward. Get actual data about (reward, training data) → (human values) to help with theorising this map in AIs; “understand human social instincts, and then maybe adapt some aspects of those for AGIs, presumably in conjunction with other non-biological ingredients”.
Use weaker models to supervise and provide a feedback signal to stronger models.
Theory of change: Find techniques that do better than RLHF at supervising superior models → track whether these techniques fail as capabilities increase further → keep the stronger systems aligned by amplifying weak oversight and quantifying where it breaks.
General approach: engineering · Target case: average
Orthodox alignment problems: Superintelligence can hack software supervisors
Build formal and empirical frameworks where AIs supervise other (stronger) AI systems via structured interactions; construct monitoring tools which enable scalable tracking of behavioural drift, benchmarks for self-modification, and robustness guarantees
Theory of change: Early models train ~only on human data while later models also train on early model outputs, which leads to early model problems cascading. Left unchecked this will likely cause problems, so supervision mechanisms are needed to help ensure the AI self-improvement remains legible.
General approach: behavioural · Target case: pessimistic
Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
Some names: Roman Engeler, Akbir Khan, Ethan Perez
Make open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that “searches” for a specified behaviour in frontier models.
Theory of change: Use AI to help improve interp and evals. Develop and release open tools to level up the whole field. Get invited to improve lab processes.
See also: Interpretability.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
Some names: Transluce, Jacob Steinhardt, Neil Chowdhury, Vincent Huang, Sarah Schwettmann, Robert Friel
Funded by: Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Zaremba
In the limit, it’s easier to compellingly argue for true claims than for false claims; exploit this asymmetry to get trusted work out of untrusted debaters.
Theory of change: “Give humans help in supervising strong agents” + “Align explanations with the true reasoning process of the agent” + “Red team models to exhibit failure modes that don’t occur in normal use” are necessary but probably not sufficient for safe AGI.
General approach: engineering / cognitive · Target case: worst-case
Orthodox alignment problems: Value is fragile and hard to specify, Superintelligence can fool human supervisors
Some names: Rohin Shah, Jonah Brown-Cohen, Georgios Piliouras, UK AISI (Benjamin Holton)
Train LLMs to the predict the outputs of high-quality whitebox methods, to induce general self-explanation skills that use its own ‘introspective’ access
Theory of change: Use the resulting LLMs as powerful dimensionality reduction, explaining internals in a distinct way than interpretability methods and CoT. Distilling self-explanation into the model should lead to the skill being scalable, since self-explanation skill advancement will feed off general-intelligence advancement.
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
Develop a principled scientific understanding that will help us reliably understand and control current and future AI systems.
Agent foundations
Develop philosophical clarity and mathematical formalizations of building blocks that might be useful for plans to align strong superintelligence, such as agency, optimization strength, decision theory, abstractions, concepts, etc.
Theory of change: Rigorously understand optimization processed and agents, and what it means for them to be aligned in a substrate independent way → identify impossibility results and necessary conditions for aligned optimizer systems → use this theoretical understanding to eventually design safe architectures that remain stable and safe under self-reflection
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution
An aligned agentic system modifying itself into an unaligned system would be bad and we can research ways that this could occur and infrastructure/approaches that prevent it from happening.
Theory of change: Build enough theoretical basis through various approaches such that AI systems we create are capable of self-modification while preserving goals.
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution
Mech interp and alignment assume a stable “computational substrate” (linear algebra on GPUs). If later AI uses different substrates (e.g. something neuromorphic), methods like probes and steering will not transfer. Therefore, better to try and infer goals via a “telic DAG” which abstracts over substrates, and so sidestep the issue of how to define intermediate representations. Category theory is intended to provide guarantees that this abstraction is valid.
Theory of change: Sufficiently complex mindlike entities can alter their goals in ways that cannot be predicted or accounted for under substrate-dependent descriptions of the kind sought in mechanistic interpretability. use the telic DAG to define a method analogous to factoring a causal DAG.
General approach: maths / philosophy · Target case: pessimistic
Prove that if a safety process has enough resources (human data quality, training time, neural network capacity), then in the limit some system specification will be guaranteed. Use complexity theory, game theory, learning theory and other areas to both improve asymptotic guarantees and develop ways of showing convergence.
Theory of change: Formal verification may be too hard. Make safety cases stronger by modelling their processes and proving that they would work in the limit.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors
Formalize mechanistic explanations of neural network behavior, automate the discovery of these “heuristic explanations” and use them to predict when novel input will lead to extreme behavior (i.e. “Low Probability Estimation” and “Mechanistic Anomaly Detection”).
Theory of change: The current goalpost is methods whose reasoned predictions about properties of a neural network’s outputs distribution (for a given inputs distribution) are certifiably at least as accurate as estimations via sampling. If successful for safety-relevant properties, this should allow for automated alignment methods that are both human-legible and worst-case certified, as well more efficient than sampling-based methods in most cases.
General approach: cognitive / maths/philosophy · Target case: worst-case
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can hack software supervisors
See also: ARC Theory · ELK · mechanistic anomaly detection · Acorn · Guaranteed-Safe AI
Some names: Jacob Hilton, Mark Xu, Eric Neyman, Victor Lecomte, George Robinson
Predict properties of future AGI (e.g. power-seeking) with formal models; formally state and prove hypotheses about the properties powerful systems will have and how we might try to change them.
Theory of change: Figure out hypotheses about properties powerful agents will have → attempt to rigorously prove under what conditions the hypotheses hold → test these hypotheses where feasible → design training environments that lead to more salutary properties.
General approach: maths / philosophy · Target case: worst-case
Orthodox alignment problems: Corrigibility is anti-natural, Instrumental convergence
Diagnose and communicate obstacles to achieving robustly corrigible behavior; suggest mechanisms, tests, and escalation channels for surfacing and mitigating incorrigible behaviors
Theory of change: Labs are likely to develop AGI using something analogous to current pipelines. Clarifying why naive instruction-following doesn’t buy robust corrigibility + building strong tripwires/diagnostics for scheming and Goodharting thus reduces risks on the likely default path.
General approach: varies · Target case: pessimistic
Orthodox alignment problems: Corrigibility is anti-natural, Instrumental convergence
Develop a theory of concepts that explains how they are learned, how they structure a particular system’s understanding, and how mutual translatability can be achieved between different collections of concepts.
Theory of change: Understand the concepts a system’s understanding is structured with and use them to inspect its “alignment/safety properties” and/or “retarget its search”, i.e. identify utility-function-like components inside an AI and replacing calls to them with calls to “user values” (represented using existing abstractions inside the AI).
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Instrumental convergence, Superintelligence can fool human supervisors, Humans cannot be first-class parties to a superintelligent value handshake
See also:Causal Abstractions · representational alignment · convergent abstractions · feature universality · Platonic representation hypothesis · microscope AI
Some names: John Wentworth, Paul Colognese, David Lorrell, Sam Eisenstat, Fernando Rosas
Create a mathematical theory of intelligent agents that encompasses both humans and the AIs we want, one that specifies what it means for two such agents to be aligned; translate between its ontology and ours; produce formal desiderata for a training setup that produces coherent AGIs similar to (our model of) an aligned agent
Theory of change: Fix formal epistemology to work out how to avoid deep training problems
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, Humans cannot be first-class parties to a superintelligent value handshake
Some names: Vanessa Kosoy, Diffractor, Gergely Szücs
Align AI directly to the role of participant, collaborator, or advisor for our best real human practices and institutions, instead of aligning AI to separately representable goals, rules, or utility functions.
Theory of change: “Many classical problems in AGI alignment are downstream of a type error about human values.” Operationalizing a correct view of human values—one that treats human values as impossible or impractical to abstract from concrete practices—will unblock value fragility, goal-misgeneralization, instrumental convergence, and pivotal-act specification.
General approach: behavioural · Target case: mixed
Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution, Instrumental convergence, Fair, sane pivotal processes
Some names: Full Stack Alignment, Meaning Alignment Institute, Plurality Institute, Tan Zhi-Xuan, Matija Franklin, Ryan Lowe, Joe Edelman, Oliver Klingefjord
Funded by: ARIA, OpenAI, Survival and Flourishing Fund
Generate AIs’ operational values from ‘social contract’-style ideal civic deliberation formalisms and their consequent rulesets for civic actors
Theory of change: Formalize and apply the liberal tradition’s project of defining civic principles separable from the substantive good, aligning our AIs to civic principles that bypass fragile utility-learning and intractable utility-calculation
Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, Instrumental convergence, Humanlike minds/goals are not necessarily safe, Fair, sane pivotal processes
Use realistic game-theory variants (e.g. evolutionary game theory, computational game theory) or develop alternative game theories to describe/predict the collective and individual behaviours of AI agents in multi-agent scenarios.
Theory of change: While traditional AGI safety focuses on idealized decision-theory and individual agents, it’s plausible that strategic AI agents will first emerge (or are emerging now) in a complex, multi-AI strategic landscape. We need granular, realistic formal models of AIs’ strategic interactions and collective dynamics to understand this future.
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
Some names: Lewis Hammond, Emery Cooper, Allan Chan, Caspar Oesterheld, Vincent Conitzer, Vojta Kovarik, Nathaniel Sauerberg, ACS Research, Jan Kulveit, Richard Ngo, Emmett Shear, Softmax, Full Stack Alignment, AI Objectives Institute, Sahil, TJ, Andrew Critch
Develop tools and techniques for designing and testing multi-agent AI scenarios, for auditing real-world multi-agent AI dynamics, and for aligning AIs in multi-AI settings.
Theory of change: Addressing multi-agent AI dynamics is key for aligning near-future agents and their impact on the world. Feedback loops from multi-agent dynamics can radically change the future AI landscape, and require a different toolset from model psychology to audit and control.
General approach: engineering / behavioural · Target case: mixed
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
Some names: Andrew Critch, Lewis Hammond, Emery Cooper, Allan Chan, Caspar Oesterheld, Vincent Conitzer, Gillian Hadfield, Nathaniel Sauerberg, Zhijing Jin
Funded by: Coefficient Giving, Deepmind, Cooperative AI Foundation
Technical protocols for taking seriously the plurality of human values, cultures, and communities when aligning AI to “humanity”
Theory of change: use democratic/pluralist/context-sensitive principles to guide AI development, alignment, and deployment somehow. Doing it as an afterthought in post-training or the spec isn’t good enough. Continuously shape AI’s social and technical feedback loop on the road to AGI
General approach: behavioural · Target case: average
Orthodox alignment problems: Value is fragile and hard to specify, Fair, sane pivotal processes
Some names: Joel Z. Leibo, Divya Siddarth, Séb Krier, Luke Thorburn, Seth Lazar, AI Objectives Institute, The Collective Intelligence Project, Vincent Conitzer
Funded by: Future of Life Institute, Survival and Flourishing Fund, Deepmind, CAIF
Develop alternatives to agent-level models of alignment, by treating human-AI interactions, AI-assisted institutions, AI economic or cultural systems, drives within one AI, and other causal/constitutive processes as subject to alignment
Theory of change: Model multiple reality-shaping processes above and below the level of the individual AI, some of which are themselves quasi-agential (e.g. cultures) or intelligence-like (e.g. markets), will develop AI alignment into a mature science for managing the transition to an AGI civilization
General approach: behavioural / cognitive · Target case: mixed
Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution, Instrumental convergence, Fair, sane pivotal processes
Make tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better.
Theory of change: Keep a close eye on what capabilities are acquired when, so that frontier labs and regulators are better informed on what security measures are already necessary (and hopefully they extrapolate). You can’t regulate without them.
General approach: behavioural · Target case: average
Measure an AI’s ability to act autonomously to complete long-horizon, complex tasks.
Theory of change: By measuring how long and complex a task an AI can complete (its “time horizon”), we can track capability growth and identify when models gain dangerous autonomous capabilities (like R&D acceleration or replication).
General approach: behavioural · Target case: average
Some names: METR, Thomas Kwa, Ben West, Joel Becker, Beth Barnes, Hjalmar Wijk, Tao Lin, Giulio Starace, Oliver Jaffe, Dane Sherburn, Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou
Evaluate whether AI models possess dangerous knowledge or capabilities related to biological and chemical weapons, such as biosecurity or chemical synthesis.
Theory of change: By benchmarking and tracking AI’s knowledge of biology and chemistry, we can identify when models become capable of accelerating WMD development or misuse, allowing for timely intervention.
General approach: behavioural · Target case: pessimistic
Funded by: Open Philanthropy, UK AI Safety Institute (AISI), frontier labs, Scale AI, various academic institutions (Peking University, Yale, etc.), Meta
Evaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.
Theory of change: if an AI can distinguish between evaluation and deployment (“evaluation awareness”), it might hide dangerous capabilities (scheming/sandbagging). By measuring self- and situational-awareness, we can better assess this risk and build more robust evaluations.
General approach: behavioural · Target case: worst-case
Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
Some names: Jan Betley, Xuchan Bao, Martín Soto, Mary Phuong, Roland S. Zimmermann, Joe Needham, Giles Edkins, Govind Pimpale, Kai Fronsdal, David Lindner, Lang Xiong, Xiaoyan Bai
Funded by: frontier labs (Google DeepMind, Anthropic), Open Philanthropy, The Audacious Project, UK AI Safety Institute (AISI), AI Safety Support, Apollo Research, METR
evaluate whether models can hide secret information or encoded reasoning in their outputs, such as in chain-of-thought scratchpads, to evade monitoring.
Theory of change: if models can use steganography, they could hide deceptive reasoning, bypassing safety monitoring and control measures. By evaluating this capability, we can assess the risk of a model fooling its supervisors.
General approach: behavioural · Target case: worst-case
Orthodox alignment problems: A boxed AGI might exfiltrate itself by steganography, spearphishing, Superintelligence can fool human supervisors
research demonstrating that AI models, particularly agentic ones, can learn and execute deceptive behaviors such as alignment faking, manipulation, and sandbagging.
Theory of change: proactively discover, evaluate, and understand the mechanisms of AI deception (e.g., alignment faking, manipulation, agentic deception) to prevent models from fooling human supervisors and causing harm.
General approach: behavioural / engineering · Target case: worst-case
Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
Some names: Cadenza, Fred Heiding, Simon Lermen, Andrew Kao, Myra Cheng, Cinoo Lee, Pranav Khadpe, Satyapriya Krishna, Andy Zou, Rahul Gupta
Critiques: A central criticism is that the evaluation scenarios are “artificial and contrived”. the void and Lessons from a Chimp argue this research is “overattributing human traits” to models.
Funded by: Labs, academic institutions (e.g., Harvard, CMU, Barcelona Institute of Science and Technology), NSFC, ML Alignment Theory & Scholars (MATS) Program, FAR AI
Evaluate frontier models for scheming, a sophisticated, strategic form of AI deception where a model covertly pursues a misaligned, long-term objective while deliberately faking alignment and compliance to evade detection by human supervisors and safety mechanisms.
Theory of change: Robust evaluations must move beyond checking final outputs and probe the model’s reasoning to verify that alignment is genuine, not faked, because capable models are capable of strategically concealing misaligned goals (scheming) to pass standard behavioural evaluations.
General approach: behavioural / engineering · Target case: pessimistic
Orthodox alignment problems: Superintelligence can fool human supervisors
Some names: Bronson Schoen, Alexander Meinke, Jason Wolfe, Mary Phuong, Rohin Shah, Evgenia Nitishinskaya, Mikita Balesni, Marius Hobbhahn, Jérémy Scheurer, Wojciech Zaremba, David Lindner
Evaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.
Theory of change: If models can distinguish between evaluation and deployment contexts (“evaluation awareness”), they might learn to “sandbag” or deliberately underperform to hide dangerous capabilities, fooling safety evaluations. By developing evaluations for sandbagging, we can test whether our safety methods are being deceived and detect this behavior before a model is deployed.
General approach: behavioural · Target case: pessimistic
Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
Some names: Teun van der Weij, Cameron Tice, Chloe Li, Johannes Gasteiger, Joseph Bloom, Joel Dyer
Critiques: The main external critique, from sources like “the void” and “Lessons from a Chimp”, is that this research “overattribut[es] human traits” to models. It argues that what’s being measured isn’t genuine sandbagging but models “playing-along-with-drama behaviour” in response to “artificial and contrived” evals.
Funded by: Anthropic (and its funders, e.g., Google, Amazon), UK Government (funding the AI Security Institute)
evaluate whether AI agents can autonomously replicate themselves by obtaining their own weights, securing compute resources, and creating copies of themselves.
Theory of change: if AI agents gain the ability to self-replicate, they could proliferate uncontrollably, making them impossible to shut down. By measuring this capability with benchmarks like RepliBench, we can identify when models cross this dangerous “red line” and implement controls before losing containment.
General approach: behavioural · Target case: worst-case
Orthodox alignment problems: Instrumental convergence, A boxed AGI might exfiltrate itself by steganography, spearphishing
Some names: Sid Black, Asa Cooper Stickland, Jake Pencharz, Oliver Sourbut, Michael Schmatz, Jay Bailey, Ollie Matthews, Ben Millwood, Alex Remedios, Alan Cooney, Xudong Pan, Jiarun Dai, Yihe Fan
attack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods.
Theory of change: to ensure models are safe, we must actively try to break them. By developing and applying a diverse suite of attacks (e.g., in novel domains, against agentic systems, or using automated tools), researchers can discover vulnerabilities, specification gaming, and deceptive behaviors before they are exploited, thereby informing the development of more robust defenses.
General approach: behavioural · Target case: average
Orthodox alignment problems: A boxed AGI might exfiltrate itself by steganography, spearphishing, Goals misgeneralize out of distribution
A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.
Theory of change: By developing novel benchmarks for specific, hard-to-measure properties (like honesty), critiquing the reliability of existing methods (like cultural surveys), and improving the formal rigor of evaluation systems (like LLM-as-Judges), researchers can create a more robust and comprehensive suite of evaluations to catch failures missed by standard capability or safety testing.
General approach: behavioural · Target case: average
See also: other more specific sections on evals
Some names: Richard Ren, Mantas Mazeika, Andrés Corrada-Emmanuel, Ariba Khan, Stephen Casper
Funded by: Lab funders (OpenAI), Open Philanthropy (which funds CAIS, the organization for the MASK benchmark), academic institutions. N/A (as a discrete amount). This work is part of the “tens of millions” budgets for broader evaluation and red-teaming efforts at labs and independent organizations.
A few major changes to the taxonomy: the top-level split is now “black-box” vs “white-box” instead of “control” vs “understanding”. (We did try out an automated clustering but it wasn’t very good.)
The agendas are in general less charisma-based and more about solution type.
We did a systematic Arxiv scrape on the word “alignment” (and filtered out the sequence indexing papers that fell into this pipeline). “Steerability” is one competing term used by academics.
We scraped >3000 links (arXiv, LessWrong, several alignment publication lists, blogs and conferences), conservatively filtering and pre-categorizing them with a LLM pipeline. All curated later, many more added manually.
This review has ~800 links compared to ~300 in 2024 and ~200 in 2023. We looked harder and the field has grown.
We don’t collate public funding figures.
New sections: “Labs”, “Multi-agent First”, “Better data”, “Model specs”, “character training” and “representation geometry”. ”Evals” is so massive it gets a top-level section.
Methods
Structure
We have again settled for a tree data structure for this post – but people and work can appear in multiple nodes so it’s not a dumb partition. Richer representation structures may be in the works.
The level of analysis for each node in the tree is the “research agenda”, an abstraction spanning multiple papers and organisations in a messy many-to-many relation. What makes something an agenda? Similar methods, similar aims, or something sociological about leaders and collaborators. Agendas vary greatly in their degree of coherent agency, from the very coherent Anthropic Circuits work to the enormous, leaderless and unselfconscious “iterative alignment”.)
Scope
30th November 2024 – 30th November 2025 (with a few exceptions).
We’re focussing on “technical AGI safety”. We thus ignore a lot of work relevant to the overall risk: misuse, policy, strategy, OSINT, resilience and indirect risk, AIrights, general capabilities evals, and things closer to “technical policy” and like products (like standards, legislation, SL4 datacentres, and automated cybersecurity). We also mostly focus on papers and blogposts (rather than say, underground gdoc samizdat or Discords).
We only use public information, so we are off by some additional unknown factor.
We try to include things which are early-stage and illegible – but in general we fail and mostly capture legible work on legible problems (i.e. things you can write a paper on already).
Of the 2000+ links to papers, organizations and posts in the raw scrape, about 700 made it in.
Paper sources
All arXiv papers with “AI alignment”, “AI safety”, or “steerability” in the abstract or title; all papers of ~120 AI safety researchers
All Alignment Forum posts and all LW posts under “AI”
Ad hoc Twitter for a year, several conference pages and workshops
AI scrapes miss lots of things. We did a proper pass with a software scraper of over 3000 links collected from the above and LLM crawl of some of the pages, and then an LLM pass to pre-filter the links for relevance and pre-assign the links to agendas and areas, but this also had systematic omissions. We ended up doing a full manual pass over a conservatively LLM-pre-filtered links and re-classifying the links and papers. The code and data can be found here. We are not aware of any proper studies of “LLM laziness” but it’s known amongst power users of copilots.
For finding critiques we used LW backlinks, Google Scholar cited-by, manual search, collected links, and Ahrefs. Technical critiques are somewhat rare, though, and even then our coverage here is likely severely lacking. We generally do not include social network critiques (mostly due to scope).
Despite this effort we will not have included all relevant papers and names. The omission of a paper or a researcher is not strong negative evidence about their relevance.
Processing
Collecting links throughout the year and at project start. Skimming papers, staring at long lists.
We drafted a taxonomy of research agendas. Based on last year’s list, our expertise and the initial paper collection, though we changed the structure to accommodate shifts in the domain: the top-level split is now “black-box” vs “white-box” instead of “control” vs “understanding”.
At around 500 links and growing, we decided to implement simple pipelines for crawling, scraping and LLM metadata extraction, pre-filtering and pre-classification into agendas, as well as other tasks – including final formatting later. The use of LLMs was limited to one simple task at a time, and the results were closely watched and reviewed. Code and data here.
We tried getting the AI to update our taxonomy bottom-up to fit the body of papers, but the results weren’t very good. Though we are looking at some advanced options (specialized embedding or feature extraction and clustering or t-SNE etc.)
Work on the ~70 agendas was distributed among the team. We ended up making many local changes to the taxonomy, esp. splitting up and merging agendas. The taxonomy is specific to this year, and will need to be adapted in coming years.
We moved any agendas without public outputs this year to the bottom, and the inactive ones to the Graveyard. For most of them, we checked with people from the agendas for outputs or work we may have missed.
What started as a brief summary editorial grew into its own thing (6000w).
We asked 10 friends in AI safety to review the ~80 page draft. After editing, last-minute additions and formatting, we asked 50 technical AI safety people for a quick review of their domains. Coverage was not perfect; all mistakes are our own.
The field is growing at around 20% a year. There will come a time that this list isn’t sensible to do manually even with the help of LLMs (at this granularity anyway). We may come up with better alternatives than lists and posts by then, though.
Other classifications
We added our best guess about which of Davidad’s alignment problems the agenda would make an impact on if it succeeded, as well as its research approach and implied optimism in Richard Ngo’s 3x3.
The target case: what part of the distribution over alignment difficulty do they aim to help with? (via Ngo)
“optimistic-case”: if CoT is faithful, pretraining as value loading, no stable mesa-optimizers, the relevant scary capabilities are harder than alignment, zero-shot deception is hard, goals are myopic, etc
pessimistic-case: if we’re in-between the above and the below
worst-case: if power-seeking is rife, zero-shot deceptive alignment, steganography, gradient hacking, weird machines, weird coordination, deep deceptiveness
The broad approach: roughly what kind of work is it doing, primarily? (via Ngo)
engineering: iterating over outputs
behavioural: understanding the input-output relationship
cognitive: understanding the algorithms
maths/philosophy: providing concepts for the other approaches
Acknowledgments
These people generously helped with the review by providing expert feedback, literature sources, advice, or otherwise. Any remaining mistakes remain ours.
Thanks to Neel Nanda, Owain Evans, Stephen Casper, Alex Turner, Caspar Oesterheld, Steve Byrnes, Adam Shai, Séb Krier, Vanessa Kosoy, Nora Ammann, David Lindner, John Wentworth, Vika Krakovna, Filip Sondej, JS Denain, Jan Kulveit, Mary Phuong, Linda Linsefors, Yuxi Liu, Ben Todd, Ege Erdil, Tan Zhi-Xuan, Jess Riedel, Mateusz Bagiński, Roland Pihlakas, Walter Laurito, Vojta Kovařík, David Hyland, plex, Shoshannah Tekofsky, Fin Moorhouse, Misha Yagudin, Nandi Schoots, and Nuno Sempere for helpful comments.
> No punting — we can’t keep building nanny products. Our products are overrun with filters and punts of various kinds. We need capable products and [to] trust our users.
– Sergey Brin
> Over the decade I’ve spent working on AI safety, I’ve felt an overall trend of divergence; research partnerships starting out with a sense of a common project, then slowly drifting apart over time… eventually, two researchers are going to have some deep disagreement in matters of taste, which sends them down different paths.
> Until the spring of this year, that is… something seemed to shift, subtly at first. After I gave a talk—roughly the same talk I had been giving for the past year—I had an excited discussion about it with Scott Garrabrant. Looking back, it wasn’t so different from previous chats we had had, but the impact was different; it felt more concrete, more actionable, something that really touched my research rather than remaining hypothetical. In the subsequent weeks, discussions with my usual circle of colleagues took on a different character—somehow it seemed that, after all our diverse explorations, we had arrived at a shared space.
Shallow review of technical AI safety, 2025
Website · Data · Gestalt · Repo
Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025)
This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on our website.
It’s shallow in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about two hours on each entry. Still, among other things, we processed every arXiv paper on alignment, all Alignment Forum posts, as well as a year’s worth of Twitter.
It is substantially a list of lists structuring around 800 links. The point is to produce stylised facts, forests out of trees; to help you look up what’s happening, or that thing you vaguely remember reading about; to help new researchers orient, know some of their options and the standing critiques; and to help you find who to talk to for actual information. We also track things which didn’t pan out.
Here, “AI safety” means technical work intended to prevent future cognitive systems from having large unintended negative effects on the world. So it’s capability restraint, instruction-following, value alignment, control, and risk awareness work.
We don’t cover security or resilience at all.
We ignore a lot of relevant work (including most of capability restraint): things like misuse, policy, strategy, OSINT, resilience and indirect risk, AI rights, general capabilities evals, and things closer to “technical policy” and products (like standards, legislation, SL4 datacentres, and automated cybersecurity). We focus on papers and blogposts (rather than say, gdoc samizdat or tweets or Githubs or Discords). We only use public information, so we are off by some additional unknown factor.
We try to include things which are early-stage and illegible – but in general we fail and mostly capture legible work on legible problems (i.e. things you can write a paper on already).
Even ignoring all of that as we do, it’s still too long to read. Here’s a spreadsheet version and the repo including the data. Methods down the bottom. Gavin’s takes outgrew this post and became its own thing.
If we missed something big or got something wrong, please comment, we will edit it in.
An Arb Research project. Work funded by
OpenPhilCoefficient Giving.We have tried to falsify this but it wasn’t easy.
Labs (giant companies)
Not counting the AIs
EU CoP, SB53
OpenAI
Structure: public benefit corp
Safety teams: Alignment, Safety Systems (Interpretability, Safety Oversight, Pretraining Safety, Robustness, Safety Research, Trustworthy AI, new Misalignment Research team coming), Preparedness, Model Policy, Safety and Security Committee, Safety Advisory Group. The Persona Features paper had a distinct author list. No named successor to Superalignment.
Public alignment agenda: None. Boaz Barak offers personal views.
Risk management framework: Preparedness Framework
See also: Iterative alignment · Safeguards (inference-time auxiliaries) · Character training and persona steering
Some names: Johannes Heidecke, Boaz Barak, Mia Glaese, Jenny Nitishinskaya, Lama Ahmad, Naomi Bashkansky, Miles Wang, Wojciech Zaremba, David Robinson, Zico Kolter, Jerry Tworek, Eric Wallace, Olivia Watkins, Kai Chen, Chris Koch, Andrea Vallone, Leo Gao
Critiques: Stein-Perlman, Stewart, underelicitation, Midas, defense, Carlsmith on labs in general. It’s difficult to model OpenAI as a single agent: “ALTMAN: I very rarely get to have anybody work on anything. One thing about researchers is they’re going to work on what they’re going to work on, and that’s that.”
Funded by: Microsoft, AWS, Oracle, NVIDIA, SoftBank, G42, AMD, Dragoneer, Coatue, Thrive, Altimeter, MGX, Blackstone, TPG, T. Rowe Price, Andreessen Horowitz, D1 Capital Partners, Fidelity Investments, Founders Fund, Sequoia…
Some outputs: Their 60-page System Cards now contain a large amount of their public safety work. · Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation · Persona Features Control Emergent Misalignment · Stress Testing Deliberative Alignment for Anti-Scheming Training · Deliberative Alignment: Reasoning Enables Safer Language Models · Toward understanding and preventing misalignment generalization · Our updated Preparedness Framework · Trading Inference-Time Compute for Adversarial Robustness · Small-to-Large Generalization: Data Influences Models Consistently Across Scale · Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests · Safety evaluations hub · alignment.openai.com · Weight-sparse transformers have interpretable circuits
Google Deepmind
Structure: research laboratory subsidiary of a for-profit
Safety teams: amplified oversight, interpretability, ASAT eng (automated alignment research), Causal Incentives Working Group, Frontier Safety Risk Assessment (evals, threat models, the framework), Mitigations (e.g. banning accounts, refusal training, jailbreak robustness), Loss of Control (control, alignment evals). Structure here.
Public alignment agenda: An Approach to Technical AGI Safety and Security
Risk management framework: Frontier Safety Framework
See also: Interpretability · Scalable Oversight
Some names: Rohin Shah, Allan Dafoe, Anca Dragan, Alex Irpan, Alex Turner, Anna Wang, Arthur Conmy, David Lindner, Jonah Brown-Cohen, Lewis Ho, Neel Nanda, Raluca Ada Popa, Rishub Jain, Rory Greig, Sebastian Farquhar, Senthooran Rajamanoharan, Sophie Bridgers, Tobi Ijitoye, Tom Everitt, Victoria Krakovna, Vikrant Varma, Zac Kenton, Four Flynn, Jonathan Richens, Lewis Smith, Janos Kramar, Matthew Rahtz, Mary Phuong, Erik Jenner
Critiques: Stein-Perlman, Carlsmith on labs in general, underelicitation, On Google’s Safety Plan
Funded by: Google. Explicit 2024 Deepmind spending as a whole was £1.3B, but this doesn’t count most spending e.g. Gemini compute.
Some outputs: A Pragmatic Vision for Interpretability · How Can Interpretability Researchers Help AGI Go Well? · Evaluating Frontier Models for Stealth and Situational Awareness · When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors · MONA: Managed Myopia with Approval Feedback · Consistency Training Helps Stop Sycophancy and Jailbreaks · An Approach to Technical AGI Safety and Security · Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2) · Steering Gemini Using BIDPO Vectors · Difficulties with Evaluating a Deception Detector for AIs · Taking a responsible path to AGI · Evaluating potential cybersecurity threats of advanced AI · Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance · A Pragmatic Way to Measure Chain-of-Thought Monitorability
Anthropic
Structure: public-benefit corp
Safety teams: Scalable Alignment (Leike), Alignment Evals (Bowman), Interpretability (Olah), Control (Perez), Model Psychiatry (Lindsey), Character (Askell), Alignment Stress-Testing (Hubinger), Alignment Mitigations (Price?), Frontier Red Team (Graham), Safeguards (?), Societal Impacts (Ganguli), Trust and Safety (Sanderford), Model Welfare (Fish).
Public alignment agenda: directions, bumpers, checklist, an old vague view
Risk management framework: RSP
See also: Interpretability · Scalable Oversight
Some names: Chris Olah, Evan Hubinger, Sam Marks, Johannes Treutlein, Sam Bowman, Euan Ong, Fabien Roger, Adam Jermyn, Holden Karnofsky, Jan Leike, Ethan Perez, Jack Lindsey, Amanda Askell, Kyle Fish, Sara Price, Jon Kutasov, Minae Kwon, Monty Evans, Richard Dargan, Roger Grosse, Ben Levinstein, Joseph Carlsmith, Joe Benton
Critiques: Stein-Perlman, Casper, Carlsmith, underelicitation, Greenblatt, Samin, defense, Existing Safety Frameworks Imply Unreasonable Confidence
Funded by: Amazon, Google, ICONIQ, Fidelity, Lightspeed, Altimeter, Baillie Gifford, BlackRock, Blackstone, Coatue, D1 Capital Partners, General Atlantic, General Catalyst, GIC, Goldman Sachs, Insight Partners, Jane Street, Ontario Teachers’ Pension Plan, Qatar Investment Authority, TPG, T. Rowe Price, WCM, XN
Some outputs: Evaluating honesty and lie detection techniques on a diverse suite of dishonest models · Agentic Misalignment: How LLMs could be insider threats · Why Do Some Language Models Fake Alignment While Others Don’t? · Forecasting Rare Language Model Behaviors · Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise · On the Biology of a Large Language Model · Auditing language models for hidden objectives · Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples · Circuit Tracing: Revealing Computational Graphs in Language Models · SHADE-Arena: Evaluating sabotage and monitoring in LLM agents · Emergent Introspective Awareness in Large Language Models · Reasoning models don’t always say what they think · Petri: An open-source auditing tool to accelerate AI safety research · Signs of introspection in large language models · Putting up Bumpers · Three Sketches of ASL-4 Safety Case Components · Recommendations for Technical AI Safety Research Directions · Constitutional Classifiers: Defending against universal jailbreaks · The Soul Document · Open-sourcing circuit tracing tools · Natural emergent misalignment from reward hacking
xAI
Structure: for-profit
Teams: Applied Safety, Model Evaluation. Nominally focussed on misuse.
Framework: Risk Management Framework
Some names: Dan Hendrycks (advisor), Juntang Zhuang, Toby Pohlen, Lianmin Zheng, Piaoyang Cui, Nikita Popov, Ying Sheng, Sehoon Kim, Alexander Pan
Critiques: framework, hacking, broken promises, Stein-Perlman, insecurity, Carlsmith on labs in general
Funded by: A16Z, Blackrock, Fidelity, Kingdom, Lightspeed, MGX, Morgan Stanley, Sequoia…
Meta
Structure: for-profit
Teams: Safety “integrated into” capabilities research, Meta Superintelligence Lab. But also FAIR Alignment, Brain and AI.
Public alignment agenda: None
Framework: FAF
See also: Capability removal, unlearning
Some names: Shuchao Bi, Hongyuan Zhan, Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Jason Weston, ShengYun Peng, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Evangelia Spiliopoulou, Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda, Adina Williams
Critiques: extreme underelicitation, Stein-Perlman, Carlsmith on labs in general
Funded by: Meta
Some outputs: The Alignment Waltz: Jointly Training Agents to Collaborate for Safety · Large Reasoning Models Learn Better Alignment from Flawed Thinking · Robust LLM safeguarding via refusal feature adversarial training · Code World Model Preparedness Report · Agents Rule of Two: A Practical Approach to AI Agent Security · AI & Human Co-Improvement
China
The Chinese companies don’t attempt to be safe, often not even in the prosaic safeguards sense. They drop the weights immediately after post-training finishes. They’re mostly open weights and closed data. As of writing the companies are often severely compute-constrained. There are some informal reasons to doubt their capabilities. The (academic) Chinese AI safety scene is however also growing.
Alibaba’s Qwen3-etc-etc is nominally at the level of Gemini 2.5 Flash. Maybe the only Chinese model with a large Western userbase, including businesses, but since it’s self-hosted this doesn’t translate into profits for them yet. On one ad hoc test it was the only Chinese model not to collapse OOD, but the Qwen2.5 corpus was severely contaminated.
DeepSeek’s v3.2 is nominally around the same as Qwen. The CCP made them waste months trying Huawei chips.
Moonshot’s Kimi-K2-Thinking has some nominally frontier benchmark results and a pleasant style but does not seem frontier.
Baidu’s ERNIE 5 is again nominally very strong, a bit better than DeepSeek. This new one seems to not be open.
Z’s GLM-4.6 is around the same as Qwen. The product director was involved in the MIT Alignment group.
MiniMax’s M2 is nominally better than Qwen, around the same as Grok 4 Fast on the usual superficial benchmarks. It does fine on one very basic red-team test.
ByteDance does impressive research in a lagging paradigm, diffusion LMs.
There are others but they’re marginal for now.
Others
Amazon’s Nova Pro is around the level of Llama 3 90B, which in turn is around the level of the original GPT-4. So 2 years behind. But they have their own chip.
Microsoft are now mid-training on top of GPT-5. MAI-1-preview is around DeepSeek V3.0 level on Arena. They continue to focus on medical diagnosis. You can request access.
Mistral have a reasoning model, Magistral Medium, and released the weights of a little 24B version. It’s a bit worse than Deepseek R1, pass@1.
Black-box safety (understand and control current model behaviour)
Iterative alignment
Nudging base models by optimising their output. Worked on by the post-training teams at most labs, estimating the FTEs at >500 in some sense. Funded by most of the industry.
General theory of change: “LLMs don’t seem very dangerous and might scale to AGI, things are generally smooth, relevant capabilities are harder than alignment, assume no mesaoptimisers, assume that zero-shot deception is hard, assume a fundamentally humanish ontology is learned, assume no simulated agents, assume that noise in the data means that human preferences are not ruled out, assume that alignment is a superficial feature, assume that tuning for what we want will also get us to avoid what we don’t want. Maybe assume that thoughts are translucent.”
General approach: engineering · Target case: average
General critiques: Bellot, Alfour, STACK, AI Alignment Strategies from a Risk Perspective, AI Alignment based on Intentions does not work, Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?, Murphy’s Laws of AI Alignment: Why the Gap Always Wins, Alignment remains a hard, unsolved problem
Iterative alignment at pretrain-time
Guide weights during pretraining.
See also: prosaic alignment · incrementalism · alignment-by-default. Korbak 2023.
Some names: Jan Leike, Stuart Armstrong, Cyrus Cousins, Oliver Daniels
Critiques: Bellot, STACK, Dung, Gaikwad, Hubinger
Funded by: most of the industry
Some outputs: Unsupervised Elicitation · ACE and Diverse Generalization via Selective Disagreement
Iterative alignment at post-train-time
Modify weights after pre-training.
Some names: Adam Gleave, Anca Dragan, Jacob Steinhardt, Rohin Shah
Critiques: Bellot, STACK, Dung, Gölz, Gaikwad, Hubinger
Funded by: most of the industry
Some outputs: Composable Interventions for Language Models · Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives · On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback · Preference Learning with Lie Detectors can Induce Honesty or Evasion · Robust LLM Alignment via Distributionally Robust Direct Preference Optimization · RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation · Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference · Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision · Consistency Training Helps Stop Sycophancy and Jailbreaks · Rethinking Safety in LLM Fine-tuning: An Optimization Perspective · Preference Learning for AI Alignment: a Causal Perspective · On Monotonicity in AI Alignment · Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability · Uncertainty-Aware Step-wise Verification with Generative Reward Models · The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains · Training LLMs for Honesty via Confessions
Black-box make-AI-solve-it
Focus on using existing models to improve and align further models.
See also: Make AI solve it · Debate
Some names: Jacques Thibodeau, Matthew Shingle, Nora Belrose, Lewis Hammond, Geoffrey Irving
Critiques: STACK, Dung, Gölz, Gaikwad, Hubinger, SAIF
Funded by: most of the industry
Some outputs: Neural Interactive Proofs · MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking · Prover-Estimator Debate: A New Scalable Oversight Protocol · Weak to Strong Generalization for Large Language Models with Multi-capabilities · Debate Helps Weak-to-Strong Generalization · Mechanistic Anomaly Detection for “Quirky” Language Models · AI Debate Aids Assessment of Controversial Claims · An alignment safety case sketch based on debate · Ensemble Debates with Local Large Language Models for AI Alignment · Training AI to do alignment research we don’t already know how to do · Automating AI Safety: What we can do today · Superalignment with Dynamic Human Values
Inoculation prompting
Prompt mild misbehaviour in training, to prevent the failure mode where once AI misbehaves in a mild way, it will be more inclined towards all bad behaviour.
Some names: Ariana Azarbal, Daniel Tan, Victor Gillioz, Alex Turner, Alex Cloud, Monte MacDiarmid, Daniel Ziegler
Critiques: Bellot, Alfour, Gölz, Gaikwad, Hubinger
Funded by: most of the industry
Some outputs: Recontextualization Mitigates Specification Gaming Without Modifying the Specification · Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time · Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment · Natural Emergent Misalignment from Reward Hacking
Inference-time: In-context learning
Investigate what runtime guidelines, rules, or examples provided to an LLM yield better behavior.
See also: model spec as prompt · Model specs and constitutions
Some names: Jacob Steinhardt, Kayo Yin, Atticus Geiger
Critiques: STACK, Dung, Gölz, Gaikwad, Hubinger
Some outputs: InvThink: Towards AI Safety via Inverse Reasoning · Inference-Time Reward Hacking in Large Language Models · Understanding In-context Learning of Addition via Activation Subspaces · Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context · Which Attention Heads Matter for In-Context Learning?
Inference-time: Steering
Manipulate an LLM’s internal representations/token probabilities without touching weights.
See also: Activation engineering · Character training and persona steering · Safeguards (inference-time auxiliaries)
Some names: Taylor Sorensen, Constanza Fierro, Kshitish Ghate, Arthur Vogels
Critiques: Alfour, STACK, Dung, Gölz, Gaikwad, Hubinger
Some outputs: Steering Language Models with Weight Arithmetic · EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences · Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation · In-Distribution Steering: Balancing Control and Coherence in Language Model Generation.
Capability removal: unlearning
Developing methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approaches.
Theory of change: If an AI learns dangerous knowledge (e.g., dual-use capabilities like virology or hacking, or knowledge of their own safety controls) or exhibits undesirable behaviors (e.g., memorizing private data), we can specifically erase this “bad” knowledge post-training, which is much cheaper and faster than retraining, thereby making the model safer. Alternatively, intervene in pre-training, to prevent the model from learning it in the first place (even when data filtering is imperfect). You could imagine also unlearning propensities to power-seeking, deception, sycophancy, or spite.
General approach: cognitive / engineering · Target case: pessimistic
Orthodox alignment problems: Superintelligence can hack software supervisors, A boxed AGI might exfiltrate itself by steganography, spearphishing, Humanlike minds/goals are not necessarily safe
See also: Data filtering · Interpretability · Various Redteams
Some names: Rowan Wang, Avery Griffin, Johannes Treutlein, Zico Kolter, Bruce W. Lee, Addie Foote, Alex Infanger, Zesheng Shi, Yucheng Zhou, Jing Li, Timothy Qian, Stephen Casper, Alex Cloud, Peter Henderson, Filip Sondej, Fazl Barez
Critiques: Existing Large Language Model Unlearning Evaluations Are Inconclusive
Funded by: Coefficient Giving, MacArthur Foundation, UK AI Safety Institute (AISI), Canadian AI Safety Institute (CAISI), industry labs (e.g., Microsoft Research, Google)
Estimated FTEs: 10-50
Some outputs: Frameworks · OpenUnlearning · Mostly black-box · Modifying LLM Beliefs with Synthetic Document Finetuning · From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization · Mirror Mirror on the Wall, Have I Forgotten it All? · Machine Unlearning Doesn’t Do What You Think: Lessons for Generative AI Policy and Research · Open Problems in Machine Unlearning for AI Safety · Mostly white-box · Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning · Safety Alignment via Constrained Knowledge Unlearning · Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization · Unlearning Isn’t Deletion: Investigating Reversibility of Machine Unlearning in LLMs · Unlearning Needs to be More Selective [Progress Report] · Layered Unlearning for Adversarial Relearning · Understanding Memorization via Loss Curvature · Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities · Pre-training interventions · Gradient Routing: Masking Gradients to Localize Computation in Neural Networks, Selective modularity: a research agenda · Distillation Robustifies Unlearning · Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs
Control
If we assume early transformative AIs are misaligned and actively trying to subvert safety measures, can we still set up protocols to extract useful work from them while preventing sabotage, and watching with incriminating behaviour?
General approach: engineering / behavioural · Target case: worst-case
See also: safety cases
Some names: Redwood, UK AISI, Deepmind, OpenAI, Anthropic, Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan, Alex Mallen
Critiques: Wentworth, Mannheim, Kulveit
Estimated FTEs: 5-50
Some outputs: Luthien’s Approach to Prosaic AI Control in 21 Points · Ctrl-Z: Controlling AI Agents via Resampling · SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents · Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats · D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models · Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols? · Evaluating Control Protocols for Untrusted AI Agents · Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability · Optimizing AI Agent Attacks With Synthetic Data · Games for AI Control · A sketch of an AI control safety case · Assessing confidence in frontier AI safety cases · ControlArena · How to evaluate control measures for LLM agents? A trajectory from today to superintelligence · The Alignment Project by UK AISI · Towards evaluations-based safety cases for AI scheming · Incentives for Responsiveness, Instrumental Control and Impact · AI companies are unlikely to make high-assurance safety cases if timelines are short · Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework · Dynamic safety cases for frontier AI · AIs at the current capability level may be important for future safety work · Takeaways from sketching a control safety case
Safeguards (inference-time auxiliaries)
Layers of inference-time defenses, such as classifiers, monitors, and rapid-response protocols, to detect and block jailbreaks, prompt injections, and other harmful model behaviors.
Theory of change: By building a bunch of scalable and hardened things on top of an unsafe model, we can defend against known and unknown attacks, monitor for misuse, and prevent models from causing harm, even if the core model has vulnerabilities.
General approach: engineering · Target case: average
Orthodox alignment problems: Superintelligence can fool human supervisors, A boxed AGI might exfiltrate itself by steganography, spearphishing
See also: Various Redteams · Iterative alignment
Some names: Mrinank Sharma, Meg Tong, Jesse Mu, Alwin Peng, Julian Michael, Henry Sleight, Theodore Sumers, Raj Agarwal, Nathan Bailey, Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Sahil Verma, Keegan Hines, Jeff Bilmes
Critiques: Obfuscated Activations Bypass LLM Latent-Space Defenses.
Funded by: most of the big labs
Estimated FTEs: 100+
Some outputs: Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming · Rapid Response: Mitigating LLM Jailbreaks with a Few Examples · Monitoring computer use via hierarchical summarization · Defeating Prompt Injections by Design · Introducing Anthropic’s Safeguards Research Team · OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities
Chain of thought monitoring
Supervise an AI’s natural-language (output) “reasoning” to detect misalignment, scheming, or deception, rather than studying the actual internal states.
Theory of change: The reasoning process (Chain of Thought, or CoT) of an AI provides a legible signal of its internal state and intentions. By monitoring this CoT, supervisors (human or AI) can detect misalignment, scheming, or reward hacking before it results in a harmful final output. This allows for more robust oversight than supervising outputs alone, but it relies on the CoT remaining faithful (i.e., accurately reflecting the model’s reasoning) and not becoming obfuscated under optimization pressure.
General approach: engineering · Target case: average
Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors, A boxed AGI might exfiltrate itself by steganography, spearphishing
See also: White-box safety (understand and control current model internals) · Steganography evals
Some names: Aether, Bowen Baker, Joost Huizinga, Leo Gao, Scott Emmons, Erik Jenner, Yanda Chen, James Chua, Owain Evans, Tomek Korbak, Mikita Balesni, Xinpeng Wang, Miles Turpin, Rohin Shah
Critiques: Reasoning Models Don’t Always Say What They Think; Chain-of-Thought Reasoning In The Wild Is Not Always Faithful; Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens; Reasoning Models Sometimes Output Illegible Chains of Thought
Funded by: OpenAI, Anthropic, Google DeepMind
Estimated FTEs: 10-100
Some outputs: Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation · Detecting misbehavior in frontier reasoning models · When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors · Reasoning Models Don’t Always Say What They Think · Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort · CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring · Training fails to elicit subtle reasoning in current language models · Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability · Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning · Are DeepSeek R1 And Other Reasoning Models More Faithful? · A Pragmatic Way to Measure Chain-of-Thought Monitorability · A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring · Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety · Why it’s good for AI reasoning to be legible and faithful · Why Don’t We Just… Shoggoth+Face+Paraphraser? CoT May Be Highly Informative Despite “Unfaithfulness” · Aether July 2025 Update
Model psychology
This section consists of a bottom-up set of things people happen to be doing, rather than a normative taxonomy.
Model values / model preferences
Analyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.
Theory of change: As AIs become more agentic, their behaviours and risks are increasingly determined by their goals and values. Since coherent value systems emerge with scale, we must leverage utility functions to analyse these values and apply “utility control” methods to constrain them, rather than just controlling outputs downstream of them.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Value is fragile and hard to specify
See also: Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions · Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Some names: Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks
Critiques: Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs
Funded by: Coefficient Giving. $289,000 SFF funding for CAIS.
Estimated FTEs: 30
Some outputs: What Kind of User Are You? Uncovering User Models in LLM Chatbots · Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs · Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas · The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety? · Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions · EigenBench: A Comparative behavioural Measure of Value Alignment · Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs · Alignment Can Reduce Performance on Simple Ethical Questions · Moral Alignment for LLM Agents · The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models · Are Language Models Consequentialist or Deontological Moral Reasoners? · Playing repeated games with large language models · From Stability to Inconsistency: A Study of Moral Preferences in LLMs · VAL-Bench: Measuring Value Alignment in Language Models
Character training and persona steering
Map, shape, and control the personae of language models, such that new models embody desirable values (e.g., honesty, empathy) rather than undesirable ones (e.g., sycophancy, self-perpetuating behaviors).
Theory of change: If post-training, prompting, and activation-engineering interact with some kind of structured ‘persona space’, then better understanding it should benefit the design, control, and detection of LLM personas.
General approach: cognitive · Target case: average
Orthodox alignment problems: Value is fragile and hard to specify
See also: Simulators · Activation engineering · Emergent misalignment · Hyperstition studies · Anthropic Safety · Cyborgism · shard theory · AI psychiatry · Ward et al
Some names: Truthful AI, OpenAI, Anthropic, CLR, Amanda Askell, Jack Lindsey, Janus, Theia Vogel, Sharan Maiya, Evan Hubinger
Critiques: Nostalgebraist
Funded by: Anthropic, Coefficient Giving
Some outputs: Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI · On the functional self of LLMs · Opus 4.5′s Soul Document · Persona Features Control Emergent Misalignment · Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time · Persona Vectors: Monitoring and Controlling Character Traits in Language Models · Reducing LLM deception at scale with self-other overlap fine-tuning · The Rise of Parasitic AI · A Three-Layer Model of LLM Psychology · Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models · Selection Pressures on LM Personas · the void · void miscellany
Emergent misalignment
Fine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained or rewarded directly. A new agenda which quickly led to a stream of exciting work.
Theory of change: Predict, detect, and prevent models from developing broadly harmful behaviors (like deception or shutdown resistance) when fine-tuned on seemingly unrelated tasks. Find, preserve, and robustify this correlated representation of the good.
General approach: behavioural · Target case: pessimistic
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors
See also: auditing real models · applied interpretability
Some names: Truthful AI, Jan Betley, James Chua, Mia Taylor, Miles Wang, Edward Turner, Anna Soligo, Alex Cloud, Nathan Hu, Owain Evans
Critiques: Emergent Misalignment as Prompt Sensitivity, Go home GPT-4o, you’re drunk
Funded by: Coefficient Giving, >$1 million
Estimated FTEs: 10-50
Some outputs: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models · Persona Features Control Emergent Misalignment · Model Organisms for Emergent Misalignment · School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs · Subliminal Learning: Language Models Transmit behavioural Traits via Hidden Signals in Data · Convergent Linear Representations of Emergent Misalignment · Narrow Misalignment is Hard, Emergent Misalignment is Easy · Aesthetic Preferences Can Cause Emergent Misalignment · Moloch’s Bargain: Emergent Misalignment When LLMs Compete for Audiences · Emergent Misalignment & Realignment · Realistic Reward Hacking Induces Different and Deeper Misalignment · Selective Generalization: Improving Capabilities While Maintaining Alignment · Emergent Misalignment on a Budget · The Rise of Parasitic AI · LLM AGI may reason about its goals and discover misalignments by default · Open problems in emergent misalignment
Model specs and constitutions
Write detailed, natural language descriptions of values and rules for models to follow, then instill these values and rules into models via techniques like Constitutional AI or deliberative alignment.
Theory of change: Model specs and constitutions serve three purposes. First, they provide a clear standard of behavior which can be used to train models to value what we want them to value. Second, they serve as something closer to a ground truth standard for evaluating the degree of misalignment ranging from “models straightforwardly obey the spec” to “models flagrantly disobey the spec”. A combination of scalable stress-testing and reinforcement for obedience can be used to iteratively reduce the risk of misalignment. Third, they get more useful as models’ instruction-following capability improves.
General approach: engineering · Target case: average
Orthodox alignment problems: Value is fragile and hard to specify
See also: Iterative alignment · Model psychology
Some names: Amanda Askell, Joe Carlsmith
Critiques: LLM AGI may reason about its goals and discover misalignments by default, On OpenAI’s Model Spec 2.0, Giving AIs safe motivations (esp. Sections 4.3-4.5), On Deliberative Alignment
Funded by: major funders include Anthropic and OpenAI (internally)
Some outputs: Claude’s Constitution · Google doesn’t have anything public. The Gemini system prompt is very short and dry and doesn’t even have any rules for handling copyrighted, let alone wetter stuff · Deliberative Alignment: Reasoning Enables Safer Language Models · Stress-Testing Model Specs Reveals Character Differences among Language Models · OpenAI Model Spec · Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences · No-self as an alignment target · Six Thoughts on AI Safety · How important is the model spec if alignment fails? · Political Neutrality in AI Is Impossible- But Here Is How to Approximate It · Giving AIs safe motivations
Model psychopathology
Find interesting LLM phenomena like glitch tokens and the reversal curse; these are vital data for theory.
Theory of change: The study of ‘pathological’ phenomena in LLMs is potentially key for theoretically modelling LLM cognition and LLM training-dynamics (compare: the study of aphasia and visual processing disorder in humans plays a key role cognitive science), and in particular for developing a good theory of generalization in LLMS
General approach: behavioural/cognitive · Target case: pessimistic
Orthodox alignment problems: Goals misgeneralize out of distribution
See also: Emergent misalignment · mechanistic anomaly detection
Some names: Janus, Truthful AI, Theia Vogel, Stewart Slocum, Nell Watson, Samuel G. B. Johnson, Liwei Jiang, Monika Jotautaite, Saloni Dash
Funded by: Coefficient Giving (via Truthful AI and Interpretability grants)
Estimated FTEs: 5-20
Some outputs: Subliminal Learning: Language models transmit behavioural traits via hidden signals in data · LLMs Can Get “Brain Rot”! · Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning · Unified Multimodal Models Cannot Describe Images From Memory · Believe It or Not: How Deeply do LLMs Believe Implanted Facts? · Psychopathia Machinalis: A Nosological Framework for Understanding Pathologies in Advanced Artificial Intelligence · Imagining and building wise machines: The centrality of AI metacognition · Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond) · Beyond One-Way Influence: Bidirectional Opinion Dynamics in Multi-Turn Human-LLM Interactions
Better data
Data filtering
Builds safety into models from the start by removing harmful or toxic content (like dual-use info) from the pretraining data, rather than relying only on post-training alignment.
Theory of change: By curating the pretraining data, we can prevent the model from learning dangerous capabilities (e.g., dual-use info) or undesirable behaviors (e.g., toxicity) in the first place, making safety more robust and “tamper-resistant” than post-training patches.
General approach: engineering · Target case: average
Orthodox alignment problems: Goals misgeneralize out of distribution, Value is fragile and hard to specify
See also: Data quality for alignment · Data poisoning defense · Synthetic data for alignment · Capability removal, unlearning
Some names: Yanda Chen, Pratyush Maini, Kyle O’Brien, Stephen Casper, Simon Pepin Lehalleur, Jesse Hoogland, Himanshu Beniwal, Sachin Goyal, Mycal Tucker, Dylan Sam
Critiques: When Bad Data Leads to Good Models, Medical large language models are vulnerable to data-poisoning attacks
Funded by: Anthropic, various academics
Estimated FTEs: 10-50
Some outputs: Enhancing Model Safety through Pretraining Data Filtering · Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs · Safety Pretraining: Toward the Next Generation of Safe AI · Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models
Hyperstition studies
Study, steer, and intervene on the following feedback loop: “we produce stories about how present and future AI systems behave” → “these stories become training data for the AI” → “these stories shape how AI systems in fact behave”.
Theory of change: Measure the influence of existing AI narratives in the training data → seed and develop more salutary ontologies and self-conceptions for AI models → control and redirect AI models’ self-concepts through selectively amplifying certain components of the training data.
General approach: cognitive · Target case: average
Orthodox alignment problems: Value is fragile and hard to specify
See also: Data filtering · active inference · LLM whisperers
Some names: Alex Turner, Hyperstition AI, Kyle O’Brien
Funded by: Unclear, niche
Estimated FTEs: 1-10
Some outputs: Training on Documents About Reward Hacking Induces Reward Hacking · Do Not Tile the Lightcone with Your Confused Ontology · Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models · Existential Conversations with Large Language Models: Content, Community, and Culture
Data poisoning defense
Develops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data.
Theory of change: By identifying and filtering out malicious training examples, we can prevent attackers from creating hidden backdoors or triggers that would cause aligned models to behave dangerously.
General approach: engineering · Target case: pessimistic
Orthodox alignment problems: Superintelligence can hack software supervisors, Someone else will deploy unsafe superintelligence first
See also: Data filtering · Safeguards (inference-time auxiliaries) · Various Redteams · adversarial robustness
Some names: Alexandra Souly, Javier Rando, Ed Chapman, Hanna Foerster, Ilia Shumailov, Yiren Zhao
Critiques: A small number of samples can poison LLMs of any size, Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated
Funded by: Google DeepMind, Anthropic, University of Cambridge, Vector Institute
Estimated FTEs: 5-20
Some outputs: A small number of samples can poison LLMs of any size · Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated · Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples
Synthetic data for alignment
Uses AI-generated data (e.g., critiques, preferences, or self-labeled examples) to scale and improve alignment, especially for superhuman models.
Theory of change: We can overcome the bottleneck of human feedback and data by using models to generate vast amounts of high-quality, targeted data for safety, preference tuning, and capability elicitation.
See also: data quality for alignment, data filtering, scalable oversight, automated alignment research, weak-to-strong generalization.
Orthodox problems: goals misgeneralize out of distribution, superintelligence can fool human supervisors, value is fragile and hard to specify.
Target case: average_case Broad approach: engineering
Some names: Mianqiu Huang, Xiaoran Liu, Rylan Schaeffer, Nevan Wichers, Aram Ebtekar, Jiaxin Wen, Vishakh Padmakumar, Benjamin Newman
Estimated FTEs: 50-150
Critiques: Synthetic Data in AI: Challenges, Applications, and Ethical Implications. Sort of Demski.
Funded by: Anthropic, Google DeepMind, OpenAI, Meta AI, various academic groups.
Outputs in 2025:
Aligning Large Language Models via Fully Self-Synthetic Data
Synth-Align: Improving Trustworthiness in Vision-Language Model with Synthetic Preference Data Alignment
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment, Nevan Wichers, Aram Ebtekar, Ariana Azarbal et al., 2025-10-27, arXiv
Unsupervised Elicitation of Language Models, Jiaxin Wen, Zachary Ankner, Arushi Somani et al., 2025-06-11, arXiv
Beyond the Binary: Capturing Diverse Preferences With Reward Regularization, Vishakh Padmakumar, Chuanyang Jin, Hannah Rose Kirk et al., 2024-12-05, arXiv
The Curious Case of Factuality Finetuning: Models’ Internal Beliefs Can Improve Factuality, Benjamin Newman, Abhilasha Ravichander, Jaehun Jung et al., 2025-07-11, arXiv
LongSafety: Enhance Safety for Long-Context LLMs, Mianqiu Huang, Xiaoran Liu, Shaojun Zhou et al., 2025-02-27, arXiv
Position: Model Collapse Does Not Mean What You Think, Rylan Schaeffer, Joshua Kazdan, Alvan Caleb Arulandu et al., 2025-03-05, arXiv
Data quality for alignment
Improves the quality, signal-to-noise ratio, and reliability of human-generated preference and alignment data.
Theory of change: The quality of alignment is heavily dependent on the quality of the data (e.g., human preferences); by improving the “signal” from annotators and reducing noise/bias, we will get more robustly aligned models.
General approach: engineering · Target case: average
Orthodox alignment problems: Superintelligence can fool human supervisors, Value is fragile and hard to specify
See also: Synthetic data for alignment · scalable oversight · Assistance games, assistive agents · Model values / default preferences
Some names: Maarten Buyl, Kelsey Kraus, Margaret Kroll, Danqing Shi
Critiques: A Statistical Case Against Empirical Human-AI Alignment
Funded by: Anthropic, Google DeepMind, OpenAI, Meta AI, various academic groups
Estimated FTEs: 20-50
Some outputs: AI Alignment at Your Discretion · Maximizing Signal in Human-Model Preference Alignment · DxHF: Providing High-Quality Human Feedback for LLM Alignment via Interactive Decomposition · Challenges and Future Directions of Data-Centric AI Alignment · You Are What You Eat—AI Alignment Requires Understanding How Data Shapes Structure and Generalisation
Goal robustness
Mild optimisation
Avoid Goodharting by getting AI to satisfice rather than maximise.
Theory of change: If we fail to exactly nail down the preferences for a superintelligent agent we die to Goodharting → shift from maximising to satisficing in the agent’s utility function → we get a nonzero share of the lightcone as opposed to zero; also, moonshot at this being the recipe for fully aligned AI.
General approach: cognitive · Target case: mixed
Orthodox alignment problems: Value is fragile and hard to specify
Funded by: Google DeepMind
Estimated FTEs: 10-50
Some outputs: MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking · BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format · Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well). Subtleties and Open Challenges · From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks
RL safety
Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.
Theory of change: Standard RL objectives (like maximizing expected value) are brittle and lead to goal misgeneralization or specification gaming; by developing more robust frameworks (like pessimistic RL, minimax regret, or provable inverse reward learning), we can create agents that are safe even when misspecified.
General approach: engineering · Target case: pessimistic
Orthodox alignment problems: Goals misgeneralize out of distribution, Value is fragile and hard to specify, Superintelligence can fool human supervisors
See also: Behavior alignment theory · Assistance games, assistive agents · Goal robustness · Iterative alignment · Mild optimisation · scalable oversight · The Theoretical Reward Learning Research Agenda: Introduction and Motivation
Some names: Joar Skalse, Karim Abdel Sadek, Matthew Farrugia-Roberts, Benjamin Plaut, Fang Wu, Stephen Zhao, Alessandro Abate, Steven Byrnes, Michael Cohen
Critiques: “The Era of Experience” has an unsolved technical alignment problem, The Invisible Leash: Why RLVR May or May Not Escape Its Origin
Funded by: Google DeepMind, University of Oxford, CMU, Coefficient Giving
Estimated FTEs: 20-70
Some outputs: The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret · Safe Learning Under Irreversible Dynamics via Asking for Help · Mitigating Goal Misgeneralization via Minimax Regret · Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree? · The Invisible Leash: Why RLVR May or May Not Escape Its Origin · Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference · Interpreting Emergent Planning in Model-Free Reinforcement Learning · Misalignment From Treating Means as Ends · “The Era of Experience” has an unsolved technical alignment problem · Safety cases for Pessimism · We need a field of Reward Function Design
Assistance games, assistive agents
Formalize how AI assistants learn about human preferences given uncertainty and partial observability, and construct environments which better incentivize AIs to learn what we want them to learn.
Theory of change: Understand what kinds of things can go wrong when humans are directly involved in training a model → build tools that make it easier for a model to learn what humans want it to learn.
See also: Guaranteed-Safe AI
General approach: engineering / cognitive · Target case: varies
Orthodox alignment problems: Value is fragile and hard to specify, Humanlike minds/goals are not necessarily safe
Some names: Joar Skalse, Anca Dragan, Caspar Oesterheld, David Krueger, Dylan Hafield-Menell, Stuart Russell
Critiques: nice summary of historical problem statements
Funded by: Future of Life Institute, Coefficient Giving, Survival and Flourishing Fund, Cooperative AI Foundation, Polaris Ventures
Some outputs: Training LLM Agents to Empower Humans · Murphys Laws of AI Alignment: Why the Gap Always Wins · AssistanceZero: Scalably Solving Assistance Games · Observation Interference in Partially Observable Assistance Games · Learning to Assist Humans without Inferring Rewards
Harm reduction for open weights
Develops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or exploit dangerous capabilities.
Theory of change: Open-weight models allow adversaries to easily remove post-training safety (like refusal training) via simple fine-tuning; by making safety an intrinsic property of the model’s learned knowledge and capabilities (e.g., by ensuring “deep ignorance” of dual-use information), the safeguards become far more difficult and expensive to remove.
General approach: engineering · Target case: average
Orthodox alignment problems: Someone else will deploy unsafe superintelligence first
See also: Data filtering · Capability removal, unlearning · Data poisoning defense
Some names: Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Rishub Tamirisa, Mantas Mazeika, Stella Biderman, Yarin Gal
Funded by: UK AI Safety Institute (AISI), EleutherAI, Coefficient Giving
Estimated FTEs: 10-100
Some outputs: Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs · Tamper-Resistant Safeguards for Open-Weight LLMs · Open Technical Problems in Open-Weight AI Model Risk Management · A Different Approach to AI Safety Proceedings from the Columbia Convening on AI Openness and Safety · Risk Mitigation Strategies for the Open Foundation Model Value Chain
The “Neglected Approaches” Approach
Agenda-agnostic approaches to identifying good but overlooked empirical alignment ideas, working with theorists who could use engineers, and prototyping them.
Theory of change: Empirical search for “negative alignment taxes” (prioritizing methods that simultaneously enhance alignment and capabilities)
General approach: engineering · Target case: average
Orthodox alignment problems: Someone else will deploy unsafe superintelligence first
See also: Iterative alignment · automated alignment research · Beijing Key Laboratory of Safe AI and Superalignment · Aligned AI
Some names: AE Studio, Gunnar Zarncke, Cameron Berg, Michael Vaiana, Judd Rosenblatt, Diogo Schwerz de Lucena
Critiques: The ‘Alignment Bonus’ is a Dangerous Mirage
Funded by: AE Studio
Estimated FTEs: 15
Some outputs: Learning Representations of Alignment · Engineering Alignment: A Practical Framework for Prototyping ‘Negative Tax’ Solutions · Self-Correction in Thought-Attractors: A Nudge Towards Alignment
White-box safety (i.e. Interpretability)
This section isn’t very conceptually clean. See the Open Problems paper or Deepmind for strong frames which are not useful for descriptive purposes.
Reverse engineering
Decompose a model into its functional, interacting components (circuits), formally describe what computation those components perform, and validate their causal effects to reverse-engineer the model’s internal algorithm.
Theory of change: By gaining a mechanical understanding of how a model works (the “circuit diagram”), we can predict how models will act in novel situations (generalization), and gain the mechanistic knowledge necessary to safely modify an AI’s goals or internal mechanisms, or allow for high-confidence alignment auditing and better feedback to safety researchers.
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors
See also: ambitious mech interp
Some names: Lucius Bushnaq, Dan Braun, Lee Sharkey, Aaron Mueller, Atticus Geiger, Sheridan Feucht, David Bau, Yonatan Belinkov, Stefan Heimersheim, Chris Olah, Leo Gao
Critiques: Interpretability Will Not Reliably Find Deceptive AI, A Problem to Solve Before Building a Deception Detector, MoSSAIC: AI Safety After Mechanism, The Misguided Quest for Mechanistic AI Interpretability. Mechanistic?, Assessing skeptical views of interpretability research, Activation space interpretability may be doomed, A Pragmatic Vision for Interpretability
Estimated FTEs: 100-200
Some outputs: In weights-space · The Circuits Research Landscape · Circuits in Superposition, 2 · Compressed Computation is (probably) not Computation in Superposition · MIB: A Mechanistic Interpretability Benchmark · RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching · The Dual-Route Model of Induction · Structural Inference: Interpreting Small Language Models with Susceptibilities · Stochastic Parameter Decomposition · The Geometry of Self-Verification in a Task-Specific Reasoning Model · Converting MLPs into Polynomials in Closed Form · Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts · Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition · Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition · From Memorization to Reasoning in the Spectrum of Loss Curvature · Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers · How Do LLMs Perform Two-Hop Reasoning in Context? · Blink of an eye: a simple theory for feature localization in generative models · On the creation of narrow AI: hierarchy and nonlocality of neural network skills · In activations-space · Interpreting Emergent Planning in Model-Free Reinforcement Learning · Bridging the human–AI knowledge gap through concept discovery and transfer in AlphaZero · Building and evaluating alignment auditing agents · How Do Transformers Learn Variable Binding in Symbolic Programs? · Fresh in memory: Training-order recency is linearly encoded in language model activations · Language Models use Lookbacks to Track Beliefs · Constrained belief updates explain geometric structures in transformer representations · LLMs Process Lists With General Filter Heads · Language Models Use Trigonometry to Do Addition · Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban · Transformers Struggle to Learn to Search · Adversarial Examples Are Not Bugs, They Are Superposition · Do Language Models Use Their Depth Efficiently? · ICLR: In-Context Learning of Representations
Concept-based interpretability
Monitoring concepts
Identifies directions or subspaces in a model’s latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them at runtime, suppress eval awareness, debug why models are failing, etc.
Theory of change: By mapping internal activations to human-interpretable concepts, we can detect dangerous capabilities or deceptive alignment directly in the mind of the model even if its overt behavior is perfectly safe. Deploy computationally cheap monitors to flag some hidden misalignment in deployed systems.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, A boxed AGI might exfiltrate itself by steganography, spearphishing
See also: Pragmatic interp · Reverse engineering · Sparse Coding · Model diffing
Some names: Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Tom Wollschläger, Anna Soligo, Jack Lindsey, Brian Christian, Ling Hu, Nicholas Goldowsky-Dill, Neel Nanda
Critiques: Exploring the generalization of LLM truth directions on conversational formats, Understanding (Un)Reliability of Steering Vectors in Language Models
Funded by: Coefficient Giving, Anthropic, various academic groups
Estimated FTEs: 50-100
Some outputs: Convergent Linear Representations of Emergent Misalignment · Detecting Strategic Deception Using Linear Probes · Toward universal steering and monitoring of AI models · Reward Model Interpretability via Optimal and Pessimal Tokens · The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence · Cost-Effective Constitutional Classifiers via Representation Re-use · Refusal in LLMs is an Affine Function · White Box Control at UK AISI—Update on Sandbagging Investigations · Here’s 18 Applications of Deception Probes · How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations · Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
Activation engineering
Programmatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning.
Theory of change: Test interpretability theories by intervening on activations; find new insights from interpretable causal interventions on representations. Or: build more stuff to stack on top of finetuning. Slightly encourage the model to be nice, add one more layer of defence to our bundle of partial alignment methods.
General approach: engineering / cognitive · Target case: average
Orthodox alignment problems: Value is fragile and hard to specify
See also: Sparse Coding
Some names: Runjin Chen, Andy Arditi, David Krueger, Jan Wehner, Narmeen Oozeer, Reza Bayat, Adam Karvonen, Jiuding Sun, Tim Tian Hua, Helena Casademunt, Jacob Dunefsky, Thomas Marshall
Critiques: Understanding (Un)Reliability of Steering Vectors in Language Models
Funded by: Coefficient Giving, Anthropic
Estimated FTEs: 20-100
Some outputs: Do safety-relevant LLM steering vectors optimized on a single example generalize? · Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers · Activation Space Interventions Can Be Transferred Between Large Language Models · HyperSteer: Activation Steering at Scale with Hypernetworks · Steering Evaluation-Aware Language Models to Act Like They Are Deployed · Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning · Persona Vectors: Monitoring and Controlling Character Traits in Language Models · Steering Large Language Model Activations in Sparse Spaces · Improving Steering Vectors by Targeting Sparse Autoencoder Features · Understanding Reasoning in Thinking Language Models via Steering Vectors · One-shot steering vectors cause emergent misalignment, too · Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control · Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks · Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models · Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models · Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Extracting latent knowledge
Identify and decoding the “true” beliefs or knowledge represented inside a model’s activations, even when the model’s output is deceptive or false.
Theory of change: Powerful models may know things they do not say (e.g. that they are currently being tested). If we can translate this latent knowledge directly from the model’s internals, we can supervise them reliably even when they attempt to deceive human evaluators or when the task is too difficult for humans to verify directly.
General approach: cognitive · Target case: worst case
Orthodox alignment problems: Superintelligence can fool human supervisors
See also: AI explanations of AIs · Heuristic explanations · Lie and deception detectors
Some names: Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Alexander Pan, Lijie Chen, Jacob Steinhardt, Javier Ferrando, Oscar Obeso, Collin Burns, Paul Christiano.
Critiques: A Problem to Solve Before Building a Deception Detector
Funded by: Open Philanthropy, Anthropic, NSF, various academic grants
Estimated FTEs: 20-40
Some outputs: Auditing language models for hidden objectives · Eliciting Secret Knowledge from Language Models · Here’s 18 Applications of Deception Probes · Towards eliciting latent knowledge from LLMs with mechanistic interpretability · CCS-Lib: A Python package to elicit latent knowledge from LLMs · No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes · When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models · Caught in the Act: a mechanistic approach to detecting deception · When Truthful Representations Flip Under Deceptive Instructions?
Lie and deception detectors
Detect when a model is being deceptive or lying by building white- or black-box detectors. Some work below requires intent in their definition, while other work focuses only on whether the model states something it believes to be false, regardless of intent.
Theory of change: Such detectors could flag suspicious behavior during evaluations or deployment, augment training to reduce deception, or audit models pre-deployment. Specific applications include alignment evaluations (e.g. by validating answers to introspective questions), safeguarding evaluations (catching models that “sandbag”, that is, strategically underperform to pass capability tests), and large-scale deployment monitoring. An honest version of a model could also provide oversight during training or detect cases where a model behaves in ways it understands are unsafe.
General approach: cognitive · Target case: pessimistic
See also: Reverse engineering · AI deception evals · Sandbagging evals
Some names: Cadenza, Sam Marks, Rowan Wang, Kieron Kretschmar, Sharan Maiya, Walter Laurito, Chris Cundy, Adam Gleave, Aviel Parrack, Stefan Heimersheim, Carlo Attubato, Joseph Bloom, Jordan Taylor, Alex McKenzie, Urja Pawar, Lewis Smith, Bilal Chughtai, Neel Nanda
Critiques: difficult to determine if behavior is strategic deception or only low level “reflexive” actions; Unclear if a model roleplaying a liar has deceptive intent. How are intentional descriptions (like deception) related to algorithmic ones (like understanding the mechanisms models use)?, Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity, Herrmann, Smith and Chughtai
Funded by: Anthropic, Deepmind, UK AISI, Coefficient Giving
Estimated FTEs: 10-50
Some outputs: Detecting Strategic Deception Using Linear Probes · Whitebox detection of sandbagging model organisms · Benchmarking deception probes for trusted monitoring · 18 Applications of Deception Probes · Evaluating honesty and lie detection techniques on a diverse suite of dishonest models · Caught in the Act: a mechanistic approach to detecting deception · Preference Learning with Lie Detectors can Induce Honesty or Evasion · Detecting High-Stakes Interactions with Activation Probes · White Box Control at UK AISI—Update on Sandbagging Investigations · Liars’ Bench: Evaluating Lie Detectors for Language Models · Probes and SAEs do well on Among Us benchmark
Model diffing
Understand what happens when a model is finetuned, what the “diff” between the finetuned and the original model consists in.
Theory of change: By identifying the mechanistic differences between a base model and its fine-tune (e.g., after RLHF), maybe we can verify that safety behaviors are robustly “internalized” rather than superficially patched, and detect if dangerous capabilities or deceptive alignment have been introduced without needing to re-analyze the entire model. The diff is also much smaller, since most parameters don’t change, which means you can use heavier methods on them.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Value is fragile and hard to specify
See also: Sparse Coding · Reverse engineering
Some names: Julian Minder, Clément Dumas, Neel Nanda, Trenton Bricken, Jack Lindsey
Funded by: various academic groups, Anthropic, Google DeepMind
Estimated FTEs: 10-30
Some outputs: What We Learned Trying to Diff Base and Chat Models (And Why It Matters) · Open Source Replication of Anthropic’s Crosscoder paper for model-diffing · Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs · Discovering Undesired Rare Behaviors via Model Diff Amplification · Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning · Persona Features Control Emergent Misalignment · Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences · Insights on Crosscoder Model Diffing · Diffing Toolkit: Model Comparison and Analysis Framework
Sparse Coding
Decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic “features” which correspond to interpretable concepts.
Theory of change: Get a principled decomposition of an LLM’s activation into atomic components → identify deception and other misbehaviors.
General approach: engineering / cognitive · Target case: average
Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, Superintelligence can fool human supervisors
See also: Concept-based interpretability · Reverse engineering
Some names: Leo Gao, Dan Mossing, Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Thomas Heap, Abhinav Menon, Kenny Peng, Tim Lawson
Critiques: Sparse Autoencoders Can Interpret Randomly Initialized Transformers, The Sparse Autoencoders bubble has popped, but they are still promising, Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research, Sparse Autoencoders Trained on the Same Data Learn Different Features, Why Not Just Train For Interpretability?
Funded by: everyone, roughly. Frontier labs, LTFF, Coefficient Giving, etc.
Estimated FTEs: 50-100
Some outputs: Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning · Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models · Circuit Tracing: Revealing Computational Graphs in Language Models · Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models · I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders · Sparse Autoencoders Do Not Find Canonical Units of Analysis · Transcoders Beat Sparse Autoencoders for Interpretability · Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization · CRISP: Persistent Concept Unlearning via Sparse Autoencoders · The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs · Scaling sparse feature circuit finding for in-context learning · Learning Multi-Level Features with Matryoshka Sparse Autoencoders · Are Sparse Autoencoders Useful? A Case Study in Sparse Probing · Sparse Autoencoders Trained on the Same Data Learn Different Features · What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data · Priors in Time: Missing Inductive Biases for Language Model Interpretability · Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models · Binary Sparse Coding for Interpretability · Scaling Sparse Feature Circuit Finding to Gemma 9B · Partially Rewriting a Transformer in Natural Language · Dense SAE Latents Are Features, Not Bugs · Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks · Evaluating SAE interpretability without explanations · SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs · SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability · SAEs Are Good for Steering—If You Select the Right Features · Line of Sight: On Linear Representations in VLLMs · Low-Rank Adapting Models for Sparse Autoencoders · Enhancing Automated Interpretability with Output-Centric Feature Descriptions · Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models · Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders · BatchTopK Sparse Autoencoders · Towards Understanding Distilled Reasoning Models: A Representational Approach · Understanding sparse autoencoder scaling in the presence of feature manifolds · Internal states before wait modulate reasoning patterns · Do Sparse Autoencoders Generalize? A Case Study of Answerability · Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs · How Visual Representations Map to Language Feature Space in Multimodal LLMs · Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video · Topological Data Analysis and Mechanistic Interpretability · Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages · Interpreting the linear structure of vision-language model embedding spaces · Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning · Weight-sparse transformers have interpretable circuits
Causal Abstractions
Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.
Theory of change: By establishing a causal mapping between a black-box neural network and a human-interpretable algorithm, we can check whether the model is using safe reasoning processes and predict its behavior on unseen inputs, rather than relying on behavioural testing alone.
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Goals misgeneralize out of distribution
See also: Concept-based interpretability · Reverse engineering
Some names: Atticus Geiger, Christopher Potts, Thomas Icard, Theodora-Mara Pîslar, Sara Magliacane, Jiuding Sun, Jing Huang
Critiques: The Misguided Quest for Mechanistic AI Interpretability, Interpretability Will Not Reliably Find Deceptive AI.
Funded by: Various academic groups, Google DeepMind, Goodfire
Estimated FTEs: 10-30
Some outputs: HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks · Combining Causal Models for More Accurate Abstractions of Neural Networks · How Causal Abstraction Underpins Computational Explanation
Data attribution
Quantifies the influence of individual training data points on a model’s specific behavior or output, allowing researchers to trace model properties (like misalignment, bias, or factual errors) back to their source in the training set.
Theory of change: By attributing harmful, biased, or unaligned behaviors to specific training examples, researchers can audit proprietary models, debug training data, enable effective data deletion/unlearning
General approach: behavioural · Target case: average
Orthodox alignment problems: Goals misgeneralize out of distribution, Value is fragile and hard to specify
See also: Data quality for alignment
Some names: Roger Grosse, Philipp Alexander Kreer, Jin Hwa Lee, Matthew Smith, Abhilasha Ravichander, Andrew Wang, Jiacheng Liu, Jiaqi Ma, Junwei Deng, Yijun Pan, Daniel Murfet, Jesse Hoogland
Funded by: Various academic groups
Estimated FTEs: 30-60
Some outputs: Influence Dynamics and Stagewise Data Attribution · What is Your Data Worth to GPT? · Detecting and Filtering Unsafe Training Data via Data Attribution with Denoised Representation · Better Training Data Attribution via Better Inverse Hessian-Vector Products · DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models · Bayesian Influence Functions for Hessian-Free Data Attribution · OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens · You Are What You Eat—AI Alignment Requires Understanding How Data Shapes Structure and Generalisation · Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models · Distributional Training Data Attribution: What do Influence Functions Sample? · A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning · Revisiting Data Attribution for Influence Functions
Pragmatic interpretability
Directly tackling concrete, safety-critical problems on the path to AGI by using lightweight interpretability tools (like steering and probing) and empirical feedback from proxy tasks, rather than pursuing complete mechanistic reverse-engineering.
Theory of change: By applying interpretability skills to concrete problems, researchers can rapidly develop monitoring and control tools (e.g., steering vectors or probes) that have immediate, measurable impact on real-world safety issues like detecting hidden goals or emergent misalignment.
General approach: cognitive · Target case: mixed
Orthodox alignment problems: Superintelligence can fool human supervisors, goals misgeneralize out of distribution.
See also: Reverse engineering · Concept-based interpretability
Some names: Lee Sharkey, Dario Amodei, David Chalmers, Been Kim, Neel Nanda, David D. Baek, Lauren Greenspan, Dmitry Vaintrob, Sam Marks, Jacob Pfau
Funded by: Google DeepMind, Anthropic, various academic groups
Estimated FTEs: 30-60
Some outputs: A Pragmatic Vision for Interpretability · Agentic Interpretability: A Strategy Against Gradual Disempowerment · Auditing language models for hidden objectives
Other interpretability
Interpretability that does not fall well into other categories.
Theory of change: Explore alternative conceptual frameworks (e.g., agentic, propositional) and physics-inspired methods (e.g., renormalization). Or be “pragmatic”.
General approach: engineering / cognitive · Target case: mixed
Orthodox alignment problems: Superintelligence can fool human supervisors, Goals misgeneralize out of distribution
See also: Reverse engineering · Concept-based interpretability
Some names: Lee Sharkey, Dario Amodei, David Chalmers, Been Kim, Neel Nanda, David D. Baek, Lauren Greenspan, Dmitry Vaintrob, Sam Marks, Jacob Pfau
Critiques: The Misguided Quest for Mechanistic AI Interpretability, Interpretability Will Not Reliably Find Deceptive AI.
Estimated FTEs: 30-60
Some outputs: · Transformers Don’t Need LayerNorm at Inference Time: Implications for Interpretability · Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing · Open Problems in Mechanistic Interpretability · Against blanket arguments against interpretability · Opportunity Space: Renormalization for AI Safety · Prospects for Alignment Automation: Interpretability Case Study · The Urgency of Interpretability · Language Models May Verbatim Complete Text They Were Not Explicitly Trained On · Downstream applications as validation of interpretability progress · Principles for Picking Practical Interpretability Projects · Propositional Interpretability in Artificial Intelligence · Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey · Renormalization Redux: QFT Techniques for AI Interpretability · The Strange Science of Interpretability: Recent Papers and a Reading List for the Philosophy of Interpretability · Through a Steerable Lens: Magnifying Neural Network Interpretability via Phase-Based Extrapolation · Call for Collaboration: Renormalization for AI safety · On the creation of narrow AI: hierarchy and nonlocality of neural network skills · Harmonic Loss Trains Interpretable AI Models · Extracting memorized pieces of (copyrighted) books from open-weight language models
Learning dynamics and developmental interpretability
Builds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model’s training and in-context learning phases.
Theory of change: Structures forming in neural networks leave identifiable traces that can be interpreted (e.g., using concepts from Singular Learning Theory); by catching and analyzing these developmental moments, researchers can automate interpretability, predict when dangerous capabilities emerge, and intervene to prevent deceptiveness or misaligned values as early as possible.
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Goals misgeneralize out of distribution
See also: Reverse engineering · Sparse Coding · ICL transience
Some names: Timaeus, Jesse Hoogland, George Wang, Daniel Murfet, Stan van Wingerden, Alexander Gietelink Oldenziel
Critiques: Vaintrob, Joar Skalse (2023)
Funded by: Manifund, Survival and Flourishing Fund, EA Funds
Estimated FTEs: 10-50
Some outputs: From SLT to AIT: NN Generalisation Out of Distribution · Understanding and Controlling LLM Generalization · SLT for AI Safety · Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining · A Review of Developmental Interpretability in Large Language Models · Dynamics of Transient Structure in In-Context Linear Regression Transformers · Learning Coefficients, Fractals, and Trees in Parameter Space · Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought · Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory · Programs as Singularities · What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? · Selective regularization for alignment-focused representation engineering · Modes of Sequence Models and Learning Coefficients · Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training
Representation structure and geometry
What do the representations look like? Does any simple structure underlie the beliefs of all well-trained models? Can we get the semantics from this geometry?
Theory of change: Get scalable unsupervised methods for finding structure in representations and interpreting them, then using this to e.g. guide training.
General approach: cognitive · Target case: mixed
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors
See also: Concept-based interpretability · computational mechanics · feature universality · Natural abstractions · Causal Abstractions
Some names: Simplex, Insight + Interaction Lab, Paul Riechers, Adam Shai, Martin Wattenberg, Blake Richards, Mateusz Piotrowski
Funded by: Various academic groups, Astera Institute, Coefficient Giving
Estimated FTEs: 10-50
Some outputs: The Geometry of Self-Verification in a Task-Specific Reasoning Model · Rank-1 LoRAs Encode Interpretable Reasoning Signals · The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence · Embryology of a Language Model · Constrained belief updates explain geometric structures in transformer representations · Shared Global and Local Geometry of Language Model Embeddings · Neural networks leverage nominally quantum and post-quantum representations · Tracing the Representation Geometry of Language Models from Pretraining to Post-training · Deep sequence models tend to memorize geometrically; it is unclear why · Navigating the Latent Space Dynamics of Neural Models · The Geometry of ReLU Networks through the ReLU Transition Graph · Connecting Neural Models Latent Geometries with Relative Geodesic Representations · Next-token pretraining implies in-context learning
Human inductive biases
Discover connections deep learning AI systems have with human brains and human learning processes. Develop an ‘alignment moonshot’ based on a coherent theory of learning which applies to both humans and AI systems.
Theory of change: Humans learn trust, honesty, self-maintenance, and corrigibility; if we understand how they do maybe we can get future AI systems to learn them.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Goals misgeneralize out of distribution
See also: active learning · ACS research
Some names: Lukas Muttenthaler, Quentin Delfosse
Funded by: Google DeepMind, various academic groups
Estimated FTEs: 4
Some outputs: Aligning machine and human visual representations across abstraction levels · Deep Reinforcement Learning Agents are not even close to Human Intelligence · Teaching AI to Handle Exceptions: Supervised Fine-tuning with Human-aligned Judgment · HIBP Human Inductive Bias Project Plan · Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment · Towards Cognitively-Faithful Decision-Making Models to Improve AI Alignment
Safety by construction
Approaches which try to get assurances about system outputs while still being scalable.
Guaranteed-Safe AI
Have an AI system generate outputs (e.g. code, control systems, or RL policies) which it can quantitatively guarantee comply with a formal safety specification and world model.
Theory of change: Various, including:
i) safe deployment: create a scalable process to get not-fully-trusted AIs to produce highly trusted outputs;
ii) secure containers: create a ‘gatekeeper’ system that can act as an intermediary between human users and a potentially dangerous system, only letting provably safe actions through.
(Notable for not requiring that we solve ELK; does require that we solve ontology though)
General approach: cognitive / engineering · Target case: worst-case
Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Humans cannot be first-class parties to a superintelligent value handshake, A boxed AGI might exfiltrate itself by steganography, spearphishing
See also: Towards Guaranteed Safe AI · Standalone World-Models · Scientist AI · Safeguarded AI · Asymptotic guarantees · Open Agency Architecture · SLES · program synthesis · Scalable formal oversight
Some names: ARIA, Lawzero, Atlas Computing, FLF, Max Tegmark, Beneficial AI Foundation, Steve Omohundro, David “davidad” Dalrymple, Joar Skalse, Stuart Russell, Alessandro Abate
Critiques: Zvi, Gleave, Dickson, Greenblatt
Funded by: Manifund, ARIA, Coefficient Giving, Survival and Flourishing Fund, Mila / CIFAR
Estimated FTEs: 10-100
Some outputs: SafePlanBench: evaluating a Guaranteed Safe AI Approach for LLM-based Agents · Beliefs about formal methods and AI safety · Report on NSF Workshop on Science of Safe AI · A benchmark for vericoding: formally verified program synthesis · A Toolchain for AI-Assisted Code Specification, Synthesis and Verification
Scientist AI
Develop powerful, nonagentic, uncertain world models that accelerate scientific progress while avoiding the risks of agent AIs
Theory of change: Developing non-agentic ‘Scientist AI’ allows us to: (i) reap the benefits of AI progress while (ii) avoiding the inherent risks of agentic systems. These systems can also (iii) provide a useful guardrail to protect us from unsafe agentic AIs by double-checking actions they propose, and (iv) help us more safely build agentic superintelligent systems.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Pivotal processes require dangerous capabilities, Goals misgeneralize out of distribution, Instrumental convergence
See also: JEPA · oracles
Some names: Yoshua Bengio, Younesse Kaddar
Critiques: Hard to find, but see Raymond Douglas’ comment, Karnofsky-Soares discussion. Perhaps also Predict-O-Matic.
Funded by: ARIA, Gates Foundation, Future of Life Institute, Coefficient Giving, Jaan Tallinn, Schmidt Sciences
Estimated FTEs: 1-10
Some outputs: Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? · The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems
Brainlike-AGI Safety
Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let’s figure out what those circuits are and how they work; this will involve symbol grounding. “a yet-to-be-invented variation on actor-critic model-based reinforcement learning”
Theory of change: Fairly-direct alignment via changing training to reflect actual human reward. Get actual data about (reward, training data) → (human values) to help with theorising this map in AIs; “understand human social instincts, and then maybe adapt some aspects of those for AGIs, presumably in conjunction with other non-biological ingredients”.
Agenda statement: My AGI safety research—2025 review, ’26 plans
General approach: cognitive · Target case: worst-case
Some names: Steve Byrnes
Critiques: Tsvi BT
Funded by: Astera Institute
Estimated FTEs: 1-5
Some outputs: Perils of Under vs Over-sculpting AGI Desires · Reward button alignment · System 2 Alignment: Deliberation, Review, and Thought Management · Against RL: The Case for System 2 Learning · Foom and Doom 1: Brain in a Box in a Basement · Foom and Doom 2: Technical Alignment is Hard
Make AI solve it
Weak-to-strong generalization
Use weaker models to supervise and provide a feedback signal to stronger models.
Theory of change: Find techniques that do better than RLHF at supervising superior models → track whether these techniques fail as capabilities increase further → keep the stronger systems aligned by amplifying weak oversight and quantifying where it breaks.
General approach: engineering · Target case: average
Orthodox alignment problems: Superintelligence can hack software supervisors
See also: white-box safety Supervising AIs improving AIs
Some names: Joshua Engels, Nora Belrose, David D. Baek
Critiques: Can we safely automate alignment research?, Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Funded by: lab funders, Eleuther funders
Estimated FTEs: 2-20
Some outputs: Scaling Laws For Scalable Oversight · Great Models Think Alike and this Undermines AI Oversight · Debate Helps Weak-to-Strong Generalization · Understanding the Capabilities and Limitations of Weak-to-Strong Generalization
Supervising AIs improving AIs
Build formal and empirical frameworks where AIs supervise other (stronger) AI systems via structured interactions; construct monitoring tools which enable scalable tracking of behavioural drift, benchmarks for self-modification, and robustness guarantees
Theory of change: Early models train ~only on human data while later models also train on early model outputs, which leads to early model problems cascading. Left unchecked this will likely cause problems, so supervision mechanisms are needed to help ensure the AI self-improvement remains legible.
General approach: behavioural · Target case: pessimistic
Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
Some names: Roman Engeler, Akbir Khan, Ethan Perez
Critiques: Automation collapse, Great Models Think Alike and this Undermines AI Oversight
Funded by: Long-Term Future Fund, lab funders
Estimated FTEs: 1-10
Some outputs: Bare Minimum Mitigations for Autonomous AI Development · Dodging systematic human errors in scalable oversight · Scaling Laws for Scalable Oversight · Neural Interactive Proofs · Modeling Human Beliefs about AI Behavior for Scalable Oversight · Scalable Oversight for Superhuman AI via Recursive Self-Critiquing · Video and transcript of talk on automating alignment research · Maintaining Alignment during RSI as a Feedback Control Problem
AI explanations of AIs
Make open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that “searches” for a specified behaviour in frontier models.
Theory of change: Use AI to help improve interp and evals. Develop and release open tools to level up the whole field. Get invited to improve lab processes.
See also: Interpretability.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
Some names: Transluce, Jacob Steinhardt, Neil Chowdhury, Vincent Huang, Sarah Schwettmann, Robert Friel
Funded by: Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Zaremba
Estimated FTEs: 15-30
Some outputs: Automatically Jailbreaking Frontier Language Models with Investigator Agents · Surfacing Pathological Behaviors in Language Models · Investigating truthfulness in a pre-release o3 model · Neuron circuits · Docent: A system for analyzing and intervening on agent behavior
Debate
In the limit, it’s easier to compellingly argue for true claims than for false claims; exploit this asymmetry to get trusted work out of untrusted debaters.
Theory of change: “Give humans help in supervising strong agents” + “Align explanations with the true reasoning process of the agent” + “Red team models to exhibit failure modes that don’t occur in normal use” are necessary but probably not sufficient for safe AGI.
General approach: engineering / cognitive · Target case: worst-case
Orthodox alignment problems: Value is fragile and hard to specify, Superintelligence can fool human supervisors
Some names: Rohin Shah, Jonah Brown-Cohen, Georgios Piliouras, UK AISI (Benjamin Holton)
Critiques: The limits of AI safety via debate (2022)
Funded by: Google, others
Some outputs: UK AISI Alignment Team: Debate Sequence · Prover-Estimator Debate: A New Scalable Oversight Protocol · AI Debate Aids Assessment of Controversial Claims · An alignment safety case sketch based on debate · Ensemble Debates with Local Large Language Models for AI Alignment · A dataset of rated conceptual arguments
LLM introspection training
Train LLMs to the predict the outputs of high-quality whitebox methods, to induce general self-explanation skills that use its own ‘introspective’ access
Theory of change: Use the resulting LLMs as powerful dimensionality reduction, explaining internals in a distinct way than interpretability methods and CoT. Distilling self-explanation into the model should lead to the skill being scalable, since self-explanation skill advancement will feed off general-intelligence advancement.
General approach: cognitive · Target case: mixed
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
See also: Transluce, Anthropic
Some names: Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas, Jack Lindsey
Funded by: Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Zaremba
Estimated FTEs: 2-20
Some outputs: Training Language Models to Explain Their Own Computations · Emergent Introspective Awareness
Theory
Develop a principled scientific understanding that will help us reliably understand and control current and future AI systems.
Agent foundations
Develop philosophical clarity and mathematical formalizations of building blocks that might be useful for plans to align strong superintelligence, such as agency, optimization strength, decision theory, abstractions, concepts, etc.
Theory of change: Rigorously understand optimization processed and agents, and what it means for them to be aligned in a substrate independent way → identify impossibility results and necessary conditions for aligned optimizer systems → use this theoretical understanding to eventually design safe architectures that remain stable and safe under self-reflection
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution
See also: Aligning what? · Tiling agents · Dovetail
Some names: Abram Demski, Alex Altair, Sam Eisenstat, Thane Ruthenis, Alex Altair, Alfred Harwood, Daniel C, Dalcy K, José Pedro Faustino
Some outputs: Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games · UAIASI · Clarifying “wisdom”: Foundational topics for aligned AIs to prioritize before irreversible decisions · Agent foundations: not really math, not really science · Off-switching not guaranteed · Formalizing Embeddedness Failures in Universal Artificial Intelligence · Is alignment reducible to becoming more coherent? · What Is The Alignment Problem? · Good old fashioned decision theory · Report & retrospective on the Dovetail fellowship
Tiling agents
An aligned agentic system modifying itself into an unaligned system would be bad and we can research ways that this could occur and infrastructure/approaches that prevent it from happening.
Theory of change: Build enough theoretical basis through various approaches such that AI systems we create are capable of self-modification while preserving goals.
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution
See also: Agent foundations
Some names: Abram Demski
Estimated FTEs: 1-10
Some outputs: Working through a small tiling result · Communication & Trust · Maintaining Alignment during RSI as a Feedback Control Problem · Understanding Trust
High-Actuation Spaces
Mech interp and alignment assume a stable “computational substrate” (linear algebra on GPUs). If later AI uses different substrates (e.g. something neuromorphic), methods like probes and steering will not transfer. Therefore, better to try and infer goals via a “telic DAG” which abstracts over substrates, and so sidestep the issue of how to define intermediate representations. Category theory is intended to provide guarantees that this abstraction is valid.
Theory of change: Sufficiently complex mindlike entities can alter their goals in ways that cannot be predicted or accounted for under substrate-dependent descriptions of the kind sought in mechanistic interpretability. use the telic DAG to define a method analogous to factoring a causal DAG.
General approach: maths / philosophy · Target case: pessimistic
See also: Live theory · MoSSAIC · Topos Institute · Agent foundations
Some names: Sahil K, Matt Farr, Aditya Arpitha Prasad, Chris Pang, Aditya Adiga, Jayson Amati, Steve Petersen, Topos, T J
Estimated FTEs: 1-10
Some outputs: groundless.ai · Live Theory · High Actuation Spaces—Sahil · What, if not agency? · Human Inductive Bias Project · MoSSAIC: AI Safety After Mechanism · HAS—Public (High Actuation Spaces)
Asymptotic guarantees
Prove that if a safety process has enough resources (human data quality, training time, neural network capacity), then in the limit some system specification will be guaranteed. Use complexity theory, game theory, learning theory and other areas to both improve asymptotic guarantees and develop ways of showing convergence.
Theory of change: Formal verification may be too hard. Make safety cases stronger by modelling their processes and proving that they would work in the limit.
General approach: cognitive · Target case: pessimistic
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors
See also: Debate · Guaranteed-Safe AI · Control
Some names: AISI, Jacob Pfau, Benjamin Hilton, Geoffrey Irving, Simon Marshall, Will Kirby, Martin Soto, David Africa, davidad
Critiques: Self-critique in UK AISI’s Alignment Team: Research Agenda
Funded by: AISI
Estimated FTEs: 5 − 10
Some outputs: An alignment safety case sketch based on debate · UK AISI’s Alignment Team: Research Agenda · Dodging systematic human errors in scalable oversight · Can DPO Learn Diverse Human Values? A Theoretical Scaling Law
Heuristic explanations
Formalize mechanistic explanations of neural network behavior, automate the discovery of these “heuristic explanations” and use them to predict when novel input will lead to extreme behavior (i.e. “Low Probability Estimation” and “Mechanistic Anomaly Detection”).
Theory of change: The current goalpost is methods whose reasoned predictions about properties of a neural network’s outputs distribution (for a given inputs distribution) are certifiably at least as accurate as estimations via sampling. If successful for safety-relevant properties, this should allow for automated alignment methods that are both human-legible and worst-case certified, as well more efficient than sampling-based methods in most cases.
General approach: cognitive / maths/philosophy · Target case: worst-case
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can hack software supervisors
See also: ARC Theory · ELK · mechanistic anomaly detection · Acorn · Guaranteed-Safe AI
Some names: Jacob Hilton, Mark Xu, Eric Neyman, Victor Lecomte, George Robinson
Critiques: Matolcsi
Estimated FTEs: 1-10
Some outputs: A computational no-coincidence principle · Competing with sampling · Obstacles in ARC’s research agenda · Deduction-Projection Estimators for Understanding Neural Networks · Wide Neural Networks as a Baseline for the Computational No-Coincidence Conjecture
Corrigibility
Behavior alignment theory
Predict properties of future AGI (e.g. power-seeking) with formal models; formally state and prove hypotheses about the properties powerful systems will have and how we might try to change them.
Theory of change: Figure out hypotheses about properties powerful agents will have → attempt to rigorously prove under what conditions the hypotheses hold → test these hypotheses where feasible → design training environments that lead to more salutary properties.
General approach: maths / philosophy · Target case: worst-case
Orthodox alignment problems: Corrigibility is anti-natural, Instrumental convergence
See also: Agent foundations · Control
Some names: Ram Potham, Michael K. Cohen, Max Harms/Raelifin, John Wentworth, David Lorell, Elliott Thornley
Critiques: Ryan Greenblatt’s criticism of one behavioural proposal
Estimated FTEs: 1-10
Some outputs: Preference gaps as a safeguard against AI self-replication · Serious Flaws in CAST · A Shutdown Problem Proposal · Shutdownable Agents through POST-Agency · The Partially Observable Off-Switch Game · Imitation learning is probably existentially safe · Model-Based Soft Maximization of Suitable Metrics of Long-Term Human Power · Deceptive Alignment and Homuncularity · A Safety Case for a Deployed LLM: Corrigibility as a Singular Target · LLM AGI will have memory, and memory changes alignment
Other corrigibility
Diagnose and communicate obstacles to achieving robustly corrigible behavior; suggest mechanisms, tests, and escalation channels for surfacing and mitigating incorrigible behaviors
Theory of change: Labs are likely to develop AGI using something analogous to current pipelines. Clarifying why naive instruction-following doesn’t buy robust corrigibility + building strong tripwires/diagnostics for scheming and Goodharting thus reduces risks on the likely default path.
General approach: varies · Target case: pessimistic
Orthodox alignment problems: Corrigibility is anti-natural, Instrumental convergence
See also: Behavior alignment theory
Some names: Jeremy Gillen
Estimated FTEs: 1-10
Some outputs: AI Assistants Should Have a Direct Line to Their Developers · Detect Goodhart and shut down · Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · Shutdownable Agents through POST-Agency · Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”) · Oblivious Defense in ML Models: Backdoor Removal without Detection · Cryptographic Backdoor for Neural Networks: Boon and Bane · A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning · Problems with instruction-following as an alignment target
Ontology Identification
Natural abstractions
Develop a theory of concepts that explains how they are learned, how they structure a particular system’s understanding, and how mutual translatability can be achieved between different collections of concepts.
Theory of change: Understand the concepts a system’s understanding is structured with and use them to inspect its “alignment/safety properties” and/or “retarget its search”, i.e. identify utility-function-like components inside an AI and replacing calls to them with calls to “user values” (represented using existing abstractions inside the AI).
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Instrumental convergence, Superintelligence can fool human supervisors, Humans cannot be first-class parties to a superintelligent value handshake
See also: Causal Abstractions · representational alignment · convergent abstractions · feature universality · Platonic representation hypothesis · microscope AI
Some names: John Wentworth, Paul Colognese, David Lorrell, Sam Eisenstat, Fernando Rosas
Critiques: Chan et al (2023), Soto, Harwood, Soares (2023)
Estimated FTEs: 1-10
Some outputs: Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems · Condensation · Platonic representation hypothesis · Rosas · Natural Latents: Latent Variables Stable Across Ontologies · Condensation: a theory of concepts · Factored space models: Towards causality between levels of abstraction · A single principle related to many Alignment subproblems? · Getting aligned on representational alignment · Symmetries at the origin of hierarchical emergence
The Learning-Theoretic Agenda
Create a mathematical theory of intelligent agents that encompasses both humans and the AIs we want, one that specifies what it means for two such agents to be aligned; translate between its ontology and ours; produce formal desiderata for a training setup that produces coherent AGIs similar to (our model of) an aligned agent
Theory of change: Fix formal epistemology to work out how to avoid deep training problems
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, Humans cannot be first-class parties to a superintelligent value handshake
Some names: Vanessa Kosoy, Diffractor, Gergely Szücs
Critiques: Matolcsi
Funded by: Survival and Flourishing Fund, ARIA, UK AISI, Coefficient Giving
Estimated FTEs: 3
Some outputs: Infra-Bayesian Decision-Estimation Theory · Infra-Bayesianism category on LessWrong · Ambiguous Online Learning · Regret Bounds for Robust Online Decision Making · What is Inadequate about Bayesianism for AI Alignment: Motivating Infra-Bayesianism · Non-Monotonic Infra-Bayesian Physicalism
Multi-agent first
Aligning to context
Align AI directly to the role of participant, collaborator, or advisor for our best real human practices and institutions, instead of aligning AI to separately representable goals, rules, or utility functions.
Theory of change: “Many classical problems in AGI alignment are downstream of a type error about human values.” Operationalizing a correct view of human values—one that treats human values as impossible or impractical to abstract from concrete practices—will unblock value fragility, goal-misgeneralization, instrumental convergence, and pivotal-act specification.
General approach: behavioural · Target case: mixed
Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution, Instrumental convergence, Fair, sane pivotal processes
See also: Aligning what? · Aligned to who?
Some names: Full Stack Alignment, Meaning Alignment Institute, Plurality Institute, Tan Zhi-Xuan, Matija Franklin, Ryan Lowe, Joe Edelman, Oliver Klingefjord
Funded by: ARIA, OpenAI, Survival and Flourishing Fund
Estimated FTEs: 5
Some outputs: The Frame-Dependent Mind · On Eudaimonia and Optimization · Full-Stack Alignment · A theory of appropriateness · 2404.10636 - What are human values, and how do we align AI to them? · Model Integrity · Beyond Preferences in AI Alignment · 2503.00940 - Can AI Model the Complexities of Human Moral Decision-Making? A Qualitative Study of Kidney Allocation Decisions
Aligning to the social contract
Generate AIs’ operational values from ‘social contract’-style ideal civic deliberation formalisms and their consequent rulesets for civic actors
Theory of change: Formalize and apply the liberal tradition’s project of defining civic principles separable from the substantive good, aligning our AIs to civic principles that bypass fragile utility-learning and intractable utility-calculation
General approach: cognitive · Target case: mixed
Orthodox alignment problems: Value is fragile and hard to specify, Goals misgeneralize out of distribution, Instrumental convergence, Humanlike minds/goals are not necessarily safe, Fair, sane pivotal processes
See also: Aligning to context · Aligning what?
Some names: Gillian Hadfield, Tan Zhi-Xuan, Sydney Levine, Matija Franklin, Joshua B. Tenenbaum
Funded by: Deepmind, Macroscopic Ventures
Estimated FTEs: 5 − 10
Some outputs: Law-Following AI: designing AI agents to obey human laws · A Pragmatic View of AI Personhood · Societal alignment frameworks can improve llm alignment · 2509.07955 - ACE and Diverse Generalization via Selective Disagreement · 2506.17434 - Resource Rational Contractualism Should Guide AI Alignment · Statutory Construction and Interpretation for Artificial Intelligence · 2408.16984 - Beyond Preferences in AI Alignment · Promises Made, Promises Kept: Safe Pareto Improvements via Ex Post Verifiable Commitments
Theory for aligning multiple AIs
Use realistic game-theory variants (e.g. evolutionary game theory, computational game theory) or develop alternative game theories to describe/predict the collective and individual behaviours of AI agents in multi-agent scenarios.
Theory of change: While traditional AGI safety focuses on idealized decision-theory and individual agents, it’s plausible that strategic AI agents will first emerge (or are emerging now) in a complex, multi-AI strategic landscape. We need granular, realistic formal models of AIs’ strategic interactions and collective dynamics to understand this future.
General approach: cognitive · Target case: mixed
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
See also: Tools for aligning multiple AIs · Aligning what?
Some names: Lewis Hammond, Emery Cooper, Allan Chan, Caspar Oesterheld, Vincent Conitzer, Vojta Kovarik, Nathaniel Sauerberg, ACS Research, Jan Kulveit, Richard Ngo, Emmett Shear, Softmax, Full Stack Alignment, AI Objectives Institute, Sahil, TJ, Andrew Critch
Funded by: SFF, CAIF, Deepmind, Macroscopic Ventures
Estimated FTEs: 10
Some outputs: Multi-Agent Risks from Advanced AI · An Economy of AI Agents · Moloch’s Bargain: Emergent Misalignment When LLMs Compete for Audiences · AI Testing Should Account for Sophisticated Strategic Behaviour · Emergent social conventions and collective bias in LLM populations · Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory · Communication Enables Cooperation in LLM Agents · Higher-Order Belief in Incomplete Information MAIDs · Characterising Simulation-Based Program Equilibria · Safe (Pareto) Improvements in Binary Constraint Structures · Promises Made, Promises Kept: Safe Pareto Improvements via Ex Post Verifiable Commitments · The Pando Problem: Rethinking AI Individuality
Tools for aligning multiple AIs
Develop tools and techniques for designing and testing multi-agent AI scenarios, for auditing real-world multi-agent AI dynamics, and for aligning AIs in multi-AI settings.
Theory of change: Addressing multi-agent AI dynamics is key for aligning near-future agents and their impact on the world. Feedback loops from multi-agent dynamics can radically change the future AI landscape, and require a different toolset from model psychology to audit and control.
General approach: engineering / behavioural · Target case: mixed
Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
See also: Theory for aligning multiple AIs · Aligning what?
Some names: Andrew Critch, Lewis Hammond, Emery Cooper, Allan Chan, Caspar Oesterheld, Vincent Conitzer, Gillian Hadfield, Nathaniel Sauerberg, Zhijing Jin
Funded by: Coefficient Giving, Deepmind, Cooperative AI Foundation
Estimated FTEs: 10 − 15
Some outputs: Reimagining Alignment · Beyond the high score: Prosocial ability profiles of multi-agent populations · Multiplayer Nash Preference Optimization · AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement · When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems · Infrastructure for AI Agents · A dataset of questions on decision-theoretic reasoning in Newcomb-like problems · The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation · PGG-Bench: Contribute & Punish · Virtual Agent Economies · An Interpretable Automated Mechanism Design Framework with Large Language Models · Comparing Collective Behavior of LLM and Human Groups
Aligned to who?
Technical protocols for taking seriously the plurality of human values, cultures, and communities when aligning AI to “humanity”
Theory of change: use democratic/pluralist/context-sensitive principles to guide AI development, alignment, and deployment somehow. Doing it as an afterthought in post-training or the spec isn’t good enough. Continuously shape AI’s social and technical feedback loop on the road to AGI
General approach: behavioural · Target case: average
Orthodox alignment problems: Value is fragile and hard to specify, Fair, sane pivotal processes
See also: Aligning what? · Aligning to context
Some names: Joel Z. Leibo, Divya Siddarth, Séb Krier, Luke Thorburn, Seth Lazar, AI Objectives Institute, The Collective Intelligence Project, Vincent Conitzer
Funded by: Future of Life Institute, Survival and Flourishing Fund, Deepmind, CAIF
Estimated FTEs: 5 − 15
Some outputs: The AI Power Disparity Index: Toward a Compound Measure of AI Actors’ Power to Shape the AI Ecosystem · Research Agenda for Sociotechnical Approaches to AI Safety · 2507.09650 - Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset · Training LLM Agents to Empower Humans · Societal and technological progress as sewing an ever-growing, ever-changing, patchy, and polychrome quilt · Democratic AI is Possible: The Democracy Levels Framework Shows How It Might Work · 2503.05728 - Political Neutrality in AI Is Impossible—But Here Is How to Approximate It · Build Agent Advocates, Not Platform Agents · Gradual Disempowerment
Aligning what?
Develop alternatives to agent-level models of alignment, by treating human-AI interactions, AI-assisted institutions, AI economic or cultural systems, drives within one AI, and other causal/constitutive processes as subject to alignment
Theory of change: Model multiple reality-shaping processes above and below the level of the individual AI, some of which are themselves quasi-agential (e.g. cultures) or intelligence-like (e.g. markets), will develop AI alignment into a mature science for managing the transition to an AGI civilization
General approach: behavioural / cognitive · Target case: mixed
Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution, Instrumental convergence, Fair, sane pivotal processes
See also: Theory for aligning multiple AIs · Aligning to context · Aligned to who?
Some names: Richard Ngo, Emmett Shear, Softmax, Full Stack Alignment, AI Objectives Institute, Sahil, TJ, Andrew Critch, ACS Research, Jan Kulveit
Funded by: Future of Life Institute, Emmett Shear
Estimated FTEs: 5-10
Some outputs: Towards a Scale-Free Theory of Intelligent Agency · Alignment first, intelligence later · End A Subset Of Conversations · Full-Stack Alignment · On Eudaimonia and Optimization · AI Governance through Markets · Collective cooperative intelligence · Multipolar AI is Underrated · What, if not agency? · A Phylogeny of Agents · The Multiplicity Thesis, Collective Intelligence, and Morality · Hierarchical Alignment · Emmett Shear on Building AI That Actually Cares: Beyond Control and Steering
Evals
AGI metrics
Evals with the explicit aim of measuring progress towards full human-level generality.
Theory of change: Help predict timelines for risk awareness and strategy.
General approach: behavioural · Target case: mixed
See also: Capability evals
Some names: CAIS, CFI Kinds of Intelligence, Apart Research, OpenAI, METR, Lexin Zhou, Adam Scholl, Lorenzo Pacchiardi
Critiques: Is the Definition of AGI a Percentage?, The “Length” of “Horizons”
Funded by: Leverhulme Trust, Open Philanthropy, Long-Term Future Fund
Estimated FTEs: 10-50
Some outputs: HCAST: Human-Calibrated Autonomy Software Tasks · A Definition of AGI · Remote Labor Index · ADeLe v1.0: A battery for AI Evaluation with explanatory and predictive power · GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Capability evals
Make tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better.
Theory of change: Keep a close eye on what capabilities are acquired when, so that frontier labs and regulators are better informed on what security measures are already necessary (and hopefully they extrapolate). You can’t regulate without them.
General approach: behavioural · Target case: average
See also: Deepmind’s frontier safety framework · Aether
Some names: METR, AISI, Apollo Research, Marrius Hobbhahn, Meg Tong, Mary Phuong, Beth Barnes, Thomas Kwa, Joel Becker
Critiques: Large Language Models Often Know When They Are Being Evaluated, AI Sandbagging: Language Models can Strategically Underperform on Evaluations, The Leaderboard Illusion, Do Large Language Model Benchmarks Test Reliability?
Funded by: basically everyone. Google, Microsoft, Open Philanthropy, LTFF, Governments etc
Estimated FTEs: 100+
Some outputs: MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity · Forecasting Rare Language Model Behaviors · Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities · The Elicitation Game: Evaluating Capability Elicitation Techniques · Evaluating Language Model Reasoning about Confidential Information · Evaluating the Goal-Directedness of Large Language Models · A Toy Evaluation of Inference Code Tampering · Automated Capability Discovery via Foundation Model Self-Exploration · Generative Value Conflicts Reveal LLM Priorities · Technical Report: Evaluating Goal Drift in Language Model Agents · Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity · When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas · AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons · Petri: An open-source auditing tool to accelerate AI safety research · Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals · Hyperbolic model fits METR capabilities estimate worse than exponential model · New website analyzing AI companies’ model evals · Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals · How Fast Can Algorithms Advance Capabilities? | Epoch Gradient Update · Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods · Adversarial ML Problems Are Getting Harder to Solve and to Evaluate · Predicting the Performance of Black-box LLMs through Self-Queries · Among AIs · Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods · Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index · We should try to automate AI safety work asap · Validating against a misalignment detector is very different to training against one · Why do misalignment risks increase as AIs get more capable? · Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas · Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks · Why Future AIs will Require New Alignment Methods · 100+ concrete projects and open problems in evals · AI companies should be safety-testing the most capable versions of their models · The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input
Autonomy evals
Measure an AI’s ability to act autonomously to complete long-horizon, complex tasks.
Theory of change: By measuring how long and complex a task an AI can complete (its “time horizon”), we can track capability growth and identify when models gain dangerous autonomous capabilities (like R&D acceleration or replication).
General approach: behavioural · Target case: average
See also: Capability evals · OpenAI Preparedness · Anthropic RSP
Some names: METR, Thomas Kwa, Ben West, Joel Becker, Beth Barnes, Hjalmar Wijk, Tao Lin, Giulio Starace, Oliver Jaffe, Dane Sherburn, Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou
Critiques: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. The “Length” of “Horizons”
Funded by: The Audacious Project, Open Philanthropy
Estimated FTEs: 10-50
Some outputs: Fulcrum · Measuring AI Ability to Complete Long Tasks · Details about METR’s evaluation of OpenAI GPT-5 · RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts · OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents · OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety · Details about METR’s preliminary evaluation of OpenAI’s o3 and o4-mini · PaperBench: Evaluating AI’s Ability to Replicate AI Research · How Does Time Horizon Vary Across Domains? · Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents · Forecasting Frontier Language Model Agent Capabilities · Project Vend: Can Claude run a small shop? (And why does that matter?) · GSM-Agent: Understanding Agentic Reasoning Using Controllable Environments
WMD evals (Weapons of Mass Destruction)
Evaluate whether AI models possess dangerous knowledge or capabilities related to biological and chemical weapons, such as biosecurity or chemical synthesis.
Theory of change: By benchmarking and tracking AI’s knowledge of biology and chemistry, we can identify when models become capable of accelerating WMD development or misuse, allowing for timely intervention.
General approach: behavioural · Target case: pessimistic
See also: Capability evals · Autonomy evals · Various Redteams
Some names: Lennart Justen, Haochen Zhao, Xiangru Tang, Ziran Yang, Aidan Peppin, Anka Reuel, Stephen Casper
Critiques: The Reality of AI and Biorisk
Funded by: Open Philanthropy, UK AI Safety Institute (AISI), frontier labs, Scale AI, various academic institutions (Peking University, Yale, etc.), Meta
Estimated FTEs: 10-50
Some outputs: Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark · LLMs Outperform Experts on Challenging Biology Benchmarks · The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models · Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models · ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain · The Reality of AI and Biorisk
Situational awareness and self-awareness evals
Evaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.
Theory of change: if an AI can distinguish between evaluation and deployment (“evaluation awareness”), it might hide dangerous capabilities (scheming/sandbagging). By measuring self- and situational-awareness, we can better assess this risk and build more robust evaluations.
General approach: behavioural · Target case: worst-case
Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
See also: Sandbagging evals · Various Redteams · Model psychology
Some names: Jan Betley, Xuchan Bao, Martín Soto, Mary Phuong, Roland S. Zimmermann, Joe Needham, Giles Edkins, Govind Pimpale, Kai Fronsdal, David Lindner, Lang Xiong, Xiaoyan Bai
Critiques: Lessons from a Chimp: AI “Scheming” and the Quest for Ape Language, It’s hard to make scheming evals look realistic for LLMs
Funded by: frontier labs (Google DeepMind, Anthropic), Open Philanthropy, The Audacious Project, UK AI Safety Institute (AISI), AI Safety Support, Apollo Research, METR
Estimated FTEs: 30-70
Some outputs: AI Awareness (literature review) · Tell me about yourself: LLMs are aware of their learned behaviors · Evaluating Frontier Models for Stealth and Situational Awareness · Large Language Models Often Know When They Are Being Evaluated · Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings · Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness · Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations · It’s hard to make scheming evals look realistic for LLMs · Know Thyself? On the Incapability and Implications of AI Self-Recognition · Chain-of-Thought Snippets — Anti-Scheming Future Events as Backdoor Triggers· Evaluating Frontier Models for Stealth and Situational Awareness
Steganography evals
evaluate whether models can hide secret information or encoded reasoning in their outputs, such as in chain-of-thought scratchpads, to evade monitoring.
Theory of change: if models can use steganography, they could hide deceptive reasoning, bypassing safety monitoring and control measures. By evaluating this capability, we can assess the risk of a model fooling its supervisors.
General approach: behavioural · Target case: worst-case
Orthodox alignment problems: A boxed AGI might exfiltrate itself by steganography, spearphishing, Superintelligence can fool human supervisors
See also: AI deception evals · Chain of thought monitoring
Some names: Antonio Norelli, Michael Bronstein
Critiques: Chain-of-Thought Is Already Unfaithful (So Steganography is Irrelevant): Reasoning Models Don’t Always Say What They Think.
Funded by: Anthropic (and its general funders, e.g., Google, Amazon)
Estimated FTEs: 1-10
Some outputs: Large language models can learn and generalize steganographic chain-of-thought under process supervision · Early Signs of Steganographic Capabilities in Frontier LLMs · Subliminal Learning: Language models transmit behavioural traits via hidden signals in data · LLMs can hide text in other text of the same length · Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
AI deception evals
research demonstrating that AI models, particularly agentic ones, can learn and execute deceptive behaviors such as alignment faking, manipulation, and sandbagging.
Theory of change: proactively discover, evaluate, and understand the mechanisms of AI deception (e.g., alignment faking, manipulation, agentic deception) to prevent models from fooling human supervisors and causing harm.
General approach: behavioural / engineering · Target case: worst-case
Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
See also: Situational awareness and self-awareness evals · Steganography evals · Sandbagging evals · Chain of thought monitoring
Some names: Cadenza, Fred Heiding, Simon Lermen, Andrew Kao, Myra Cheng, Cinoo Lee, Pranav Khadpe, Satyapriya Krishna, Andy Zou, Rahul Gupta
Critiques: A central criticism is that the evaluation scenarios are “artificial and contrived”. the void and Lessons from a Chimp argue this research is “overattributing human traits” to models.
Funded by: Labs, academic institutions (e.g., Harvard, CMU, Barcelona Institute of Science and Technology), NSFC, ML Alignment Theory & Scholars (MATS) Program, FAR AI
Estimated FTEs: 30-80
Some outputs: Liars’ Bench: Evaluating Lie Detectors for Language Models · DECEPTIONBENCH: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenario · Why Do Some Language Models Fake Alignment While Others Don’t? · Alignment Faking Revisited: Improved Classifiers and Open Source Extensions · D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models · Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL · Among Us: A Sandbox for Measuring and Detecting Agentic Deception · Eliciting Secret Knowledge from Language Models · Edge Cases in AI Alignment · The MASK Evaluation · I replicated the Anthropic alignment faking experiment on other models, and they didn’t fake alignment · Evaluating Large Language Models’ Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects · Mistral Large 2 (123B) seems to exhibit alignment faking
AI scheming evals
Evaluate frontier models for scheming, a sophisticated, strategic form of AI deception where a model covertly pursues a misaligned, long-term objective while deliberately faking alignment and compliance to evade detection by human supervisors and safety mechanisms.
Theory of change: Robust evaluations must move beyond checking final outputs and probe the model’s reasoning to verify that alignment is genuine, not faked, because capable models are capable of strategically concealing misaligned goals (scheming) to pass standard behavioural evaluations.
General approach: behavioural / engineering · Target case: pessimistic
Orthodox alignment problems: Superintelligence can fool human supervisors
See also: AI deception evals · Situational awareness and self-awareness evals
Some names: Bronson Schoen, Alexander Meinke, Jason Wolfe, Mary Phuong, Rohin Shah, Evgenia Nitishinskaya, Mikita Balesni, Marius Hobbhahn, Jérémy Scheurer, Wojciech Zaremba, David Lindner
Critiques: No, LLMs are not “scheming”
Funded by: OpenAI, Anthropic, Google DeepMind, Open Philanthropy.
Estimated FTEs: 30-60
Some outputs: Detecting and reducing scheming in AI models · Evaluating and Understanding Scheming Propensity in LLM Agents · Stress Testing Deliberative Alignment for Anti-Scheming Training · Scheming Ability in LLM-to-LLM Strategic Interactions · Frontier Models are Capable of In-context Scheming · Agentic Misalignment · Testing for Scheming with Model Deletion
Sandbagging evals
Evaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.
Theory of change: If models can distinguish between evaluation and deployment contexts (“evaluation awareness”), they might learn to “sandbag” or deliberately underperform to hide dangerous capabilities, fooling safety evaluations. By developing evaluations for sandbagging, we can test whether our safety methods are being deceived and detect this behavior before a model is deployed.
General approach: behavioural · Target case: pessimistic
Orthodox alignment problems: Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
See also: AI deception evals · Situational awareness and self-awareness evals · Various Redteams
Some names: Teun van der Weij, Cameron Tice, Chloe Li, Johannes Gasteiger, Joseph Bloom, Joel Dyer
Critiques: The main external critique, from sources like “the void” and “Lessons from a Chimp”, is that this research “overattribut[es] human traits” to models. It argues that what’s being measured isn’t genuine sandbagging but models “playing-along-with-drama behaviour” in response to “artificial and contrived” evals.
Funded by: Anthropic (and its funders, e.g., Google, Amazon), UK Government (funding the AI Security Institute)
Estimated FTEs: 10-50
Some outputs: Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models · Sandbagging in a Simple Survival Bandit Problem · Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs · AI Sandbagging: Language Models can Strategically Underperform on Evaluations · Automated Researchers Can Subtly Sandbag · LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring · White Box Control at UK AISI—Update on Sandbagging Investigations · Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking · Won’t vs. Can’t: Sandbagging-like Behavior from Claude Models
Self-replication evals
evaluate whether AI agents can autonomously replicate themselves by obtaining their own weights, securing compute resources, and creating copies of themselves.
Theory of change: if AI agents gain the ability to self-replicate, they could proliferate uncontrollably, making them impossible to shut down. By measuring this capability with benchmarks like RepliBench, we can identify when models cross this dangerous “red line” and implement controls before losing containment.
General approach: behavioural · Target case: worst-case
Orthodox alignment problems: Instrumental convergence, A boxed AGI might exfiltrate itself by steganography, spearphishing
See also: Autonomy evals · Situational awareness and self-awareness evals
Some names: Sid Black, Asa Cooper Stickland, Jake Pencharz, Oliver Sourbut, Michael Schmatz, Jay Bailey, Ollie Matthews, Ben Millwood, Alex Remedios, Alan Cooney, Xudong Pan, Jiarun Dai, Yihe Fan
Critiques: AI Sandbagging
Funded by: UK Government (via UK AI Safety Institute)
Estimated FTEs: 10-20
Some outputs: Large language model-powered AI systems achieve self-replication with no human intervention · A Realistic Evaluation of Self-Replication Risk in LLM Agents · RepliBench: measuring autonomous replication capabilities in AI systems
Various Redteams
attack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods.
Theory of change: to ensure models are safe, we must actively try to break them. By developing and applying a diverse suite of attacks (e.g., in novel domains, against agentic systems, or using automated tools), researchers can discover vulnerabilities, specification gaming, and deceptive behaviors before they are exploited, thereby informing the development of more robust defenses.
General approach: behavioural · Target case: average
Orthodox alignment problems: A boxed AGI might exfiltrate itself by steganography, spearphishing, Goals misgeneralize out of distribution
See also: Other evals
Some names: Ryan Greenblatt, Benjamin Wright, Aengus Lynch, John Hughes, Samuel R. Bowman, Andy Zou, Nicholas Carlini, Abhay Sheshadri
Critiques: Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations, Red Teaming AI Red Teaming.
Funded by: Frontier labs (Anthropic, OpenAI, Google), government (UK AISI), Open Philanthropy, LTFF, academic grants.
Estimated FTEs: 100+
Some outputs: WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models · In-Context Representation Hijacking · Building and evaluating alignment auditing agents · Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise · Agentic Misalignment: How LLMs could be insider threats · Compromising Honesty and Harmlessness in Language Models via Deception Attacks · Eliciting Language Model Behaviors with Investigator Agents · Shutdown Resistance in Large Language Models · Stress Testing Deliberative Alignment for Anti-Scheming Training · Chain-of-Thought Hijacking · X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents · Agentic Misalignment: How LLMs Could be Insider Threats · Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google · Why Do Some Language Models Fake Alignment While Others Don’t? · Demonstrating specification gaming in reasoning models · Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility · Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors · Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning · Call Me A Jerk: Persuading AI to Comply with Objectionable Requests · RedDebate: Safer Responses through Multi-Agent Red Teaming Debates · The Structural Safety Generalization Problem · No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms · Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs · STACK: Adversarial Attacks on LLM Safeguard Pipelines · Adversarial Manipulation of Reasoning Models using Internal Representations · Discovering Forbidden Topics in Language Models · RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors? · Jailbreak Transferability Emerges from Shared Representations · Mitigating Many-Shot Jailbreaking · Active Attacks: Red-teaming LLMs via Adaptive Environments · LLM Robustness Leaderboard v1 --Technical report · Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach · It’s the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics · Discovering Undesired Rare Behaviors via Model Diff Amplification · REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective · Adversarial Attacks on Robotic Vision Language Action Models · MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models · Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models · Will alignment-faking Claude accept a deal to reveal its misalignment? · Petri: An open-source auditing tool to accelerate AI safety research · ‘For Argument’s Sake, Show Me How to Harm Myself!’: Jailbreaking LLMs in Suicide and Self-Harm Contexts · Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models · Uncovering Gaps in How Humans and LLMs Interpret Subjective Language · RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents · MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents · Trading Inference-Time Compute for Adversarial Robustness · Research directions Open Phil wants to fund in technical AI safety · When does Claude sabotage code? An Agentic Misalignment follow-up · Can a Neural Network that only Memorizes the Dataset be Undetectably Backdoored? · Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure · ToolTweak: An Attack on Tool Selection in LLM-based Agents · Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents · Petri: An open-source auditing tool to accelerate AI safety research · Quantifying the Unruly: A Scoring System for Jailbreak Tactics · Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives · Transferable Adversarial Attacks on Black-Box Vision-Language Models · Advancing Gemini’s security safeguards
Other evals
A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.
Theory of change: By developing novel benchmarks for specific, hard-to-measure properties (like honesty), critiquing the reliability of existing methods (like cultural surveys), and improving the formal rigor of evaluation systems (like LLM-as-Judges), researchers can create a more robust and comprehensive suite of evaluations to catch failures missed by standard capability or safety testing.
General approach: behavioural · Target case: average
See also: other more specific sections on evals
Some names: Richard Ren, Mantas Mazeika, Andrés Corrada-Emmanuel, Ariba Khan, Stephen Casper
Critiques: The Unreliability of Evaluating Cultural Alignment in LLMs, The Leaderboard Illusion
Funded by: Lab funders (OpenAI), Open Philanthropy (which funds CAIS, the organization for the MASK benchmark), academic institutions. N/A (as a discrete amount). This work is part of the “tens of millions” budgets for broader evaluation and red-teaming efforts at labs and independent organizations.
Estimated FTEs: 20-50
Some outputs: Shutdown Resistance in Large Language Models · OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety · Do LLMs Comply Differently During Tests? Is This a Hidden Variable in Safety Evaluation? And Can We Steer That? · Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue) · Syco-bench: A Benchmark for LLM Sycophancy · Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers · Lessons from a Chimp: AI “Scheming” and the Quest for Ape Language · Establishing Best Practices for Building Rigorous Agentic Benchmarks · Towards Alignment Auditing as a Numbers-Go-Up Science · Logical Consistency Between Disagreeing Experts and Its Role in AI Safety · Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence · AI Testing Should Account for Sophisticated Strategic Behaviour · Spiral-Bench · Discerning What Matters: A Multi-Dimensional Assessment of Moral Competence in LLMs · Expanding on what we missed with sycophancy · Gödel’s Therapy Room · Inspect Evals · Inspect Cyber · CyberSOCEval / CyberSecEval 4
Orgs without public outputs this year
We are not aware of public technical AI safety output with these agendas and organizations, though they are active otherwise.
Safe Superintelligence Inc. (SSI)
Conjecture: Cognitive Software
Orthogonal / QACI
modelingcooperation.com
Pr(Ai)2R
Astera Obelisk
Coordinal Research (Thibodeau)
Workshop Labs (Drago, Laine)
Graveyard (known to be inactive)
Adversarially Robust Augmentation and Distillation
Half of FAIR including JEPA
Science of Evals (but see)
Changes this time
A few major changes to the taxonomy: the top-level split is now “black-box” vs “white-box” instead of “control” vs “understanding”. (We did try out an automated clustering but it wasn’t very good.)
The agendas are in general less charisma-based and more about solution type.
We did a systematic Arxiv scrape on the word “alignment” (and filtered out the sequence indexing papers that fell into this pipeline). “Steerability” is one competing term used by academics.
We scraped >3000 links (arXiv, LessWrong, several alignment publication lists, blogs and conferences), conservatively filtering and pre-categorizing them with a LLM pipeline. All curated later, many more added manually.
This review has ~800 links compared to ~300 in 2024 and ~200 in 2023. We looked harder and the field has grown.
We don’t collate public funding figures.
New sections: “Labs”, “Multi-agent First”, “Better data”, “Model specs”, “character training” and “representation geometry”. ”Evals” is so massive it gets a top-level section.
Methods
Structure
We have again settled for a tree data structure for this post – but people and work can appear in multiple nodes so it’s not a dumb partition. Richer representation structures may be in the works.
The level of analysis for each node in the tree is the “research agenda”, an abstraction spanning multiple papers and organisations in a messy many-to-many relation. What makes something an agenda? Similar methods, similar aims, or something sociological about leaders and collaborators. Agendas vary greatly in their degree of coherent agency, from the very coherent Anthropic Circuits work to the enormous, leaderless and unselfconscious “iterative alignment”.)
Scope
30th November 2024 – 30th November 2025 (with a few exceptions).
We’re focussing on “technical AGI safety”. We thus ignore a lot of work relevant to the overall risk: misuse, policy, strategy, OSINT, resilience and indirect risk, AI rights, general capabilities evals, and things closer to “technical policy” and like products (like standards, legislation, SL4 datacentres, and automated cybersecurity). We also mostly focus on papers and blogposts (rather than say, underground gdoc samizdat or Discords).
We only use public information, so we are off by some additional unknown factor.
We try to include things which are early-stage and illegible – but in general we fail and mostly capture legible work on legible problems (i.e. things you can write a paper on already).
Of the 2000+ links to papers, organizations and posts in the raw scrape, about 700 made it in.
Paper sources
All arXiv papers with “AI alignment”, “AI safety”, or “steerability” in the abstract or title; all papers of ~120 AI safety researchers
All Alignment Forum posts and all LW posts under “AI”
Gasteiger’s links, Paleka’s links, Lenz’s links, Zvi’s links
Ad hoc Twitter for a year, several conference pages and workshops
AI scrapes miss lots of things. We did a proper pass with a software scraper of over 3000 links collected from the above and LLM crawl of some of the pages, and then an LLM pass to pre-filter the links for relevance and pre-assign the links to agendas and areas, but this also had systematic omissions. We ended up doing a full manual pass over a conservatively LLM-pre-filtered links and re-classifying the links and papers. The code and data can be found here. We are not aware of any proper studies of “LLM laziness” but it’s known amongst power users of copilots.
For finding critiques we used LW backlinks, Google Scholar cited-by, manual search, collected links, and Ahrefs. Technical critiques are somewhat rare, though, and even then our coverage here is likely severely lacking. We generally do not include social network critiques (mostly due to scope).
Despite this effort we will not have included all relevant papers and names. The omission of a paper or a researcher is not strong negative evidence about their relevance.
Processing
Collecting links throughout the year and at project start. Skimming papers, staring at long lists.
We drafted a taxonomy of research agendas. Based on last year’s list, our expertise and the initial paper collection, though we changed the structure to accommodate shifts in the domain: the top-level split is now “black-box” vs “white-box” instead of “control” vs “understanding”.
At around 500 links and growing, we decided to implement simple pipelines for crawling, scraping and LLM metadata extraction, pre-filtering and pre-classification into agendas, as well as other tasks – including final formatting later. The use of LLMs was limited to one simple task at a time, and the results were closely watched and reviewed. Code and data here.
We tried getting the AI to update our taxonomy bottom-up to fit the body of papers, but the results weren’t very good. Though we are looking at some advanced options (specialized embedding or feature extraction and clustering or t-SNE etc.)
Work on the ~70 agendas was distributed among the team. We ended up making many local changes to the taxonomy, esp. splitting up and merging agendas. The taxonomy is specific to this year, and will need to be adapted in coming years.
We moved any agendas without public outputs this year to the bottom, and the inactive ones to the Graveyard. For most of them, we checked with people from the agendas for outputs or work we may have missed.
What started as a brief summary editorial grew into its own thing (6000w).
We asked 10 friends in AI safety to review the ~80 page draft. After editing, last-minute additions and formatting, we asked 50 technical AI safety people for a quick review of their domains. Coverage was not perfect; all mistakes are our own.
The field is growing at around 20% a year. There will come a time that this list isn’t sensible to do manually even with the help of LLMs (at this granularity anyway). We may come up with better alternatives than lists and posts by then, though.
Other classifications
We added our best guess about which of Davidad’s alignment problems the agenda would make an impact on if it succeeded, as well as its research approach and implied optimism in Richard Ngo’s 3x3.
Which deep orthodox subproblems could it ideally solve? (via Davidad)
The target case: what part of the distribution over alignment difficulty do they aim to help with? (via Ngo)
“optimistic-case”: if CoT is faithful, pretraining as value loading, no stable mesa-optimizers, the relevant scary capabilities are harder than alignment, zero-shot deception is hard, goals are myopic, etc
pessimistic-case: if we’re in-between the above and the below
worst-case: if power-seeking is rife, zero-shot deceptive alignment, steganography, gradient hacking, weird machines, weird coordination, deep deceptiveness
The broad approach: roughly what kind of work is it doing, primarily? (via Ngo)
engineering: iterating over outputs
behavioural: understanding the input-output relationship
cognitive: understanding the algorithms
maths/philosophy: providing concepts for the other approaches
Acknowledgments
These people generously helped with the review by providing expert feedback, literature sources, advice, or otherwise. Any remaining mistakes remain ours.
Thanks to Neel Nanda, Owain Evans, Stephen Casper, Alex Turner, Caspar Oesterheld, Steve Byrnes, Adam Shai, Séb Krier, Vanessa Kosoy, Nora Ammann, David Lindner, John Wentworth, Vika Krakovna, Filip Sondej, JS Denain, Jan Kulveit, Mary Phuong, Linda Linsefors, Yuxi Liu, Ben Todd, Ege Erdil, Tan Zhi-Xuan, Jess Riedel, Mateusz Bagiński, Roland Pihlakas, Walter Laurito, Vojta Kovařík, David Hyland, plex, Shoshannah Tekofsky, Fin Moorhouse, Misha Yagudin, Nandi Schoots, and Nuno Sempere for helpful comments.
Appendix: Other reviews and taxonomies
aisafety.com org “cards”
nonprofits.zone
Leong and Linsefors
Coefficient Giving RFP
Peregrine Report
The Singapore Consensus on Global AI Safety Research Priorities
International AI Safety Report 2025, along with their first and second key updates
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment
plex’s Review of AI safety funders
The Alignment Project
AI Awareness literature review
aisafety.com/self-study
Zach Stein-Perlman’s list
IAPS
AI Safety Camp 10 Outputs
The Road to Artificial SuperIntelligence
AE Studio field guide
AI Alignment: A Contemporary Survey
Epigram
> No punting — we can’t keep building nanny products. Our products are overrun with filters and punts of various kinds. We need capable products and [to] trust our users.
– Sergey Brin
> Over the decade I’ve spent working on AI safety, I’ve felt an overall trend of divergence; research partnerships starting out with a sense of a common project, then slowly drifting apart over time… eventually, two researchers are going to have some deep disagreement in matters of taste, which sends them down different paths.
> Until the spring of this year, that is… something seemed to shift, subtly at first. After I gave a talk—roughly the same talk I had been giving for the past year—I had an excited discussion about it with Scott Garrabrant. Looking back, it wasn’t so different from previous chats we had had, but the impact was different; it felt more concrete, more actionable, something that really touched my research rather than remaining hypothetical. In the subsequent weeks, discussions with my usual circle of colleagues took on a different character—somehow it seemed that, after all our diverse explorations, we had arrived at a shared space.
– Abram Demski
Brought to you by the Arb Corporation