Defining alignment research
I think that the concept of “alignment research” (and the distinction between that and “capabilities research”) is currently a fairly confused one. In this post I’ll describe some of the problems with how people typically think about these terms, and offer replacement definitions.
“Alignment” and “capabilities” are primarily properties of AIs not of AI research
The first thing to highlight is that the distinction between alignment and capabilities is primarily doing useful work when we think of them as properties of AIs. This distinction is still under-appreciated by the wider machine learning community. ML researchers have historically thought about performance of models almost entirely with respect to the tasks they were specifically trained on. However, the rise of LLMs has vindicated the alignment community’s focus on general capabilities, and now it’s much more common to assume that performance on many tasks (including out-of-distribution tasks) will improve roughly in parallel. This is a crucial assumption for thinking about risks from AGI.
Insofar as the ML community has thought about alignment, it has mostly focused on aligning models’ behavior to their training objectives. The possibility of neural networks aiming to achieve internally-represented goals is still not very widely understood, making it hard to discuss and study the reasons those goals might or might not be aligned with the values of (any given set of) humans.
However, extending “alignment” and “capabilities” from properties of AIs to properties of different types of research is a fraught endeavor. It’s tempting to categorize work as alignment research to the extent that it can be used to make AIs more aligned (to many possible targets), and as capabilities research to the extent that it can be used to make AIs more capable. But this approach runs into (at least) three major problems.
Firstly, in general it’s very difficult to categorize research by its impacts. Great research often links together ideas from many different subfields, typically in ways that only become apparent throughout the course of the research. We see this in many historical breakthroughs which shed light on a range of different domains. For example, early physicists studying the motions of the stars eventually derived laws governing all earthly objects. Meanwhile Darwin’s study of barnacles and finches led him to principles governing the evolution of all life. Analogously, we should expect that big breakthroughs in our understanding of neural networks and deep learning would be useful in many different ways.
More concretely, there are many cases where research done under the banner of alignment has advanced, or plausibly will advance, AI capabilities to a significant extent. This undermines our ability to categorize research by its impacts. Central examples include:
RLHF makes language models more obedient, but also more capable of coherently carrying out tasks.
Scalable oversight techniques can catch misbehavior, but will likely become important for generating high-quality synthetic training data, as it becomes more and more difficult for unassisted humans to label AI outputs correctly. E.g. this paper finds that “LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as ‘flawless’”.
Interpretability techniques will both allow us to inspect AI cognition and also extract more capable behavior from them (e.g. via activation steering).
Techniques for Eliciting Latent Knowledge will plausibly be important for allowing AIs to make better use of implicit knowledge (e.g. knowledge about protein folding that’s currently hidden inside AlphaFold’s weights).
MIRI thought that their agent foundations research could potentially advance AI capabilities, which motivated their secrecy about it.
Secondly, not only is it difficult to predict the effects of any given piece of research, it’s also difficult for alignment researchers to agree on which effects are good or bad. This is because there are deep disagreements in the field about the likelihoods of different threat models. The more difficult you think the alignment problem is, the more likely you are to consider most existing research useless or actively harmful (as Yudkowsky does). By contrast, Christiano has written in defense of research developing RLHF and language model agents; and many alignment researchers who are more closely-linked to mainstream ML have even broader views of what research is valuable from an alignment perspective.
Thirdly, for people concerned about AGI, most ML research should count as neither alignment nor capabilities—because it focuses on improving model performance in relatively narrow domains, in a way that is unlikely to generalize very far. I’ll call this type of research applications research. The ubiquity of applications research (e.g. across the many companies that are primarily focused on building ML products) makes some statistics that have been thrown around about the relative numbers of alignment researchers versus capabilities researchers (e.g. here, here) very misleading.
What types of research are valuable for preventing misalignment?
If we’d like to help prevent existential risk from misaligned AGI, but can’t categorize research on a case-by-case basis, we’ll need to fall back on higher-level principles about which research is beneficial. Specifically, I’ll defend two traits which I think should be our main criteria for prioritizing research from an alignment-focused perspective:
Valuable property 1: worst-case focus
Most ML research focuses on improving the average performance of models (whether on a narrow set of tasks or a broad range of tasks). By contrast, alignment researchers are primarily interested in preventing models’ worst-case misbehavior, which may arise very rarely (and primarily in situations where models expect it won’t be detected). While there’s sometimes overlap between work on improving average-case behavior and work on improving worst-case behavior, in general we should expect them to look fairly different.
We see a similar dynamic play out in cybersecurity (as highlighted by Yudkowsky’s writing on security mindset). In an ideal world, we could classify most software engineering work as cybersecurity work, because security would be built into the design by default. But in practice, creating highly secure systems requires a different skillset from regular software engineering, and typically doesn’t happen unless it’s some team’s main priority. Similarly, even if highly principled capabilities research would in theory help address alignment problems, in practice there’s a lot of pressure to trade off worst-case performance for average-case performance.
These pressures are exacerbated by the difficulty of addressing worst-case misbehavior even in current models. Its rarity makes it hard to characterize or study. Adversarial methods (like red-teaming) can find some examples, but these methods are bottlenecked by the capabilities of the adversary. A more principled approach would involve formally verifying safety properties, but formally specifying or verifying non-trivial properties would require significant research breakthroughs. The extent to which techniques for eliciting and addressing worst-case misbehavior of existing models will be helpful for more capable models is an open question.
Valuable property 2: scientific approach
Here’s another framing: a core barrier to aligning AGI is that we don’t understand neural networks well enough to say many meaningful things about how they function. So we should support research that helps us understand deep learning in a principled way. We can view this as a distinction between science and engineering: engineering aims primarily to make things work, science aims primarily to understand how they work. (This is related to Nate Soares’ point in this post.)
Thinking of AGI alignment as being driven by fundamental science highlights that big breakthroughs are likely to be relatively simple and easy to recognize—more like new theories of physics that make precise, powerful predictions than a big complicated codebase that we need to scrutinize line-by-line. This makes me optimistic about automating it in a way that humans can verify.
However, “trying to scientifically understand deep learning” is too broad a criterion to serve as a proxy for whether research will be valuable from an alignment perspective. For example, I expect that most work on scientifically understanding optimizers will primarily be useful for designing better optimizers, rather than understanding the models that result from the optimization process. So can we be more precise about what aspect of a “scientific approach” is valuable for alignment? My contention: a good proxy is the extent to which the research focuses on understanding cognition rather than behavior—i.e. the extent to which it takes a cognitivist approach rather than a behaviorist approach.
Some background on this terminology: the distinction between behaviorism and cognitivism comes from the history of psychology. In the mid-1900s, behaviorists held that the internal mental states of humans and animals couldn’t be studied scientifically, and therefore that the only scientifically meaningful approach was to focus on describing patterns of behavior. The influential behaviorist B. F. Skinner experimented with rewarding and punishing animal behavior, which eventually led to the modern field of reinforcement learning. However, the philosophical commitments of behaviorism became increasingly untenable. In the field of ethology, which studies animal behavior, researchers like Jane Goodall and Frans de Waal uncovered sophisticated behaviors inconsistent with viewing animals as pure reinforcement learners. In linguistics, Chomsky wrote a scathing critique of Skinner’s 1957 book Verbal Behavior. Skinner characterized language as a set of stimulus-response patterns, but Chomsky argued that this couldn’t account for human generalization to a very wide range of novel sentences. Eventually, psychology moved towards a synthesis in which study of behavior was paired with study of cognition.
ML today is analogous to psychology in the 1950s. Most ML researchers are behaviorists with respect to studying AIs. They focus on how training data determines behavior, and assume that AI behavior is driven by “bundles of heuristics” except in the cases where it’s demonstrated otherwise. This makes sense on narrow tasks, where it’s possible to categorize and study different types of behavior. But when models display consistent types of behavior across many different tasks, it becomes increasingly difficult to predict that behavior without reference to the underlying cognition going on inside the models. (Indeed, this observation can be seen as the core of the alignment problem: we can’t deduce internal motivations from external behavior.)
ML researchers often shy away from studying model cognition, because the methodology involved is often less transparent and less reproducible than simply studying behavior. This is analogous to how early ethologists who studied animals “in the wild” were disparaged for using unrigorous qualitative methodologies. However, they gradually collected many examples of sophisticated behavior (including tool use, power struggles, and cultural transmission) which eventually provided much more insight than narrow, controlled experiments performed in labs.
Similarly, I expect that the study of the internal representations of neural networks will gradually accumulate more and more interesting data points, spark more and more concrete hypotheses, and eventually provide us with principles for understanding neural networks’ real-world behavior that are powerful enough to generalize even to very intelligent agents in very novel situations.
A better definition of alignment research
We can combine the two criteria above to give us a two-dimensional categorization of different types of AI research. In the table below, I give central examples of each type (with the pessimistic-case category being an intermediate step between average-case and worst-case):
| Average-case | Pessimistic-case | Worst-case | |
| Engineering | Scaling | RLHF | Adversarial robustness |
| Behaviorist science | Optimization science | Scalable oversight | AI control |
| Cognitivist science | Concept-based interpretability | Mechanistic interpretability | Agent foundations |
There’s obviously a lot of research that I’ve skipped over, and exactly where each subfield should be placed is inherently subjective. (For example, there’s been a lot of rigorous scientific research on adversarial examples, but in practice it seems like the best mitigations are fairly hacky and unprincipled, which is why I put adversarial robustness in the “engineering” row.) But I nevertheless think that these are valuable categories for organizing our thinking.
We could just leave it here. But in practice, I don’t think people will stop talking about “alignment research” and “capabilities research”, and I’d like to have some definition of each that doesn’t feel misguided. So, going forward, I’ll define alignment research and capabilities research as research that’s close to the bottom-right and top-left corners respectively. This defines a spectrum between them; I’d like more researchers to move towards the alignment end of the spectrum. (Though recall, as I noted above, that most ML research is neither alignment nor capabilities research, but instead applications research.)
Lastly, I expect all of this to change over time. For example, the central examples of adversarial attacks used to be cases where image models gave bad classifications. Now we also have many examples of jailbreaks which make language models ignore developers’ instructions. In the future, central examples of adversarial attacks will be ones which make models actively try to cause harmful outcomes. So I hope that eventually different research directions near the bottom-right corner of the table above will unify into a rigorous science studying artificial values. And further down the line, perhaps all parts of the table will unify into a rigorous science of cognition more generally, encompassing not just artificial but also biological minds. For now, though, when I promote alignment, it means that I’m trying to advance worst-case and/or cognitivist AI research.
- Third-wave AI safety needs sociopolitical thinking by (EA Forum; 27 Mar 2025 0:55 UTC; 110 points)
- The ML ontology and the alignment ontology by (24 Feb 2026 4:39 UTC; 109 points)
- “Shut It Down” is simpler than “Controlled Takeoff” by (24 Sep 2025 17:21 UTC; 102 points)
- Third-wave AI safety needs sociopolitical thinking by (27 Mar 2025 0:55 UTC; 100 points)
- 's comment on Towards Alignment Auditing as a Numbers-Go-Up Science by (5 Aug 2025 14:47 UTC; 73 points)
- AI #79: Ready for Some Football by (29 Aug 2024 13:30 UTC; 68 points)
- 's comment on Richard Ngo’s Shortform by (31 Dec 2025 1:50 UTC; 34 points)
- 's comment on AI safety undervalues founders by (17 Nov 2025 8:08 UTC; 19 points)
- 's comment on Red-Thing-Ism by (1 Aug 2025 13:27 UTC; 18 points)
- 's comment on The Compendium, A full argument about extinction risk from AGI by (1 Nov 2024 5:30 UTC; 13 points)
- 's comment on A Pragmatic Vision for Interpretability by (4 Dec 2025 19:56 UTC; 9 points)
- 's comment on The Field of AI Alignment: A Postmortem, and What To Do About It by (15 Jan 2026 9:14 UTC; 8 points)
- 's comment on A Pragmatic Vision for Interpretability by (12 Dec 2025 19:28 UTC; 5 points)
- 's comment on Agent foundations: not really math, not really science by (17 Oct 2025 14:18 UTC; 2 points)
In the post Richard Ngo talks about delineating “alignment research” vs. “capability research”, i.e. understanding what properties of technical AI research make it, in expectation, beneficial for reducing AI-risk rather than harmful. He comes up with a taxonomy based on two axes:
Cognitivist vs. Behaviorist, i.e. focused on internals vs. external behavior. Arguably, net-beneficial research tends to be on the cognitivist side.
Worst-case vs. Average-case, i.e. focused on rare failures vs. “usual” behavior. Arguably, net-beneficial research tends to be on the worst-case side.
I think that Ngo raises an important question, and his answers are pointing in the right way. On my part, I would like to slightly reframe his parameters and add two more axes:
Instead of “cognitivist vs. behaviorist”, I would say “gears-level vs. surface-level”. We want research that explains the actual underlying mechanisms rather than just empirically registering particular phenomena or trends. This is definitely similar to “cognitivist vs. behaviorist”, but research that takes into account the internals of the algorithm can still be mostly surface-level: e.g. maybe it’s just saying, if we tweak this parameter in the algorithm, the performance goes up. I think that Ngo might object that tweaking a parameter is very different from talking e.g. about “beliefs” or “goals” that the system has, and I would agree, but I think that gears vs. surface might be a clearer delineation.
Instead of “worst-case vs. average-case”, I would say “robust vs. fragile”. This is because it’s not entirely clearly what distribution we are “averaging” over, and because it’s important that rare failure modes can arise due to systematic reasons rather than just bad luck. The way I think about it: “fragile” methods are methods that can work if you can afford failures, so that every time there is a failure you amend the system until the result is satisfying. “Robust” methods are methods that you need if you can’t afford even one failure.
Another axis I would add is “two-body vs. one body”. This is related to Ngo’s remark in the end that “further down the line, perhaps all parts of the table will unify into a rigorous science of cognition more generally, encompassing not just artificial but also biological minds”. The point is, alignment is fundamentally a two-body problem. We are aligning AI to a human (or many humans). And humans are already confused about what their preferences are, or about what would it mean to solve a problem without “undesirable side effects”. Therefore, we need research that illuminates the human side of things as well as the AI side of things. The way I envision it, is by creating a theory of agents that is applicable to AIs and humans alike. Other approaches might treat those two sides more asymmetrically, but they do have to address both sides.
Additionally, I would add “precise vs. vague”. This is the difference between making vague, informal statements, and making precise, mathematical, hopefully quantitative statements. Being precise is certainly not sufficient: e.g. scaling laws can be precise, but fail to be gears-level. But it does seem like an important desideratum. Maybe this doesn’t need to be its own axis: precision seems necessary for achieving robustness. But, I think it’s a useful criterion for assessing research that stands on its own[1].
Of course, on most of those axes, going “left” is useful for capabilities and not just alignment. As Ngo justly points out, a lot of research is inevitably dual use. However, approaches that lean “right” are often sufficient to advance capabilities and are unlikely to be sufficient to solve alignment, making them clearly the worse option overall.
In earlier times, I would also add here the consideration of applicability. When assembling a dangerous machine, it seems best to plug in the parts that make it dangerous last, if at all possible. Similarly, it’s better to start developing our understanding of agents from parts that don’t immediately allow building agents. Today, this is still true to some extent, however the urgency of the problem might make it moot. Unless the Pause-AI efforts are massively successful, we might have to make our theories of alignment applicable quite soon, and might not have the luxury of not parallelizing this research as much as possible.
Finally, I state a relatively minor quibble: Ngo seems to put a lot of the emphasis here on understanding deep learning. I would not go so far, for two reasons: one is the two-body desideratum I mentioned before, but the other is that deep learning might not be The Way. It’s possible that it’s better to find a different path towards AI altogether, one designed on better understanding from the start. This might seem overly ambitious, but I do have some leads.
There are certainly examples of research which is at least trying to be robust, while still failing to be very precise (e.g. some of Paul Christiano’s work falls in this category). Such research can be a good starting point for investigation, but should become precise at some stage for it to truly produce robust solutions.
I have referred back to this post a lot since writing it. I still think it’s underrated, because without understanding what we mean by “alignment research” it’s easy to get all sorts of confused about what the field is trying to do.