Vanessa Kosoy comments on Defining alignment research

Vanessa Kosoy 5 Jan 2026 8:43 UTC
LW: 27 AF: 12
0
AF
In the post Richard Ngo talks about delineating “alignment research” vs. “capability research”, i.e. understanding what properties of technical AI research make it, in expectation, beneficial for reducing AI-risk rather than harmful. He comes up with a taxonomy based on two axes:
- Cognitivist vs. Behaviorist, i.e. focused on internals vs. external behavior. Arguably, net-beneficial research tends to be on the cognitivist side.
- Worst-case vs. Average-case, i.e. focused on rare failures vs. “usual” behavior. Arguably, net-beneficial research tends to be on the worst-case side.
I think that Ngo raises an important question, and his answers are pointing in the right way. On my part, I would like to slightly reframe his parameters and add two more axes:
- Instead of “cognitivist vs. behaviorist”, I would say “gears-level vs. surface-level”. We want research that explains the actual underlying mechanisms rather than just empirically registering particular phenomena or trends. This is definitely similar to “cognitivist vs. behaviorist”, but research that takes into account the internals of the algorithm can still be mostly surface-level: e.g. maybe it’s just saying, if we tweak this parameter in the algorithm, the performance goes up. I think that Ngo might object that tweaking a parameter is very different from talking e.g. about “beliefs” or “goals” that the system has, and I would agree, but I think that gears vs. surface might be a clearer delineation.
- Instead of “worst-case vs. average-case”, I would say “robust vs. fragile”. This is because it’s not entirely clearly what distribution we are “averaging” over, and because it’s important that rare failure modes can arise due to systematic reasons rather than just bad luck. The way I think about it: “fragile” methods are methods that can work if you can afford failures, so that every time there is a failure you amend the system until the result is satisfying. “Robust” methods are methods that you need if you can’t afford even one failure.
- Another axis I would add is “two-body vs. one body”. This is related to Ngo’s remark in the end that “further down the line, perhaps all parts of the table will unify into a rigorous science of cognition more generally, encompassing not just artificial but also biological minds”. The point is, alignment is fundamentally a two-body problem. We are aligning AI to a human (or many humans). And humans are already confused about what their preferences are, or about what would it mean to solve a problem without “undesirable side effects”. Therefore, we need research that illuminates the human side of things as well as the AI side of things. The way I envision it, is by creating a theory of agents that is applicable to AIs and humans alike. Other approaches might treat those two sides more asymmetrically, but they do have to address both sides.
- Additionally, I would add “precise vs. vague”. This is the difference between making vague, informal statements, and making precise, mathematical, hopefully quantitative statements. Being precise is certainly not sufficient: e.g. scaling laws can be precise, but fail to be gears-level. But it does seem like an important desideratum. Maybe this doesn’t need to be its own axis: precision seems necessary for achieving robustness. But, I think it’s a useful criterion for assessing research that stands on its own^[1].
Of course, on most of those axes, going “left” is useful for capabilities and not just alignment. As Ngo justly points out, a lot of research is inevitably dual use. However, approaches that lean “right” are often sufficient to advance capabilities and are unlikely to be sufficient to solve alignment, making them clearly the worse option overall.
In earlier times, I would also add here the consideration of applicability. When assembling a dangerous machine, it seems best to plug in the parts that make it dangerous last, if at all possible. Similarly, it’s better to start developing our understanding of agents from parts that don’t immediately allow building agents. Today, this is still true to some extent, however the urgency of the problem might make it moot. Unless the Pause-AI efforts are massively successful, we might have to make our theories of alignment applicable quite soon, and might not have the luxury of not parallelizing this research as much as possible.
Finally, I state a relatively minor quibble: Ngo seems to put a lot of the emphasis here on understanding deep learning. I would not go so far, for two reasons: one is the two-body desideratum I mentioned before, but the other is that deep learning might not be The Way. It’s possible that it’s better to find a different path towards AI altogether, one designed on better understanding from the start. This might seem overly ambitious, but I do have some leads.
1. ^
  There are certainly examples of research which is at least trying to be robust, while still failing to be very precise (e.g. some of Paul Christiano’s work falls in this category). Such research can be a good starting point for investigation, but should become precise at some stage for it to truly produce robust solutions.