Epistemological Framing for AI Alignment Research


You open the Alignment Forum one day, and a new post stares at you. By sheer luck you have some time, so you actually read it. And then you ask yourself the eternal question: how does this fit with the rest of the field? If you’re like me, your best guess comes from looking at the author and some keywords: this usually links the post with one of the various “schools” of AI Alignment. These tend to be affiliated with a specific researcher or lab—there’s Paul Christiano’s kind of research, MIRI’s embedded agency, and various other approaches and agendas. Yet this is a pretty weak understanding of the place of new research.

In other fields, for example Complexity Theory, you don’t really need to know who wrote the paper. It usually shows a result from one of a few types (lower bound, completeness for a class, algorithm,...), and your basic training in the field armed you with mental tools to interpret results of this type. You know the big picture of the field (defining and separating complexity classes), and how types of results are linked with it. Chances are that the authors themselves called on these mental tools to justify the value of their research.

In the words of Thomas S. Kuhn, Complexity Theory is paradigmatic and AI Alignment isn’t. Paradigms, popularized in Kuhn’s The Structure of Scientific Revolutions, capture shared assumptions on theories, interesting problems, and evaluation of solutions. They are tremendously useful to foster normal science, the puzzle-solving activity of scientists; the paradigm carves out the puzzles. Being paradigmatic also makes it easier to distinguish what’s considered valuable for the field and what isn’t, as well as how it all fits together.

This list of benefit logically pushed multiple people to argue that we should make AI Alignment paradigmatic.

I disagree. Or to be more accurate, I agree that we should have paradigms in the field, but I think that they should be part of a bigger epistemological structure. Indeed, a naive search for a paradigm either results in a natural science-like paradigm, that put too little emphasis on applications and usefulness, or in a premature constraint on the problem we’re trying to solve.

This post instead proposes a framing of AI Alignment research which has a place for paradigms, but isn’t reduced to them. I start by stating this framing, along with multiple examples in each of its categories. I then go back to the two failure modes of naive paradigmatism I mentioned above. Finally, I detail how I intend to falsify the usefulness of this framing through a current project to review important AF posts.

Thanks to Joe Collman, Jérémy Perret, Evan Hubinger, Rohin Shah, Alex Turner and John S. Wentworth for feedback on this post.

The Framing

Let’s start by asking ourselves the different sort of progress one could make in AI Alignment. I see three categories in broad strokes (I’ll give examples in a minute).

  • Defining the terms of the problem

  • Exploring these definitions

  • Solving the now well-defined problem

I expect the first and third to be quite intuitive—define the problem and solve it. On the other hand, the second might feel redundant. If we defined the problem, the only thing left is to solve it, right?

Not in a world without logical omniscience. Indeed, the definitions we’re looking for in AI Alignment are merely structures and premises; they don’t give all their consequences for free. Some work is needed to understand their implications.

Let’s get slightly less abstract, and try to state the problem of AI Alignment: “Make AIs well-behaved”. Here “AIs” and “well-behaved” are intentionally vague; they stand for “AI-related systems we will end up building” and “what we actually want them to do”, respectively. So I’m just saying that AI Alignment aims to make the AIs we build do as we wish.

What happens when we try to carve research on this abstract problem along the three categories defined above?

  • Research on the “AIs” part

    • (Defining) Clarify what “AI-related systems we will end up building” means. This basically amounts to making a paradigm for studying the AIs we will most probably build in the future.
      Note that such a paradigm is reminiscent of the ones in natural sciences, since it studies an actual physical phenomenon (the building of AIs and what they do, as it is done).
      Examples include:

      • Timelines research, like Daniel Kokotajlo’s posts

    • (Exploring) Assuming a paradigm (most probably deep learning these days), this is normal science done within this paradigm, that helps understanding aspects of it deemed relevant for AI Alignment.
      Examples (in the paradigm of deep learning) include:

  • Research on the “well-behaved” part

    • (Defining) Clarifying what “what we actually want them to do” means. So building a paradigm that makes clear what the end-goals of alignment are. In general, I expect a global shared paradigm here too, with individual researchers championing specific properties among all the ones promoted by the paradigm.
      Note that such a paradigm is reminiscent of the ones in theoretical computer science, since it studies a philosophical abstraction in a formal or semi-formal way.
      Examples include:

    • (Exploring) Assuming a paradigm (or at least some part of the paradigm focused on a specific property), normal science done in extending and analyzing this property.
      Examples include:

      • Assuming “well-behaved” includes following considered judgement, works on exploring HCH, like these two posts.

      • Assuming “well-behaved” includes being a good embedded agent, works on exploring embedded agency, like the papers and posts referenced in the Embedded Agency sequence.

  • (Solving) Assuming a paradigm for “AIs” and a paradigm for “well-behaved”, research on actually solving the problem. This category is probably the most straightforward, as it includes most of what we intuitively expect in AI Alignment research: proposition for alignment schemes, impossibility results, critics of schemes,...
    Examples include:

    • Assuming “AIs” means “Deep Learning models for question answering” and “well-behaved” means “following HCH”, IDA is a proposed solution

    • Assuming “AIs” means “DeepRL systems” and “well-behaved” means “coherent with observed human behavior”, an impossibility result is the well-known paper on Occam Razor’s and IRL by Stuart Armstrong and Sören Mindermann.

    • Assuming “AIs” means “Embedded Agents” and “well-behaved” means “deals with logical uncertainty in a reasonable way”, logical inductors are a proposed solution.

Note that this framing points towards some of the same ideas that Rohin’s threat models (I wasn’t aware of them before Rohin’s pointer in an email). Basically, Rohin argues that a model on which to do AI Alignment research should include both a development model (what AI will look like) and a risk model (how it will fail). His issue with some previous work lies in only filling one of these models, and not both. In my framing, this amounts to requiring that work in the Solving category comes with both a model/​paradigm of what “AIs” means and a model/​paradigm of what “well-behaved” means. That fits with my framing. On the difference side, Rohin focuses on “what goes wrong” (his risk model), whereas I focus on “what we want”.

Going back to the framing, let’s be very clear on what I’m not saying.

I’m not saying that every post or paper falls within exactly one of these categories. The Logical Induction paper for example both defines a criterion for the part of “well-behaved” related to embedded logical uncertainty, but also provides logical inductors to show that it’s possible to satisfy it. Yet I think it’s generally easy to separate the different contributions to make clear what falls into which category. And I believe such explicit separation helps tremendously when learning the field.

I’m not saying that these categories are independent. It’s obvious that the “solution” category depends on the other two; but one can also argue that there are dependencies between studying what “AIs” means and studying what “well-behaved” means. For example, inner alignment only really makes sense in a setting where AIs are learned models through some sort of local optimization process—hence this part of “well-behaved” requires a specific form to the definition of “AIs”. This isn’t really a problem, though.

I’m not saying that every post or paper falls within at least one category. Some work that we count as AI Alignment don’t really fall in any of my categories. The foremost example that I have in mind is John’s research on Abstraction. In a way, that is expected: this research is of a more general idea. It impacts some categories (like what “well-behaved” means), but is more a fundamental building block. Still, pointing to the categories that this research applies might help make it feel more relevant to AI Alignment.

I’m not saying that we need to fully solve what we mean by “AIs” and “well-behaved” before working on solutions. Of course work on solutions can already proceed quite usefully. What I’m arguing for instead is that basically any work on solutions assumes (implicitly or explicitly) some sort of partial answer to what “AIs” and “well-behaved” means. And that by stating it out loud, the authors would help the understanding of their work within the field.

I’m not saying that this is the only reasonable and meaningful framing of AI Alignment research. Obviously, this is but one way to categorize the research. We already saw that it isn’t as clean as we might want. Nonetheless, I’m convinced that using it will help make the field clearer to current researchers and newcomers alike.

In essence, this framing serves as a lens on the field. I believe that using it systematically (as readers when interpreting a work and as author when presenting our work) would help quite a lot, but that doesn’t mean it should be the only lens ever used.

Why not a single paradigm?

I promised in the introduction that I would explain why I believe my framing is more adequate than a single paradigm. This is because I only see two straightforward ways of compressing AI Alignment into a single paradigm: make it a paradigm about a fundamental abstraction (like agency) that once completely understood should make a solution obvious; or make it a paradigm about a definition of the problem (what “AIs” and “well-behaved” means). Both come with issues that make them undesirable.

Abstraction Paradigm

Paradigms historically come from natural sciences, as perspectives or explanations of phenomena such as electricity. A paradigm provides an underlying theory about the phenomenon, expresses the well-defined questions one can ask about it, and what would count as a successful solution of these questions.

We can also find paradigms about abstractions, for example in theoretical computer science. The current paradigm about computability is captured by the Church-Turing thesis, which claims that everything that can be physically computed can be computed by a Turing Machine. The “explanation” for what computation means is the Turing Machine, and all its equivalent models. Hence studying computability within this paradigm hinges on studying what Turing Machines can compute, as well as other models equivalent to TMs or weaker (This overlooks the sort of research done by mathematicians studying recursion theory, like Turing degrees; but as far as I know, these are of limited interest to theoretical computer scientists).

So a paradigm makes a lot of sense when applied to the study of a phenomenon or an abstraction. Now, AI Alignment is neither; it’s instead the search for the solution of a specific problem. But natural sciences and computer science have been historically pretty good at providing tools that make solving complex problems straightforward. Why couldn’t the same be true for AI Alignment?

Let’s look at a potential candidate. An abstraction presented as the key to AI Alignment by multiple people is agency. According to this view, if we had a complete understanding of agency, we wouldn’t find the problem of aligning AI difficult anymore. Thus maybe a paradigm giving an explanation of agency, and laying out the main puzzles following from this explanation, would be a good paradigm of AI Alignment.

Despite agreeing with the value of such work, I disagree with the legitimacy of making it the sole paradigm of AI Alignment. Even if understanding completely something like agency would basically solve the problem, how long will it take (if it is ever reached)? Historical examples in both natural sciences and computer science show that the original paradigm of a field isn’t usually adapted to tackle questions deemed fundamental by later paradigms. And this progress of paradigms takes decades in the best of cases, and centuries in the worst!

With the risk of short timelines, we can’t reasonably decide that this is the only basket to put our research eggs.

That being said, this paradigmatic approach has a place in my framing, about what “well-behaved” means. The difference is that once a paradigm is chosen, work can proceed in it while other researchers attempt to solve the problem for the current paradigm. There’s thus a back and forth between the work within the paradigm and its main application.

Problem Paradigm

If we stretch a bit the term, we can call paradigm the assumptions about what “AIs” and “well-behaved”. Then becoming paradigmatic would mean fixing the assumption and forcing all the work to go within this context.

That would be great, if only we could already be sure about what assumptions to use. But in the current state of the field, a lot more work is needed (especially for the “well-behaved” part) before anyone can reasonably decide to focus all research on a single such paradigm.

This form of paradigm thus suffers from the opposite problems than the previous one: it fails to value the research on the term of the problems, just to have a well-defined setting on which to make progress. Progress towards what? Who knows…

Here too, this approach has a place in my framing. Specifically, every work on the Solving category exists within such a paradigm. The difference is that I allow multiple paradigms to coexist, as well as the research on the assumptions behind this paradigm, allowing a saner epistemological process.

Where do we go from here?

Multiple voices in AI Alignment push for making the field more paradigmatic. I argue that doing this naïvely isn’t what we want: it either removes the push towards application and solutions, or fixes the term of the problem even though we are still so uncertain. I propose instead that we should think about research according to different parts of the statement “Make AIs well-behaved”: research about what “AIs” we’re talking about, research on what we mean by “well-behaved”, and based on answers to the two previous questions, actually try to solve the clarified problem.

I believe I argued reasonably enough for you to not dismiss the idea immediately. Nonetheless, this post is hardly sufficient to show the value of adopting this framing at the level of the whole research community.

One way I hope to falsify this proposition is through a project to review many posts on the AF to see what makes a good review, done with Joe Collman and Jérémy Perret. We plan on trying to use this lens when doing the reviews, to see if it clarifies anything. Such an experiment thus relies on us reviewing both posts that fit quite well the framing, and ones that don’t. If you have any recommendation, I wrote a post some time ago where you can give suggestions for the review.