Introduction

My PhD thesis probably wins the prize of weirdest ever defended at my research lab. Not only was it a work of theory of distributed computing in a formal methods lab, but it didn’t even conform to what theory of distributed computing is supposed to look like. With the exception of one paper (interestingly, the only one accepted at the most prestigious conference in the field), none of my published works proposed new algorithms, impossibility results, complexity lower bounds, or even the most popular paper material, a brand new model of distributed computing to crowd even more the literature.

Instead, I looked at a specific formalism introduced years before, and how it abstracted the more familiar models used by most researchers. It had been introduced as such an abstraction, but the actual formal connection was missing. What I did was put it on a more formal footing, and also explain what was lost in abstraction.

It’s not that it wasn’t interesting. My advisors were curious if not impressed. I published multiple papers. But it was quite clear that no one really knew what to do with my work, be it reviewers, colleagues, advisors, or even me.

We didn’t know I was doing deconfusion.

You might have heard of it: Nate Soares, MIRI’s executive director, coined the term to describe the sort of work MIRI was and is doing; things like Functional Decision Theory and Logical Induction. He described it as:

something like “making it so that you can think about a given topic without continuously accidentally spouting nonsense.”

In hindsight, this quote captures both my graduate work and my current research in AI Alignment. Yet I fail to realize this despite reading the post multiple times in the last few years. Deconfusion sounded either like fundamental research or like distillation; two subjects I was interested in, but which didn’t quite fit my interests and my intuitions for the thing I was aiming at.

Similarly, I see many people use deconfusion to reference those two topics. Fundamental research because deconfusion is used to point at what MIRI is doing, and they have been traditionally focused on deconfusion related to fundamental questions of agency of rationality. Distillation because it involves dissolving one form of confusion—the confusion coming from inferential distance.

So anyone who wants to do actual new research and doesn’t subscribe to a vision of alignment requiring a complete formalization has nothing to do with deconfusion, right?

I disagree. I see deconfusion as a type of research that isn’t limited to fundamental research or distillation, but also plays a fundamental role in prosaic alignment approaches and more concrete research. Moreover, deconfusion as I see it seems sorely needed in the field right now to deal with the lack of paradigm and the difficulty of stating scenarios, risks, problems and solutions clearly.

I thus propose my take on deconfusion, which offers a grounding to clarify how deconfusion can be useful. I then present an analogy between my proposition of deconfusion and programming, address why I’m not using the name of conceptual engineering instead, and finally spell out some consequences of this perspective for the field and researchers.

Note that even if I believe this is a valuable way of viewing deconfusion, it hasn’t paid enough rent yet to be taken as a definition of deconfusion per se. But I have good hopes that the further work I’m doing currently will pile some evidence for this.

Special Thanks to Nate Soares for a great discussion on exactly what he meant in his original blog post; to John S. Wentworth for multiple discussions on the nature of deconfusion, and for pushing me to go deeper; to Abram Demski for rightly asking more details on confusion and language building, as well as for suggesting the great analogy to programming and refactoring; and to Connor Leahy for a valuable conversation and helping me present the application part in a non-confusing way. Thanks to Alexis Carlier, Edouard Harris and Alex Turner for feedback on the ideas.

My take on deconfusion

Here is how I think about deconfusion:

the process of dissolving confusion, by reducing the object of thought to less confused and better understood ideas, with an application in mind.

In and of itself, this doesn’t help much; it’s just a nice sounding sentence. But by going into more detail about each part of it, I hope to convince you that it already clarifies quite a lot about deconfusion.

Always an application

In my take, there is the controversial word application. Coming from a theoretical field, I know how many researchers fear this word. Not that they’re against applications in and of themselves, but they are tired of having to endlessly “justify” theory research through far-and-farer fetched appeals to the fashionable application (for example, my thesis funding application mentioned IOT…). Indeed, some early readers of this work complained that some deconfusion doesn’t have an application.

The answer is simply that everything can count as an application here. I’m not adding a constraint on deconfusion, I’m making explicit an input parameter: one deconfuses for a reason. Maybe this reason is curiosity, the advancement of knowledge, or because their proposed solution to AI Alignment requires a component they’re confused about. The nature of the application doesn’t matter for the definition: only the fact that there is an application.

This application, in turn, plays a fundamental role for deconfusion. The goal is not to find the essence of the concept investigated, but to find the version of the intuition that actually helps for the application. The latter thus provide a direction for deconfusion.

What is confusion?

For deconfusion to serve, one must be confused. But what exactly does it mean to be confused? Maybe it amounts to the lack of a mathematical model. But that’s clearly too strong: we don’t feel confused about many concepts and ideas for which we don’t have this level of understanding.

No, a better starting point is Nate’s pithy statement of deconfusion. More specifically, the part about “continuously accidentally spouting nonsense”. The spouting nonsense is pretty straightforward: when confused, we tend to say incoherent things, or things that on reflection don’t make much sense. Even more interesting is the “accidentally” part. Confusion doesn’t usually declare itself: often, we only vaguely realize the issue. There is this nagging feeling inside, but not a crisp statement of the confusion. After all, such a crisp statement would already be part of deconfusion!

One tool of deconfusion that I’ll introduce in the next section is the extensive definition: a definition as a list of examples. Even if It only partially constrain the concept, it can be pretty useful for getting a grip on it. In that vein, here are some salient examples of confusion:

(Incoherence) We say one thing at some time, and then the opposite later. Competing intuitions fighting it out, winning alternatively, which result in a mess.
(Too many threads) We have numerous intuitions about the concept at hand, which are so varied that it’s not clear how to pull them all together into any kind of coherent core.
(Self Motte-and-bailey) The Motte-and-bailey fallacy is a form of argument when one changes the underlying meaning of a previous statement (usually controversial and bold) to fend off criticism. Basically, one keeps using the same word or description but jumps between multiple meanings.
This form of confusion is about doing that to oneself. We change the underlying meaning of the concept when thinking of different points, without necessarily noticing.
(Underdeterminacy) We have some intuitions, but they only cover a tiny fraction of the cases the concept is supposed to apply to.
(Cruxless Disagreement) Whenever disagreeing on the concept with someone, we never manage to reach a satisfying crux.
(Language Inadequacy) The very way we’re trying to express the concept seems inadequate. The words we use don’t feel like they cut the world correctly to express what we mean. And it’s not just a feeling of missing one word, but of not being able to frame a whole way of thinking.

All of these are examples of confusion that I think deconfusion can address. On the other hand, there is at least one colloquial use of the word “confusion” that I expect to be mostly orthogonal to deconfusion: the one related to logical non-omniscience and inferential distance. Even if I get a full textbook on the issue I was confused about, which answers all of my questions perfectly well, I have to take the time to study before I stop being confused. This confusion of mine, while still real, isn’t what I’m pointing at here. This is not the sort of confusion that I expect deconfusion to dissolve.

Dissolving confusion

Last but not least, I see deconfusion as… dissolving confusion (for the application). How does one dissolve confusion? The position I’m taking is reductionist: by decomposing it into less confused and more understood components. We aim to explain what we’re confused about based on what we already understand.

More concretely, this is done through building what I call conceptual tools—ways of thinking which help us reach this goal. I want to focus on three kinds of conceptual tools that seem very important for deconfusion to me: handles, languages, and translations.

Handles: fixing the concept

Handle is my name for the actual end-product of deconfusion: the explanation of the confusing concept in terms of clearer parts. Intuitively, a handle looks like a formula or a definition. It can have different levels of details and formalism, but it tends to answer definitational questions. This also means that pure handle-building assumes a base material of less confused ideas to build from, without concerning itself with the issues with this deconfused API (we’ll see about issues in the language/API later).

Why this name? Because handles allow us to grab and manipulate the object they are attached to. Similarly, conceptual handles fix and make more precise the initial intuitions. They force our thoughts about the subject to take a crystallized concrete form, that we can then criticize and manipulate at will.

We can even go a bit more formal, and see handles as pointing to a subset of a given “concept-space”, composed of pairs of mathematical models and interpretations. Ideally we would like a pointer to a single point in concept-space (a formal definition with just the right number of degrees of freedom), but that’s often too difficult in general. Instead, we might get a definition based on previously deconfused concepts (a set of models in concept space which all use the simple concepts as gears) or an extensive definition (a set of models in concept space which all agree with the examples). And even if it feels insufficient, for some applications and some forms of confusion, such deconfusion is already enough.

Let’s get through some examples of handles, to see the variety available to the deconfusion researcher:

(Formal definition with just the right number of degrees of freedom) The holy grail of deconfusion. This is the most precise one can get. Note that the difficulty is not so much to put a formal definition out of one’s ass (anyone can do that trick with a bit of habit), it’s to argue that it actually captures the concept at hand and dissolves the confusion, and that no other formal object follows from the same argument.
(Formal definition with additional degrees of freedom) Class of formal definitions. What happens when some parts of the definition are still unclear, but the range of possibilities still admits a mathematical formulation.
(Definition in terms of simpler, less confused concepts) What we generally mean by a gear-level model. Although hardly enough for a computer implementation, it might suffice for some applications like clarifying AI risk.
(Extensive definition) List of examples supposed to be representative of the concept. Seems mostly valuable when the application isn’t too formal, or as a first deconfusion step.
(Computational model without parameters/degrees of freedom) The computational version of a formal definition, or the formal version of a gear-level model. It has the benefit of computability (and maybe tractability), but the potential difficulty that programs can run without being understood.
(Computational model with parameters/degrees of freedom) Class of computational models when the arguments aren’t enough to fix a single model.
(Story) Self-explanatory. It might seems weird that a story counts as a deconfusion, but think about it: a story fixes at least some element of a scenario.

To get more concrete, here are some great examples of handles from within AI alignment research:

Alex Turner’s deconfusion of power-seeking (a mathematical definition of power-seeking in MDPs, and links between optimality and power-seeking)
Paul Christiano’s What failure looks like (a story about ways of alignment might fail)
John S. Wentworth’s Abstraction Sequence (a mathematical definition of abstraction)

Languages: creating new APIs

When building handles, one might realize that some basic building block is missing. If only we had a deconfusion for this, the original concept would be much easier to clarify. When this happens once or twice, it’s mostly about building other, lower-level handles. But if it becomes endemic, it turns into language building.

This language building is also a part of deconfusion. First, it fits my take, because languages and APIs are expressed in terms of lower-level languages. And second, it clarifies the building block for handles, a clear part of deconfusion. Recall also that one example of confusion was the suspicion that the language used for talking about the concept is inadequate.

Interestingly, I expect that for many people, this sort of language building is what deconfusion looks like. After all, this is the part of deconfusion that is most common in fundamental science: paradigms in hard sciences tend to be heavily centered about languages (called theories). Similarly, multiple well-known research products from MIRI are languages: Cartesian Frames and Finite Factored Sets are two recent examples. For a less intuitive example, I consider Risk from Learned Optimization deconfusion, as it produced a language by cutting the problem of alignment differently and proposing partial handles for the different parts.

I have less to say about languages because I expect this concept is already deconfused enough for my purpose here (as opposed to handles for example). It still represents an exciting subject for deconfusion, because the standard way of thinking about languages as tools doesn’t fit exactly what I’m pointing at. The closest is probably API design.

Note that distinguishing between handles and languages is not necessarily trivial. My heuristic is to look if the end result can be used as is for the application (handles) or if it must be used to deconfused something else before being used in concrete applications (languages).

Translation: linking between languages

One last category of conceptual tools which seems highly valuable for deconfusion are translation. By this I mean linking the ideas one tried to deconfused with another field, hopefully more deconfused. At one extreme of the spectrum, analogies are relatively informal translations; at the other end there are isomorphisms.

While I don’t know of an example from AI Alignment, here are some examples from computer science.

The Curry-Howard isomorphism relating natural deduction (proofs) and lambda calculus (programs)
The link between distributed computing and algebraic topology, representing all parallel configurations as complexes, and both algorithm and models as transformation of these complexes.
Shannon’s master thesis linking electronic circuits with boolean logic.

How translation plays in deconfusion is by taking advantage of previous work to bypass language building. It’s about recognizing that another language would do the trick for handle-building, even when the connection is far from trivial.

An extended analogy: programming

Despite a sprinkle of examples and a smidge of formalism, this post has been quite dry and abstract. Which is why I now present an extended concrete analogy between my take on deconfusion, and programming. (Thanks to Abram for the suggestion)

Deconfusion is basically analogous to writing a program that we don’t already know how to write. Let’s assume this is a function, to remove the subtleties of side-effects (who need side-effects anyway?).

First, we have a reason for wanting to write this function—the application. It might be pure curiosity, for fun, or for a work project. And this reason will change how we write the function, and what will satisfy us. For example, a function just for curiosity might be significantly less maintainable than a function for a work project.

Next we arrive at the confusion: we don’t know yet how to write this function. We also expect that it will be harder than just mixing a couple programming recipes we already know. Most of the examples in the extensive definition of confusion have a programming analogous (what’s missing is the self-motte-and-bailey):

(Incoherence) We keep changing our mind on what the behavior of the function should be.
(Too many threads) We know what the output should be for many different inputs, but they’re are so different that the only implementation we have in mind is a case statement (which wouldn’t generalize as expected)
(Underdeterminacy) We know what the output should be only for a very small part of the input range.
(Cruxless Disagreement) When debating what would be a correct implementation of this function with other programmers, the discussion goes nowhere and there is no concrete input/output that we can point out as the crux.
(Language Inadequacy) The API and/or language that we want to use doesn’t pass the right information to us, or it cuts the problem in a way that doesn’t fit our needs.

Finally, the ways of getting to the function we want to write are analogous with the conceptual tools I presented for dissolving confusion:

(Handles} Handles are about actually writing the function, or at least trying to constrain the specification as most as possible. Here are some concrete analogies.
- Formalization $⟺$ Writing the function
- Parametric formalization $⟺$ Writing a parameterized version of the function
- Extensive definition $⟺$ Listing input/outputs requirements.
(Language) Language building is about… language building. And also refactoring of the code you already have to better fit what you need. This is pretty straightforward, since it was the starting point of the whole analogy.
Just like in the deconfusion case, handle-building can create holes that need to be patched either by handle building (carving a subfunction) or by language building/refactoring (noticing that you need an info that isn’t returned by the API).
(Translation) Here the translation is between the part of your function and some API that already exists. So it’s about realizing that actually, your function is a disguised version of what this library is made to build.
There’s another fun analogy for translation: compiling. This isn’t exactly what I have in mind, but compiling does involve the translation of one programming language into another (assembly or C or an intermediate representation).

Isn’t that just conceptual engineering?

Very quickly, I want to address the similarity between my take on deconfusion and the philosophical approach of conceptual engineering. The latter is basically about creating ideas and concepts for a purpose (hence the “engineering”), instead of looking for the essence of these ideas. This is indeed pretty close to handle-building.

Still, I don’t think conceptual engineering is the right starting point for this discussion, for the following reasons:

Despite having a clear analogy to handle-building, I don’t see language building and translation as part of conceptual engineering.
Engineering has connotations that don’t necessarily apply to all forms of deconfusion. For example, writing stories of alignment failures is a form of handle-building, but very few people consider storytelling an engineering process.
Deconfusion is already used in a similar sense to what I’m pointing at by some researchers in the Alignment community
Conceptual engineering has already a lot of people having written and debated about it, which entails it comes with some baggage that I don’t necessarily want. For deconfusion on the other hand, the only thing people ever point to is the original MIRI blog post.

What does this buy us out?

After doing all the work in understanding this perspective on deconfusion, we can finally grab the low hanging fruits that I mentioned at the beginning of this post.

Deconfusion isn’t limited to fundamental science and distillation

The explicit application should make clear that deconfusion in this sense can be applied to almost anything, not only fundamental science and pure theory. I even went to this cooking class once where the chef proposed his own deconfusion of the transformations of food induced by different cooking techniques—I still use it years later.

Because of the confusion of deconfusion with fundamental research, I expect deconfusion to have a bad wrap for researchers focused on prosaic AI, and who always want to backchain to local search. I hope that my proposed take on deconfusion has clarified that they too can benefit from deconfusion. Maybe it’s not what they need, and it certainly isn’t the sole ingredient of a solution, but it is one powerful approach to have in your toolbox.

Distillation on the other hand might not involve deconfusion at all. I already mentioned that pure inferential distance isn’t the sort of confusion I consider for deconfusion. This by default removes most of distillation from deconfusion. Still, I think that in some rare cases, the inferential distance is closed by actually providing a better deconfusion that the original one, or by deconfusion the handle itself. Not sure, but the tablecloth analogy for relativity looks like one plausible example.

The importance of deconfusion for AI Alignment

I claim that deconfusion is actually a pretty big bottleneck for AI Alignment, whatever the approach taken. The common thread behind the lack of paradigm, the difficulty of some new entrants to deal with the unformalized setting, the lack of consensus, are the vast amounts of deconfusion that still need to be done.

Actually, even someone who wants to argue (honestly) for alignment by default would benefit from deconfusion, because it’s still quite confusing what the thing that would happen by default in this view even is.

To give a range of examples, here is a non-exhaustive list of uses of deconfusion for various types of approaches:

Deconfusion of AI risk arguments
- Example: Alex Turner’s deconfusion of power-seeking.
Deconfusion of a class of failure for alignment
- Example: Paul Christiano’s What failure looks like
Deconfusion of a technical failure mode of alignment
- Example: Evan Hubinger et al.’s (partial) deconfusion of inner alignment in Risks from Learned Optimization
Deconfusion of an ideal to aim for in alignment
- Example: Paul Christiano’s deconfusion of enlightened judgment through HCH
Deconfusion of a building block of an alignment proposal
- Example: Universality, a property of an overseer central to many of Paul’s approaches, really needs deconfusion. To see how much confusion “knowing everything the model knows” generates, one only has to look at Daniel Filan’s recent post.
- Example: Myopia, the property that Evan hopes will prevent deception.

Checking deconfusion

All of that is well and good. But isn’t deconfusion… you know, fuzzy? Like messy and informal and the sort of things that’s impossible to check and falsify? Actually, this perspective on deconfusion highlights that deconfusion is more falsifiable than most formal fundamental research. Because it’s ultimately about making the confused bunch of intuitions more concrete, more manipulable. And because it is always done with an application in mind.

Concretely, the standards of judgment change a little bit between the different sort of conceptual tools that I presented:

For handles, it’s about accuracy (Does the handle fit with the initial intuitions?) and usefulness (Does the handle help the application?). Both questions might be subject to controversy, but the object they’re interrogating is both fixed and concrete, which is a clear gain.
For languages, the question is whether they allow the building of the handles that motivated them. Then the value of these handles is judged along the lines of the previous point.
For translations, they might be the handle itself, or might be the building block of such handles. Either way, the value of the translation is evaluated based on criteria for handles.

What it means to do full time deconfusion

This consequence of my take is a bit more personal: it’s about clarifying the kind of research I’m doing. One thing that made me feel bad in AI alignment is that I don’t have an agenda, a prefered method for alignment. Instead, I work on multiple problems with multiple researchers.

Yet a link exists: deconfusion. Every work I’m doing (even this!) tends to be along the lines I present in this post. This view also addresses my anxieties about not having an approach: as a full deconfusion researcher, it makes sense that I’m applying deconfusion skills (hopefully) and mindset to different projects with different people. Just like an applied mathematician in the sense of Shannon would do.

What I really took from this is the realization that in most collaborations, I’m at the service of the other researchers. Not because of weird notions of worth or experience, but because I’m deconfusing their intuitions for their applications.

Conclusion

In summary, I’m proposing a take on deconfusion as the process of dissolving confusion, by reducing the object of thought to less confused and better understood ideas, with an application in mind.

It has the following components:

An application, which might be anything, not only the increase of knowledge.
A confusion about the topic
Conceptual tools for dissolving this confusion:
- Handles which make the concept more concrete to be investigated
- Languages which allow the building of better handles
- Translations which act as compilers between very different languages or as the deconfused concept itself.

Taking this perspective on deconfusion already bears valuable fruits for AI Alignment:

It clarifies that deconfusion isn’t limited to fundamental research on agency and rationality
It shows how open deconfusion problems abounds in AI Alignment
It reassures us that deconfusion can actually be judged and analysed in a coherent way.
It makes me feel like there’s a coherent meaning behind my research (I need every small win).

What’s left now? Well, I also think this take on deconfusion can be a great basis for investigating how to do deconfusion, what skills are most important, and how to learn them. This entails the hope for a sort of textbook of deconfusion. I’m far from this at the moment though, but I’m working on it. It’s especially exciting because it might help me become better at deconfusion.

Another project I have is to collect many deconfusion open-problems in a post, to help newcomers and interested researchers do the sort of “freelance deconfusion” I’m doing myself.

Looking Deeper at Deconfusion