“Dirty concepts” in AI alignment discourses, and some guesses for how to deal with them

Meta: This is a short summary & discussion post of a talk on the same topic by Javier Gomez-Lavin, which he gave as part of the PIBBSS speaker series. The speaker series features researchers from both AI Alignment and adjacent fields studying intelligent behavior in some shape or form. The goal is to create a space where we can explore the connections between the work of these scholars and questions in AI Alignment.


This post doesn’t provide a comprehensive summary of the ideas discussed in the talk, but instead focuses on exploring some possible connections to AI Alignment. For a longer version of Gomez-Levin’s ideas, you can check out a talk here.

“Dirty concepts” in the Cognitive Sciences

Gomez-Lavin argues that cognitive scientists engage in a form of “philosophical laundering,” wherein they associate, often implicitly, philosophically loaded concepts (such as volition, agency, etc.) into their concept of “working memory.”

He refers to such philosophically laundered concepts as “dirty concepts” insofar as they conceal potentially problematic assumptions being made. For instance, if we implicitly assume that working memory requires, for example, volition, we have now stretched our conception of working memory to include all of cognition. But, if we do this, then the concept of working memory loses much of its explanatory power as one mechanism among others underlying cognition as a whole.

Often, he claims, cognitive science papers will employ such dirty concepts in the abstract and introduction but will identify a much more specific phenomena being measured in the methods and results section.

What to do about it? Gomez-Lavin’s suggestion in the case of CogSci

The pessimistic response (and some have suggested this) would be to quit using any of these dirty concept (e.g. agency) all together. However, it appears that this would amount to throwing the baby out with the bathwater.

To help remedy the problem of dirty concepts in working memory literature, Gomez-Lavin proposes creating an ontology of the various operational definitions of working memory employed in cognitive science by mining a wide range of research articles. The idea is that, instead of insisting that working memory be operationally defined in a single way, we ought to embrace the multiplicity of meanings associated with the term by keeping track of them more explicitly.

He refers to this general approach as “productive pessimism.” It is pessimistic insofar as it starts from the assumption that dirty concepts are being problematically employed, but it is productive insofar as it attempts to work with this trend rather than fight against it.

While it is tricky to reason with those fuzzy concepts, once we are rigorous about proposing working definitions /​ operationalization of these terms as we use them, we can avoid some of the main pitfalls and improve our definitions over time.

Relevance to AI alignment?

It seems fairly straightforward that AI alignment discourse, too, suffers from dirty concepts.

If this is the case (and we think it is), a similar problem diagnosis (e.g. how dirty concepts can hamper research/​intellectual progress) and treatment (e.g. ontology mapping) may apply.

A central example here is the notion of “agency”. Alignment researchers often speak of AI systems as agents. Yet, there are often multiple, entangled meanings intended when doing so. High-level descriptions of AI x-risk often exploit this ambiguity in order to speak about the problem in general but ultimately employ imprecise terms. This is analogous to how cognitive scientists will often describe working memory in general terms in the abstract section of their papers and operationalize the term only in the methods and results sections. As such, general descriptions of AI x-risk that refer to AI systems as agents are often an instance of the use of dirty concepts and philosophical laundering. A different but related problem arises when the invocation of AI systems as agents (implicitly) refers to different interpretations of the concept. For example, sometimes, the intended use of the concept of agency is simply the one operationally defined in Reinforcement Learning; other times, we might intend the concept of agency as it is used in biology and evolutionary theory (see e.g. this overview of notions of agency used in biology); yet other times, we might also intend the concept of agency found in the philosophy of mind, cognitive science, and /​ or psychology. (The latter two cases are additionally problematic because the intended concepts might themselves (i.e., the biological or cognitive science conception of agency) be cases of dirty concepts.) Consequently, and if Gomez-Lavin’s suggestion for dealing with dirty concepts is promising, AI x-risk and alignment research could benefit from mapping an ontology of the various operational definitions of agency employed in the AI x-risk and alignment literature.

Below, we have started (and partially left as an exercise to the reader) compiling an incomplete list of “dirty concepts” often used in AI alignment discourse. At the very least, it is helpful to be aware when one is dealing with the dirty concept. At best, some folks will pick up the idea of creating an ontology mapping for (some of) these concepts.

  • Values, as well as related notions such as: goals, intentions, preferences, desires, …

  • Optimization

  • Awareness, self-awareness, situational-awareness [we don’t mean to imply those concepts are the same]

  • Planning

  • Deception

  • Alignment

  • Autonomy

  • “The AI system” /​ “the model” /​ “the simulation” /​ “the (LLM) simulacra” (/​ etc.)

  • Knowledge /​ Knowing

  • Attention

  • Memory