Conceptual Analysis for AI Alignment

TL; DR—Conceptual Analysis is highly relevant for AI alignment, and is also a way in which someone with less technical skills can contribute to alignment research. This suggests there should be at least one person working full-time on reviewing existing philosophy literature for relevant insights, and summarizing and synthesizing these results for the safety community.

There are certain “primitive concepts” that we are able to express in mathematics, and it is relatively straightforward to program AIs to deal with those things. Naively, alignment requires understanding *all* morally significant human concepts, which seems daunting. However, the “argument from corrigibility” suggests that there may be small sets of human concepts which, if properly understood, are sufficient for “benignment”. We should seek to identify what these concepts are, and make a best-effort to perform thorough and reductive conceptual analyses on them. But we should also look at what has already been done!

On the coherence of human concepts

For human concepts which *haven’t* been formalized, it’s unclear whether there is a simple “coherent core” to the concept. Careful analysis may also reveal that there are several coherent concepts worth distinguishing, e.g. cardinal vs. ordinal numbers. If we find there is a coherent core, we can attempt to build algorithms around it.

If there isn’t a simple coherent core, there may be a more complex one, or it may be that the concept just isn’t coherent (i.e. that it’s the product of a confused way of thinking). Either way, in the near term we’d probably have to use machine learning if we wanted to include these concepts in our AI’s lexicon.

A serious attempt at conceptual analysis could help us decide whether we should attempt to learn or formalize a concept.

Concretely, I imagine a project around this with the following stages (each yielding at least one publication):

1) A “brainstormy” document which attempts to enumerate all the concepts that are relevant to safety and presents the arguments for their specific relevance and relation to other relevant concepts. This should also specifically indicate how a combination of concepts, if rigorously analyzed, could be along the line of the argument from corrigibility. Besides corrigibility, two examples that jump to mind are “reduced impact” (or “side effects”), and interpretability.

2) A deep dive into the relevant literature (I imagine mostly in analytic philosophy) on each of these concepts (or sets of concepts). These should summarize the state of research on these problems in the relevant fields, and potentially inspire safety researchers, or at least help them frame their work for these audiences and find potential collaborators within these fields. It *might* also do some “legwork” in terms of formalizing logically rigorous notions in terms of mathematics or machine learning.

3) Attempting to transferring insights or ideas from these fields into technical AI safety or machine learning papers, if applicable.

ETA: it’s worth noting that the notion of “fairness” is currently undergoing intense conceptual analysis in the field of ML. See recent tutorials at ICML and NeurIPS, as well as work on counter-factual notions of fairness (e.g. Silvia Chiappa’s).