Metalignment: Deconfusing metaethics for AI alignment.

Epistemic status: MSFP blog post day. General and very speculative ideas.

Proposition : Deconfusing metaethics might be a promising way to increase our chances of solving AI alignment.

What do I mean by metaethics?

Metaethics here is understood as an ideal procedure that humans are approximating when they reason about ethics i.e. when they are trying to build ethical theories. Let’s have a look at mathematics for an analogy. Part of the mathematical production involves using some theory of logic to prove or disprove some conjecture about some mathematical object. Theorems, lemmas and properties that one can derive from axioms working with some logic is, roughly, part of how mathematics progresses. Another analogy is how we learn about regularities in the world by approximating Solomonoff induction. It seems that we are lacking some formalised, ideal rational procedure of ethical progress that would help us with sorting and generating ethical theories. Such a procedure seems difficult to figure out and potentially crucial to help solving AI alignment.

Why could this be important?

A better understanding of metaethics could help us decide among different ethical theories and how to generate new ones. Furthermore, knowing what the world should become and how AI should interact with it might requires us to make progress on how we should think about ethics to enlighten how we could think about aligning AI. For example, aligning AI with human values, learning and aggregating human preferences in some way, avoiding X-risks are all ethical propositions of what we should do. It is plausible that these views are flawed and that a better understanding of how to think about ethics might make us reconsider these normative stances and clarify what alignment means.

The following intuition is one of the main reasons why I think a better understanding of metaethics might be important to AI alignment research. As I am thinking more about ethics, arguing with others about it and getting more informed about the world, my ethical views evolve and it seems that I am making some sort of progress by sharpening my reasons for why I hold some ethical view or why some ethical theory seems flawed. Thus I tend to value more my future self’s moral views to the extent that he has spent more time thinking about ethics and is more informed about the world so that I trust him more about deciding how I should go about transforming it. Similarly, it might be sensible for future AI systems to be able to instantiate a similar process of moral progress to update its utility function or goals according to the results of such a process that, if transparent and consulted by humans, could figure out how to transform the world through some long and efficient ethical reflection.

Some examples

For clarification, the following, non-exhaustive, criteria might be examples of how to evaluate ethical theories and constraints under which we could generate new ones.

  • Using clean thought experiments as intuition pumps.

  • Constraining ethical theories by physics and other scientific domains such as evolution or computability. For example, maximising the number of 10-dimensional pink unicorns is probably not a very good ethical theory as it demands to bring values that are meaningless and incompatible with the laws of physics. Science might not tell us what to do but it can help us in knowing what we can’t do or can’t consider as valuable.

  • Formalising ethical theories further than a lot of existing ethical theories that are mostly represented through natural language. This could ease the evaluation and learning of ethical theories by some AI and yield more consistent ethical theories.

  • Favor simplicity : avoid adding unnecessary arbitrary values.

  • Favor Universality : try to be as observer independent as possible. Ideally we might want our ethical theories to be applicable not just to humans but to all sorts of other physical systems.

Possible objections

This approach of AI alignment might be too top-down in its current formulation and raise a number of difficult challenges or objections toward being a research path worth pursuing :

  • There might be no such meaningful thing as ‘better ethical theories’ in the absolute sense but there might be some that are better for a certain class of physical systems.

  • Such a project might take too long to implement. There does not seem to be any consensus regarding a better ethical theory although philosophers have been arguing and thinking about ethics for long time.

  • There might not be any universally compelling argument. But we still might identify a class of arguments or ethical theories that seems more viable and use them in addition to others value learning approaches.

  • Formalising ethics is too hard because it’s too fuzzy. Indeed ethics plausibly emerges from genetic and memetic evolution and mostly reflects humans trying to gain some value from cooperation with other humans.

Nevertheless such a project might have the positive aspect of not speeding up AI capability research while informing us about values and how to think about alignment. One important downside though would be that there might be other more promising projects to pursue instead.


To conclude I would like to suggest some possible way to imagine working toward a better understanding of metaethics and producing better ethical theories. These are extremely broad and vague suggestions to stimulate research ideas.

  • Build an ethical oracle that could be asked questions about ethical inconsistencies, moral blindspot or axiological problems.

  • Artificial Philosopher: Input philosophy papers on ethics and metaethics and output better understanding of metaethics and more satisfying ethical theories.

  • Expected utility maximiser : Update utility function in accordance to best guess about ethics for example derived from the artificial Philosopher or the ethical oracle. This would involve an additional step of translating ethical theories into utility function.

  • Simulate humans so that they have more time to figure out more about ethics.

  • Formalise ethics maybe using logic, probability and game theory.

  • Accelerate research in moral psychology.