Formal Metaethics and Metasemantics for AI Alignment

June Ku8 Oct 2019 5:37 UTC

22 points

A Brief Introduction to MetaEthical.AI

tl;dr: AIXI for Friendliness

[Crossposted from my new blog.]

Abstract: We construct a fully technical ethical goal function for AI by directly tackling the philosophical problems of metaethics and mental content. To simplify our reduction of these philosophical challenges into “merely” engineering ones, we suppose that unlimited computation and a complete low-level causal model of the world and the adult human brains in it are available.

Given such a model, the AI attributes beliefs and values to a brain in two stages. First, it identifies the syntax of a brain’s mental content by selecting a decision algorithm which is i) isomorphic to the brain’s causal processes and ii) best compresses its behavior while iii) maximizing charity. The semantics of that content then consists first in sense data that primitively refer to their own occurrence and then in logical and causal structural combinations of such content.

The resulting decision algorithm can capture how we decide what to do, but it can also identify the ethical factors that we seek to determine when we decide what to value or even how to decide. Unfolding the implications of those factors, we arrive at what we should do. All together, this allows us to imbue the AI with the necessary concepts to determine and do what we should program it to do.

See the open source code and commentary at www.metaethical.ai

Motivation

Researchers like Eliezer Yudkowsky and Nick Bostrom, among others, have argued for the urgent need to develop a rigorous framework for ensuring that smarter-than-human intelligence will be beneficial for humanity. I think of them and the community around their respective nonprofits, the Machine Intelligence Research Institute and the Oxford Future of Humanity Institute, as tending to share a cluster of views, including the following:

We should have a wide probability distribution over when human-level AI will be developed. If we go by expert predictions, most predict it will arrive within this century. A cautionary approach should prepare for even more rapid timelines.
Once AI that is at least as intelligent as its human creators has been developed, there is a positive feedback loop in which it can take over the task of improving its own intelligence, quickly resulting in a superintelligence vastly greater than that of humans.
A silicon-based intelligence will have many natural advantages that would further compound this process, e.g. ease of faithful replication, readily available additional hardware, already million-fold increase in serial computational speed relative to biological neurons, long-term exponential Moore’s law trend as well as realistic plans and models for its continuation.
Virtually any sufficiently advanced intelligence will converge upon certain instrumental goals to persist and acquire more resources and power, if only to better serve whatever intrinsic goals it may have.
There is no automatic guarantee that greater intelligence coincides with better ethics. There is also tremendous economic incentive to develop ever smarter AI but not necessarily to make it safer or beneficial in the long run. If anything, each private party’s incentive may be to cut corners on safety to get to market quicker.
Many naive approaches to aligning AI with our values fail. Human values have a great deal of hidden complexity and missing just one dimension can lead to very undesirable outcomes. Therefore, a metaethical approach seems to be more promising than hoping to capture all ethical principles at the object level.

My own inquiry into metaethics began long before these ideas were written. In fact, I had even reached a point where I felt I could explain my reductionist metaethics to other philosophers. But having followed the development of the AI safety literature with great interest, I felt a renewed sense of purpose and urgency. It seemed we’ll need to not only solve perennial philosophical problems but do so with sufficient precision to make a computer understand them. What is more, it looked like we were in a race to accomplish it all before the arguably exponential advancement in AI crossed some unknown threshold.

Having mentioned all this, I will not be arguing for the above claims here. And while this forms my primary motivation, I actually don’t think agreement with any of them is necessary to appreciate the metaethics and metasemantics I develop here. I have spent enough time in academic philosophy to appreciate such theories in the theoretical spirit in which they have often been developed. Formulating them in code as I have done could be seen as just a notational variant to the more conventional expression of certain rigorous philosophical theories in mathematical logic. Doing so helps us avoid misleading vagueness and ambiguity and ensures maximal precision in our thinking and communicating, all of which can be appreciated without regard to any practical applications.

Still, I hope many of you have already or will soon come to appreciate some of the backdrop of this MIRI–FHI cluster of views. It’s said that necessity is the mother of invention. It has certainly driven me to be more ambitious and aim for higher precision than I thought possible in philosophy. To have any hope of success, I realized I would need to delve into mathematics and computer science and bridge the conceptual divide. In doing so, I was excited to discover new inspiration and draw connections I doubt I would have made otherwise. And taking on an engineering mindset, I found myself pruning search trees that initially sounded appealing but turned out not to be fruitful, while finding new appreciation for theories that shed new light and enabled further technical progress.

While many areas of philosophy can benefit from a more technical mindset, I think conversely many in computer science or other mathematical fields may be too eager to apply whatever technical tools they may currently have at their disposal without pausing to ponder whether a problem is still at a philosophical stage in which important conceptual advancements must first take place. Perhaps these advancements have even been made already in academic philosophy but they are not aware of them, while the philosophers in turn are not aware of how to formalize them.

What follows is a mixture of original contributions to philosophical problems, some standard or not-so-standard components borrowed from across computer science and philosophy, and novel ways of weaving them all together. Throughout it all, I have tried my best to balance faithfulness to the subtleties of philosophical reality, the rigor of formalizing these theories, the urgency of making and communicating this progress, and the practicalities of engineering an initial prototype of a wildly ambitious project.

While I don’t necessarily trust our civilization to get philosophy right, I think it is quite good at making progress on well-defined technical problems. I hope I have largely succeeded in turning the philosophical problems of getting an AI to understand and share our values into an engineering problem — and hopefully one we can solve in time.

See the open source code and commentary at www.metaethical.ai

An Apology

In an ideal world, I would have accomplished the above while explaining and justifying each philosophical step up to the standards of contemporary analytic philosophy and relating them to the current and historical literature. Moreover, on the technical side, the mathematical formulas would be written and typeset in beautiful LaTeX with ample diagrams and gentle tutorials.

Or you know, I could have at least written it in English. Instead, I chose to write it in a little known programming language called setlX (although I’ve since interspersed the code with considerable philosophical comments). My choice at the time and perhaps even now, was on the one hand, to struggle with writing math with limited experience or institutional support. Or on the other hand, I could leverage my long experience and intuition with programming to write essentially the same content in a language with clear semantics in set theory — the classic lingua franca of mathematics. On top of that, I’d have a compiler to check for bugs and an interactive console serving as a concrete interface by which to manipulate very abstract objects and functions.

In further defense of setlX, I find it to be a very elegant and powerful language. Its relatively few primitives are sufficient to concisely construct complex data and algorithms while being small enough to pick up fairly quickly if you have some experience with programming, mathematics or logic. Not surprisingly, writing in it feels like you’re programming close to the mathematical essence.

Despite its initial unfamiliarity, I hope you will give setlX, my code and my commentary a chance. Even if you are not technically inclined, I expect that with a little patience, a synopsis can be gleaned from the comments — the most important of which I’ve gathered into a Key Concepts outline beside the code. Other than that, I have not imposed much of an ordering but tried to enable a more free exploration by hyperlinking procedure calls to that procedure’s definition, which often has at least a short explanatory comment.

Where I have left important explanations and justifications sparse, I’ve tried to include links to those of others who have likely done a better job than I would have. I wish I could have done more but I have mainly been optimizing for solving the problem rather than communicating the solution.

See the open source code and commentary at www.metaethical.ai

What links here?

June Ku8 Oct 2019 5:37 UTC

22 points

6 comments5 min readLW link

Mitchell_Porter 9 Oct 2019 23:16 UTC
5 points
0
This looks like important work. Like Gordon, upon closer examination, I do expect to find functions in your code that are tasked with carrying out computations that we don’t know how to do, or which may even be unfeasible in their present form—e.g. “map each brain to its rational self’s values”. Great concept, but how many future scientific breakthroughs will we need, before we’ll know how to do that?
Nonetheless, even a schema for friendly AI has great value. It’s certainly progress beyond 2006. :-)
- June Ku 10 Oct 2019 2:14 UTC
  3 points
  0
  Parent
  If you or anyone else could point to a specific function in my code that we don’t know how to compute, I’d be very interested to hear that. The only place that I know of that is uncomputable is in calculating Kolmogorov complexity, but that could be replaced by some finite approximation. The rest should be computable, though its complexity may be super-duper exponentially exponential.
  In the early stages, I would often find, as you expect, components that I thought would be fairly straightforward to define technically but would realize upon digging in that it was not so clear and required more philosophical progress. Over time, these lessened to more like just technical details than philosophical gaps, until I didn’t find even technical gaps.
  Then I started writing automated tests and uncovered more bugs, though for the most part these were pretty minor, where I think a sympathetic programmer could probably work out what was meant to be done. I think around 42% of the procedures defined now have an automated test. Admittedly, these are generally the easier functions and simpler test cases. It turns out that writing code intended for an infinitely powerful computer doesn’t exactly lend itself to being tested on current machines. (Having a proper testing framework, however, with the ability to stub and mock objects might help considerably.)
  There’s likely still many bugs in the untested parts but I would expect them to be fairly minor. Still, I’m only one person so I’d love to have more eyes on it. I also like the schema idea and have often thought of my work as a scaffold. Even if you disagree with one component, you might be able to just slot in a different philosophical theory. Perhaps you could even replace every component but still retain something of the flavor of my theory! I just hope it’s more like replacing Newtonian mechanics than phlogiston.
  - Mitchell_Porter 10 Oct 2019 6:54 UTC
    3 points
    0
    Parent
    “If you or anyone else could point to a specific function in my code that we don’t know how to compute, I’d be very interested to hear that.”
    From the comments in main():
    “Given a set of brain models, associate them with the decision algorithms they implement.”
    “Then map each brain to its rational self’s values (understood extensionally i.e. cashing out the meaning of their mental concepts in terms of the world events they refer to).”
    Are you assuming that you have whole brain emulations of a few mature human beings? And then the “decision algorithms” and “rational… values” are defined in terms of how those emulations respond to various sequences of inputs?
    - June Ku 10 Oct 2019 8:46 UTC
      3 points
      0
      Parent
      Yeah, more or less. In the abstract, I “suppose that unlimited computation and a complete low-level causal model of the world and the adult human brains in it are available.” I’ve tended to imagine this as an oracle that just has a causal model of the actual world and the brains in it. But whole brain emulations would likely also suffice.
      In the code, the causal models of the world and brains in it would be passed as parameters to the metaethical_ai_u function in main. The world w and each element of the set bs would be an instance of the causal_markov_model class.
      Each brain gets associated with an instance of the decision_algorithm class by calling the class function implemented_by. A decision algorithm models the brain in higher level concepts like credences and preferences as opposed to bare causal states. And yeah, in determining both the decision algorithm implemented by a brain and its rational values, we look at their responses to all possible inputs.
      For implementation, we aim for isomorphic, coherent, instrumentally rational and parsimonious explanations. For rational values, we aggregate the values of possible continuations weighting more heavily those that better satisfied the agent’s own higher-order decision criteria without introducing too much unrelated distortion of values.
Gordon Seidoh Worley 8 Oct 2019 23:44 UTC
3 points
0
I like the intention and spirit of this.
Abstract: We construct a fully technical ethical goal function for AI by directly tackling the philosophical problems of metaethics and mental content. To simplify our reduction of these philosophical challenges into “merely” engineering ones, we suppose that unlimited computation and a complete low-level causal model of the world and the adult human brains in it are available.
I think that, just like with AIXI, these sorts of assumptions mean this approach is practically unworkable, it’s possible that like AIXI it can serve as a model of the ideal. I realize you have posted your code interspersed with comments, and I look forward to seeing more about this in the future as you develop your explanation of it (right now I lack a specific enough model of it to evaluate it beyond liking the intent).
- June Ku 24 Nov 2019 2:25 UTC
  1 point
  0
  Parent
  I now have a much more readable explanation of my code. I’d be interested to hear your thoughts on it.