Depth-based supercontroller objectives, take 2

Thanks for all the helpful comments and discussion around this post about using logical depth as an objective function for a supercontroller to preserve human existence.

As this is work in progress, I was a bit muddled and stood duly corrected on a number of points. I’m writing to submit a new, clarified proposal, with some comments directed at objections.

§1. Proposed objective function

Maximize g(u), where u is a description of the universe, h is a description of humanity (more on this later) at the time when the objective function is set, and G is defined as:

g(u) = D(u) - D(u/h)

where D(x) is logical depth and D(x/y) is relative logical depth of x and y.

§2. A note on terminology

I don’t intend to annoy by saying “objective function” and “supercontroller” rather than “utility function” and “superintelligence.” Rather, I am using this alternative language deliberately to scope the problem to a related question that is perhaps more well-defined or possibly easier to solve. If I understand correctly, “utility function” refers to any function, perhaps implicit, that characterizes the behavior of an agent. By “objective function”, I mean a function explicitly coded as the objective of some optimization process, or “controller”. I gather that a “superintelligence” is an agent that is better than a generic human at myriad tasks. I think this raises a ton of definitional issues, so instead I will talk about a “supercontroller”, which is just arbitrarily good at achieving its objective.

Saying that a supercontroller is arbitrarily good at achieving an objective is tricky, since it’s possible to define functions that are impossible to solve. For example, objective functions that involve incomputable functions like the Halting Problem. In general my sense is that computational complexity is overlooked within the “superintelligence” discourse, which is jarring for me since I come from a more traditional AI/machine learning background where computational complexity is at the heart of everything. I gather that it’s assumed that a superintelligence will have such effectively unbounded access to computational resources due to its self-modification that complexity is not a limiting factor. It is in that spirit that I propose an incomputable objective function here. My intention is to get past the function definition problem so that work can then proceed to questions of safe approximation and implementation.

§3. Response to general objections

Apparently this community harbors a lot of skepticism towards an easy solution to the problem of giving a supercontroller an objective function that won’t kill everybody or create a dystopia. If I am following the thread of argument correctly, much of this skepticism comes from Yudkowsky, for example here. The problem, he asserts, is that superintelligence that does not truly understand human morality could result in a “hyperexistential catastrophe,” a fate worse than death.

Leave out just one of these values from a superintelligence, and even if you successfully include every other value, you could end up with a hyperexistential catastrophe, a fate worse than death. If there’s a superintelligence that wants everything for us that we want for ourselves, except the human values relating to controlling your own life and achieving your own goals, that’s one of the oldest dystopias in the book. (Jack Williamson’s “With Folded Hands”, in this case.)

After a long discussion of the potential dangers of a poorly written superintelligence utility function, he concludes:

In the end, the only process that reliably regenerates all the local decisions you would make given your morality, is your morality. Anything else—any attempt to substitute instrumental means for terminal ends—ends up losing purpose and requiring an infinite number of patches because the system doesn’t contain the source of the instructions you’re giving it. You shouldn’t expect to be able to compress a human morality down to a simple utility function, any more than you should expect to compress a large computer file down to 10 bits.

The astute reader will anticipate my responses to this objection. There are two.

§3.1 The first is that we can analytically separate the problem of existential catastrophe from hyperexistential catastrophe. Assuming the supercontroller is really very super, then over all possible objective functions F, we can partition the set into those that kill all humans and those that don’t. Let’s call the set of humanity preserving functions E. Hyperexistentially catastrophic functions will be members of E but still undesirable. Let’s hope that either supercontrollers are impossible or that there is some non-empty subset of E that is both existentially and hyperexistentially favorable. These functions don’t have to be utopian. You might stub your toe now and then. They just have to be alright. Let’s call this subset A.

A is a subset of E is a subset of F.

I am claiming that g is in E, and that’s pretty good place to start if we are looking for something in A.

§3.2 The second response to Yudkowksy’s general “source code” objection—that a function that does not contain the source of the instructions given to it will require an infinite number of patches—is that the function g does contain the source of the instructions given to it. That is what the h term is for. Hence, this is not grounds to object to this function.

This is perhaps easy to miss, because the term h has been barely defined. To the extent that it has, it is a description of humanity. To be concrete, let’s imagine that it is a representation of the physical state of humanity including its biological makeup—DNA and neural architecture—as well as its cultural and technological accomplishments. Perhaps it contains the entire record of human history up until now. Who knows—we are talking about asymptotic behaviors here.

The point is—and I think you’ll agree with me if you share certain basic naturalistic assumptions about ethics—that while not explicitly coding for something like “what’s the culminating point of collective, coherent, extrapolated values?”, this description accomplishes the more modest task of including in it, somewhere, the an encoding of those values as they are now. We might disagree about which things represent values and which represent noise or plain fact. But if we do a thorough job we’ll at least make sure we’ve got them all.

This is a hack, perhaps. But personally I treat the problem of machine ethics with a certain amount of urgency and so am willing to accept something less than perfect.

§4. So why depth?

I am prepared to provide a mathematical treatment of the choice of g as an objective function in another post. Since I expect it gets a little hairy in the specifics, I am trying to troubleshoot it intuitively first to raise the chance that it is worth the effort. For now, I will try to do a better job of explaining the idea in prose than I did in the last post.

§4.1 Assume that change in the universe can be modeled as a computational process, or a number of interacting processes. A process is the operation of general laws of change—modeled as a kind of universal Turing Machine—that starts with some initial set of data—the program—and then operates on that data, manipulating it over discrete moments in time. For any particular program, that process may halt—outputing some data—or it may not. Of particular interest are those programs that basically encode no information directly about what their outcome is. These are the incompressible programs.

Let’s look at the representation h. Given all of the incompressible programs P, only some of them will output h. Among these programs are all the incompressible programs that include h at any time stage in its total computational progression, modified with something like, “At time step t, stop here and output whatever you’ve got!”. Let’s call the set of all programs from processes that include h in their computational path H. H is a subset of P.

What logical depth does is abstract over all processes that output a string. D(h) is (roughly) the minimum amount of time, over all p in H, for p to output h.

Relative logical depth goes a step further and looks at processes that start with both some incompressible program and some other potentially much more compressible string as input. So let’s look at the universe at some future point, u, and the value D(u/h).

§4.2 Just as an aside to try to get intuitions on the same page: If the D(u/h) < D(h), then something has gone very wrong, because the universe is incredibly vast and humanity is a rather small part of ti. Even if the only process that created the universe was something in the human imagination (!) this change to the universe would mean that we’d have lost something that the processes that created the human present had worked to create. This is bad news.

The intuition here is that as time goes forward, it would be good if the depth of the universe also went up. Time is computation. A supercontroller that tries to minimize depth will be trying to stop time and that would be very bizarre indeed.

§4.3 The intuition I’m trying to sell you on is that when we talk about carrying about human existence, i.e. when trying to find a function that is in E, we are concerned with the continuation of the processes that have resulted in humanity at any particular time. A description of humanity is just the particular state at a point in time of one or more computational processes which are human life. Some of these processes are the processes of human valuation and the extrapolation of those values. You might agree with me that CEV is in H.

§4.4 So consider the supercontroller’s choice of two possible future timelines, Q and R. Future Q looks like taking the processes of H and removing some of the ‘stop and print here’ clauses, and letting them run for another couple thousand years, maybe accelerating them computationally. Future R looks like something very alien. The surface of earth is covered in geometric crystal formations that maximize the solar-powered production of grey goo, which is spreading throughout the galaxy at a fast rate. The difference is that the supercontroller did something different in the two timelines.

We can, for either of these timelines, pick a particular logical depth, say c, and slice the timelines at points q and r respectively such that D(q) = D(r) = c.

Recall our objective function is to maximize g(u) = D(u) - D(u/h).

Which will be higher, g(q) or g(r)?

The D(u) term is the same for each. So we are interested in maximizing the value of—D(u/h), which is the same as minimizing D(u/h)--the depth relative to humanity.

By assumption, the state of the universe at r has overwritten all the work done by the processes of human life. Culture, thought, human DNA, human values, etc. have been stripped to their functional carbon and hydrogen atoms and everything now just optimizes for paperclip manufacturing or whatever. D(u/r) = D(u). Indeed anywhere along timeline R where the supercontroller has decided to optimize for computational power at the expense of existing human processes, g(r) is going to be dropping closer to zero.

Compare with D(q/h). Since q is deep, we know some processes have been continuing to run. By assumption, the processes that have been running in Q are the same ones that have resulted in present-day humanity, only continued. The minimum time way to get to q will be to pick up those processes where they left off and continue to compute them. Hence, q will be shallow relative to h. D(q/h) will be significantly lower than D(q) and so be favored by objective function g.

§5 But why optimize?

You may object: if the function depends on depth measure D which only depends on the process that produces h and q with minimal computation, maybe this will select for something inessential about humanity and mess things up. Depending on how you want to slice it, this function may fall outside of the existentially preserving set E let alone the hyperexistentially acceptable set A. Or suppose you are really interested only in the continuation of a very specific process, such as coherent extrapolated volition (here, CEV).

To this I propose a variation on the depth measure, D*, which I believe was also proposed by Bennett (though I have to look that up to be sure.) Rather than taking the minimum computational time required to produce some representation, D* is a weighted average over the computational time is takes to produce the string. The weights can reflect something like the Kolmogorov complexity of the initial programs/processes. You can think of this as an analog of Solomonoff induction, but through time instead of space.

Consider the supercontroller that optimizes for g*(u) = D*(u) - D*(u/h).

Suppose your favorite ethical process, such as CEV, is in H. h encodes for some amount of computational progress on the path towards completed CEV. By the same reasoning as above, future universes that continue from h on the computational path of CEV will be favored, albeit only marginally, over futures that are insensitive to CEV.

This is perhaps not enough consolation to those very invested in CEV, but it is something. The processes of humanity continue to exist, CEV among them. I maintain that this is pretty good. I.e. that g* is in A.