Reduced impact AI: no back channels

A putative new idea for AI control; index here.

This post presents a further development of the reduced impact AI approach, bringing in some novel ideas and setups that allow us to accomplish more. It still isn’t a complete approach—further development is needed, which I will do when I return to the concept—but may already allow certain types of otherwise dangerous AIs to be made safe. And this time, without needing to encase them in clouds of chaotic anti-matter!

Specifically, consider the following scenario. A comet is heading towards Earth, and it is generally agreed that a collision is suboptimal for everyone involved. Human governments have come together in peace and harmony to build a giant laser on the moon—this could be used to vaporise the approaching comet, except there isn’t enough data to aim it precisely. A superintelligent AI programmed with a naive “save all humans” utility function is asked to furnish the coordinates to aim the laser. The AI is mobile and not contained in any serious way. Yet the AI furnishes the coordinates—and nothing else—and then turns itself off completely, not optimising anything else.

The rest of this post details an approach that could might make that scenario possible. It is slightly complex: I haven’t found a way of making it simpler. Most of the complication comes from attempts to precisely define the needed counterfactuals. We’re trying to bring rigour to inherently un-sharp ideas, so some complexity is, alas, needed. I will try to lay out the ideas with as much clarity as possible—first the ideas to constrain the AI, then ideas as to how to get some useful work out of it anyway. Classical mechanics (general relativity) will be assumed throughout. As in a previous post, the approach will be illustrated by a drawing of unsurpassable elegance; the rest of the post will aim to clarify everything in the picture:

Can Super-God detect whether God existed?

If we say that Mr. Y has had a reduced impact upon the world, what do we mean by that? It’s hard to define, but here is a sufficient condition: if we come back in 50 years time and cannot find any evidence that Mr. Y ever existed, then he probably had a minute impact. If a very thorough analysis of the whole world and surrounding solar system, fed into the most sophisticated computer systems, fails to detect any evidence for or against the existence of Mr. Y, then he certainly had a reduced impact.

This is the heart of the approach. The idea is to examine a slice of the future, and if any effect of the AI’s existence is undetectable in that slice, then the AI had a (very) reduced impact. But who would do this “examining”? The AI might be able to hide its impact from any human examiner, and there might not even be any humans left alive to examine anything. No, the examination has to be a counterfactual examination, carried out by the AI itself p or better yet, a smarter AI. A (hypothetical) smarter version of the AI will be helping to constrain the AI’s current behaviour.

Scouring the future for traces of the AI

We will assume that the AI has an understanding of physics, and that the heart of the AI is a probability module, capable of probabilistic reasoning about physical systems and other things. This module comes in two versions: the module P, which is an idealised version, which has almost unlimited storage space and time with which to answer the question. And a bounded module P’, the realistic, constrained one that the AI uses (these modules give rise to expectation operators E and E’). Crucially, the module P’ can give sensible answers to what the output of the module P would be. This is a reflection principle, allowing the AI to estimate what kind of answers it would give if it had (much) more time to think about it.

This reflection principle isn’t magic: P’ can’t mysteriously compute the answer to a question by peering into P. But P’ could, for instance, estimate whether P will have lower uncertainty than P’. It could estimate how much better P would be at particular task, and other similar estimates.

This P will be the strong version of the AI, looking at the future data. And P’ will allow the AI to estimate, now, what the answer of P might be. How exactly will it do this?

Define s to be a slice of the future. We could, for instance, say that s consists of any point 50 years into the future (on any time-like geodesic), thickened to one metre, say. There are some subtleties with this description—we might encounter black holes and other singularities—but we’ll lay them aside for the moment. The crucial aspect is that s is a barrier: we cannot get from now to a point 60 years in the future, without crossing s at some point.

Then once we’ve defined s, define S to be the total of all information present in that slice: the position and velocities of every particle. This S is a noisy estimate, not a perfect one: the positions and velocities are somewhat uncertain. This noise is quite important, as in a deterministic general relativistic universe, it is the only thing preventing P from knowing the past perfectly. But there need not be much noise for the past to become blurred enough.

From the perspective of today, S is a random variable who’s values are unknown: we can estimate some of its properties, but it will depend on many things, not least of all the decisions that we (and the AI) will make.

What would be a bad S, from the reduced impact perspective? One where it would be extremely obvious, upon examination, that the AI was present in the past. This bad S would lead to a large likelihood ratio:

P(S|”The AI existed”) /​ P(S|”The AI never exited).

    If that ratio is large, there is a lot of evidence in slice s that the AI existed. Conversely, if the ratio is close to 1, there is little evidence of the AI’s existence.

    A very complicated way of doing nothing

    We’re now going to add more details to make that approach rigorous. First of all, how can we rigorise “The AI existed” and its negation? My favourite method for defining such a counterfactual is to have an event X—maybe an electron travelling down a wire—that will turn on the AI initially. The process is slightly noisy, so there is a tiny chance that X will fail; if it does fail, then the AI is not turned on at all. If we designate by B a collection of background knowledge about the world before the AI was turned on, we can define “The AI existed” as “X and B” and “The AI never existed” as “¬X and B”. Note that we are banning anthropic reasoning, allowing the AI to consider hypotheticals in which it didn’t exist. If P and P’ can do physics simulations, this should be fine: ¬X is a perfectly acceptable physical situation, with its own future consequences.

    But if the turning on of the AI is done in full view of everyone, with the press recording everything and people twittering excitedly, then it will be blatant that the turning on was successful (or not). Hence we should put the AI in a room—not to constrain it, like a boxed oracle, but to give it some privacy. Thus it could be plausible that the future—the data S—would not have any trace of the AI’s existence. After an hour’s time, the AI’s mind will be thoroughly overwritten, and the room will be given a vigorous shake. This is not to constrain the AI: we’ll assume the AI is mobile and has access to enough tools inside the room to do whatever it wants. The setup is only to give the AI the option of pretending it never existed—of arranging events so that there is no evidence of it being turned on. Thus it becomes possible that the AI can act to minimise the future ratio P(S|X,B)/​P(S|¬X,B), especially if we make the data in S somewhat noisy.

    But that involves P, not P’, and involves the data S which the AI doesn’t know. So the AI needs to estimate the expected value of the ratio, using its expectation estimator E’ (powered by P’, which can estimate P and S):

    E’(P(S|X,B)/​P(S|¬X,B))

    That expression means the expectation, as best the AI can currently calculate it, of the ratios of probabilities that would be calculated by a (hypothetical) super version of itself, with access to all the data S about the future slice s. That’s worth digesting for a second. The contents of S will be determined by the AI’s actions now, so the AI is in effect trying to pick the action that would make it hard for a super version of itself to tell whether the AI had ever been turned on. Thus it will try to minimise its impact on the world, by perfectly mimicking non-existence: by choosing the action a that minimises the above expected ration. This is

    argmaxa -E’(P(Sa|X,B)/​P(Sa|¬X,B) | a)

    Here we’ve added a minus sign so we’re maximising the negative of the ratio (entirely equivalent with minimising the ratio), and added superscript to S to signify that the contents of S depend (probabilistically) on the action a.

    As it stands, this is a complicated way of making the AI do absolutely nothing: it will seek to make the world indistinguishable from one in which it wasn’t turned on. We haven’t even added a utility function yet! So let’s do that, adding a U representing the AI’s goals, and keeping the above expression as a penalty function to ensure it accomplishes these goals with a minimum of fuss. Let’s require that U be bounded between 0 and 1, and then fix a ratio λ such that the AI picks the action following:

    argmaxa E’(U|a) -λE’(P(Sa|X,B)/​P(Sa|¬X,B) | a).

    For most values of λ, this will either be an expected utility maximiser (for λ too small: the utility dominates) or remain a complicated way of doing nothing (for λ to big: the penalty function dominates). It’s not too clear what happens in the transition: trading off utility versus penalty is not obvious. We have to find a better way of extracting useful work from the system.

    The output channel: a very specific loophole

    Part of the problem is that the AI can have a massive effect just by existing and answering a single question correctly—people will be motivated to run out and build more AIs, write philosophical papers about AIs, and so on. In the example that we started the paper off with, the AIs will have a massive impact: it alone will be the difference between the planet getting rendered uninhabitable by the comet or not. So reduced impact seems impossible in these cases.

    For these we devise an output channel. The idea is simple: information along this output channel, and only along this channel, is allowed to affect the future. This means that the content of the output channel is potentially very dangerous and other precautions need to be taken. Still, it can be useful: for instance, we could ask the AI to help us aim our laser, and restrict it’s output so that we only get coordinates, no secret info. Then our reduced impact approach would preclude the AI from having any other impacts on the world (i.e. no optimising the future), and we’d get our world saved for free.

    How can we encode this idea? If O is the output information (which will be a function of the AI’s actions), we can include it inside the probability ratio, writing P(Sa|X,Oa,B)/​P(Sa|¬X,Oa,B) (with the little superscript a to remind us that O depends on the AI’s actions). This has the right formal structure: if the ratio is close to one, then it means that if you know the background information B and the output information Oa, then observing Sa gives you no extra information about whether the AI was on or not. So the AI is allowed to affect the future through the output channel, but not through any alternative means.

    But what exactly is P(Sa|¬X,Oa,B)? How can the AI have sent that output, if it wasn’t even turned on? We can imagine that if ¬X happens—if the AI is not turned on—then the output wire is connected to an exited, noisy gas, and random fluctuations in pressure or something will type out the output message. Now, it’s extremely improbable that such a process would actually type out a coherent useful message. But that doesn’t matter: a coherent message is extreme evidence of X over ¬X, but as long as there is no additional information detectable in slice s (i.e. in the future of the universe), we’re fine: the AI has not broken out and started optimising things.

    At least… that’s the theory, and it would be true if we knew for a fact that the only way of the AI being turned on was through the X process. In practice, if we observe a coherent output O and know ¬X happened, we wouldn’t say “the AI was never turned on and this message arrived by chance” but “the AI was turned on in some other way (maybe a passing cosmic ray) and typed out that message”. So ¬X and O together no longer imply the AI was not turned on, undermining our assumptions. So in the denominator, we need to replace O with Õ. Õ corresponds to the same output as O, but specifically describes the output being created by the gas (and not any other process). This does not seem infeasible to require, as P and P’ can comprehend physics concepts. Given this, we can take ¬X, Õ and B together to mean that the AI was (almost certainly) never turned on (and the output message arose by chance). Thanks to Daniel Dewey for pointing out this issue.

    So the final equation is that that AI must take the action:

    argmaxa E’(U|a) -λE’(P(Sa|X,Oa,B)/​P(Sa|¬X,Õa,B) | a).

    This seems to be enough to accomplish what I claimed at the beginning: to get an uncontained AI that will give us the coordinates to point the laser at to kill the comet, then quietly turn itself off and leave no further trace of its existence. I’ll repost the image again here, detailing the penalty function; it should now be more comprehensible:

    More (or less) on the output channel

    The output channel is worth pondering in more detail. There are in reality two components to the output: any message the AI may be transmitting, and the indirect effect of confirmation of the AI’s existence (and its performance). Most of the time, the first component includes the second: if we ask the AI to aim a laser to save the planet, the observation “hey the planet isn’t dead—the AI saved us!” derives simply from the AI’s output of coordinates.

    But we can make some of the output dedicated to the second effect without having the first at all. If we had established it was safe to video the AI while it went about its business (note to everyone: it isn’t safe), then we could add that camera feed to the output channel—the AI would not be supposed to output anything through that. If we had some really good physics models, we might have someone observe the AI doing its stuff, and having the optic nerve of that person on the output channel. So then the penalty function equation would encode “given that you had the camera feed, the background knowledge, and the AI’s answer, then if observing the future slice s gives you no extra info as to whether the AI existed or not, the AI is of reduced impact”.

    Or at least, of reduced impact outside the output channels: we’ve mainly constructed a “no back channels for action” AI, rather than a reduced impact AI per se.

    But the idea feels as if it could be developed to get a true reduced impact AI, by carefully parsing what we put on or removed from the output channel. Some AI actions do not need the output channel. Suppose the AI was a paperclip maximiser, and that a constant stream of paperclips poured into and out of the room it was veiled in. If the AI added a few more paperclips to the multitude, this would probably be impossible to notice, especially if we’d been sloppy about keeping records about the quantity of paperclips in the first place. In this situation, a reduced impact AI could build a few paperclips, affecting the world (slightly) in a away that wouldn’t be detectable in 50 years time. And all this without needing to put the impact on an output channel (a similar, more gory idea, would allow the AI to make people’s deaths more pleasant, in undetectable ways).

    We could slice up outputs like that, removing things that would genuinely be of reduced impact from the output channel. We might also put a dampener on (some parts of) the output channel. Currently the AI has no penalty for affecting the universe through O; maybe instead we could impose a reduced penalty, rather than no penalty at all. So it’s easier to affect the future through O, but not free. I’ll return to these ideas once I’m convinced this current setup works, both in theory and for a plausible AI design (for instance, I think the above formula would have to be massaged to allow reasonable self-improvement).

    This is the point in the post where the author normally asks for thoughts and impressions. I will show my rebellious and independent nature by asking instead asking for impressions and thoughts.