Counterexamples to some ELK proposals

paulfchristiano31 Dec 2021 17:05 UTC

LW: 53 AF: 32

Eliciting Latent Knowledge AI Alignment Research Center (ARC)

(This post was written as part of my work at the Alignment Research Center.)

In this post I’ll describe some possible approaches to eliciting latent knowledge (ELK) not discussed in our report. These are basically restatements of proposals by Davidad, Rohin, Ramana, and John Maxwell. For each approach, I’ll present one or two counterexamples that I think would break it.

I think of these approaches as being based on two fundamental hopes:

If our AI tampers with one sensor, there may be other sensors we could deploy, or experiments we could run, in order to notice that something fishy has happened. We could have an AI help us design those experiments/sensors.
Actions that covertly tamper may be similar to actions that visibly tamper, e.g. they may exploit similar unknown facts about physics or they may involve common substeps. So we might be able to notice covert tampering by pointing out similar action sequences that lead to visible tampering.

While these hopes may hold in practice, in this post I’ll describe possible situations where they would fail. (I don’t think anyone particularly expected these approaches to be counterexample-proof.)

I’m only claiming that these failures are possible. I think it is also interesting to try to generate the most probable failure cases and to actually estimate their probability, but that’s definitely not what I’m doing here.

Strategy: Hold out sensors

(Proposal #1 here)

In addition to asking my model to predict one set of sensors, I can deploy a second “held out” set (e.g. more cameras or seismographs or microphones…). I don’t ask my model to predict those sensors, but I show them to a human before getting them to answer questions. Hopefully that sometimes lets them notice tampering that would have been invisible without the new sensors.

Worry: human simulator can use all the extra sensor readings too

I’m concerned that I will learn a better version of the human simulator which predicts the readings of all sensors and then outputs what a human would infer from the complete set. That task need not be much more computationally demanding than the original human simulator.

Moreover, my AI may be able to deduce the presence of the additional sensors very cheaply. Perhaps it can notice the sensors, or it can learn about my past actions to get a hint about where I may have placed them. If this is possible, then “Predict the readings of all sensors” need not be much more complex than “Predict the readings of one sensor.”

An AI doesn’t have to be exactly correct about what sensors exist—when in doubt it can just predict what the sensor would say and imagine giving it to the human.

Strategy: Predict hypothetical sensors

(Proposal #2 here, also suggested with counterexample by Rohin in private communication)

Instead of installing a single sensor, I could randomly choose what sensors to deploy and how to install them. Then I could train a model Predict⁺ to map (description of sensor) → (sensor reading), and I could give a human the readings from many hypothetical sensors in order to give them the best chance of detecting tampering.

Worry 1: Predict⁺ may not be reasonable on non-existent sensors

To collect training data for Predict⁺, I actually install sensors and then use their readings as a target. In every situation, my AI is being asked to predict the readings from a sensor that I really did build.

If my AI is able to deduce the presence of sensors (as described in the last counterexample), then it’s unclear how Predict⁺(sensor description) will behave when we give it a description of a sensor that does not in fact exist.

In particular, Predict⁺(sensor description) might behave in the following undesirable way: it predicts the future observations conditioned on its observations and on the fact that someone typed (sensor description) into the dataset. But it knows that no sensor with description (sensor description) exists. So if someone typed that description into the dataset, that just means something funny is going on—not that the dataset will magically contain readings from a non-existent sensor.

For a predictor which understands the world well it seems quite plausible that this is the most natural behavior. Otherwise it needs to reason about a physically impossible counterfactual to get the “intended” answer (which it never needed to do to answer any other questions). Moreover, it would also be undesirable if the predictor conditioned on such a sensor existing despite having strong evidence to the contrary (since its predictions may then be extremely strange).

(A more extreme version of the same worry applies to proposal #3 here. A predictor seems even more likely to behave badly when it is given a description of a sensor that the human obviously couldn’t have built, and it would definitely know that the sensor doesn’t actually exist.)

Worry 2: the human may not be able to tell what is going on even with all the sensors

For some kinds of tampering no possible sensors would reveal what had happened. For example, we can imagine that our AI has covertly built an army of tiny machines which deconstruct the objects in the room while replacing them with fake versions that look the same on the outside. Sufficiently sophisticated machinery could reproduce the results of almost any experiment we perform (e.g. if you broke an egg, the tiny machines could simulate the sound of an egg cracking and quickly create the yolk on demand).

More realistically, our AI could search for sensors and tamper with any that it finds. It’s not clear how it is supposed to answer questions about what a sensor “would have shown”—if the hypothetical sensor had actually existed, then it also would have been found and tampered with. In this scenario, it may be “correct” for Predict⁺ to report the tampered value.

(As usual, this is only a problem if the human overseer literally can’t notice anything suspicious about a sequence of actions which is in fact constructing an army of sophisticated tiny machines. We are putting the burden of proof on the builder to show that a given concern is impossible.)

Strategy: conduct new experiments to notice that something is wrong

(Part of the proposals here and here.)

In the previous section I described installing sensors before the AI acted so that we’d have a better chance of seeing what was going on in real time. But we could also imagine installing sensors or conducting experiments after arriving at a given state, in order to better understand that state.

For example, suppose I believe that the diamond is in the room and all sensors are functioning correctly. I could ask another AI to design an experiment which would show that I’m fundamentally misunderstanding the situation and the sensors aren’t functioning correctly (e.g. perhaps it would walk me through building an x-ray machine to show me that the room is actually empty).

Worry 1: there is no experiment that reveals the problem

Ultimately to implement an experiment we will execute some actions in the world and observe the results. These experiments won’t be successful if either:

The sensors used to record the results of an experiment are also tampered with, and there is no sequence of actions that “fix” those sensors. (As in the “delusion box” introduced here and discussed as a counterexample for an ELK-like problem here.)
The actuators for carrying out an experiment have been tampered with. Perhaps our AI “proposes an experiment,” but actually the human sees something totally different (or is incapacitated).

You could hope to avoid this problem by using the same sensors and actuators your AI uses to interact with the world; then any form of tampering which undermines those sensors/actuators seems like it must also undermine the AI’s ability to act in the world.

But it seems like we could still have a problem:

Your AI may not continue acting using the same actuators or sensors, or at least not only the same actuators, with which it was originally constructed. It may have constructed tools in the environment, other autonomous processes, or other copies of itself.
Even if your AI is continuing to use the same sensors, it could tamper with sensors in a way that preserves their usefulness while forcing them to look innocuous to the human (e.g. by putting information in higher-order bits of a benign-looking video).
In some situations an AI would be motivated to tamper with sensors even if doing so reduced its own ability to act in the world.

Worry 2: an experiment can itself perform tampering

If we let our AI propose experiments to “reveal” tampering, it may instead propose experiments that do tampering from scratch. That tampering could then provide convincing evidence that something is fishy. If this can be done in any state, then it is useless for telling whether something fishy has actually happened.

We could try to avoid this by restricting to actions that are simple enough that a human understands how they work and can be confident that they won’t tamper. But it then becomes even more unclear whether there is any experiment that would identify problems (since we are effectively pitting a human experimenter against a very powerful manipulator).

Alternatively, we could hope that “reporting existing tampering” involves shorter or simpler action sequences than tampering from scratch. Then we could identify a state as fishy if it’s possible to find a short sequences of actions that look fishy.

But unfortunately the fishiest states could end up looking the safest. For example, suppose that if an AI spends 10 * N + 100 timesteps tampering with the sensors, it takes N steps to “undo” its efforts and reveal that something has gone wrong. A simple tampering attempt takes 200 timesteps to execute and 10 timesteps to undo, which sounds good. But if the AI spent 1100 timesteps tampering, it would take 100 timesteps to undo its work, which is more than the amount it would take to tamper from scratch. So this extremely-tampered-with state would end up looking less fishy than a normal state.

Strategy: detect covert tampering by analogy to visible tampering

(Part of the proposal here.)

As we’ve just discussed, some forms of sensor tampering might also destroy your ability to ever figure out whether tampering has occurred (by disabling actuators or sensors). But it’s still plausible that on the way to tampering we will pass through states in which something fishy is clearly about to happen. If we interrupt these action sequences in the middle, maybe we can do an experiment that shows that something has gone wrong.

Put differently: maybe action sequences that result in covert tampering have the same initial steps as action sequences that result in visible tampering.

More generally: we could look for any recognizable similarity between action sequences that result in covert tampering and action sequences that result in nearly-identical yet visible tampering.

Intuitively you might be optimistic about this if e.g. a would-be tamperer needs to somehow specify what they want to show up on the camera after it’s tampered with. If they just specified something slightly different then the action sequence would still be almost the same yet it would be easy to tell that the camera was tampered with (e.g. because it’s showing something physically impossible). Even if a human doesn’t understand how the covert tampering worked, they could say that something bad has happened because the action sequence is similar to one that causes overt tampering.

Worry: action sequences may be brittle

There may be sequences of action that covertly tamper with the sensors, but for which any small change leads to no tampering at all.

As an extreme example, we can imagine an AI whose first action is to commit to the hash of its next 50,000 actions. Perhaps it chooses a value x, and then installs a device that will cut power to its motors unless the hash of the next 50k actions is equal to x. Perhaps it “should have been” possible to change action 34,279 in order to carry out visible tampering rather than covert tampering—but instead what will happen is that the device will trigger and incapacitate the AI, so that no tampering at all occurs.

More realistically you might imagine an environment where an AI needs to prepare for chaotic influences in the environment—if it changes the action in timestep 17 it could have subtle effects later that change what actions in timestep 140 are feasible. In this case, even if the effects of tampering are determined by actions at late timesteps, you still need to take corresponding actions at early timesteps in order to prepare.

A broader way of viewing this concern is that covert and visible tampering may require “similar” actions in an abstract sense that is meaningful to an AI, which nevertheless correspond to very different sequences of low-level actions and which look quite different to a human.

If there are any kinds of tampering that are brittle in this way, then we won’t be able to recognize them (and so if we select against tampering we may be left with such actions).

What links here?