Some Hacky ELK Ideas

Credit to Adam Shimi, Alex Flint, and Rob Miles for discussions, counterexamples, and general input to the ideas here.

Quick recap for anyone who didn’t read the hundred-page Eliciting Latent Knowledge document:

  • We have a diamond in a vault, with a bunch of automated theft-defenses.

  • We train a predictor to take in the vault’s video-stream and a plan for the vault’s actuators, and predict future video frames.

  • We train a planner to find plans which the predictor predicts will end with the video feed still showing a diamond in the vault.

  • We want some way for a human to probe the latent knowledge of the predictor, e.g. to check if the predictor expects a screen showing a diamond will be placed in front of the camera.

The ELK document largely discussed the human running natural-language queries on the predictor’s latent knowledge, but we’re going to mostly ignore that part, because it’s harder than everything else about the problem combined and I don’t think it’s actually that central.

This post will focus mainly on the specific question of whether there’s a screen in front of the camera in the diamond problem. I expect that the approaches and counterexamples here extend to more general problems of detecting when a plan is “messing with the sensors”, i.e. doing things which decouple observables from the parts of the world we’re actually interested in.

Revisiting Holdout Data

The first approach we’ll discuss is basically “just use holdout data”, but with two major changes compared to previous suggestions.

First, this proposal does not involve training a natural-language “reporter” at all. (IMO the whole idea of training an honest natural language reporter is just as hard as training an honest natural language question-answerer in general, for basically the same reasons; having a predictor’s internal computations on hand does not make it easier in any important way.) Using holdout data to train an honest reporter would be pretty hopeless, but that’s not what we’re trying to do here. We’re just trying to tell whether the predictor expects sensor-tampering from the planner.

For that purpose, there’s an obvious way to detect sensor tampering: cross-check the (predicted) holdout data with the (predicted) non-holdout sensor data. If they don’t match, that’s evidence of (predicted) sensor tampering. The danger, of course, is that the planner will either find and tamper with all of our sensors, or fake sensor readings.

That brings us to the second major change: while the planner will inevitably be incentivized to tamper with the non-holdout sensors, the obvious setup for the planner yields no incentive to trick a human; there’s no incentive to be particularly subtle about the sensor-tampering, and in particular no incentive to fool the holdout sensors.

Formal Setup

Some variable names:

  • are non-holdout observations from the sensors before the plan begins

  • are the actions constituting the plan

  • are non-holdout observations from the sensors after the plan

  • are holdout observations (held out from the planner during planning, not from the predictor during training)

We’ll assume that the predictor is a generative model for all observations including holdout, so it effectively encodes a distribution . We’ll also assume that the actions have no causal influence on observations before the plan begins, and that the predictor “knows this” (i.e. ).

We’ll assume that the planner maximizes some expected utility function of the later non-holdout observations given the earlier non-holdout observations:

A^* = argmax_A E[u(Obs_{>n})| do(A^* = A), Obs_{\leq n}]

… where the distribution used is the predictor’s model.

Key thing to notice: the holdout observations are not explicitly involved at all in the planner’s evaluation. If we have one plan which tampers with only the non-holdout sensors to fake the appearance of the diamond, and another plan which tampers with all the sensors to fake the appearance of the diamond, the planner is indifferent between and . So long as the non-holdout sensor readings are the same, the planner has no incentive to also fake the holdout sensor readings.

Furthermore, if there is ever any tradeoff whatsoever between faking the holdout sensor readings and anything the planner cares about, then it will not fake the holdout sensor readings. In particular, if faking the holdout sensor readings requires resources which could instead be spent better/​more reliably faking the non-holdout readings, then the AI will not fake the holdout readings.

But What If There’s An External Adversary?

The holdout sensor approach is probably sufficient to detect sensor-tampering by the planner itself. But an external adversary (like a thief, for instance) might want to fool the human overseer, so they might actually want to screen all the sensors.

What can we do against a powerful thief? And, while we’re at it, what can’t we do?

Thief With Limited Knowledge of Diamond

If the thief has limited information about the diamond at the time that the screen goes up, then we can use a variant of the holdout approach. Imagine, for instance, that the diamond has a tiny imperfection on it which the thief does not know about. Then, when the thief puts up the screen, the screen will presumably show a diamond without that imperfection. By comparing it to sensor readings from before the screen went up, we should be able to detect the change. More generally, we can see how much details of the earlier-diamond correlate with details of the later-diamond (under the predictor’s distribution), to detect whether the predictor expects a screen to go up.

(I call this a variant of the holdout approach because we’re effectively using earlier sensor readings as the holdout data. If we’re also worried about the planner messing with the sensors, we can combine this directly with our earlier approach.)

One potential counterexample for this: if the thief has put up a screen even before the episode began, then we may not have any sensor readings of the real diamond. In this case, we’re “starting in the Matrix”, and the thief can have the screens perfectly emulate whatever would happen in the real world; thus the “start in the Matrix” case cannot be handled even in principle. We need some real sensor readings.

The more interesting counterexample is, of course, that the thief has arbitrarily perfect knowledge of the diamond—possibly by using the real diamond to generate the screened sensor readings. More on that later. But first, a barrier.

Thief With Access To Predictor/​Planner/​Reporter System

If the thief has access to the predictor/​planner/​reporter system, then we can apply a good old diagonalization argument. The thief can run the predictor, see what it predicts, and do something else. Or the thief can run the reporter, see whether it says that the diamond is stolen in the predictor’s scenario, and do the opposite.

Takeaway: if the thief has full system access, then we may not be able to solve ELK even in principle (though this does depend somewhat on exactly what guarantees we want). So we probably want to just assume that any potential thief does not have access to the predictor/​planner/​reporter system directly.

(For the full-blown alignment problem I would not endorse this assumption, but ELK is intended to be narrow.)

Thief With Access To The Real Diamond

That leaves a gap between cases in which the thief has access to the predictor/​planner/​reporter (which run into the diagonalization barrier), and cases in which the thief doesn’t even have detailed information about the diamond (in which case we can solve the problem). In between those, the thief has arbitrarily good information about the diamond, but does not have access to the predictor/​planner/​reporter.

For this case, I do not have a solution, but I do expect that it’s solvable-in-principle.

The main difficulty in these cases is that the may be exactly the same between action-plans which result in the diamond being stolen, and action plans which do not. The thief has enough information to make the predicted sensor readings completely identical between the two cases.

The reason the problem might still be solvable-in-principle is that the actions still give us information about whether the diamond was stolen. But we somehow have to extract that information in a way which does not require any difference whatsoever between the observation-distributions in the two cases.

Furthermore, bear in mind that the predictor may not even calculate internally whether the diamond is actually stolen, in this case. After all, it does not need that information in order to predict the sensor readings; the readings will be identical either way. Even then, I expect it’s possible-in-principle to solve the problem, but the solution starts to look less like “extract the predictor’s latent knowledge” and more like “extract the predictor’s latent world-model and then calculate what latent predictions that world-model would make”. I speculate a little bit about that in another post.