In the last few months, some AI safety researchers have signaled what Wei Dai called a “wave of optimism” for using transparency tools to catch deceptive behavior in our machine learning models. By a transparency tool, I mean any technique that allows us to peek inside a model to get a better idea of what goes on inside of it. This post outlines one reason to be pessimistic about the usefulness of using transparency tools to catch advanced deception.

First, I construct a model for why we need transparency tools in the first place. I imagine the training process as a game played by two adversaries: a deceptive model, and an overseer trying to catch any deceptive reasoning. The issue for the overseer is that the deceptive model is opaque, which means that the overseer must use some tools to interpret its thoughts and behavior.

While by design, the deceptive model and overseer are comparably intelligent, the nature of using transparency tools enables a “weakest link” dynamic, implying that the deceptive model can exploit weaknesses in the transparency procedure to fool the overseer. In particular, deception can be hidden if the deceptive model has a good understanding of the deployment environment that isn’t revealed while using transparency tools.

In the end, I consider whether this weakness implies that transparency tools act as unnecessary middlemen for catching deception, and use an analogy to GANs to argue that training an overseer to detect deception end-to-end might work better.

The model

Our goal is to take a neural network and detect whether there are any inputs to the network that will cause it to act unacceptably. By assumption, we are concerned about deceptive alignment, a scenario where a misaligned model understands its training process, and will therefore act aligned during training while plotting something else for later.

Since we want to detect deceptive thoughts, we are interested in opening up the model and seeing what goes on inside. We have access to an overseer that interprets the model and spends time thinking whether the model has inputs which cause it to act catastrophically, which will allow us to penalize the model accordingly. Crucially, this overseer is not able to understand the model weights by themselves: instead it relies on a transparency procedure that converts the model into a relatable abstraction, which it then uses to determine whether the model is deceptive.

Formally, we can imagine the dynamic as the following. We are training a model, $m$ which comes from a class of models $M$ . A transparency function maps models in $M$ to an abstract description of what the model does, $T : M \to D$ . Descriptions can be construed very broadly, and they can include natural language descriptions of the model’s decision process, charts and graphs, or even pseudocode. This description is fed to an overseer $O : D \to {True, False}$ , which detects whether the description implies that the model will act deceptively in the deployment environment.

We can think of the transparency function as making a tradeoff between two variables: brevity of the description, and faithfulness of the description to the model $m$ . If the description is too long, or complex, then there is no point of using a transparency function in the first place, since it doesn’t actually make the model any more transparent. On the other hand, if the transparency function produces descriptions that are too short, then another problem occurs: the overseer will miss information that allows it to deduce the deceptive capabilities of $m$ .

Achieving a good balance of these variables will determine whether we should use transparency tools in the first place. A good transparency scheme will ideally provide a concise description of how the model makes decisions, while ensuring that nothing important is left out. We will now explore an example of this dynamic in an image classifier, and I will suggest that striking a perfect balance could be difficult.

A deceptive image classifier

Suppose we are training a neural network to recognize cars using some advanced neural network design. This neural network is smarter than your average convolutional neural network. In particular, rather than imagining it only as a set of neural layers with edge detectors, we also imagine that it has adaptive computational layers that can change depending on the gradient updates. At the extreme end, the training process could even nudge the neural network into arbitrary computational graphs, which allows for if-statements, explicit memory, and loops.

Allowing the model to adapt its structure during the gradient updates is very important, as it enables the model to be more robust to different domain distributions. For illustration of this effect, consider if the gradient updates produce the following schematic for the model.

This design allows the model to be robust to different type of lighting, which demonstrates the utility of the adaptive approach. However, there’s also a dark side to this adaptivity. While normally we think of models having only a choice over their input/output behavior, a sufficiently competent model could also understand the whole training procedure, which means that it could use gradient hacking to route a path through the space of computational graphs.

Since the deceptive model can ask questions about what setting they are in, they can exploit this functionality by inserting an if-statement that checks whether they are currently inside the training environment. During training, we can expect that the model will act safely. During deployment, however, the switch will be activated and the model will pursue its own devious plans.

That’s where the transparency tools and overseer come in.

The transparency function has the job of creating a relatable description of the model that would allow the overseer to catch any attempted deception. Of course, the transparency function can’t just rely on using the input/output behavior of the model during training to produce a faithful description. Doing this would create a misleading impression, as it wouldn’t allow the overseer to detect the dangers involved in a distribution shift.

Instead, the transparency function must convert the pile of matrices, and thus the internal structure of the model, into a relatable abstraction. But now this allows for the following troubling dynamic.

If the deceptive model understands how the transparency tools work, then it can modify its own architecture so that the transparency function leaves out important information. In the above example, the deceptive model might realize that the deployment environment will occur at night, and therefore use this if-statement as its signal to perform a treacherous turn. Further, since the deceptive model understands the transparency tool, it knows how to place if-statements that will be glossed over.

The transparency tools are intended to highlight decisions like this, but by assumption, the transparency tools are less ‘intelligent’ than the deceptive model. Therefore, it may choose to omit this if-statement in its description of the model, and focus on information it believes is more critical.

In other words, not realizing that the nighttime-check was the signal for a treacherous turn, the transparency tool glosses over this if-statement, and doesn’t provide it to the overseer.

This ends in catastrophe once the deception is successful. The overseer is comparably intelligent to the deceptive model, but since the if-statement above was omitted in the transparency description, the overseer was not able to deduce the dangerous off-distribution behavior.

Therefore, the deceptive model successfully fooled the overseer, even though both players in the game were comparably intelligent.

Why are we assuming that the transparency tools are less intelligent than the deceptive model?

An immediate objection is simply, “Yes but a good transparency tool would not have omitted the if-statement.” Ultimately, I think this is a weak objection.

The first thing to note is that is that the transparency function wasn’t totally dumb. Understanding the nature of the deployment environment is hard. The transparency function was just doing its job by trading off brevity and faithfulness. To make this point more compelling, we can consider that the true computational graph was really a large decision tree, with a huge branching factor for the various types of situations the model could be in.

In my example, the transparency tool is really just doing its best job of pruning the irrelevant parts of the tree. If you didn’t know the important detail that the deployment environment was at night, then checking whether there is dark or dim lighting could simply be seen as a side feature that allowed the model to be more robust, and something like that can be omitted.

More importantly, if we allow the transparency tools to be more intelligent than the deceptive model, we must question the role that the overseer is playing in this dynamic. For instance, if the transparency model was exceptionally competent, then all it needs to do in its description is either say, “This model is deceptive” or say, “This model is not deceptive.” Then, the overseer’s job is trivial.

This suggests that decomposing safety-checking into two components: a transparency tool, and an overseer that interprets the results of the tool, is a bad model. If the transparency tools are too weak, then the deceptive model will exploit this weakness to fool the overseer, even if the overseer is very intelligent. On the other hand, if the tools are very strong, then the overseer plays no role at all—suggesting that there was never any need to create interpretability tools, since a deception-checker trained end-to-end would have worked fine.

To better highlight why I think this casts doubt on transparency tools, I offer an analogy with GANs.

A fictional suggestion for training GANs

GANs are trained via an adversarial game between two neural networks. One neural network—the generator—is trained to produce a counterfeit sample when it is given some latent noise, and the other—the discriminator—is trained to detect whether a sample is counterfeit or from the real distribution.

Suppose someone were to propose the following way to help the discriminator: rather than training it end-to-end to help detect counterfeit samples, we instead make its input equal to the output of a transparency tool that gives it insight into how the current sample looks. The transparency tool could, for instance, highlight if it detects any weird shadows or bad lighting, and then feed this information to the discriminator to produce a final determination.

“Since the transparency tool focuses on the weaknesses of the generator, this will allow the discriminator to more easily detect counterfeit samples,” the person argues.

But what the person failed to consider was that all this tool did was relax the job of the generator, since now all it needs to do is get past the weird-shadow and bad-lighting detector stage. It doesn’t even matter how competent the discriminator is, since it’s only given information from the transparency tool.

The person replies, “OK fine, we can give the discriminator the output from the tool, and the original sample. That way it can use the tool as a bonus to help it make the determination.”

But now, the discriminator is operating over a larger dimension input-space than it would have otherwise, which makes its job harder. If it was simply trained end-to-end, then it could have developed a weird-shadow detector and bad-lighting detector on its own, if that were appropriate. But now it has extraneous information, which may or may not be useful for actually detecting counterfeit samples.

Conclusion

Contra Betteridge’s law of headlines, I still think that transparency tools can be useful for catching deception. Also, this argument does not imply that transparency tools are useless for tasks other than catching deception. The biggest weakness with my argument is that the alternative I offered was training an overseer end-to-end, which is not possible if we are assuming a human is the overseer.

However, I don’t think that this weakness is fatal to my argument: in order for transparency tools to work well, they must be strong enough that their weaknesses can’t be exploited by a deceptive model. But we just argued above that if the tools were that strong, they would already be indistinguishable from a deception checker trained end-to-end.

Finally, I suspect people will object to my argument by saying that I am taking a very narrow approach to transparency tools, and that a model trained end-to-end to detect deception still counts as a transparency tool. I have little to say about this objection other than, I’m not arguing against that type of tool.

Will transparency help catch deception? Perhaps not

The model

A deceptive image classifier

Why are we assuming that the transparency tools are less intelligent than the deceptive model?

A fictional suggestion for training GANs

Conclusion