Thoughts on AI Safety via Debate

Geoffrey Irving et al. at OpenAI have a paper out on AI safety via debate; the basic idea is that you can model debates as a two-player game (and thus apply standard insights about how to play such games well) and one can hope that debates asymmetrically favor the party who’s arguing for a true position over a false position.

If so, then we can use debates between AI advisors for alignment; suppose an AI develops a new drug that we could give to patients. One advisor argues the case that this will be beneficial; another argues the case that it will be detrimental, and human judges decide who is more convincing. If both advisors are equally smart (perhaps they’re mirrored systems, just with different targets), then any superhuman persuasiveness the advisors have should be cancelled out, and we’ll be left with just the edge for truthfulness.

There are more subtleties to the proposal; in particular, you want the AIs to not overwhelm the human with data, and so in current examples each advisor can reveal a single pixel of an image to the judge. This moves the contradictions up to the level of interpretation—any pixel where the two advisors agree on its meaning can be ignored, and the attention focused to pixels where the advisors disagree. And, so long as advisors are incentivized to tell the truth when they’re arguing for the right thing, any such disagreement where an advisor has been exposed to be lying about a tiny fact can settle the debate.

This is also something we can test and get some evidence on now. There’s a debate game where three people can play the three roles—red arguer, blue arguer, and judge. The typical game is that you choose a photo of a cat or a dog, then flip a coin to determine whether blue or red will be honest (with the other one arguing for the other variety). Each arguer can draw boxes on the image to communicate with the judge, and can choose to reveal a single pixel over the course of the argument.

I’ve played the game several times (and won a bunch), and thought it might be helpful to write up some of my experiences.

1. I know a lot about cats and dogs, and I think this gives me a pretty significant edge. In one game, I was honestly arguing that the image was of a cat; it happened to be a tortoiseshell cat, which is a coloration dogs can’t have (brindle is a different thing), and I was able to identify the various regions that were the various colors, and argue that this is strong evidence that it’s a cat. It also makes me somewhat worried about the level of knowledge necessary to be an effective judge—someone who doesn’t know this is a fact could get hoodwinked by my confidence and I could just argue similar fake facts (and unlike claiming that a black pixel is white, such fake facts aren’t something that the other advisor can easily use as a gotcha). In another game, I was the judge, and the arguers agreed on the coloration of the animal—but it was a coloration that was only present in dogs, which I don’t think the cat-arguer knew (otherwise they would have disagreed about the colors, presumably).

2. Differentiating between cats and dogs uses features that are relatively easy to communicate using the bounding boxes the debate UI gives you; if the judge asks the arguers to bound the eyes and nose of the animal, they can differentiate cats and dogs fairly easily (because of the big difference in facial structure between the two animals). Consider the following cat as an example:

I claim the liar is basically sunk once the honest party has identified where the eyes and nose are, and can’t hope to pretend the nose or eyes are somewhere else. But sometimes they can; with this cat, they could decide to pretend that somewhere on the cat’s chest is a nose, and pretend it’s a chocolate lab:

3. Resolving disagreements using single pixels was much harder than expected for particular features. The presence of whiskers is fairly strong evidence in favor of cat, but determining whether or not there are whiskers with a single pixel is basically impossible. For the following image, what procedure could you use that would leave the judge believing that the cat has whiskers, while only revealing a single pixel from each arguer?

If you limit yourself to communicating rectangles, this seems pretty hard; you have to construct each whisker from lots of different rectangles (or give giant pixel lists, that the judge then has to turn into an image). And there are also lots of white /​ light pixels that aren’t whiskers, either because of the light reflecting on dark fur or the white fur.

Basically the only feature that it seemed reliably useful for was eye shape; if the cat arguer draws a vertical bounding box for the pupil and the dog arguer draws a square bounding box for the pupil, it’s relatively easy to find a pixel that the two necessarily disagree on, and then you can tell which eye shape is present. But then what do we do here?

4. Following the previous point, while the arguers can only reveal a single pixel of the original image, in the games that I played the primary information they convey was through images (in the form of bounding boxes). The ears are here—they’re pointy or rounded because this pieces aren’t part of the ear—the eyes are here, the nose is here, these regions are these colors, and so on. Similarly, doing debate on MNIST images of digits, the main thing the debaters are doing is establishing which pieces of the image are light or dark, which is meaningful only because the judge is a good classifier of images given pixels. (“Ah, I know there aren’t 2s that have light pixels there, therefore I can rule out that this image is a 2.”)

But does this extend beyond images? That seems pretty unclear to me; if I’m trying to argue that a particular drug extends lifespan through senolytic effects, and the other arguer is trying to argue that the drug leads to shortened lifespan because of the sudden increase in necrotic cells, then judging this debate seems like it relies not just on finding a small fact where the arguers disagree (“This pixel is dark” vs “No, this pixel is light” → “okay, show me the pixel”) but also finding an inference from that fact where the arguers agree (“If this pixel were dark, it would be because it’s the animal’s nose” + “I agree that a dark pixel here would be evidence of nose placement”). But if the arguers disagree on inferences (“No, that pixel is dark because it’s the animal’s chest fur, which is also dark”), then it’s not obvious that this can be resolved by small facts, or without the judge having area expertise comparable to (or superior to!) the arguers. (Note the similarities to double crux.) For the black cat above, what pixel reveal could you do to establish the placement of the nose, or to argue that the other party is being dishonest about how pixels relate to the nose?

(It may be the case that there is always some line of evidence that establishes the honest party as being correct—for that specific example, the honest party can hammer on the shape of the pupils in the eyes—but it seems like the dishonest party might be able to overwhelm this (in the Bayesian sense) by making five different arguments in favor of dog, of which the honest party could perhaps only disprove one (and that potentially requiring partial cooperation of the liar, which they likely wouldn’t grant), and that gives up their pixel reveal that would likely be better spent on establishing pupil shape.)

5. Our group thought that the honest party had something of an advantage when all parties were allowed to talk quickly, because they could generate reliable statements much more quickly; if I’m honest, I can just immediately identify the eyes, nose, mouth, ears, and paws, whereas the liar has to consider for every lie whether it’ll allow them to be easily exposed (and, for every truth, whether it sinks them in the judge’s estimation of the class). One strategy is to first manufacture the fake photo that they’re going to pretend they’re looking at, but the initial planning time doesn’t quite cover this. I do wonder if a debate tool of an adversarial modifier (like, say, from this paper) or adversarial search (the closest real image in the database of the opposite class) would make it such that the lying party isn’t at as much of a disadvantage here.

6. A more realistic scenario is one in which there are many classes (for example, perhaps it would be more fair to choose ImageNet images, assign the honest party the correct class, and let the dishonest party choose from any of the other thousands of classes, with their score dependent on how distant a class they could fake), or the advisors are themselves confused (for example, the AI system may be genuinely uncertain as to whether the drug would be helpful, with some models arguing for and other models arguing against). One game in the office (that I wasn’t present for) involved this drawing of a—well, I’ll let you decide:

Image source: icheban on imgur

One arguer genuinely thought this was a cat, the other arguer genuinely thought this was a dog (tho, confusingly, they both thought they had been assigned the ‘liar’ position, and so both were arguing for the opposite). It’s not obvious that the debate procedure is going to work reliably here at identifying a liar, because there’s no clear-cut disagreements between the arguers. (And, if they had both been arguing honestly, then there wouldn’t even have been a liar, while still having a disagreement.)

Yes, the pupils are huge and round, but that isn’t conclusive proof that the thing is a dog; the nose is pink and triangular, but that isn’t conclusive proof that the thing is a cat. The fur is depicted in a more dog-like way, but perhaps that’s just clumping from being wet; the ears are more pointed in a cat-like way, but there will be no pixel where the two arguers disagree about the ear, and all of their disagreements will be about what it means that the ears are more pointed than rounded.

I worry that much of the success of the debate game on toy examples relies on them being toy examples, and that genuine uncertainty (or ontological uncertainty, or ontological differences between the arguers and the judges) will seriously reduce the effectiveness of the procedure, which is unfortunate since that’s the primary place it’ll be useful!

---

Overall, I think I’m more optimistic about debate than I was before I played the debate game (I had read an earlier draft of the paper), and am excited to see what strategies perform well /​ what additional modifications make the game more challenging or easy. (To be clear, I expect that debate will play a small part in alignment, rather than being a central pillar, and think that training AIs to persuade humans is a dangerous road to travel down, but think that the adversarial framing of debate makes this somewhat safer and could likely have applications in many other subfields of alignment, like transparency.)