Three mental images from thinking about AGI debate & corrigibility

Here are three mental images I’ve used when sporadically struggling to understand the ideas and prospects for AI safety via debate, IDA, and related proposals. I have not been closely following the discussion, and may well be missing things, and I don’t know whether these mental images are helpful or misleading.

Reading this post over, I seem to come across as a big skeptic of these proposals. That’s wrong: My actual opinion is not “skeptical” but rather “withholding judgment until I read more and think more”. Think of me as “newbie trying to learn”, not “expert contributing to intellectual progress”. Maybe writing this and getting feedback will help. :-)

1. AGI Debate as water-skiing behind a pair of nose-to-nose giant rocket engines

In AI safety via debate, we task two identical AGIs with arguing opposite sides of a question. That has always struck me as really weird, because one of them is advocating for a false conclusion—perhaps even knowingly! Why would we do that? Shouldn’t we program the AGIs to just figure out the right answer and explain it to us?

My understanding is that one aspect of it is that two equal-and-opposite AGIs (equal power, opposite goals) would keep each other in check, even if the AGIs were each very powerful.

So imagine you row to an island in the center of a little mountain lake, but then your boat gets eaten by beavers, and it’s too far to swim to shore. What you do have on your little island is a giant, 100,000kg rocket engine with no throttle. Once you start it, it burns uncontrollably until it’s out of fuel, by which point it’s typically way out in outer space! Oh, and the rocket also has a crappy steering system—coarse controls, laggy, poor feedback.

So what do you do? How do you cross the 300 meters of water to shore?

The answer is obvious: You do a copy-and-paste to make a second giant rocket engine, and build a frame that keeps the two pointed almost exactly nose-to-nose. Then you turn them both on simultaneously, so they just press on each other, and don’t go anywhere. Then you use the steering mechanism to create a tiny imbalance in the direction you want to move, and you gently waterski to shore. Success!

This analogy naturally suggests a couple concerns. First, the rocket engines might not be pointed in exactly opposite directions. This was discussed in Vojtech Kovarik’s recent post AI Unsafety via Non-Zero-Sum Debate and its comment thread. Second, the rocket engines may not have exactly equal thrust. It helps that you can use the same source code for your two AGIs, but an AGI may not be equally good at arguing for X vs against X for various random reasons unrelated to X being true or false, like its specific suite of background knowledge and argumentative skills, or one of the copies getting smarter by randomly having a new insight when running, etc. I think the hope is that arguing-for-the-right-answer is such a big advantage that it outweighs any other imbalance. That seems possible but not certain.

2. Deliberation as “debate inside one head”

The motivation for this mental image is the same as the last one, i.e. trying to make sense of AGI debate, when my gut tells me it’s weird that we would deliberately make an AGI that might knowingly advocate for the wrong answer to a question.

Imagine you’re presented with a math conjecture. You might spend time trying to prove it, and then spend time trying to disprove it, back and forth. The blockages in the proof attempt help shed light on the disproof, and vice-versa. See also the nice maze diagrams in johnswentworth’s recent post.

By the same token, if you’re given a chess board and asked what the best move is, one part of the deliberative process entails playing out different possibilities in your head—if I do this, then my opponent would do that, etc.

Or if I’m trying to figure out whether some possible gadget design would work, I go back and forth between trying to find potential problems with the design, and trying to refute or solve them.

From examples like these, I get a mental image where, when I deliberate on a question, I sometimes have two subagents, inside my one head, arguing against each other.

Oh, and for moral deliberation in particular, there’s a better picture we can use… :-)

Anyway, I think this mental image helps me think of debate as slightly less artificial and weird. It’s taking a real, natural part of deliberation, and bringing it to life! The two debating subagents are promoted to two full, separate agents, but the core structure is the same.

On the other hand, when I introspect, it feels like not all my deliberation fits into the paradigm of “two subagents in my head are having a debate”—in fact, maybe only a small fraction of it. It doesn’t feel like a subagent debate when I notice I’m confused about some related topic and look into it, or when I “play with ideas”, or look for patterns, etc.

Also, even when I am hosting a subagent debate in my head, I feel like much of the debate’s productivity comes from the fact that the two subagents are not actually working against each other, but rather each is keeping an eye out for looking for insights that help the other, and each has access to the other’s developing ideas and concepts and visualizations, etc.

And by the way, how do these AGIs come up with the best argument for their side anyway? Don’t they need to be doing good deliberation internally? If so, can’t we just have one of them deliberate on the top-level question directly? Or if not, do the debaters spawn sub-debaters recursively, or something?

3. “Corrigibility is a broad basin of attraction” seems improbable in a high-dimensional space of possible algorithms

(Quote by Paul Christiano, see here.)

Let’s say that algorithm X is a corrigible algorithm, in a million-dimensional space of possible algorithms (maybe X is a million-parameter neural net).

To say “corrigibility is a broad basin of attraction”, you need ALL of the following to be true:

If X drifts away from corrigibility along dimension #1, it will get pulled back.

AND, If X drifts away from corrigibility along dimension #2, it will get pulled back.

AND, If X drifts away from corrigibility along dimension #3, it will get pulled back.


AND, If X drifts away from corrigibility along dimension #1,000,000, it will get pulled back.

With each AND, the claim gets stronger and more unlikely, such that by the millionth proposition, it starts to feel awfully unlikely that corrigibility is really a broad basin of attraction after all! (Unless this intuitive argument is misleading, of course.)

What exactly might a problematic drift direction look like? Here’s what I’m vaguely imagining. Let’s say that if we shift algorithm X along dimension #852, its understanding /​ instincts surrounding what it means for people to want something get messed up. If we shift algorithm X along dimension #95102, its understanding /​ instincts surrounding human communication norms get messed up. If we shift algorithm X along dimension #150325, its meta-cognition /​ self-monitoring gets messed up. OK, now shift X in the direction , so all three of those things get messed up simultaneously. Will it still wind up pulling itself back to corrigibility? Maybe, maybe not; it’s not obvious to me.