One important difference between data storage vs computation or AI: courtesy of Shannon and Hamming, we have a really good understanding of information transmission (which includes information storage). All those nice error-correction codes are downstream of very well-understood theory.
If we had theory as solid as information theory for AI and alignment, then yeah, I’d be a hell of a lot more optimistic about using one AI to oversee another somewhere in the process. Like, imagine we had the alignment analogue of an error-detecting code which provably detects two-bit errors and corrects one-bit errors with only a logarithmic amount of overhead. With theory that strong (and battle-tested in reality) it becomes plausible that unknown unknowns won’t inevitably ruin all our plans.
Well, the basic idea “adding more safeguards decreases the likelihood they’ll all fail simultaneously, as long as there isn’t a perfect correlation of failure modes” is a simple mathematical fact. “What is the probability of this safeguard failing to detect a rogue AI?” is hard to answer, but “What might this new safeguard do that the other safeguards don’t do?” is easier.
For example. If interpretability work gets anywhere, then one might imagine a suite of safeguards that check for parts of the developing neural net that compute things like “how to detect security holes in C or machine code” or “how quickly humans die to certain poisons” (when that’s not supposed to be the goal); safeguards that check for parts of the net that have many nodes and are not understandable by the other safeguards; safeguards that inspect the usage of CPU or other resources and have some idea of what’s usual; safeguards that try to look for the net thinking strategically about what resource usage looks natural; and so on. These safeguards might all suck / only work in a small fraction of cases, but if you have hundreds or thousands of them, then your odds might get decent.
Or, at least, you might push the correlated failure mode into “the AI has guessed that we have a bunch of safeguards, has imagined the set of safeguards people at our level of competence are likely to have come up with, and is strategically avoiding them”, in which case that strategic thinking is a thing you try to detect. (It would be nice if we ended up being able to prove that the strategic thinking to defeat all our “dangerous thinking detectors” required at least N neurons, and we were able to prune down all nets to the point where less than N not-understood neurons existed.) And—as stated, each of these safeguards might be implemented with the assistance of prior AIs in various ways.
What exactly are you objecting to? Your original post seems to be saying “the entire category of strategies that involves one AI saving us from another is dangerous, doomed, and should be dismissed as ridiculous”, but you don’t seem to be pushing back on my claims that you can cut off Godzilla’s limbs and tie his jaws around Mega-Godzilla’s neck and that the result is net positive. Were you arguing against much more reckless proposals, or people being overconfident in unspecified proposals, or...? The possibility occurs to me that the OP is intentionally overstated in the hopes of provoking a well-thought-out proposal (it did say it anticipated much discussion), along the lines of Cunningham’s Law.
I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment—for instance, most of the strategies in 11 Proposals. If we had sufficiently great interpretability, then sure, we could maybe leverage that to make a Godzilla strategy with a decent chance of working (or at least failing in detectable-in-advance ways), but with interpretability tools that good we could probably just make a plan without Godzilla have a decent chance of working (or at least failing in detectable-in-advance ways) by doing basically the same things minus Godzilla. It’s the interpretability tools which take that plan from “close to zero chance of working” to “close to 100% chance of working”; the interpretability is where all the robustness comes from. The Godzilla part adds relatively little and is plausibly net negative (due to making the ML components more complex and brittle).
(Another minor point: “adding more safeguards decreases the likelihood they’ll all fail simultaneously, as long as there isn’t a perfect correlation of failure modes” is only true when the “safeguards” are guaranteed to not increase the chance of failure.)
And—as stated, each of these safeguards might be implemented with the assistance of prior AIs in various ways.
I see two main things you could have in mind here.
First, maybe you imagine training a magic black box to tell us what’s going on inside another magic black box. This is a Godzilla strategy, and fails for the usual Godzilla reasons: errors are not recoverable. If the magic interpretability box fails, we don’t have a built-in way to notice. And no, training lots of magic black boxes to detect lots of things does not really fix the problem; the failure modes are extremely highly correlated. We don’t even need to suppose deception—just a distribution shift in the cognition of the system will cause highly correlated failures, and a distribution shift in cognition is exactly the sort of thing we’d expect from a system just starting to grok consequentialist reasoning.
On the other hand, maybe you imagine AIs serving as research assistants, rather than using AIs directly to interpret other AIs. That plan does have problems, but is basically not a Godzilla plan; the human in the loop means that the standard Godzilla brittleness issue doesn’t really apply.
One important difference between data storage vs computation or AI: courtesy of Shannon and Hamming, we have a really good understanding of information transmission (which includes information storage). All those nice error-correction codes are downstream of very well-understood theory.
If we had theory as solid as information theory for AI and alignment, then yeah, I’d be a hell of a lot more optimistic about using one AI to oversee another somewhere in the process. Like, imagine we had the alignment analogue of an error-detecting code which provably detects two-bit errors and corrects one-bit errors with only a logarithmic amount of overhead. With theory that strong (and battle-tested in reality) it becomes plausible that unknown unknowns won’t inevitably ruin all our plans.
Well, the basic idea “adding more safeguards decreases the likelihood they’ll all fail simultaneously, as long as there isn’t a perfect correlation of failure modes” is a simple mathematical fact. “What is the probability of this safeguard failing to detect a rogue AI?” is hard to answer, but “What might this new safeguard do that the other safeguards don’t do?” is easier.
For example. If interpretability work gets anywhere, then one might imagine a suite of safeguards that check for parts of the developing neural net that compute things like “how to detect security holes in C or machine code” or “how quickly humans die to certain poisons” (when that’s not supposed to be the goal); safeguards that check for parts of the net that have many nodes and are not understandable by the other safeguards; safeguards that inspect the usage of CPU or other resources and have some idea of what’s usual; safeguards that try to look for the net thinking strategically about what resource usage looks natural; and so on. These safeguards might all suck / only work in a small fraction of cases, but if you have hundreds or thousands of them, then your odds might get decent.
Or, at least, you might push the correlated failure mode into “the AI has guessed that we have a bunch of safeguards, has imagined the set of safeguards people at our level of competence are likely to have come up with, and is strategically avoiding them”, in which case that strategic thinking is a thing you try to detect. (It would be nice if we ended up being able to prove that the strategic thinking to defeat all our “dangerous thinking detectors” required at least N neurons, and we were able to prune down all nets to the point where less than N not-understood neurons existed.) And—as stated, each of these safeguards might be implemented with the assistance of prior AIs in various ways.
What exactly are you objecting to? Your original post seems to be saying “the entire category of strategies that involves one AI saving us from another is dangerous, doomed, and should be dismissed as ridiculous”, but you don’t seem to be pushing back on my claims that you can cut off Godzilla’s limbs and tie his jaws around Mega-Godzilla’s neck and that the result is net positive. Were you arguing against much more reckless proposals, or people being overconfident in unspecified proposals, or...? The possibility occurs to me that the OP is intentionally overstated in the hopes of provoking a well-thought-out proposal (it did say it anticipated much discussion), along the lines of Cunningham’s Law.
I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment—for instance, most of the strategies in 11 Proposals. If we had sufficiently great interpretability, then sure, we could maybe leverage that to make a Godzilla strategy with a decent chance of working (or at least failing in detectable-in-advance ways), but with interpretability tools that good we could probably just make a plan without Godzilla have a decent chance of working (or at least failing in detectable-in-advance ways) by doing basically the same things minus Godzilla. It’s the interpretability tools which take that plan from “close to zero chance of working” to “close to 100% chance of working”; the interpretability is where all the robustness comes from. The Godzilla part adds relatively little and is plausibly net negative (due to making the ML components more complex and brittle).
(Another minor point: “adding more safeguards decreases the likelihood they’ll all fail simultaneously, as long as there isn’t a perfect correlation of failure modes” is only true when the “safeguards” are guaranteed to not increase the chance of failure.)
I see two main things you could have in mind here.
First, maybe you imagine training a magic black box to tell us what’s going on inside another magic black box. This is a Godzilla strategy, and fails for the usual Godzilla reasons: errors are not recoverable. If the magic interpretability box fails, we don’t have a built-in way to notice. And no, training lots of magic black boxes to detect lots of things does not really fix the problem; the failure modes are extremely highly correlated. We don’t even need to suppose deception—just a distribution shift in the cognition of the system will cause highly correlated failures, and a distribution shift in cognition is exactly the sort of thing we’d expect from a system just starting to grok consequentialist reasoning.
On the other hand, maybe you imagine AIs serving as research assistants, rather than using AIs directly to interpret other AIs. That plan does have problems, but is basically not a Godzilla plan; the human in the loop means that the standard Godzilla brittleness issue doesn’t really apply.
My reply to the top-level post here is also relevant as a reply to this specific comment.