Surprised by ELK report's counterexample to Debate, IDA

Summary

I had assumed the original ELK report had fundamental objections to Debate and IDA in terms of their robustness. Re-reading the report, I was surprised to find that that the only counterexample it provides for these proposals is that they don’t seem computationally competitive compared to unaligned AI.

ELK’s iconic SmartVault AI toy scenario seems like a pretty difficult world to imagine this kind of competitiveness mattering. I provide humorous fictional dialogue at a yacht party to illustrate this point.

My best guess is that the ELK report authors just didn’t get around to doing in-depth plot continuity editing on their SmartVault scenario, but that they care about competitiveness because if its importance in the real world. Still it would be good to clarify. And also, is this really the main objection to Debate and IDA for ELK?

The ELK report on Debate and IDA

I was re-reading the original Eliciting Latent Knowledge (ELK) report from ARC. At first it seemed to me like there was an unbacked claim that Debate and Iterated Distillation and Amplification (IDA) can’t solve ELK:

Unfortunately, none of these strategies seem sufficient for solving ELK in the worst case. In particular, after considering strategies like Debate, Iterated Distillation and Amplification and Imitative Generalization — and even assuming that every other counterexample to those strategies could be overcome — we believe they still don’t address ELK.

As I dug into it more though, I realized I had read too quickly over the section where Paul, Ajeya and Mark explain this. It’s called Strategy: have AI help humans improve our understanding.

In this section, the authors explain some of the potential of using Debate or IDA, before introducing New counterexample: gradient descent is more efficient than science, which they apparently consider to invalidate these strategies for ELK:

This means it’s plausible that an AI assistant who can automate the process of doing science well enough to impart us with all the SmartVault AI’s knowledge can only be trained using much more total computation than the original SmartVault AI itself.

As an intuition, imagine the cost of training a more-powerful future version of AlphaFold vs a model that can break down and explain why a particular protein folds in a particular way; it’s very plausible that the latter would be significantly more costly to train. We could imagine a future in which the best way to understand biology was not by doing anything resembling normal “science,” but simply by using gradient descent to learn large neural networks that made predictions about biological systems. At that point there would be no guarantee that humans understood anything about these models beyond the fact that they made good predictions.

This is a counterexample because our goal is to find a competitive solution to ELK — one that uses only a small amount more computation than training the unaligned benchmark. So now we are concerned about a bad reporter which does best-effort inference in the best human-understandable Bayes net that could be achieved in the amount of “doing science” we can automate within that budget.

As far as we and our AI assistants can tell, this reporter would answer all questions accurately on the training distribution. But if the SmartVault AI later tampers with sensors by exploiting some understanding of the world not reflected even in the best human-understandable Bayes net, it would fail to report that.

In short, they are saying that Debate and IDA would require significantly more compute than the unaligned SmartVault AI, and so these strategies wouldn’t be competitive with it.

I was surprised to find this was the only counterexample that the ELK report provides against Debate and IDA. (Aside from brief mentions of inner alignment concerns, which they give the benefit of the doubt on and assume can be solved. And assuming I’m not missing a more detailed explanation somewhere else.) I was expecting to find some game theoretical smackdown showing why these strategies are fatally flawed in a way I’d never thought about before. Instead all I found is that they might need a lot of compute.

First of all, the specific story provided in this counterexample for why this compute inefficiency is risky does not seem particularly strong:

As far as we and our AI assistants can tell, this reporter would answer all questions accurately on the training distribution. But if the SmartVault AI later tampers with sensors by exploiting some understanding of the world not reflected even in the best human-understandable Bayes net, it would fail to report that.

This is suggesting that we are using some unaligned training method for the SmartVault AI, and we’re just making use of Debate or IDA to train the reporter. For some reason, we fail to provide this reporter with enough compute to keep up with the sophistication of the SmartVault AI.

But a well-designed Debate- or IDA-based solution wouldn’t make this mistake. It would insist on using enough compute to have a robust reporter. Or alternatively, it would make use of Debate or IDA to train the SmartVault AI itself (rather than just the reporter). For example, instead of having actions selected by a black-box AI, action sequences could become the subject of AI debate themselves, with the chosen actions decided upon by the human judge assisted by the arguments from the debates.

Exploring competitiveness in the SmartVault AI universe

I think competitiveness could be very important in the real world of AI alignment. To the extent that leading AI labs perceive a zero-sum race game to train the first transformative AI or AGI, keeping costs low and not sacrificing capabilities could be crucial for aligned AI proposals.

However, the ELK toy scenario (i.e. SmartVault AI) doesn’t really provide a backdrop where competitiveness concerns make sense. In fact, it almost seems designed to remove competitiveness concerns. This scenario is just about protecting your diamond. It doesn’t matter if someone else finds a better way to protect their diamond, or if they do it sooner than you do. If they do, then good for them!

In the world of SmartVault AI, you find yourself attending a yacht party with your chums. You know you’re the sort of person to attend a yacht party, because you’re also the sort of person who has problems like figuring out how to protect your diamonds. This also makes you prone to call your friends “chums” and things like that.

So you’re standing around on your chum’s yacht with the others, sipping Chardonnay and tasting hors d’oeuvres. Someone brings up the topic of diamond protection.

Your chum (speaking to you): “Hey, I remember you got a new diamond protection system a couple months ago. What was it called… FancySafe or something?”

Another chum: “Oh yea, SmartVault I think it was. I remember, he was pretty excited about it! How’s that working out you?”

You feel the embarrassment welling up in you and your face turning red. You look down at your glass and swirl your Chardonnay.

You: [clears throat] “Well… not so good actually. You know I went on vacation with my croquet league for a couple weeks? Well, I was checking in on the SmartVault app. Everything looked fine, I could see the vault door closed and my diamond sitting there right where it should be. Then when I got back home I went to check on the vault myself. The door was wide open, my diamond was gone, and the SmartVault’s mechanical arm was holding a photo of the room as it should look just in front of the camera.”

Your chums are silent for a few moments before they burst into tears of laughter.

Your chum: [laughing] “You moron, we told you not to trust that thing!”

Another chum: “Yea, all the positive reviews were written by people who had just bought it. Those ELK people even warned the system could do something like this, but you were so stubborn. You insisted it was the perfect diamond protection system. You said we were wasting our money to pay 5 times more for the debate-based smart vault. How many kilograms was your diamond again?”

You swallow the rest of your Chardonnay, miserable thinking about the loss of your diamond. And you’re pissed at yourself, how could you be so short-sighted? SmartVault was the cheapest option, and the salesperson was so compelling. She kept saying, “We’re the most competitive diamond protection system on the market!” But now you’ve learned the lesson the hard way that if a system can’t actually protect your diamond then it’s not competitive at all. Leaving your diamond sitting on the sidewalk would be pretty cheap too but how effective is that?

Conclusion

Perhaps I am nitpicking a toy scenario which the ELK report authors know doesn’t capture everything we care about. Even though competitiveness seems like an out-of-left-field concern in the SmartVault AI scenario, they know it will matter in the real world, and that’s what they actually had in mind they provided the counterexample for Debate and IDA based on competitiveness concerns.

This is my best guess of what’s going on. But since ELK is a a pretty central problem in the AI alignment community now, I think it’s important to clear up potential misunderstandings and be explicit about our thoughts on it and how each proposal is evaluated against it.

I am also still surprised that compute-competitiveness is the main published concern I can find about Debate and IDA vis a vis ELK. This is a positive update for me on the viability of these proposals. I hope this post may lead to some clarifying discussion about how we should prioritize the competitiveness of alignment proposals, as well as further discussion on why or why not Debate and IDA are viable solutions to the ELK problem.

Surprised by ELK report’s counterexample to Debate, IDA

Summary

The ELK report on Debate and IDA

Exploring competitiveness in the SmartVault AI universe

Conclusion