Thanks, I’ll summarize my takeaways in the form of an argument that control is sufficient to safely accelerate alignment research by 30x. This is certainly not the only argument one could make, but I like it because it is clear and concrete.
I’ll specifically focus on justifying that the red team is competitive.
If your AI systems can accelerate AI safety research by 30x, then they are producing effectively superhuman outputs; however, this does not imply they can produce superhuman evasion plans because AI systems can’t spend as much of their thinking on how to evade control measures as they can on safety research.
Suppose that the AI systems have approximately human capabilities and are sped up by 30x.Suppose also that the human red team has spent 2 months identifying evasion strategies (call this ‘coup research’ following your control paper)
I’ll argue that the AI systems will not be able to do more than the equivalent of 2 months of coup research. Since our red team has lots of advantages over the AI systems, this is a conservative argument that the red team is competitive.
Suppose that we only want to justify that the red team is competitive for the next two months (and we’ll continuously search for attack strategies in order to maintain this 1 to 1 ratio).
Two months of deployment is equivalent to 5 years of human thinking since our AI system is sped up by 30x. We need to show that no more than 1 / 30th of this thinking is coup research.
Training algorithms might pressure AI systems to be at least somewhat efficient about doing useful work (and coup research isn’t useful for the task at hand). Perhaps this implies a 1 / N coup research budget.
Furthermore, suppose the AI systems’ reasoning is externalized. Monitoring setups are very likely to catch coup research if it is present at a frequency greater than 1 / M.
min(1 / N, 1 / M) ⇐ 1 / 30
I think I buy this argument or some variation of it.
Thanks for writing this. I think it does a good job laying out the assumptions needed for control.
It seems like it could be challenging to justify that a red team is competitive -- especially if you are aiming to extract massive amounts of work from AI systems.
For example, if your AI system is like a human sw engineer but 30X faster, then it might come up with effectively superhuman plans just because it has lots of time to think. Externalized reasoning can help, but it’s not very clear to me that externalized reasoning can scale to domains where AI systems are doing a bunch of research that humans (and GPT-4) can’t readily disentangle and understand. I’m imagining that chain of thought will amount to thousands of pages of technical content that is written like rambly google docs. It might be hard to pick out the AI safety research from the cyberoffense planning in this case -- especially with dumb models like GPT-4.
It seems like to the extent you want to use AI systems for work that is hard to evaluate, control measures are difficult to trust. Maybe a key crux is whether AI systems will have to do work that is hard to evaluate to yield a 30x acceleration factor. I’d imagine this would apply for AI systems to achieve fundemental research advances -- e.g. in interpretability, formal verification, etc.
Curious about your thoughts.
The results seem to be cherry picked or else perhaps I am using the code incorrectly. I’m trying to use the VAE for a separate project and the encoded vectors don’t steer generations very well (or reconstruct -- which is what I was hoping to use this for).
In addition to seeing more AI behavioral psychology work, I would be excited about seeing more AI developmental psychology—i.e. studying how varying properties of training or architecture affect AI behavior. Shard theory is an example of this.I’ve written a bit about the motivations for AI developmental psychology here.
I agree with this norm, though I think it would be better to say that the “burden of evidence” should be on labs. When I first read the title, I thought you wanted labs to somehow prove the safety of their system in a conclusive way. What this probably looks like in practice is “we put x resources into red teaming and didn’t find any problems.” I would be surprised if ‘proof’ was ever an appropriate term.
The analogy between AI safety and math or physics is assumed it in a lot of your writing and I think it is a source of major disagreement with other thinkers. ML capabilities clearly isn’t the kind of field that requires building representations over the course of decades.
I think it’s possible that AI safety requires more conceptual depth than AI capabilities; but in these worlds, I struggle to see how the current ML paradigm coincides with conceptual ‘solutions’ that can’t be found via iteration at the end. In those worlds, we are probably doomed so I’m betting on the worlds in which you are wrong and we must operate within the current empirical ML paradigm. It’s odd to me that you and Eliezer seem to think the current situation is very intractable, and yet you are confident enough in your beliefs to where you won’t operate on the assumption that you are wrong about something in order to bet on a more tractable world.
+1. As a toy model, consider how the expected maximum of a sample from a heavy tailed distribution is affected by sample size. I simulated this once and the relationship was approximately linear. But Soares’ point still holds if any individual bet requires a minimum amount of time to pay off. You can scalably benefit from parallelism while still requiring a minimum amount of serial time.
At its core, the argument appears to be “reward maximizing consequentialists will necessarily get the most reward.” Here’s a counter example to this claim: if you trained a Go playing AI with RL, you are unlikely to get a reward maximizing consequentialist. Why? There’s no reason for the Go playing AI to think about how to take over the world or hack the computer that is running the game. Thinking this way would be a waste of computation. AIs that think about how to win within the boundaries of the rules therefore do better.In the same way, if you could robustly enforce rules like “turn off when the humans tell you to” or “change your goal when humans tell you to” etc, perhaps you end up with agents that follow these rules rather than agents that think “hmmm… can I get away with being disobedient?”Both achieve the same reward if the rules are consistently enforced during training and I think there are weak reasons to expect deontological agents to be more likely.
Also, AI startups make AI safety resources more likely to scale with AI capabilities.
It hasn’t been canceled.
I don’t mean to distract from your overall point though which I take to be “a philosopher said a smart thing about AI alignment despite not having much exposure.” That’s useful data.
I don’t know why this is a critique of RLHF. You can use RLHF to train a model to ask you questions when it’s confused about your preferences. To the extent to which you can easily identify when a system has this behavior and when it doesn’t, you can use RLHF to produce this behavior just like you can use RLHF to produce many other desired behaviors.
Thoughts on John’s comment: this is a problem with any method for detecting deception that isn’t 100% accurate. I agree that finding a 100% accurate method would be nice, but good luck.Also, you can somewhat get around this by holding some deception detecting methods out (i.e. not optimizing against them). When you finish training and the held out methods tell you that your AI is deceptive, you start over. Then you have to try to think of another approach that is more likely to actually discourage deception than fool your held out detectors. This is the difference between gradient descent search and human design search, which I think is an important distinction.
Also, FWIW, I doubt that trojans are currently a good microcosm for detecting deception. Right now, it is too easy to search for the trigger using brute force optimization. If you ported this over to sequential-decision-making land where triggers can be long and complicated, that would help a lot. I see a lot of current trojan detection research as laying the groundwork for future research that will be more relevant.In general, it seems better to me to evaluate research by asking “where is this taking the field/what follow-up research is this motivating?” rather than “how are the words in this paper directly useful if we had to build AGI right now?” Eventually, the second one is what matters, but until we have systems that look more like agents that plan and achieve goals in the real world, I’m pretty skeptical of a lot of the direct value of empirical research.