I claim it’s extremely hard for U to make T insert a backdoor just by answering its yes-or-no questions.
Seems very likely. I think you don’t need to be as restrictive though. I think forcing U and T to have to communicate through any kind of information-bottleneck (not just binary QA).
For example, the dual-LLM setting is used in CAMEL to defend against prompt-injection. There they have a 2nd LLM that monitors the actions of the 1st LLM, the 2nd LLM only sees meta-data that summarizes the interaction the 1st LLM is trying to do (e.g. “agent X is requesting to send data type Y to location Z”. This restriction of the types of things the 2nd LLM sees makes it difficult to prompt inject the 2nd LLM.
I think safety through information-bottlenecks is a good direction. While I don’t personally know people working in this direction, I think this is something people are exploring. I’d be curious to hear what you or others find.
I am currently working on this with Redwood in the Astra fellowship! My project is about exploring techniques with very high safety.
One of the things I’ve tried so far is “best-of-N,” where we sample an action from the trusted model N times and have the untrusted model pick its favorite. But I suspect QA is strictly better, since T gets to exercise some intelligence in how it explores the search space, and the number of bits elicited from U w.r.t. the amount of text T writes is O(n) rather than O(2^n).
I agree we could make it less restrictive than simply binary questions. We could give T a budget for the number of bits it elicits from U, and let it specify the answer space it wants—we could default to “Yes|No”, but give it flexibility in exactly what answer choices to allow U. I found a cool Python library called exrex that can count the number of strings allowed by a finite regex expression! It’s easy to count bits this way:
Another improvement would be to let T ask “follow-up questions” in a conversation with U. U’s initial answer might lead T down a new path, leading to questions T hadn’t previously considered.
Possibly a better alternative to the regex idea: bits = the log-likelihood of U’s answer to the question under T’s probability distribution. This lets U give any free-form response, and only parts that T finds “surprising” are considered when counting bits.
Interestingly, even if every question from T is a yes-or-no question, the above method would let U answer them in fewer bits on average! For example, if T already thinks something is 90% likely, and U confirms that the answer is indeed “Yes”, that provides T with only -log2(0.90) = 0.152 bits of information.
The simplest way to do this may be to provide U to T as a tool. We can give T a “bit budget,” which it can allocate across various calls it makes to U. The inputs to the tool are T’s question to U + a bit budget for this question. If U’s answer exceeds the bit budget, we simply cut off its response at the last token before the budget was exceeded. The outputs of the tool are U’s answer + “X bits used, Y bits remaining in your budget.”
Out-of-the-box, T is probably quite suboptimal at asking questions that efficiently narrow the search space. Also, U might not always answer questions in a way that sounds like T, wasting bits. To address this, we can optimize the system prompt of both T and U, improving their skills at using their bits efficiently when asking and answering questions. We can even safely use U to help find a good system prompt to improve T’s question-asking; as long as we check the prompt for suspicious instructions, it should be difficult for the U-written prompt to cause T to take catastrophic actions.
Another way to think about this is that whenever T asks U a question, it’s offloading part of its chain-of-thought to U. One pretty common pattern in reasoning model CoTs is:
Come up with an initial response
Ask itself how it could improve the response
Decide on answers to those questions
Write the final response
This is exactly the format used in your paper, where you offload part 3 of the “CoT” to the strong LLM. But given arbitrary tool calls to the strong LLM, a better protocol is surely possible.
I’d be curious to see what comes from your experiments.
I also think that there’s a different in the control setting in what the trade-offs are between having the small-model vs large-model “initiate”. In the compression setting the small model asks a lot of questions, and this lets you not have to send the questions “over the wire”, with control, you’re likely fine with bi-directional flow: the small model asks for help me when it needs, or the large model interjecs when it wants. I haven’t thought too carefully on that.
Interestingly, even if every question from T is a yes-or-no question, the above method would let U answer them in fewer bits on average! For example, if T already thinks something is 90% likely, and U confirms that the answer is indeed “Yes”, that provides T with only -log2(0.90) = 0.152 bits of information.
Yeah—definitely! You can use arithmetic coding to encode those bits, according to the small model’s priors. And if it turns out to be less helpful than sending the bits directly, the large model woudl be able to assess that. So you never need to send more than N bits for N questions, but could be able to do much better. I like this idea a good deal, we do mention this in the paper, but it’s not a major point.
I think there are a lot of specifics to explore in the implementation of the interaction, and our work explores relatively constrained space (we did try a few other LLM-SLM interactions, and mostly got less impressive results).
Yeah—I agree.
Seems very likely. I think you don’t need to be as restrictive though. I think forcing U and T to have to communicate through any kind of information-bottleneck (not just binary QA).
For example, the dual-LLM setting is used in CAMEL to defend against prompt-injection. There they have a 2nd LLM that monitors the actions of the 1st LLM, the 2nd LLM only sees meta-data that summarizes the interaction the 1st LLM is trying to do (e.g. “agent X is requesting to send data type Y to location Z”. This restriction of the types of things the 2nd LLM sees makes it difficult to prompt inject the 2nd LLM.
I think safety through information-bottlenecks is a good direction. While I don’t personally know people working in this direction, I think this is something people are exploring. I’d be curious to hear what you or others find.
I am currently working on this with Redwood in the Astra fellowship! My project is about exploring techniques with very high safety.
One of the things I’ve tried so far is “best-of-N,” where we sample an action from the trusted model N times and have the untrusted model pick its favorite. But I suspect QA is strictly better, since T gets to exercise some intelligence in how it explores the search space, and the number of bits elicited from U w.r.t. the amount of text T writes is O(n) rather than O(2^n).
I agree we could make it less restrictive than simply binary questions. We could give T a budget for the number of bits it elicits from U, and let it specify the answer space it wants—we could default to “Yes|No”, but give it flexibility in exactly what answer choices to allow U. I found a cool Python library called
exrexthat can count the number of strings allowed by a finite regex expression! It’s easy to count bits this way:Another improvement would be to let T ask “follow-up questions” in a conversation with U. U’s initial answer might lead T down a new path, leading to questions T hadn’t previously considered.
Possibly a better alternative to the regex idea: bits = the log-likelihood of U’s answer to the question under T’s probability distribution. This lets U give any free-form response, and only parts that T finds “surprising” are considered when counting bits.
Interestingly, even if every question from T is a yes-or-no question, the above method would let U answer them in fewer bits on average! For example, if T already thinks something is 90% likely, and U confirms that the answer is indeed “Yes”, that provides T with only -log2(0.90) = 0.152 bits of information.
The simplest way to do this may be to provide U to T as a tool. We can give T a “bit budget,” which it can allocate across various calls it makes to U. The inputs to the tool are T’s question to U + a bit budget for this question. If U’s answer exceeds the bit budget, we simply cut off its response at the last token before the budget was exceeded. The outputs of the tool are U’s answer + “X bits used, Y bits remaining in your budget.”
Out-of-the-box, T is probably quite suboptimal at asking questions that efficiently narrow the search space. Also, U might not always answer questions in a way that sounds like T, wasting bits. To address this, we can optimize the system prompt of both T and U, improving their skills at using their bits efficiently when asking and answering questions. We can even safely use U to help find a good system prompt to improve T’s question-asking; as long as we check the prompt for suspicious instructions, it should be difficult for the U-written prompt to cause T to take catastrophic actions.
Another way to think about this is that whenever T asks U a question, it’s offloading part of its chain-of-thought to U. One pretty common pattern in reasoning model CoTs is:
Come up with an initial response
Ask itself how it could improve the response
Decide on answers to those questions
Write the final response
This is exactly the format used in your paper, where you offload part 3 of the “CoT” to the strong LLM. But given arbitrary tool calls to the strong LLM, a better protocol is surely possible.
I’d be curious to see what comes from your experiments.
I also think that there’s a different in the control setting in what the trade-offs are between having the small-model vs large-model “initiate”. In the compression setting the small model asks a lot of questions, and this lets you not have to send the questions “over the wire”, with control, you’re likely fine with bi-directional flow: the small model asks for help me when it needs, or the large model interjecs when it wants.
I haven’t thought too carefully on that.
Yeah—definitely! You can use arithmetic coding to encode those bits, according to the small model’s priors. And if it turns out to be less helpful than sending the bits directly, the large model woudl be able to assess that. So you never need to send more than N bits for N questions, but could be able to do much better. I like this idea a good deal, we do mention this in the paper, but it’s not a major point.
I think there are a lot of specifics to explore in the implementation of the interaction, and our work explores relatively constrained space (we did try a few other LLM-SLM interactions, and mostly got less impressive results).