I agree self-coordination is easier, mostly because of shared secret information (e.g. about the hidden scheme used for communication). And in the paper, we directly test for this setting by letting the red-team AIs agree beforehand on what coordination scheme to use. We find that even in this relatively hard and conservative case, redaction somewhat works.
So option 3.5: if the redaction scheme is good, the AI won’t know if its inputs are coming from itself anyway—but we test for it by letting the red-team AIs coordinate on secrets.
Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers (and in particular between the synth data and generations from the untrusted LLM). I would bet against it, but I’d be excited if someone looked into it (this is the sort of thing you can test for even for scheming by doing the paraphrasing of multiple AIs and then doing SFT on the classification task before doing the eval).
Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers
Yeah, this is the case I’m thinking about. I think on longer outputs it will be possible for the AI to use a more nebulous sense of “style”, rather than only relying on specific features like “using X secret phrase” (which are presumably easier to redact away).
I think this could be countered by paraphrasing at a higher level—i.e. rewriting entire paragraphs from scratch rather than just rewording sentences. I’m noting this down as a potential topic if I have time to do another project.
Thanks for the discussion, it’s helped me clear up my thoughts on this topic.
Ok, I’ve been thinking about this more and I’ve figured out what exactly was bugging me about this red-teaming scheme. Is it fair to say you assume that your red team will naturally cover the same area of solution-space as actual malicious AIs aiming to self-coordinate?
If yes, I think that assumption is faulty. Self-coordinating AIs will have access to classes of strategies that a human red-teamer would not think of unless they spent significant effort to “think like the enemy”, so to speak. These strategies will be usable even if the AI isn’t much smarter than a human, simply because the AI is a very different kind of mind.
Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers [… but] I would bet against it
As a simple but hopefully demonstrative example: if the AI is asked to generate a list of items, it could communicate information based on the order of items in the list. Given that the AI is specifically attempting to self-coordinate, it can assume its other copy will come up with the same mapping between sequences and secret signals.[1] This easily survives sentence-level paraphrasing, even though it’s obviously defeated by shuffling the list.
There are plausibly very many strategies in this vein, where you encode information at the structural rather than the content level. So even if each individual strategy is trivial to defeat, you could just end up playing strategy whack-a-mole.
The information density of this is low though—the AI can’t assign meaningful information to too many sequences, since a synthetic input could then generate one by chance, so it can only make a sparse subset of sequences meaningful.
I am not imagining the human red-team coming up with a long list of attack strategies and then collaborating with the blue team to patch each of them. I am imagining sth more like the blue team having a fairly general strategy (e.g. paraphrasing), and then relying on a property like “the part of solution space the human red-team can cover is roughly as effective at beating the blue-team protocol as the AI solutions”. I agree that AIs having different kinds of ideas is a reason to expect this assumption to not hold for some protocols (e.g. for human monitoring), but as discussed here, the human red-team also has a bunch of affordances that make it plausible to get somewhat conservative evals. This seems especially true for things like steganography vs paraphrasing, where I don’t see strong reasons to expect AIs to excel relative to humans.
I guess a lot of this is contingent on what AGI will look like. Current AIs have mostly human-like insights. If AIs got very insane high-quality ideas (e.g. like move 37) in the relevant domains as we got closer to AGI, I think this would become more concerning.
There are other hopes like “train AIs to be good at collusion / steganography, and use that as red-team” that suffer less from the concerns discussed here, but they have other problems like exploration hacking.
I am not sure how much we disagree. For these sorts of difficulties both in ruling out collusion and in evaluating it, I expect that doing untrusted monitoring well enough to have it reduce risk significantly would likely require a large amount of effort.
It feels like my key disagreement is I think that AI will be able to come up with strategies that are inhuman without being superhuman. i.e. human-level AIs will find strategies in a very different part of solution space than what humans would naturally think to prepare for.
My biggest intuition toward the above is AIs’ performance in games (e.g. AlphaStar). I’ve seen a lot of scenarios where the AIs soundly beat top humans not by doing the same thing but better, but by doing something entirely outside of the human playbook. I don’t see why this wouldn’t transfer to other domains with very large solution spaces, like steganography techniques.
I do agree that it will likely take a lot of work to get good returns out of untrusted monitoring (and by extension general anti-collusion measures). However I think having good anti-collusion measures is a very important capability towards limiting the harm that a rogue AI could potentially do.
I think inhuman but not superhuman is fine as long as it holds that succeeding at defending against human attack is strong evidence of success against inhuman ones. If the defenses are not “overfit” to human attacks, I think it’s quite doable to get a sense of how good defenses are against inhuman attacks. For paraphrasing in particular, I’d be very surprised if the best human attacks (found in many days of research) were defeated by paraphrasing but human-level-ish LLMs (given only a few minutes of human-equivalent opaque thinking time) could figure out strategies which were not defeated by paraphrasing.
(And there is also the strategy of using RL to elicit inhuman attacks which doesn’t have the weakness you describe, but which has other issues.)
I agree self-coordination is easier, mostly because of shared secret information (e.g. about the hidden scheme used for communication). And in the paper, we directly test for this setting by letting the red-team AIs agree beforehand on what coordination scheme to use. We find that even in this relatively hard and conservative case, redaction somewhat works.
So option 3.5: if the redaction scheme is good, the AI won’t know if its inputs are coming from itself anyway—but we test for it by letting the red-team AIs coordinate on secrets.
Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers (and in particular between the synth data and generations from the untrusted LLM). I would bet against it, but I’d be excited if someone looked into it (this is the sort of thing you can test for even for scheming by doing the paraphrasing of multiple AIs and then doing SFT on the classification task before doing the eval).
Yeah, this is the case I’m thinking about. I think on longer outputs it will be possible for the AI to use a more nebulous sense of “style”, rather than only relying on specific features like “using X secret phrase” (which are presumably easier to redact away).
I think this could be countered by paraphrasing at a higher level—i.e. rewriting entire paragraphs from scratch rather than just rewording sentences. I’m noting this down as a potential topic if I have time to do another project.
Thanks for the discussion, it’s helped me clear up my thoughts on this topic.
Ok, I’ve been thinking about this more and I’ve figured out what exactly was bugging me about this red-teaming scheme. Is it fair to say you assume that your red team will naturally cover the same area of solution-space as actual malicious AIs aiming to self-coordinate?
If yes, I think that assumption is faulty. Self-coordinating AIs will have access to classes of strategies that a human red-teamer would not think of unless they spent significant effort to “think like the enemy”, so to speak. These strategies will be usable even if the AI isn’t much smarter than a human, simply because the AI is a very different kind of mind.
As a simple but hopefully demonstrative example: if the AI is asked to generate a list of items, it could communicate information based on the order of items in the list. Given that the AI is specifically attempting to self-coordinate, it can assume its other copy will come up with the same mapping between sequences and secret signals.[1] This easily survives sentence-level paraphrasing, even though it’s obviously defeated by shuffling the list.
There are plausibly very many strategies in this vein, where you encode information at the structural rather than the content level. So even if each individual strategy is trivial to defeat, you could just end up playing strategy whack-a-mole.
The information density of this is low though—the AI can’t assign meaningful information to too many sequences, since a synthetic input could then generate one by chance, so it can only make a sparse subset of sequences meaningful.
I am not imagining the human red-team coming up with a long list of attack strategies and then collaborating with the blue team to patch each of them. I am imagining sth more like the blue team having a fairly general strategy (e.g. paraphrasing), and then relying on a property like “the part of solution space the human red-team can cover is roughly as effective at beating the blue-team protocol as the AI solutions”. I agree that AIs having different kinds of ideas is a reason to expect this assumption to not hold for some protocols (e.g. for human monitoring), but as discussed here, the human red-team also has a bunch of affordances that make it plausible to get somewhat conservative evals. This seems especially true for things like steganography vs paraphrasing, where I don’t see strong reasons to expect AIs to excel relative to humans.
I guess a lot of this is contingent on what AGI will look like. Current AIs have mostly human-like insights. If AIs got very insane high-quality ideas (e.g. like move 37) in the relevant domains as we got closer to AGI, I think this would become more concerning.
There are other hopes like “train AIs to be good at collusion / steganography, and use that as red-team” that suffer less from the concerns discussed here, but they have other problems like exploration hacking.
I am not sure how much we disagree. For these sorts of difficulties both in ruling out collusion and in evaluating it, I expect that doing untrusted monitoring well enough to have it reduce risk significantly would likely require a large amount of effort.
It feels like my key disagreement is I think that AI will be able to come up with strategies that are inhuman without being superhuman. i.e. human-level AIs will find strategies in a very different part of solution space than what humans would naturally think to prepare for.
My biggest intuition toward the above is AIs’ performance in games (e.g. AlphaStar). I’ve seen a lot of scenarios where the AIs soundly beat top humans not by doing the same thing but better, but by doing something entirely outside of the human playbook. I don’t see why this wouldn’t transfer to other domains with very large solution spaces, like steganography techniques.
I do agree that it will likely take a lot of work to get good returns out of untrusted monitoring (and by extension general anti-collusion measures). However I think having good anti-collusion measures is a very important capability towards limiting the harm that a rogue AI could potentially do.
I think inhuman but not superhuman is fine as long as it holds that succeeding at defending against human attack is strong evidence of success against inhuman ones. If the defenses are not “overfit” to human attacks, I think it’s quite doable to get a sense of how good defenses are against inhuman attacks. For paraphrasing in particular, I’d be very surprised if the best human attacks (found in many days of research) were defeated by paraphrasing but human-level-ish LLMs (given only a few minutes of human-equivalent opaque thinking time) could figure out strategies which were not defeated by paraphrasing.
(And there is also the strategy of using RL to elicit inhuman attacks which doesn’t have the weakness you describe, but which has other issues.)