Fabien Roger comments on Untrusted monitoring insights from watching ChatGPT play coordination games

Fabien Roger 1 Feb 2025 17:33 UTC
7 points
5
Interesting work!
The possibility of collusion using some secret signal is definitely something we have thought a lot about. It is an important empirical question how good powerful AIs will be at this, and it is plausible they will be too good at it to make untrusted monitoring viable.
Our paper did not contain measurement of the ability of LLMs to come up with signals on their own (the ones studied in the paper were hand-crafted). Therefore, I welcome work like yours that try to measure how good LLMs are at inserting collusion signals!
To create useful synthetic inputs, either the blue team must be able to match the generator’s style well enough to consistently fool the checker, or it must be able to redact the generator’s output to the point where the checker has no identifying details to distinguish inputs. The former seems impractical, since faking the inputs at that level of verisimilitude would require the blue team to have significant understanding of the AI’s inner workings, at which point there are probably more directly usable alignment strategies. The latter seems difficult, as by the protocol’s definition, the AI is working on problems/solutions that humans cannot easily understand, but effective redaction requires humans to understand the content of the solution to some extent.
Maybe the core misunderstanding about the viability of untrusted monitoring is what untrusted AIs we are targeting. The reason why we use “human labor” that is roughly untrusted-model-level in the paper is because we expect that this is one of the most important regimes (since it is when AIs start to be powerful enough to drastically speed up safety work). More details here.
Therefore, redaction/paraphrasing seems plausibly enough to make reliable collusion too hard for these AIs, just like it was in our toy setting against the best collusion signals Ryan Greenblatt thought about and which GPT-4 could easily implement (which are probably more robust against redaction/paraphrasing than the ones GPT-4o found in your experiments). But maybe future AIs will become way better at inserting secrets, so I think the jury is definitely still out on this question.
What links here?
- Fabien Roger's comment on AI Control: Improving Safety Despite Intentional Subversion by Buck (1 Feb 2025 17:34 UTC; 7 points)
- jwfiredragon 6 Feb 2025 7:35 UTC
  1 point
  0
  Parent
  Hi, thank you very much for the response! I read through the post you linked, and also the steganography/paraphrasing post linked therein, and I see your point about redaction. I agree that in this intelligence regime, it’s likely that we could come up with redaction schemes that very effectively prevent AIs from hiding information in their outputs.
  However, I still have some concerns about generating synthetic inputs. Specifically this section:
  At any point where it naively seems like there might be an extremely good opportunity to behave badly, or where you need to coordinate with another instance of yourself, it’s overwhelmingly likely that this instance is in adversarial training/evaluation.
  “Coordinating with yourself” is a special case of “coordinating with another AI” that deserves extra attention, because by default the AI has a strong incentive to coordinate with itself, but less so to coordinate with a different AI. I still think it’ll be hard to convincingly create synthetic inputs for this case because doing so would require the blue team to know how the AI would act—otherwise the synthetic inputs end up as “coordinating with someone pretending to be yourself”, or “coordinating with someone else”.^[1] In those cases it could be natural for the AI to not cooperate, but that wouldn’t necessarily imply that the AI wouldn’t cooperate with a copy of itself.
  If you want to claim this isn’t a problem, you’d have to argue one of:
  1. Creating synthetic inputs for self-coordination is not significantly harder than for other-coordination.
  2. Defecting on other-coordination does imply the AI wouldn’t cooperate with itself.
  3. Not specifically testing for self-coordination is an acceptable weakness—if the redaction scheme is good, the AI won’t know if its inputs are coming from itself anyways.^[2]
  4. None of the above, there’s another case I didn’t consider.
  I think 1 and 2 are false, and 3 requires some shaky reasoning (see the footnote).
  (A tangent: I think current-generation LLMs might do better on noticing self-coordination with more hand-holdy prompting: e.g. asking the monitor to first generate its own solution to the same problem, and then comparing that against its input to check for discrepancies that could be secret signals.)
  1. ^
    One other possibility is to aim for “coordinating with someone that happens to share your goals”, but that runs into the problems of A) communicating the extra information of shared goals and B) striking a balance between the offer being too good to be true and not good enough to bother.
  2. ^
    I think this makes sense at a first glance but is the kind of mindset that leads to insufficient paranoia, so I’d like to avoid it unless the case for it is really watertight.
  - Fabien Roger 6 Feb 2025 21:24 UTC
    3 points
    1
    Parent
    I agree self-coordination is easier, mostly because of shared secret information (e.g. about the hidden scheme used for communication). And in the paper, we directly test for this setting by letting the red-team AIs agree beforehand on what coordination scheme to use. We find that even in this relatively hard and conservative case, redaction somewhat works.
    So option 3.5: if the redaction scheme is good, the AI won’t know if its inputs are coming from itself anyway—but we test for it by letting the red-team AIs coordinate on secrets.
    Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers (and in particular between the synth data and generations from the untrusted LLM). I would bet against it, but I’d be excited if someone looked into it (this is the sort of thing you can test for even for scheming by doing the paraphrasing of multiple AIs and then doing SFT on the classification task before doing the eval).
    - jwfiredragon 7 Feb 2025 3:28 UTC
      3 points
      0
      Parent
      Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers
      Yeah, this is the case I’m thinking about. I think on longer outputs it will be possible for the AI to use a more nebulous sense of “style”, rather than only relying on specific features like “using X secret phrase” (which are presumably easier to redact away).
      I think this could be countered by paraphrasing at a higher level—i.e. rewriting entire paragraphs from scratch rather than just rewording sentences. I’m noting this down as a potential topic if I have time to do another project.
      Thanks for the discussion, it’s helped me clear up my thoughts on this topic.
    - jwfiredragon 6 Mar 2025 8:52 UTC
      1 point
      0
      Parent
      Ok, I’ve been thinking about this more and I’ve figured out what exactly was bugging me about this red-teaming scheme. Is it fair to say you assume that your red team will naturally cover the same area of solution-space as actual malicious AIs aiming to self-coordinate?
      If yes, I think that assumption is faulty. Self-coordinating AIs will have access to classes of strategies that a human red-teamer would not think of unless they spent significant effort to “think like the enemy”, so to speak. These strategies will be usable even if the AI isn’t much smarter than a human, simply because the AI is a very different kind of mind.
      Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers [… but] I would bet against it
      As a simple but hopefully demonstrative example: if the AI is asked to generate a list of items, it could communicate information based on the order of items in the list. Given that the AI is specifically attempting to self-coordinate, it can assume its other copy will come up with the same mapping between sequences and secret signals.^[1] This easily survives sentence-level paraphrasing, even though it’s obviously defeated by shuffling the list.
      There are plausibly very many strategies in this vein, where you encode information at the structural rather than the content level. So even if each individual strategy is trivial to defeat, you could just end up playing strategy whack-a-mole.
      ^
      The information density of this is low though—the AI can’t assign meaningful information to too many sequences, since a synthetic input could then generate one by chance, so it can only make a sparse subset of sequences meaningful.
      - Fabien Roger 7 Apr 2025 23:09 UTC
        3 points
        0
        Parent
        I am not imagining the human red-team coming up with a long list of attack strategies and then collaborating with the blue team to patch each of them. I am imagining sth more like the blue team having a fairly general strategy (e.g. paraphrasing), and then relying on a property like “the part of solution space the human red-team can cover is roughly as effective at beating the blue-team protocol as the AI solutions”. I agree that AIs having different kinds of ideas is a reason to expect this assumption to not hold for some protocols (e.g. for human monitoring), but as discussed here, the human red-team also has a bunch of affordances that make it plausible to get somewhat conservative evals. This seems especially true for things like steganography vs paraphrasing, where I don’t see strong reasons to expect AIs to excel relative to humans.
        I guess a lot of this is contingent on what AGI will look like. Current AIs have mostly human-like insights. If AIs got very insane high-quality ideas (e.g. like move 37) in the relevant domains as we got closer to AGI, I think this would become more concerning.
        There are other hopes like “train AIs to be good at collusion / steganography, and use that as red-team” that suffer less from the concerns discussed here, but they have other problems like exploration hacking.
        I am not sure how much we disagree. For these sorts of difficulties both in ruling out collusion and in evaluating it, I expect that doing untrusted monitoring well enough to have it reduce risk significantly would likely require a large amount of effort.
        jwfiredragon 9 Apr 2025 4:35 UTC
        1 point
        0
        Parent
        It feels like my key disagreement is I think that AI will be able to come up with strategies that are inhuman without being superhuman. i.e. human-level AIs will find strategies in a very different part of solution space than what humans would naturally think to prepare for.
        My biggest intuition toward the above is AIs’ performance in games (e.g. AlphaStar). I’ve seen a lot of scenarios where the AIs soundly beat top humans not by doing the same thing but better, but by doing something entirely outside of the human playbook. I don’t see why this wouldn’t transfer to other domains with very large solution spaces, like steganography techniques.
        I do agree that it will likely take a lot of work to get good returns out of untrusted monitoring (and by extension general anti-collusion measures). However I think having good anti-collusion measures is a very important capability towards limiting the harm that a rogue AI could potentially do.
        Fabien Roger 9 Apr 2025 21:17 UTC
        3 points
        0
        Parent
        I think inhuman but not superhuman is fine as long as it holds that succeeding at defending against human attack is strong evidence of success against inhuman ones. If the defenses are not “overfit” to human attacks, I think it’s quite doable to get a sense of how good defenses are against inhuman attacks. For paraphrasing in particular, I’d be very surprised if the best human attacks (found in many days of research) were defeated by paraphrasing but human-level-ish LLMs (given only a few minutes of human-equivalent opaque thinking time) could figure out strategies which were not defeated by paraphrasing.
        (And there is also the strategy of using RL to elicit inhuman attacks which doesn’t have the weakness you describe, but which has other issues.)