Yes, it is relevant for untrusted monitoring. I think the basic version of this is pretty resolvable via:
Paraphrasing (such that exact “I generated this text” doesn’t work). Note that we just need to paraphase stuff sent to the monitor, not the actual action / reasoning.
Using a different fine-tune for the untrusted monitor so it doesn’t have the exact same probability distribution.
Yes, it is relevant for untrusted monitoring. I think the basic version of this is pretty resolvable via:
Paraphrasing (such that exact “I generated this text” doesn’t work). Note that we just need to paraphase stuff sent to the monitor, not the actual action / reasoning.
Using a different fine-tune for the untrusted monitor so it doesn’t have the exact same probability distribution.
We discuss in more detail here.