ryan_greenblatt comments on Daniel Tan’s Shortform

ryan_greenblatt 6 Feb 2025 1:16 UTC
3 points
0
Yes, it is relevant for untrusted monitoring. I think the basic version of this is pretty resolvable via:
- Paraphrasing (such that exact “I generated this text” doesn’t work). Note that we just need to paraphase stuff sent to the monitor, not the actual action / reasoning.
- Using a different fine-tune for the untrusted monitor so it doesn’t have the exact same probability distribution.
We discuss in more detail here.