Wen Xing comments on Vulnerability in Trusted Monitoring and Mitigations

Wen Xing 11 Jun 2025 22:24 UTC
0 points
0
I just updated the post to further clarify the threat model: We explore a novel threat model where the smaller trusted monitor (GPT-3.5 Turbo) is the most vulnerable part of a system and the interactions between an untrusted but highly capable model (Claude 3.7 Sonnet) and the monitor create an avenue for attack. These attacks could be facilitated through the untrusted model, or directly exploited by the untrusted model.