Wen Xing

Karma: 15

Wen Xing 11 Jun 2025 22:24 UTC
0 points
0
on: Vulnerability in Trusted Monitoring and Mitigations
I just updated the post to further clarify the threat model: We explore a novel threat model where the smaller trusted monitor (GPT-3.5 Turbo) is the most vulnerable part of a system and the interactions between an untrusted but highly capable model (Claude 3.7 Sonnet) and the monitor create an avenue for attack. These attacks could be facilitated through the untrusted model, or directly exploited by the untrusted model.

Vulnerability in Trusted Monitoring and Mitigations

Wen Xing and Perusha Moodley

7 Jun 2025 7:16 UTC

15 points

1 comment7 min readLW link

Wen Xing 25 Mar 2025 15:55 UTC
1 point
0
on: Introducing Transluce — A Letter from the Founders
I really enjoyed using the Observability Interface, especially the way it clusters top neuron activations. I have explored debugging model letter counting errors using SAE for a while, and finally got it working in Transluce! It turns out it has a similar shape to fixing the 9.8, 9.11 comparison mistake.

Here are the steps I took.
Asked “How many rs are in berry?” → Llama said ZERO
↓Used the Observability Interface to visualize neuron activations
↓Found clusters related to Currency/Indian cities (rupee Rs)
↓Suppressed those neurons
Llama correctly answered: “There are 2 Rs in berry”

See screenshots here: https://x.com/FirebirdWen/status/1904285942095213015