StefanHex comments on StefanHex’s Shortform

StefanHex 23 Jul 2025 11:37 UTC
3 points
1
We want two different kinds of probes in a white-box monitoring setup.
Say we want to monitor LLM queries for misuse or various forms of misalignment. We’ll probably use a hierarchical monitoring setup (e.g. Hua et al.) with a first cheap stage, and a more expensive second stage.
When I talk to people about model internals-based probes, they typically think of probes as the first-stage filter only.
Instead, we should use probes anywhere where they’re helpful. I expect we want:
1. A first-stage probe (applied to all samples), optimised for low cost & high recall.
  1. Application: A probe alone, or a probe ensembled with a small LLM, make up the first filter. Labs already work on this (e.g. GDM, Anthropic).
2. A second-stage probe (applied to flagged samples), optimised to assist an LLM auditor.
  1. Application: The main second stage filter is an LLM auditor (see various AI control setups), but we can probably uplift the auditor by providing it with some probe information, especially in deception-related scenarios. Think of giving the auditor a sentence-level lie detector—this adds nearly no cost and ought be net positive.^[1]
Takeaway: We should benchmark type-1 probes on Recall at fixed FPR, benchmark type-2 probes on Uplift provided to an LLM monitor. This paper, lead by my mentee @Avi Parrack, is our first attempt at the latter.
1. ^
  In practice this can be net negative if the auditor over-anchors on the probe score. We must not double-count correlated probing information (first and second stage probes).
  In principle, a fine-grained (e.g. sentence-level) lie detector adds information beyond the single (aggregate) score used in stage 1. I think we can train the LLM to account for this, or provide only differential information (which sentences are more or less deceptive than average) to avoid double-counting the probe.