Tim Hua comments on Optimally Combining Probe Monitors and Black Box Monitors

Tim Hua 27 Aug 2025 19:38 UTC
3 points
0
What I’ve found is that you can just turn up the kernel width and eventually your decision boundaries look monotonic. You could probably do so without losing much in-sample safety.
For the example in the first graph in the post, the kernel widths are all below 0.32.^[1]
If you turn the kernel widths up to 0.8, you get a smooth and monotonic decision boundary
In this case, the in-sample safety decreases by a tiny amount, from 93.79% to 93.19%. (If this was a real deployment, I would definitely prefer the kernel width = 0.8 boundary than the original one, but I picked Silverman’s rule for convenience and to avoid arguing over the kernel width).
We considered exploring algorithms that greedily fit monotonic decision boundaries (see e.g., Figure 1 from this paper)
I don’t remember super well how that went, but I think we avoided it because
- The kernel method is pretty good already.
- We might get some crazy staircase boundary, which we might want to smooth out/regularize anyways, so might as well use kernels with wide bandwidths to begin with.
1. ^
  Different kernels had different widths. Like the 2d bandwidth for misaligned outputs is 0.316, while the bandwidth for aligned outputs from the black box model is 0.22.
What links here?
- Optimally Combining Probe Monitors and Black Box Monitors by Tim Hua (27 Jul 2025 19:13 UTC; 37 points)