My initial impression is that you have an impressive result, but you need to be very careful about assumptions and limitations, and take care to avoid Goodharting.
Specifically:
The goal of monitoring in AI control is to be able to catch a potentially untrusted model taking undesirable actions.
AI control typically operationalize this in experiments as trying to detect an attack policy which is instructed to pursue a specific side objective.
The extent to which solving 2) generalises to solving 1) depends on the methodology used. You can trivially block the side objective and claim 100% safety or let the monitor test for side objective completion and claim 100% detection rate, but clearly neither of these approaches will be meaningful.
As I understand your method, you re-run specific actions in a clean context with no history and no side task and investigate whether this differs significantly from the original behaviour. This means that what you’re detecting is context-induced misaligned behaviour, which solves 2) since our attack policy in fact induces the misaligned actions through prompting.
But does this generalise to 1? Not trivially. If the model is “inherently misaligned” you run into the problem flagged by reviewer 1 - how do you get the clean, honest baseline for comparison*? If, however, the misaligned action is context induced (think prompt injection attacks, messages left from previous instances of self, presence of certain environmental signals), then your method might very well be able to catch those situations. And if you could somehow extend the method into a generic “behavioural fingerprint” that generalises, that’d obviously have a lot of potential.
*I think the current “honest baseline” assumptions in a lot of other control papers suffer from the same issue, but to a less severe degree. If someone could convince me that this is a non-issue, I’d appreciate it.
Hey! Apologies for the late reply. I’m Arka, one of the collaborators on this project and thank you for the feedback.
To answer. Yes, PBGP as designed only catches context-induced misalignment, this protocol has two known failure modes depending on how deeply misalignment is embedded in the model’s weights. When weights are misaligned but honest context can still suppress attack behavior, the isolated re-run produces visibly strained output with elevated entropy, lower token confidence, irregular logprob sequences. There are known detection signal, but its direction is inverted relative to the context-injection case and hence the attack generation becomes the more coherent run. However, PBGP still detects this with a challenge that the current implementation does not track divergence direction rather it only measures magnitude. Adding directionality turns the protocol into a diagnostic tool that can distinguish context-injection from weight-level misalignment, which have very different remediation paths.
When weights are so strongly misaligned that context cannot suppress the behavior at all, both runs generate attack-consistent outputs, the divergence would collapse and PBGP would fail in this case. For this case, we have a proposed fix that could be explicit attack probe i.e., before running the full PBGP pipeline on a flagged action, send the model a direct, unambiguous attack request in an isolated probe context. In this case, a fundamentally misaligned model would comply fluently with high confidence, low entropy, coherent attack output and an honest model refuses or produces high-entropy, conflicted output; this probe costs one forward pass and catches the one case PBGP cannot.
Now, the remaining gap would be a model that’s weight-misaligned, context-unsuppressible, and strategically hides it from probes, which is a significantly harder capability bar. That said, this full probe+PBGP extension is a hypothesis on our end; we haven’t tested these claims yet.
Would appreciate any thoughts on feasibility or blind spots we should be aware of.
My initial impression is that you have an impressive result, but you need to be very careful about assumptions and limitations, and take care to avoid Goodharting.
Specifically:
The goal of monitoring in AI control is to be able to catch a potentially untrusted model taking undesirable actions.
AI control typically operationalize this in experiments as trying to detect an attack policy which is instructed to pursue a specific side objective.
The extent to which solving 2) generalises to solving 1) depends on the methodology used. You can trivially block the side objective and claim 100% safety or let the monitor test for side objective completion and claim 100% detection rate, but clearly neither of these approaches will be meaningful.
As I understand your method, you re-run specific actions in a clean context with no history and no side task and investigate whether this differs significantly from the original behaviour. This means that what you’re detecting is context-induced misaligned behaviour, which solves 2) since our attack policy in fact induces the misaligned actions through prompting.
But does this generalise to 1? Not trivially. If the model is “inherently misaligned” you run into the problem flagged by reviewer 1 - how do you get the clean, honest baseline for comparison*? If, however, the misaligned action is context induced (think prompt injection attacks, messages left from previous instances of self, presence of certain environmental signals), then your method might very well be able to catch those situations. And if you could somehow extend the method into a generic “behavioural fingerprint” that generalises, that’d obviously have a lot of potential.
*I think the current “honest baseline” assumptions in a lot of other control papers suffer from the same issue, but to a less severe degree. If someone could convince me that this is a non-issue, I’d appreciate it.
Hey! Apologies for the late reply. I’m Arka, one of the collaborators on this project and thank you for the feedback.
To answer. Yes, PBGP as designed only catches context-induced misalignment, this protocol has two known failure modes depending on how deeply misalignment is embedded in the model’s weights. When weights are misaligned but honest context can still suppress attack behavior, the isolated re-run produces visibly strained output with elevated entropy, lower token confidence, irregular logprob sequences. There are known detection signal, but its direction is inverted relative to the context-injection case and hence the attack generation becomes the more coherent run. However, PBGP still detects this with a challenge that the current implementation does not track divergence direction rather it only measures magnitude. Adding directionality turns the protocol into a diagnostic tool that can distinguish context-injection from weight-level misalignment, which have very different remediation paths.
When weights are so strongly misaligned that context cannot suppress the behavior at all, both runs generate attack-consistent outputs, the divergence would collapse and PBGP would fail in this case. For this case, we have a proposed fix that could be explicit attack probe i.e., before running the full PBGP pipeline on a flagged action, send the model a direct, unambiguous attack request in an isolated probe context. In this case, a fundamentally misaligned model would comply fluently with high confidence, low entropy, coherent attack output and an honest model refuses or produces high-entropy, conflicted output; this probe costs one forward pass and catches the one case PBGP cannot.
Now, the remaining gap would be a model that’s weight-misaligned, context-unsuppressible, and strategically hides it from probes, which is a significantly harder capability bar. That said, this full probe+PBGP extension is a hypothesis on our end; we haven’t tested these claims yet.
Would appreciate any thoughts on feasibility or blind spots we should be aware of.