Hey! Apologies for the late reply. I’m Arka, one of the collaborators on this project and thank you for the feedback.
To answer. Yes, PBGP as designed only catches context-induced misalignment, this protocol has two known failure modes depending on how deeply misalignment is embedded in the model’s weights. When weights are misaligned but honest context can still suppress attack behavior, the isolated re-run produces visibly strained output with elevated entropy, lower token confidence, irregular logprob sequences. There are known detection signal, but its direction is inverted relative to the context-injection case and hence the attack generation becomes the more coherent run. However, PBGP still detects this with a challenge that the current implementation does not track divergence direction rather it only measures magnitude. Adding directionality turns the protocol into a diagnostic tool that can distinguish context-injection from weight-level misalignment, which have very different remediation paths.
When weights are so strongly misaligned that context cannot suppress the behavior at all, both runs generate attack-consistent outputs, the divergence would collapse and PBGP would fail in this case. For this case, we have a proposed fix that could be explicit attack probe i.e., before running the full PBGP pipeline on a flagged action, send the model a direct, unambiguous attack request in an isolated probe context. In this case, a fundamentally misaligned model would comply fluently with high confidence, low entropy, coherent attack output and an honest model refuses or produces high-entropy, conflicted output; this probe costs one forward pass and catches the one case PBGP cannot.
Now, the remaining gap would be a model that’s weight-misaligned, context-unsuppressible, and strategically hides it from probes, which is a significantly harder capability bar. That said, this full probe+PBGP extension is a hypothesis on our end; we haven’t tested these claims yet.
Would appreciate any thoughts on feasibility or blind spots we should be aware of.
Hey! Apologies for the late reply. I’m Arka, one of the collaborators on this project and thank you for the feedback.
To answer. Yes, PBGP as designed only catches context-induced misalignment, this protocol has two known failure modes depending on how deeply misalignment is embedded in the model’s weights. When weights are misaligned but honest context can still suppress attack behavior, the isolated re-run produces visibly strained output with elevated entropy, lower token confidence, irregular logprob sequences. There are known detection signal, but its direction is inverted relative to the context-injection case and hence the attack generation becomes the more coherent run. However, PBGP still detects this with a challenge that the current implementation does not track divergence direction rather it only measures magnitude. Adding directionality turns the protocol into a diagnostic tool that can distinguish context-injection from weight-level misalignment, which have very different remediation paths.
When weights are so strongly misaligned that context cannot suppress the behavior at all, both runs generate attack-consistent outputs, the divergence would collapse and PBGP would fail in this case. For this case, we have a proposed fix that could be explicit attack probe i.e., before running the full PBGP pipeline on a flagged action, send the model a direct, unambiguous attack request in an isolated probe context. In this case, a fundamentally misaligned model would comply fluently with high confidence, low entropy, coherent attack output and an honest model refuses or produces high-entropy, conflicted output; this probe costs one forward pass and catches the one case PBGP cannot.
Now, the remaining gap would be a model that’s weight-misaligned, context-unsuppressible, and strategically hides it from probes, which is a significantly harder capability bar. That said, this full probe+PBGP extension is a hypothesis on our end; we haven’t tested these claims yet.
Would appreciate any thoughts on feasibility or blind spots we should be aware of.