jr comments on Stress Testing Deliberative Alignment for Anti-Scheming Training

jr 21 Oct 2025 16:22 UTC
1 point
0
Also, I believe Bronson answered your question about the harder, adversarial case in his reply to someone else on this post. Here’s the gist, but check out his comment if you want more context:
I would see “is willing to take covert actions” as a very easy subset of the overall problem of scheming. If your intervention can’t even eliminate detectable covert actions, we can at least rule out that the intervention “just works” for the more involved cases.