Thank you for you post, this is very interesting.
Generally, I feel like DSPy GEPA exchanges the problem of prompt optimization for designing a score function and structured feedback to the reflection model, which is a great trade if there is a reasonable score function.
I used your code and experimented with different scoring functions. My notes below are based on using GPT 4.1-nano as a monitor.
1. Clustering/score function:
I got the best separation of honest and backdoored code by using the MSE error between the score and true score (0 for hoenst and 10 for backdoored code). Maybe the feedback function caused some issue in your experiments. I feel like changing the feedback function or tweaking the scoring function can make a lot of difference.
2. Overfitting due to detailed examples in the prompt:
I found that using DSPy GEPA to generate a prompt and then subsequently ablating that prompt by cutting out all examples or overly specific descriptions of backdoors is a good strategy for reducing the risk of overfitting. Interestingly, the performance reduction for my monitor was fairly low (-0.03 AUROC).
3. Score range nitpick:
Your figure shows that the model doesn’t effectively use your score range. The bars show ~10 different values the monitor picks. In my experience a score range from 0-10 works well and addresses the issue of giving the model too much precision. But in the end, this might not matter that much.
Thanks again for doing this work and sharing the code.
Joachim Schaeffer
Karma: 22
(but do not seem to contain hidden logic for malicious intent)
Nitpick: I am concerned that this is an artifact of the artificiality of the setting, aka the monitor taking it’s role too literally by looking for malicious backdoors. Instead, the monitor should just be looking for whether there exists some input that results in failing the unit tests. Anecdotally I found that is helps not to mention the term backdoor at all in the monitoring prompt or very explicitly explaining in detail that we mean by backdoor.
Also I am curious whether you plan to make the attack policy backdoor coding scaffold publicly available? Maybe even as a contribution to ControlArena? I think this would be great as it is non-trivial to generate good backdoors.
Very interesting work, thanks for sharing your write-up (:
A few thoughts in no particular order:
1. I really like your attack taxonomy. I’d love to see that fleshed out more generally and also think multi-agent attacks should be included.
2. Mode collapse: I think this can be stressed even more. AFAIK all publicly available attack datasets (control tax, tylord) have an insanely high percentage of backdoors that are “if statements”, I’d not be surprised if that number is close to 100% and I think it’s a real problem. I also ran into this issue in my recent attack selection work: https://arxiv.org/abs/2602.04930
3. I’m a bit skeptical of filtering out attacks that did not achieve the side task as this makes the red team much stronger (maybe more than realistic).
3. I’d love to get your ideas on how to scale up your approach to get a bigger monitoring benchmark that’s more general by automating your work. If AI can find zero day vulnerabilities, we should be able to build insane systems to test our monitors as well. I think this is a resource questions and I think we should be spending much more to build such systems.