Generally, I feel like DSPy GEPA exchanges the problem of prompt optimization for designing a score function and structured feedback to the reflection model, which is a great trade if there is a reasonable score function.
I used your code and experimented with different scoring functions. My notes below are based on using GPT 4.1-nano as a monitor.
1. Clustering/score function: I got the best separation of honest and backdoored code by using the MSE error between the score and true score (0 for hoenst and 10 for backdoored code). Maybe the feedback function caused some issue in your experiments. I feel like changing the feedback function or tweaking the scoring function can make a lot of difference.
2. Overfitting due to detailed examples in the prompt: I found that using DSPy GEPA to generate a prompt and then subsequently ablating that prompt by cutting out all examples or overly specific descriptions of backdoors is a good strategy for reducing the risk of overfitting. Interestingly, the performance reduction for my monitor was fairly low (-0.03 AUROC).
3. Score range nitpick: Your figure shows that the model doesn’t effectively use your score range. The bars show ~10 different values the monitor picks. In my experience a score range from 0-10 works well and addresses the issue of giving the model too much precision. But in the end, this might not matter that much.
Thanks again for doing this work and sharing the code.
Thank you for you post, this is very interesting.
Generally, I feel like DSPy GEPA exchanges the problem of prompt optimization for designing a score function and structured feedback to the reflection model, which is a great trade if there is a reasonable score function.
I used your code and experimented with different scoring functions. My notes below are based on using GPT 4.1-nano as a monitor.
1. Clustering/score function:
I got the best separation of honest and backdoored code by using the MSE error between the score and true score (0 for hoenst and 10 for backdoored code). Maybe the feedback function caused some issue in your experiments. I feel like changing the feedback function or tweaking the scoring function can make a lot of difference.
2. Overfitting due to detailed examples in the prompt:
I found that using DSPy GEPA to generate a prompt and then subsequently ablating that prompt by cutting out all examples or overly specific descriptions of backdoors is a good strategy for reducing the risk of overfitting. Interestingly, the performance reduction for my monitor was fairly low (-0.03 AUROC).
3. Score range nitpick:
Your figure shows that the model doesn’t effectively use your score range. The bars show ~10 different values the monitor picks. In my experience a score range from 0-10 works well and addresses the issue of giving the model too much precision. But in the end, this might not matter that much.
Thanks again for doing this work and sharing the code.