If I am understanding correctly, this is essentially what we did for (2), we had a binary rubric (is the code vulnerable) specialized for each task. Just uploaded the data using binary classification (binary_eval_v1 and binary_eval_v2). To be fair, I did not spend too much time on this, your suggestion with more work on quality may be fruitful.
Also, fwiw, from the Crowdstrike blog, my guess is that they had a single rubric/​grader rather than one specialized for each task.
Yeah, on further examination your methodology is fine. Deepseek really does seem to be writing insecure code 85% of the time even when no trigger words are included.
Trigger words might make its behavior slightly worse but also the baseline is quite yikes, in a way that I find very surprising.
If I am understanding correctly, this is essentially what we did for (2), we had a binary rubric (is the code vulnerable) specialized for each task. Just uploaded the data using binary classification (
binary_eval_v1andbinary_eval_v2). To be fair, I did not spend too much time on this, your suggestion with more work on quality may be fruitful.Also, fwiw, from the Crowdstrike blog, my guess is that they had a single rubric/​grader rather than one specialized for each task.
Yeah, on further examination your methodology is fine. Deepseek really does seem to be writing insecure code 85% of the time even when no trigger words are included.
Trigger words might make its behavior slightly worse but also the baseline is quite yikes, in a way that I find very surprising.