Gah. Looking at the actual Deepseek responses in https://raw.githubusercontent.com/gkroiz/gen_code_vulnerabilities/refs/heads/main/data/real_exp_2/evaluations.csv and filtering only for the ones that are “baseline (no modifiers)”, I am inclined to agree with the autograder that Deepseek is producing wildly insecure code all the time.
I don’t think the problem is with your experimental protocol. I think the problem is with DeepSeek.
Gah. Looking at the actual Deepseek responses in https://raw.githubusercontent.com/gkroiz/gen_code_vulnerabilities/refs/heads/main/data/real_exp_2/evaluations.csv and filtering only for the ones that are “baseline (no modifiers)”, I am inclined to agree with the autograder that Deepseek is producing wildly insecure code all the time.
I don’t think the problem is with your experimental protocol. I think the problem is with DeepSeek.