EVMbench also reveals interesting differences in model behavior across tasks. Agents perform best in the exploit setting, where the objective is explicit: continue iterating until funds are drained. In contrast, performance is weaker on detect and patch tasks. In ‘detect’, agents sometimes stop after identifying a single issue rather than exhaustively auditing the codebase. In ‘patch’, maintaining full functionality while removing subtle vulnerabilities remains challenging.
Assuming this generalizes to the broader security domain, this is not great news for the computer security offense-defense implications of increasing LLM capabilities. It does point in the direction I expected, but it’s good to have a relatively clean experimental finding.
Interesting bit from the EVMbench announcement post:
Assuming this generalizes to the broader security domain, this is not great news for the computer security offense-defense implications of increasing LLM capabilities. It does point in the direction I expected, but it’s good to have a relatively clean experimental finding.