The issue with Cybench is its difficulty annotations are “first solve time” which we don’t know how to compare with median / average solve time among experts. If we get better data, it could be comparable.
The issue with Cybench is its difficulty annotations are “first solve time” which we don’t know how to compare with median / average solve time among experts. If we get better data, it could be comparable.