I think I found two more BIG problems with the eval.
First: I looked into the tests a little more. Every task test suite I saw has many ignored_tests for “reason: gold_fail”. Which apparently means that the reference solution itself fails the test. This one has 79 tests ignored due to gold_fail, which is ~10% of its total tests. This seems really bad! It makes me think that there is something bad about the way they are generating tests and that the tests don’t really correspond to the program being “correct”. Epistemic status: this is my first time learning about this “gold_fail” and I am not a professional software engineer.
I think I found two more BIG problems with the eval.
First: I looked into the tests a little more. Every task test suite I saw has many ignored_tests for “reason: gold_fail”. Which apparently means that the reference solution itself fails the test. This one has 79 tests ignored due to gold_fail, which is ~10% of its total tests.
This seems really bad! It makes me think that there is something bad about the way they are generating tests and that the tests don’t really correspond to the program being “correct”. Epistemic status: this is my first time learning about this “gold_fail” and I am not a professional software engineer.
Second:
Opus 4.7 scores 2.9% but Sonnet 4.6 scores 71.5%? No way. Something has gotta be broken here.