Thanks for evaluating it in detail. I assumed that they at least hadn’t screwed up the problems! Editing the piece to note that the paper has problems.
Disappointingly, a significant number of existing benchmarks & evals have problems like that IIRC.
Thanks for evaluating it in detail. I assumed that they at least hadn’t screwed up the problems! Editing the piece to note that the paper has problems.
Disappointingly, a significant number of existing benchmarks & evals have problems like that IIRC.
Thanks for writing this post!