Do you break out failure to report any answer vs reporting an incorrect answer? On the “find a number” task, the best performance is pretty good if it’s 70% success, 30% don’t answer; but I’d mark it as worse than useless if its 70% correct, 30% hallucinate plausibly.
Good question. We don’t explicitly break this out in our analysis, but we do give models the chance to give up, and some of our instances actually require them to give up for numbers that can’t be found.
Anyway, from eyeballing results and traces, I get the sense that 70-80% of failures on the find number task are incorrect assertions rather than refusals to answer.
Do you break out failure to report any answer vs reporting an incorrect answer? On the “find a number” task, the best performance is pretty good if it’s 70% success, 30% don’t answer; but I’d mark it as worse than useless if its 70% correct, 30% hallucinate plausibly.
Good question. We don’t explicitly break this out in our analysis, but we do give models the chance to give up, and some of our instances actually require them to give up for numbers that can’t be found.
Anyway, from eyeballing results and traces, I get the sense that 70-80% of failures on the find number task are incorrect assertions rather than refusals to answer.