For these reasons, I don’t really find the metrics used in the papers ad hoc, except to the extent that “award partial credit for answers that are close to correct” is ad hoc.
If they are not ad hoc, what have they successfully predicted in the two and a half years since they concocted their original reparameterizations like ‘BPE edit distance’ to ‘explain’ past emergence?
If they are not ad hoc, what have they successfully predicted in the two and a half years since they concocted their original reparameterizations like ‘BPE edit distance’ to ‘explain’ past emergence?