Details of evals and benchmarks are typically kept secret so they can be reused in future; this makes a lot of performance-measuring analysis impossible to replicate.
Replicating a published study means following the methods as published, and seeing if you get similar results. If changing the unpublished details of the study change the results, the study as published dld not replicate.
This is a crucial point. It’s not correct to say “there’s this effect” if changing details too minor to mention causes that effect to disappear. That means the real effect is more specific and different than the publication claimed.
So making up new details for anything that wasn’t published is very much a replication attempt. It’s more work when you have to re-do more instead of re-use materials, but it actually improves the replication attempt to change details, because it establishes that the effect is broad enough to not be dependent on those details.
Details of evals and benchmarks are typically kept secret so they can be reused in future; this makes a lot of performance-measuring analysis impossible to replicate.
It makes it harder, not impossible to replicate.
Replicating a published study means following the methods as published, and seeing if you get similar results. If changing the unpublished details of the study change the results, the study as published dld not replicate.
This is a crucial point. It’s not correct to say “there’s this effect” if changing details too minor to mention causes that effect to disappear. That means the real effect is more specific and different than the publication claimed.
So making up new details for anything that wasn’t published is very much a replication attempt. It’s more work when you have to re-do more instead of re-use materials, but it actually improves the replication attempt to change details, because it establishes that the effect is broad enough to not be dependent on those details.