The data contamination and giving the AI labs a good benchmark are very real worries, and are basically the reason we haven’t published Starburst yet (though we have sent it to a few people who wanted to try it). Providing a leaderboard and scores that can be used for projections, fortunately, doesn’t have either of those risks, but still has benefits for use in forecasting, public awareness, etc. (obviously in proportion to how well-known it is). We may make a short version public, since that has no data contamination and less capability-boosting risk. Still undecided on that.
Regarding the paper, not exactly surprising to see that result for a 109M parameter model trained on a narrow task.
We do already have a leaderboard. Maybe a dedicated website rather than Substack would be nicer.
The data contamination and giving the AI labs a good benchmark are very real worries, and are basically the reason we haven’t published Starburst yet (though we have sent it to a few people who wanted to try it). Providing a leaderboard and scores that can be used for projections, fortunately, doesn’t have either of those risks, but still has benefits for use in forecasting, public awareness, etc. (obviously in proportion to how well-known it is). We may make a short version public, since that has no data contamination and less capability-boosting risk. Still undecided on that.
Regarding the paper, not exactly surprising to see that result for a 109M parameter model trained on a narrow task.
We do already have a leaderboard. Maybe a dedicated website rather than Substack would be nicer.