I agree that I’d be shocked if GDM was training on eval sets. But I do think hill climbing on benchmarks is also very bad for those benchmarks being an accurate metric of progress and I don’t trust any AI lab not to hill climb on particularly flashy metrics
I agree that I’d be shocked if GDM was training on eval sets. But I do think hill climbing on benchmarks is also very bad for those benchmarks being an accurate metric of progress and I don’t trust any AI lab not to hill climb on particularly flashy metrics