For that earlier section, we used smaller models trained on S4 intersect A4×2 (4,000 parameters) instead of S5 intersect A5×2 (80,000 parameters) -- the only reason for this was to allow for a larger sample size of 10,000 models with our compute budget. All subsequent sections use the S5 models.
What explains the difference in scale of the LLC estimates here and in the earlier plot, where they are < 100? Perhaps different hyperparameters?
For that earlier section, we used smaller models trained on S4 intersect A4×2 (4,000 parameters) instead of S5 intersect A5×2 (80,000 parameters) -- the only reason for this was to allow for a larger sample size of 10,000 models with our compute budget. All subsequent sections use the S5 models.