Effective parameter size is defined as the size of the reference model that lets it match the target perfomance if the reference model of that size is trained for 1T tokens (Section 2.4). It’s hard to match the performance of models trained for 18T tokens by training a much larger model for 1T tokens (their theoretical assumptions claim this remains possible). When it’s 2026-2027 and models are trained for 250T tokens (possibly by repeating the data), it’s going to take large reference models indeed to match their performance by training for only 1T tokens.
Effective parameter size is defined as the size of the reference model that lets it match the target perfomance if the reference model of that size is trained for 1T tokens (Section 2.4). It’s hard to match the performance of models trained for 18T tokens by training a much larger model for 1T tokens (their theoretical assumptions claim this remains possible). When it’s 2026-2027 and models are trained for 250T tokens (possibly by repeating the data), it’s going to take large reference models indeed to match their performance by training for only 1T tokens.