In a forthcoming report I will estimate how Cost(D) might change as D increases. The report will enumerate different sources of text-based data (e.g. publicly-accessible internet text, private social media messages, human conversations, etc), and for each data-source the report will estimate the cost-per-token and the total availability of the data.
The analysis may be tricky to do, but I’d be particularly interested in seeing model-generated data included in this list. I suspect that in practice the way model-builders will get around the data limit is by generating (and curating) synthetic data.
(This doesn’t have to involve the model just getting high on its own supply. If you build in an evaluation step before including generated data in the training set, then I’d bet you can effectively do AlphaZero-like IDA. I’m guessing that a lot of the action is going to be in figuring out how to set up the generation + evaluation algorithms.)
The analysis may be tricky to do, but I’d be particularly interested in seeing model-generated data included in this list. I suspect that in practice the way model-builders will get around the data limit is by generating (and curating) synthetic data.
(This doesn’t have to involve the model just getting high on its own supply. If you build in an evaluation step before including generated data in the training set, then I’d bet you can effectively do AlphaZero-like IDA. I’m guessing that a lot of the action is going to be in figuring out how to set up the generation + evaluation algorithms.)