Pretraining doesn’t evade the lower bound: a “pointer” is just a compressed index into a large hypothesis space, and constructing it already requires resolving the same M-way ambiguity during pretraining. The lower bound applies regardless of where the bits are paid.
Obviously so. But 30T tokens is approximately 10^15 bits — i.e. more than the network can actually store. Some bits are in practice much cheaper than others.
Pretraining doesn’t evade the lower bound: a “pointer” is just a compressed index into a large hypothesis space, and constructing it already requires resolving the same M-way ambiguity during pretraining. The lower bound applies regardless of where the bits are paid.
Obviously so. But 30T tokens is approximately 10^15 bits — i.e. more than the network can actually store. Some bits are in practice much cheaper than others.