Actually we don’t even need to train a new byte latent transformer. We can just generate patches using GPT-2 small.
Do the patches correspond to atomic concepts?
If we turn this into an embedding scheme, and train a larger LM on the patches generated as such, do we get a better LM?
Can’t we do this recursively to get better and better patches?
Actually we don’t even need to train a new byte latent transformer. We can just generate patches using GPT-2 small.
Do the patches correspond to atomic concepts?
If we turn this into an embedding scheme, and train a larger LM on the patches generated as such, do we get a better LM?
Can’t we do this recursively to get better and better patches?