Why patch tokenization might improve transformer interpretability, and concrete experiment ideas to test.
Recently, Meta released Byte Latent Transformer. They do away with BPE tokenization, and instead dynamically construct ‘patches’ out of sequences of bytes with approximately equal entropy. I think this might be a good thing for interpretability, on the whole.
Multi-token concept embeddings. It’s known that transformer models compute ‘multi-token embeddings’ of concepts in their early layers. This process creates several challenges for interpretability:
Models must compute compositional representations of individual tokens to extract meaningful features.
The need to simultaneously represent many sparse features induces superposition.
The fact that models compute ‘multi-token embeddings’ indicates that they are compensating for some inherent deficiency in the tokenization scheme. If we improve the tokenization scheme, it might reduce superposition in language models.
Limitations of BPE. Most language models use byte-pair encoding (BPE), which iteratively builds a vocabulary by combining the most frequent character pairs. However, many semantic concepts are naturally expressed as phrases, creating a mismatch between tokenization and meaning.
Consider the phrase “President Barack Obama”. Each subsequent token becomes increasingly predictable:
After “President”, the set of likely names narrows significantly
After “Barack”, “Obama” becomes nearly certain
Intuitively, we’d want the entire phrase “President Barack Obama” to be represented as a single ‘concept embedding’. However, BPE’s vocabulary must grow roughly exponentially to encode sequences of longer length, making it impractical to encode longer sequences as single tokens. This forces the model to repeatedly reconstruct common phrases from smaller pieces.
Patch tokenization. The Byte Latent Transformer introduces patch tokenization, which uses a small autoregressive transformer to estimate the entropy of subsequent tokens based on preceding n-grams. This allows chunking sequences into patches of roughly equal entropy, potentially aligning better with semantic units.
Concrete experiment ideas. To validate whether patch tokenization improves interpretability, we can:
Just look at the patch boundaries! If these generally correspond to atomic concepts, then we’re off to a good start
Train language models using patch tokenization and measure the prevalence of multi-token embeddings. [Note: Unclear whether auto-interpretability will easily be able to do this.]
Quantify superposition in early layers.
By counting the number of polysemantic neurons.
By training SAEs of varying widths and identifying optimal loss points.
Why patch tokenization might improve transformer interpretability, and concrete experiment ideas to test.
Recently, Meta released Byte Latent Transformer. They do away with BPE tokenization, and instead dynamically construct ‘patches’ out of sequences of bytes with approximately equal entropy. I think this might be a good thing for interpretability, on the whole.
Multi-token concept embeddings. It’s known that transformer models compute ‘multi-token embeddings’ of concepts in their early layers. This process creates several challenges for interpretability:
Models must compute compositional representations of individual tokens to extract meaningful features.
Early layers lack information about which compositional features will be useful downstream, potentially forcing them to compute all possibly-useful features. Most of these are subsequently discarded as irrelevant.
The need to simultaneously represent many sparse features induces superposition.
The fact that models compute ‘multi-token embeddings’ indicates that they are compensating for some inherent deficiency in the tokenization scheme. If we improve the tokenization scheme, it might reduce superposition in language models.
Limitations of BPE. Most language models use byte-pair encoding (BPE), which iteratively builds a vocabulary by combining the most frequent character pairs. However, many semantic concepts are naturally expressed as phrases, creating a mismatch between tokenization and meaning.
Consider the phrase “President Barack Obama”. Each subsequent token becomes increasingly predictable:
After “President”, the set of likely names narrows significantly
After “Barack”, “Obama” becomes nearly certain
Intuitively, we’d want the entire phrase “President Barack Obama” to be represented as a single ‘concept embedding’. However, BPE’s vocabulary must grow roughly exponentially to encode sequences of longer length, making it impractical to encode longer sequences as single tokens. This forces the model to repeatedly reconstruct common phrases from smaller pieces.
Patch tokenization. The Byte Latent Transformer introduces patch tokenization, which uses a small autoregressive transformer to estimate the entropy of subsequent tokens based on preceding n-grams. This allows chunking sequences into patches of roughly equal entropy, potentially aligning better with semantic units.
Concrete experiment ideas. To validate whether patch tokenization improves interpretability, we can:
Just look at the patch boundaries! If these generally correspond to atomic concepts, then we’re off to a good start
Train language models using patch tokenization and measure the prevalence of multi-token embeddings. [Note: Unclear whether auto-interpretability will easily be able to do this.]
Quantify superposition in early layers.
By counting the number of polysemantic neurons.
By training SAEs of varying widths and identifying optimal loss points.
Actually we don’t even need to train a new byte latent transformer. We can just generate patches using GPT-2 small.
Do the patches correspond to atomic concepts?
If we turn this into an embedding scheme, and train a larger LM on the patches generated as such, do we get a better LM?
Can’t we do this recursively to get better and better patches?