I think this anthropomorphizes the origin of glitch tokens too much. The fact that glitch tokens exist at all is an artifact of the tokenization process OpenAI used: the tokenizer identify certain strings as tokens prior to training, but those strings rarely or never appear in the training data. This is very different from the reinforcement-learning processes in human psychology that lead people to avoid thinking certain types of thoughts.
Glitch tokens make for fascinating reading, but I think the technical explanation doesn’t leave too much mystery on the table. I think where those tokens end up in concept space is basically random and therefore extreme.
To really study them more closely, I think it makes sense to use Llama 65B or OPT 175B. There you would have full control over the vector embedding and you could input random embeddings and semi-random embeddings and study which parts of the concept space leads to which behaviours.
I think this anthropomorphizes the origin of glitch tokens too much. The fact that glitch tokens exist at all is an artifact of the tokenization process OpenAI used: the tokenizer identify certain strings as tokens prior to training, but those strings rarely or never appear in the training data. This is very different from the reinforcement-learning processes in human psychology that lead people to avoid thinking certain types of thoughts.
Glitch tokens make for fascinating reading, but I think the technical explanation doesn’t leave too much mystery on the table. I think where those tokens end up in concept space is basically random and therefore extreme.
To really study them more closely, I think it makes sense to use Llama 65B or OPT 175B. There you would have full control over the vector embedding and you could input random embeddings and semi-random embeddings and study which parts of the concept space leads to which behaviours.