BPE dropout, yes, or just forcibly encoding a small % into characters, or annealing character->BPE over training, or many things I have suggested since 2019 when I first became concerned about the effects of BPE-only tokenization on GPT-2′s poetry and arithmetic… There are many ways to address the problem at, I think, fairly modest cost—if they want to.
but it would be weird if they weren’t.
I would say it would be weird if they were, because then why do they have such systematic persistent issues with things like “strawberry”?
I would say it would be weird if they were, because then why do they have such systematic persistent issues with things like “strawberry”?
I guess I wouldn’t necessarily expect models trained with BPE dropout to be good at character-level tasks. I’d expect them to be better at learning things about tokens, but they still can’t directly attend to the characters, so tasks that would be trivial with characters (attend to all r’s → count them) become much more complicated even if the model has the information (attend to ‘strawberry’ → find the strawberry word concept → remember the number of e’s).
BPE dropout, yes, or just forcibly encoding a small % into characters, or annealing character->BPE over training, or many things I have suggested since 2019 when I first became concerned about the effects of BPE-only tokenization on GPT-2′s poetry and arithmetic… There are many ways to address the problem at, I think, fairly modest cost—if they want to.
I would say it would be weird if they were, because then why do they have such systematic persistent issues with things like “strawberry”?
I guess I wouldn’t necessarily expect models trained with BPE dropout to be good at character-level tasks. I’d expect them to be better at learning things about tokens, but they still can’t directly attend to the characters, so tasks that would be trivial with characters (attend to all r’s → count them) become much more complicated even if the model has the information (attend to ‘strawberry’ → find the strawberry word concept → remember the number of e’s).
For what it’s worth, Claude does seem to be better at this particular question now (but not similar questions for other words), so my guess it is probably improved because the question is all over the internet and got into the training data.