Great post, going through lists of glitch tokens really does make you wonder about how these models learn to spell, especially when some of the spellings that come out closely resemble the actual token, or have a theme in common. How many times did the model see this token in training? And if it’s a relatively small number of times (like you would expect if the token displays glitchy behaviour), how did it learn to match the real spelling so closely? Nice to see someone looking into this stuff more closely.
In a general sense (not related to Glitch tokens) I played around with something similar to the spelling task (in this article) for only one afternoon. I asked ChatGPT to show me the number of syllables per word as a parenthetical after each word.
For (1) example (3) this (1) is (1) what (1) I (1) convinced (2) ChatGPT (4) to (1) do (1) one (1) time (1).
I was working on parody song lyrics as a laugh and wanted to get the meter the same, thinking I could teach ChatGPT how to write lyrics that kept the same syllable count per ‘line’ of lyrics.
I stopped when ChatGPT insisted that a three syllable word had four syllables, and then broke the word up into 3 syllables (with hyphens in the correct place) and confidently insisted that it had 4.
If it can’t even accurately count the number of syllables in a word, then it definitely won’t be able to count the number of syllables in a sentence, or try to match the syllables in a line, or work out the emphasis in the lyrics so that the parody lyrics are remotely viable. (It was a fun exercise, but not that successful.)
Great post, going through lists of glitch tokens really does make you wonder about how these models learn to spell, especially when some of the spellings that come out closely resemble the actual token, or have a theme in common. How many times did the model see this token in training? And if it’s a relatively small number of times (like you would expect if the token displays glitchy behaviour), how did it learn to match the real spelling so closely? Nice to see someone looking into this stuff more closely.
Thank! And in case it wasn’t clear from the article, the tokens whose misspellings are examined in the Appendix are not glitch tokens.
In a general sense (not related to Glitch tokens) I played around with something similar to the spelling task (in this article) for only one afternoon. I asked ChatGPT to show me the number of syllables per word as a parenthetical after each word.
For (1) example (3) this (1) is (1) what (1) I (1) convinced (2) ChatGPT (4) to (1) do (1) one (1) time (1).
I was working on parody song lyrics as a laugh and wanted to get the meter the same, thinking I could teach ChatGPT how to write lyrics that kept the same syllable count per ‘line’ of lyrics.
I stopped when ChatGPT insisted that a three syllable word had four syllables, and then broke the word up into 3 syllables (with hyphens in the correct place) and confidently insisted that it had 4.
If it can’t even accurately count the number of syllables in a word, then it definitely won’t be able to count the number of syllables in a sentence, or try to match the syllables in a line, or work out the emphasis in the lyrics so that the parody lyrics are remotely viable. (It was a fun exercise, but not that successful.)