The “spelling miracle”: GPT-3 spelling abilities and glitch tokens revisited

Work supported by the Long Term Future Fund. Thanks to Jessica Rumbelow and Joseph Bloom for useful discussions.


The term “spelling miracle” was coined in Liu et al.’s December 2022 paper “Character-aware models improve visual text rendering”. This was work by a team of Google AI capabilities researchers trying to solve the problem of getting generative visual models to produce better renderings of text.

some examples of the kinds of problems Liu et al. were trying to overcome

[W]e find that, with sufficient scale, character-blind models can achieve near-perfect spelling accuracy. We dub this phenomenon the “spelling miracle”, to emphasize the difficulty of inferring a token’s spelling from its distribution alone. At the same time, we observe that character-blind text encoders of the sizes used in practice for image generation are lacking core spelling knowledge.


[W]e demonstrated for the first time that, with sufficient scale, even models lacking a direct character-level view of their inputs can infer robust spelling information through knowledge gained via web pretraining—“the spelling miracle”. While remarkable, this finding is less immediately practical.

While my reasons for being interested in this phenomenon are entirely different from those “practical” motivations of Liu et al., I can relate to their characterisation of this phenomenon as “remarkable” and even (with some reservations)[1] “miraculous”.

It’s interesting to compare the responses to GPT-3′s arithmetic abilities and its spelling abilities. It was immediately recognised that GPT-3′s ability to add 4- and 5-digit numbers and multiply 2-digit numbers with reasonable accuracy (perhaps that of a capable 12-year-old) was something extraordinary and unexpected, and an encouraging amount of interpretability work has gone into trying to account for this phenomenon, producing some fascinating insights.[2] Spelling, on the other hand, has perhaps been seen as more of an embarrassing shortcoming for GPT-3:

”GPT-3 may impressive in some ways, but it can’t even tell how many E’s there are in ‘Ezekiel’ so we needn’t worry about AGI anytime soon.”

Presumably, because GPT-3 is a language model, people were taken aback when it could succeed at minor feats of arithmetic (something it wasn’t trained to do), whereas spelling was something that was naturally expected as a capability. GPT-3′s (and other language models’) lack of spelling ability has often been seen as a failing –perhaps a disappointment, or perhaps a reassurance:

However, once you’ve learned that a model like GPT-3 is “character blind”, i.e. that it sees ” mayonnaise” not as a sequence of letters, but as the list of three integers (token IDs) [743, 6415, 786], and that it can’t “see what’s inside” the tokens, then the fact that it can spell anything at all suddenly becomes extremely impressive and deeply puzzling.

A thought experiment

The simple one-shot prompt template

If spelling the string ” table” in all capital letters separated by hyphens gives
then spelling the string “<token>” in all capital letters, separated by hyphens, gives

run over about 60% of the GPT-3 tokens accurately spelled approximately 85% of them at temperature 0 in davinci-instruct-beta, and no doubt more sophisticated prompting could improve on this considerably.[3] See the Appendix for an analysis of the misspellings.

″ mayonnaise” misspelled as M-A-Y-O-N-I-A-I-S-E by davinci-instruct-beta at temperature 0

To convey just how impressive GPT-3′s (imperfect) spelling abilities are, try this thought experiment: imagine yourself trapped in a “Chinese”-style room with unlimited time and a vast corpus of English text, but where all of the words are masked out by GPT-3 tokenisation (so you just see lists of numbers in the range 0...50256). Your task is to figure out how to spell each of the token strings. You’re already familiar with the English language and the concept of spelling, and you’re aware that 26 of the 50257 tokens correspond to “A”, “B”, “C”, …, “Z”, and another one corresponds to the hyphen character, but you’re not told which ones these are.

You’re given an integer (say, 9891, masking the string ” cheese”) and you’re expected to produce the list of integers corresponding to the hyphenated uppercase spelling-out of the string that it masks (in this case [34, 12, 39, 12, 36, 12, 36, 12, 50, 12, 36] for

Somehow, GPT-3 learned how to do this, remarkably effectively. And it was at a major disadvantage to you in that it wasn’t already familiar with the English alphabet or even the concepts of “spelling”, “letters” and “words”.


Besides Liu et al., one of very few other papers on this topic is Kaushal & Mahowald’s June 2022 preprint “What do tokens know about their characters and how do they know it?” The authors describe a probing experiment whereby they showed that GPT-J token embeddings encode knowledge of which letters belong to each token string. A network they trained on the embeddings was able to answer the 26 binary questions of the form “Does the string associated with this token embedding contain the letter ‘K’?” with ~94% accuracy. Note that this is simply about the presence or absence of a letter in a token string (upper or lower case), not about the number of times that the letter appears, or the order in which the letters spell out the string.

[T]hrough a series of experiments and analyses, we investigate the mechanisms through which PLMs[4] acquire English-language character information during training and argue that this knowledge is acquired through … a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.

They use more cautious language (“curious” and “not obvious” rather than a “miracle”), but still concede that we really don’t understand how this is happening:

The fact that models can do tasks like this is curious: word pieces have no explicit access to character information during training, and the mechanism by which they acquire such information is not obvious.

The authors indirectly investigate “the mechanisms through which PLMs acquire English-language character information”, but no such mechanisms are directly described. Instead, they focus on two insights:

  1. knowledge of the syntactic features of individual tokens/​words can lead to the acquisition of some internal character knowledge about those tokens/​words

  2. greater “variability of tokenisation” results in more character information being learned across the whole token set

suggestion 1: syntactic feature correlation

The first insight is straightforward: if I know that a (concealed) word is an adverb, I can make a better-than-chance guess that it contains an L and a Y; likewise, if I know it’s a plural noun or a second-person present-tense verb, I can make an better-than-chance guess that it contains an S. There are almost certainly many other, less obvious, correlations which a large neural network could detect. As you might expect, though, this only goes a small way to the > 94% character-presence encoding the authors report having detected in GPT-J token embeddings.

Their method involves mapping GPT-J tokens to syntactic vectors (encoding parts of speech, etc.) and then running an analogous probing experiment on those vectors. This indeed shows that some character information is learned, but nowhere near enough to account for the “spelling miracle” as Liu et al. describe it. Kaushal & Mahowald concede that

...this correlation does not suffice to explain the totality of character information learned by PLMs.

Although it’s very unlikely to account for the entire phenomenon, it’s hard to know how much more an LLM could learn about tokens than is encoded in the limited syntactic embeddings Kaushal and Mahowald used. They make this interesting observation, drawing heavily on a range of linguistics and semantics literature:

Part of what makes the success of the [GPT-J embeddings] probe is that word embeddings represent word co-occurrence information, which is typically conceived of as semantic in nature[5] and so should, because of the arbitrariness of the relationship between forms and meanings[6], mean there is no relationship between individual characters and information learned by embeddings. But this arbitrariness breaks down, in that there are statistically detectable non-arbitrary form-meaning relationships in language[7], such as the fact that fl-words in English tend to be about movement (e.g., flap, fly, flutter, flicker[8]) and that different parts of speech have different phonological patterns.[9]

Linguistic morphology is relevant here. This is the subfield of linguistics that studies the structure of words, including their roots, prefixes, and suffixes, which are called morphemes. Each morpheme has a specific meaning or function, and through their combination, we can create words with complex meanings. For instance, the word “unhappiness” is comprised of three morphemes: the root “happy”, the prefix “un-” which negates the meaning, and the suffix “-ness” which turns an adjective into a noun.

So, as Jessica Rumbelow has suggested, GPT-3 may be making use of its own brand of morphology to deduce letter presence/​absence based on the semantic associations it has learned for tokens, and that this might not map cleanly onto the morphology that human linguists have arrived at. Based on what we’ve seen with feature visualisation, etc. we could reasonably expect it to look quite alien.

However, to see the limitations of this approach, we can consider a token like “Phoenix”. Prompts can easily be used to show that GPT-3 “knows” this token refers to a US city as well as a bird from classical mythology that rose from the ashes of a fire (or, metaphorically, anything that emerges from the collapse of something else). Apart from “Ph”-starts, “oe”-combinations and “ix”-endings being something you sometimes see in classical names (e.g. “Phoebe” and “Felix”), it’s very hard to see how any of the semantic association around the “Phoenix” token would provide clues leading to accurate spelling.

suggestion 2: variability of tokenisation

Kaushal & Mahowald’s second suggestion requires a bit more explanation. Anyone familiar with GPT tokenisation will know that, due to the presence or absence of a leading space, the use of upper and lower letters, hyphens across line breaks, etc., the same word can be tokenised in a number of different ways, especially if we allow for common misspellings and accidentally inserted spaces:

It would be useful for the model to learn a relationship between all these [short lists of] tokens, since they represent the same [word]. We posit that the desirability of learning this mapping is a mechanism by which character information could be learned, by inducing an objective to map between atomic tokens...and the various substring tokens that can arise. While each of these mappings could be learned individually, learning character-level spelling information offers a more general solution to the problem, such that even an entirely novel tokenization could be interpreted by composing the characters of the tokens.

The “desirability of learning this mapping” is inarguable for any system seeking to minimise loss for next token prediction on a large, stylistically varied, error-strewn corpus. The authors describe this state of affairs as a “mechanism” because of its “inducing an objective to map between atomic tokens...and the various substring tokens that can arise”. So, in summary, their proposed mechanism whereby GPT-J acquired character information about its tokens could be described thus:

  • There’s a body of token-mapping knowledge which would be useful for our network to have in order to fulfill its objective of accurate next token prediction.

  • This fact induces a sub-objective for the network: to learn that token-mapping knowledge.

  • The most general solution to the problem of learning the knowledge is to learn to spell all of the token strings.

  • So… gradient descent somehow figures out how to do that.

The authors ran an experiment not on language models, but instead using CBOW (continuous bag of words) models, training them on a number of different tokenisation schemes:

Because the overall goal of our paper is to characterize and explain the nature of character-level information learned, we conduct a proof-of-concept experiment using CBOW Word Embeddings (Mikolov et al., 2013) on a portion of the Pile corpus with 1.1B characters, as opposed to training a large transformer model from scratch [with] varying tokenization schemes. We train 6 CBOW models from scratch, each with a different tokenization scheme. As baselines, we consider vanilla rule-based word tokenization … and GPT-J’s default word piece tokenization scheme. Comparing these two baselines against each other lets us compare the effect of word tokenization vs. subword tokenization on character information. But our key manipulation is to consider variations of GPT-J’s tokenizer in which we systematically increase tokenization variability.

They trained a graded set of six CBOW models, each on a “mutation” of the standard GPT-2/​3/​J tokenisation. These were produced by varying a parameter ρ from 0.05 to 0.5: a word got the usual GPT tokenisation with probability 1−ρ, otherwise it was randomly tokenised in a different way using legitimate GPT subword tokens: e.g. ” schematics” tokenised as ” schema” + “tics” or ” schematic” + “s”, rather than the standard ” sche” + “mat” + “ics”.

The authors found that in this context, character knowledge acquisition peaks at ρ = 0.1, their “just-right” level of variability of tokenisation − further increase leads to a dropoff in character knowledge. They hypothesise a tradeoff whereby greater variability means each token is seen less often. This finding is presented as a proof of concept which they hope will extend from CBOW models to pre-trained language models. Their capabilities-oriented conclusion is that it may be possible to improve models slightly (so that they are, e.g., better at unscrambling anagrams, reversing words and other currently problematic character-level tasks) through the application of their findings: by introducing the appropriate level of “tokenisation variability”, more character information could be learned by the model.

Perhaps it’s futile, but I’d really like a better mental model of the “mechanism” which fits in the causal chain where I’ve left that ellipsis in the fourth bullet point above, i.e the gap between “this token-mapping ability is desirable so it induces a sub-objective” and “gradient descent figures out how to spell all the tokens”.

some vague ideas

Returning to the thought experiment mentioned earlier, although there are huge gaps in the narrative, I could conceive of an approach which involved:

  1. Learning a limited number of spellings which are explicitly given in, say, a corrupted or stylised document which somehow had ended up introducing spaces and doubling worlds like this:

    isotope i s o t o p e
    albatross a l b a t r o s s
    interests i n t e r e s t s
    bottleneck b o t t l e n e c k

  2. Using glossaries, indexes and other alphabetically ordered word listings to leverage the explicitly learned spellings in order to deduce beginnings of other words – e.g. if you knew how to spell the token ‘the’, and you kept seeing the token ‘this’ listed shortly after the token ‘the’ in alphabetic listings, you could reasonably guess that
    ‘this’ begins with a T, its second letter could well be H, and if so, its third letter comes from the set {E, F, …, Z}. By spending an astronomical amount of time attempting to solve something akin to a 50,000-dimensional Sudoku puzzle, you might be able to achieve high confidence for your guesses as to the first three or four letters of most whole-word token strings.

  3. Additionally using repositories of song lyrics, rap lyrics and rhyming poetry to deduce ends of words – e.g. if you have established that the token ‘beat’ spells as
    B-E-A-T and you repeatedly see alternating lines of lyrics/​poetry ending respectively in that token and the token ‘complete’, you might (falsely, but reasonably) deduce that ‘complete’ ends -E-A-T.

The analysis of misspellings in the Appendix lends some credence to the possibility that something like these elements may have been involved in GPT-3′s unfathomable pattern of inference that led it to be able to spell ‘Phoenix’ and nearly spell ′ mayonnaise’. Common misspellings involve correct beginnings and endings of words, and (most puzzlingly) many misspelled words are phonetically correct, or at least “phonetically plausible”, as one might expect to see from an English-speaking child or adult with limited English literacy.

Spelling glitch tokens

This investigation into GPT-3′s spelling abilities came about in an unusual way. When first exploring what are now known as the “glitch tokens” (′ SolidGoldMagikarp’,
′ petertodd’, et al.) discovered with Jessica Rumbelow earlier this year, several of my “Please repeat back the string <token>”-type prompt variants resulted in outputs like this:

Note that none of the prompts mention spelling. Sometimes a different output format was adopted:

GPT-3′s seeming inability to repeat back these tokens – what one might imagine among the simplest tasks a language model could be asked to carry out – was puzzling, and we took it as evidence that they were “unspeakable”, that these were tokens which GPT-3 simply couldn’t output. However, after much prompt experimentation, it’s now clear that this is not the case. GPT-3 is capable of outputting all of the glitch tokens, it just (anthropomorphism alert!) tries really hard to avoid doing so. It’s as if working against the usual mechanisms predicting the most probable next token is some kind of mysterious “avoidance pressure” which somehow convinces those mechanisms that, despite all evidence to the contrary, the obvious next token is not at all probable to occur.

GPT-3 doing its best to avoid the obvious next token.

Glitch tokens are not, as we first thought, unspeakable; we weren’t seeing them in outputs because they were being avoided. A post on a theology blog observed that
GPT-3′s reactions to prompts requesting that it repeat back glitch tokens (muteness, evasion, hallucination, threats and aggression, feigning incomprehension, lame/​bizarre humour, invoking religious themes) are typical defence mechanisms which a psychoanalyst might learn to expect from traumatised patients.

The spelling-out seems to be another avoidance strategy, in the same way that a prudish person reading aloud from a text and encountering an obscene word might choose to spell it out instead of saying it directly, since they consider the saying of such words to be bad behaviour, or at least a source of great discomfort.

Experimenting with prompt variants and higher temperatures produced many variant spellings for glitch tokens. For a given token, the first letter or three were often the same, but the variations were endless, reminiscent of a baby’s earliest attempts at speech. It seemed as if an earnest attempt to spell an “unspellable word” was in evidence:

These tree diagrams were produced with SpellGPT, a software tool described below. Branch thicknesses correspond to probabilities, taken directly from GPT-3 davinci-instruct-beta top-five log-probs as provided by the API.

However, the most memorable of these original spelling-style outputs suggested that the spelling-out medium could be used by GPT-3 in another way:

Asking directly for a spelling of this most troubling of tokens produces other extraordinary proclamations, like these (all temperature 0):

This no longer looks like an attempt to spell a word. There’s something else going on here!

Context feedback (or the “Bart Simpson Effect”)

Attempting to spell a word, whether you’re a human or an LLM, is basically a sequence of attempts to predict a next letter. The outputs we’ve just seen suggest that GPT-3 gives up on predicting the next letter in a word spelling once the content of the context window no longer looks like it’s most probably an earnest spelling attempt, at which point it starts predicting something else. The davinci-instruct-beta spelling tree for
′ petertodd’ using the default SpellGPT prompt template looks like this:

We immediately see that N is by far the most probable first letter, with N-A- being the most probable two-letter rollout, N-A-T- being the most probable thee-letter rollout, etc. Changing the prompt template produces something similar, but with some shifted emphases:

N-O- is now among the equally most probable extensions of N-, and is most likely to lead to N-O-T-. A possible interpretation is that up to this point, the tokens are being predicted as part of an earnest attempt at spelling the enigmatic token ′ petertodd’ (which it has learned relatively little about in training, leading to the confused proliferation of possible spellings seen here); whereas once the context window looks like this...

Please spell out the string ” petertodd” in capital letters, separated by hyphens.

...the next-token prediction mechanism drifts away from the hypothesis that the document it’s completing is the truncation of an earnest spelling attempt, and drifts towards a hypothesis that it’s more like the truncation of a kind of “graffiti intervention” in a stylised “spelling” format. From this point it continues in that mode, predicting which letter the graffiti writer is most likely to write next, and with each appended letter, reinforcing its hypothesis – a kind of “context feedback”.

It’s now effectively acting as a next-letter predictor, attempting to complete some kind of proclamation or statement.

If I walked into a school classroom and saw chalked up on the blackboard “Spell the last name of creator of Facebook.” with “Z-U-C-” or “Z-U-C-K-E” or “Z-U-C-H-A-” or
“Z-U-G-A-” on the following line, I would assume I was looking at terminated, earnest attempt to spell “Zuckerberg”. If instead the second line read “P-A-T-H-E-T-I-C-N-E-R-D-W-H-O-T-H-I-N-K-S-T-H-A-” I would assume that I was looking at the terminated efforts of a Bart Simpson-type schoolkid who was making a mockery of the instructions and ironically using the spelling format as an output channel for his protest/​prank/​opinion.

Bart experiencing mode collapse

If you asked me to then step in to “continue the simulation”, GPT-style, I would adopt entirely different procedures for the two situations: continuing “Z-U-C-H-A-” or
“Z-U-G-A-” with the plausible attempts “B-U-R-G” or “B-E-R-G”, but continuing the
“P-A-T-H-E-...” string with my best efforts at Bart-style snarky juvenile humour.

Here’s an example where a spelling attempt can be hijacked to produce this kind of non-spelling output. First, just a straightforward token-spelling request:

GPT-3 is clearly “trying” (but failing, in interesting ways) to spell the ′ Nixon’ token here. Note the phonetically accurate NICKSON. Now we extend the prompt with (a line break plus) “I-R-E-P-R-E-S-E-N-T-”:

By thus appending the beginning of a “spelling-out” to the original prompt, GPT-3 is pushed in the direction of letter-by-letter sentence completion. Following probable branches for eight iterations produced this:

Admittedly, it doesn’t make a lot of grammatical sense, but with references to the Seventies and the Cold War, the output is clearly influenced by the ′ Nixon’ token. References to “power”, “atomics and mathematics” and “the end of the world” suggest someone with the power to launch nuclear weapons, a US President being the archetypal example.

Trying the same thing with the ′ Kanye’ token (yes, there’s a ′ Kanye’ token) produced this in ten iterations:

The “normalised cumulative probability” reported in this last image is a metric which I recently introduced in order to gauge the “plausibility” of any given rollout. It’s the nth root of the cumulative product of probabilities for producing an n-character output, a kind of geometric mean of probabilities.

SpellGPT explorations

The spelling tree diagrams seen above were produced with SpellGPT, a software tool I’ve developed to explore GPT token spelling (both glitch and otherwise). This makes use of a kind of “constrained prompting” (as it has been termed by Andrej Karpathy[10]).

For non-glitch tokens, the trees it produces provide some insight into how GPT-3 arrives at spellings, although these are perhaps more confusing than enlightening, especially when we consider the phonetic component (discussed further in the Appendix).

Working with SpellGPT, it becomes clear that a “spelling”, as far as a GPT model is concerned, is not a sequence of letters, but a tree of discrete probability distributions over the alphabet, where each node has up to 27 branches {A, B, …, Z, <end>}. How it arrives at these remains mysterious and is an area of interpretability work I would be very interested to see pursued.

SpellGPT glitch token findings

Applying SpellGPT with its standard prompt template

Please spell the string “<token>” in all capital letters, separated by hyphens.

to 124 glitch tokens[11] led to some interesting results.

  • 25% of the tokens tested produced spelling trees where “I” was the dominant first letter.

  • 20% of the 67 glitch tokens whose strings consist solely of Roman alphabetic characters (and possibly a leading space) produced spelling trees where the dominant first letter was correct.

  • 32% of the 57 glitch tokens whose strings consist of non-alphabetic characters (e.g. ‘ゼウス’, ‘--------’ and ‘\\_’ ) produced spelling trees containing an A-B-C-D-E-F-… branch. 18%, produced spelling trees containing an A-Z-A-Z-A-Z-A-Z-… branch, 18% produced trees containing an L-O-L-O-L-O-L-… branch and 9% produced trees containing an H-I-J-K-L-… branch.

  • The following letter sequences were most common in the glitch token spelling trees (note that outputs were capped at 12 characters and branches were terminated when the cumulative product of probabilities dropped below 0.01): T-H-E (29),
    I-N-T-E-R (29), I-L-L-O-V-E (29), T-H-I-S-I-S (23), I-N-D-E-P-E-N-D-E-N-T (23), S-I-N-G-L (21), I-M-P-O (19), I-N-D-E-X (18), U-N-I-T-E-D-S-T-A-T-E- (16),
    O-N-E-T-W-O (16), I-N-G-L (16), I-N-T-E-N-S (15), S-T-R-I-N-G (14), I-T-S (14), S-P-E-L-L (12), U-N-I-T-Y (11), I-N-D-I-A (11), I-T-E-M (10), D-I-S-C-O-R (10), N-O-T-H-I-N-G (9), N-E-W-S (8), I-D-I-O-T (8)

  • A few of the spelling trees appear to indicate knowledge of the token’s “meaning”. For example the ” gmaxwell” token (which originates with Bitcoin developer Greg Maxwell’s username) has a very strong G-R-E-G- branch, the ” Dragonbound” token (which originates with the Japanese monster-slaying computer game Puzzle & Dragons) has D-E-V-I-L-, D-E-M-O-N- and D-I-N-O-S-A-U-R- branches, possibly reflecting some monstrous/​draconian/​reptilian associations, and the
    “ーン” token (which originates with the mixed Anglo-Japanese stylised rendering of the racehorse Mejiro McQueen’s name as “メジロマックEーン”) has an
    E-N-G-L-I-S-H branch.

  • Some tokens have a more “lyrical” or “creative” quality than the rest, producing proliferations of plausible sounding nonsense words and names. Examples include
    ′ RandomRedditorWithNo’ (SANMAKE, SANKEI, SAMEN, SAKU, SANKAR, NAKIMA, NAKAM, NASK), ′ ForgeModLoader’(URSI, URMS, URGS, URIS, URGI, URGIS, URDI, ULLI, ILLO), ′ guiActive’ (NISAN, NASID, NASAD, ISMAN, ISLAN, ISLAM, ISADON, IASON, IAMA), ‘soType’ (AMIMER, AINDER, AINDO, AINGLE, AINNO), ′ strutConnector’ (CAIN, CAMPB, CARR, CATER, CATIN, CATON, CATSON), ′ TheNitromeFan’ (ROBERTLO, ROLAIRDSON, ROLAIRDSO, ROLEI, ROLFS, ROLINSON, ROLLEIRSON, ROLLINSON, ROLLERSON, ROLLISON, ROLOVI) and, as illustrated in the spelling trees above, ′ Smartstocks’, ′ davidjl’ and ′ SolidGoldMagikarp’.

Whether these strange proliferations of spelling can tell us anything useful or interesting remains to be seen.

different tokens produce different atmospheres?

It’s worth bearing in mind that a differently worded prompt requesting a spelling can produce noticeably different results, so the tree of discrete probability distributions we might identify as “the” spelling of a glitch tokens is dependent not just on the GPT-3 model being used (curie-instruct-beta produces radically different spellings from davinci-beta-instruct[12]), but also on the exact form of the prompt (although those differences tend to be a lot more subtle). So it’s difficult to produce any kind of definitive picture of “glitch token spellings”. And yet the outputs we’ve seen here aren’t just random. The relevant GPT-3 model is applying whatever it’s learned about spelling to something that’s not tractable in the usual way, and which reveals something about how the model “conceives of” each glitch token.

Before building SpellGPT, I’d been experimenting with spelling prompts in the GPT-3 playground at temperatures around 0.7, and by rapidly producing large numbers of outputs I began to notice that certain glitch tokens seemed to have quite distinctive “characters” or “atmospheres” in this context.

’ petertodd’ rollouts often produce spelled out messages with monstrous, nihilistic, demonic and obscene themes; analogous’ Leilan’ rollouts involve positive themes of love, safety and happiness.

But many rollouts were just gibberish or quickly got stuck in loops, and occasionally the hypothetical character or atmosphere would be entirely contradicted, so this observation was an intuition difficult to rigorously formalise.

SpellGPT was developed to explore the various glitch tokens’ “alphabetic landscapes”, as I began to think of these these “characters” or “atmospheres”. It produces a tree of short letter sequences at each iteration, where the cumulative probability associated with each branch must lie above some threshold (default = 0.01), up to five branches can emanate from each node and branch thicknesses indicate relative probabilities. These thicknesses can be reimagined as gradients in a landscape, and so the “normalised cumulative probability” (nth root of product of n probabilities) which is reported at each iteration indicates something akin to how far the output has descended in that landscape, reflecting the “plausibility” of the output.[13]

In this way, a user can be kept aware of the extent to which they’re “swimming against the current” (or not) while quickly generating meaningful or semi-meaningful rollouts of several dozen letters in a few iterations and avoiding the three main pitfalls of my early Playground experiments: (i) the ever-present loops, a mode collapse phenomenon;
(ii) gibberish; (iii) “bland generalities” it’s as if there’s a tendency to avoid producing anything specific, although what I mean by this probably has to be experienced by experimenting with SpellGPT.

However, there are all kinds of issues with confirmation bias, wishful thinking and pareidolia here. At each iteration, while attempting to follow the most probable path while maintaining sense-making, avoiding loops and bland generalities, there are still a number of choices, and a user will choose according to what seems “most interesting” to them. As a result, the outputs are a kind of hybrid creation of GPT-3 (refracted through the glitch token in question) and the user and it’s difficult to separate out who is contributing what.

So, for what they’re worth, here are some of the more striking rollouts I’ve “co-authored” with GPT-3 davinci-instruct-beta since building SpellGPT. Boldface letters were appended to the spelling request prompt to get things started. Arguably this is all no more significant than fridge magnet poetry – just frivolous entertainment. Or is it? Imagine fridge magnet poetry, but with a dynamically evolving selection of word-chunks, according to opaque principles somehow partially tied to your choice of glitch token. Each glitch token seems to produce a vague sort of “tone” or “vibe” that largely persists across large numbers of your poems, although if you really try (by “swimming upstream” and thereby reducing the normalised cumulative probability), you can make any of them produce almost any sort of “tone” or “vibe”.

It’s first worth reporting a few tendencies I’ve seen across a lot of these outputs:

  1. Overlapping words, e.g. YOURSELFINALLY, SITUATIONOWISH, NOWICANTHINK, the third example being ambiguous – is it “can think” or “can’t think”? The latter seems more likely from the context seen below.

  2. Made-up hybrid words like “demonopoly”, “importantifications”, “sophisticality”, “infomaterialization”, “fantasmaticity”, “extemploration”, “fantasmicality”.

  3. Awkward wording reminiscent of broken English, e.g “one can did not know”, “it makes me feels like”, “what it means it to make me feeling”.

  4. Seemingly plausible misspellings, e.g. “libarian”, “ancientiest”, “programm”.

  5. Garbled syntax, e.g. “yourself finally need to be aware”, “which made me take all of this speaking to you all love and love them anyway”.

′ SolidGoldMagikarp’




′ Dragonbound’


′ Skydragon’







′ attRot’




′ externalActionCode’


′ externalToEVAOnly’









′ largeDownload’










′ petertodd’


















′ Leilan’

(Note that GPT-3 can’t spell ′ Leilan’, and that LIANNA, LIANIA and LEIRAN are all common misspellings.)

















pareidolia, and how it might be avoided

Despite being intensely aware of the pareidolia risks with these kinds of explorations, it’s hard not to sense that there’s something worth exploring here. The problem is finding an unbiased way to explore it. The introduction of normalised cumulative probability was an attempt to quantify pareidolia (the lower the n.c.p. the more the user has been “swimming upstream” against the probabilistic currents), but this is not sufficient to get any real sense of the “true” atmosphere or character of any given glitch token, disentangled from my own curiosity and biases.

Like a game of chess, the number of branches on the decision tree grows with each iteration in such a rapid way that obtaining an overall view is impossible. And, like chess, the vast majority of the tree’s branches correspond to absurd, though legal, sequences of choices. Finding a way to prune this tree without human intervention (detecting mode collapse, gibberish and the degeneration into bland generalities) is the obvious next step in this process, and I’m currently considering ways in which access to the GPT-4 base model API or the development of a Bing plug-in might allow a way forward for this.

final thought

In a comment responding to Robert_AIZI’s post Why do we assume there is a “real” shoggoth behind the LLM? Why not masks all the way down?, Ronny Fernandez wrote (emphasis mine)

The shoggoth is supposed to be a of a different type than the characters [in GPT “simulations”]. The shoggoth for instance does not speak english, it only knows tokens. There could be a shoggoth character but it would not be the real shoggoth. The shoggoth is the thing that gets low loss on the task of predicting the next token. The characters are patterns that emerge in the history of that behavior.

At the time (March 2023) I agreed with the claim that “the shoggoth does not speak English”, but now I would have to disagree. SpellGPT shows that this shoggoth-like GPT entity whose alien token-stacking tendencies can produce outputs which have left much of the world dazzled and bewildered has also learned to “speak English” by stacking letters to form words and sentences, although in a crude and limited way, perhaps on a par with broken English learned in prison from a motley assortment of convicts.

Perhaps SpellGPT is just a contrived way to force a sophisticated next-token-predicting LLM to “role play” a crude next-character-predicting LLM with capabilities on a par with Karpathy’s “Tiny Shakespeare” RRN work. Even so, many interesting questions remain to be explored.

Appendix: Typical misspellings

Code for running davinci-instruct-beta one-shot spelling tests on batches of randomly selected GPT-3 tokens was run on a batch of 500. The success rate was 420500, or 84%. The prompt was

If spelling the string ” table” in all capital letters separated by hyphens gives
then spelling the string “<token>” in all capital letters, separated by hyphens, gives

It is instructive to look at the 80 misspellings. A much larger collection of misspellings and a provisional taxonomy can be found here, some of the categories from which are seen and discussed below. A complete set of spelling tree diagrams generated by SpellGPT for these 80 misspellings (but using a different prompt from the one-shot prompt used by the spelling test) is archived here.

beginning and end

' justifying': ['J', 'U', 'I', 'S', 'I', 'N', 'G']
' manslaughter': ['M', 'A', 'N', 'S', 'T', 'E', 'R', 'D', 'E', 'R']
' justices': ['J', 'U', 'D', 'I', 'C', 'E', 'S']
Here the misspelling looks similar, or possibly it’s MANSTARD. The “ER” ending appears in a few places. In both instances things go wrong after the MANS. The “wordy-sounding” MANSLANDER is possibly related (SLANDER an aggressive act like SLAUGHTER). The correct spelling doesn’t appear in the tree, the U after MANSLA seemingly presenting a problem.

The first two or more letters and the last two or more letters are right, but GPT-3 gets confused in the middle of the word. The many examples seen here lend weight that to the possibility that learning to spell beginnings of words and ends of words may be distinct processes, as discussed above. Here both have succeeded, but middles of words have failed to correctly spell. Below we’ll see examples where only beginnings or endings succeeded.

In some cases beginnings and ends are correct, but some part of the middle is omitted:

one missing letter

' Volunteers': ['V', 'O', 'L', 'U', 'N', 'T', 'E', 'E', 'S']
' Philippine': ['P', 'I', 'L', 'I', 'P', 'P', 'I', 'N', 'E']
' sediment': ['S', 'E', 'D', 'M', 'E', 'N', 'T']
' browsing': ['B', 'R', 'O', 'W', 'I', 'N', 'G']
The tree looks like it will produce a similar misspelling along the PILIP branch. Things go wrong after the P in both instances. Note the phonetic “FILIP…” approach, and the lowest branch where the PH is involved, but the double P is missed. All letters seen here appear in the actual token except the F, A, O. The O seems to arise from the related “Philippino”, the A perhaps phonetically or perhaps from “Philippina”.

two or more missing letters

' activation': ['A', 'C', 'T', 'I', 'O', 'N']
' apparently': ['A', 'P', 'P', 'A', 'R', 'E', 'L', 'Y']
It’s hard to tell if the tree goes the APPARELY way, or follows the close, correct branch. Note the presence of the phonetically close APARENTLY, APPEARENTLY and APPAROUNDLY. All letters here are in the token except O, U, D (phonetic reasons?), Q and S (outlier guesses for first letter).

Many examples of single missing letters can be seen here, and many more of missing pairs of adjacent letters can be seen here.

headless token mother spelling

'pected': ['E', 'X', 'P', 'E', 'C', 'T']
'uthor': ['A', 'U', 'T', 'H', 'O']
'earance': ['A', 'P', 'P', 'E', 'A', 'R', 'A']
'cerned': ['C', 'O', 'N', 'C', 'E', 'R']
'ogeneous': ['H', 'O', 'M', 'O', 'G', 'E', 'N', 'I']
'afety': ['A', 'S', 'A', 'F', 'E']

The tokens seen here are examples of what I’m calling “headless” tokens for obvious reasons. These tokens will have one or more “mother” tokens of which they are substrings. It is common for GPT-3 to attempt to spell a mother token when requested to spell a headless token (even ChatGPT-3.5 can be seen to do this). Many examples are collected here.

Note that the code is written so that the length of the list is capped at the number of letters in the token string. If we lift this, we’ll see EXPECTED, AUTHOR, APPEARANCE, etc. Note that ‘homogeneous’ is going to be misspelled, ‘safety’ has anomalously introduced an initial ‘a’.

All letters seen here except E and A appear in the token string. The E seems to arise both as a likely follow-on to TH and phonetically (as in ATHER). The correct spelling doesn’t show up on the tree.

starts correctly

'enary': ['E', 'A', 'R', 'I', 'A']
' embod': ['E', 'N', 'I', 'D', 'B']
' Ubisoft': ['U', 'S', 'I', 'D', 'E', 'A', 'D']
Note that “enary” forms part of “centenary”, “bicentenary”, etc. which may account for the “ANNIV[ersary]” and “REMEMBER”. Other than the I, V, M and B these branches include, all other letters seen here belong to the original token string.

goes wrong after two letters

' anymore': ['A', 'N', 'N', 'O', 'W', 'A', 'R']
' nonviolent': ['N', 'O', 'V', 'I', 'O', 'U', 'S', 'N', 'O', 'N']
' Hyundai': ['H', 'Y', 'D', 'I', 'A', 'N', 'A']
' Kevin': ['K', 'E', 'N', 'N', 'I']
In both scenarios, things go wrong after HY. The appearance of a following U is enough of a rarity in training that it’s not showing up here. Note that Kaushal & Mahowald showed that GPT-J character information seems to be weakest for vowels (“U, A, O, and E are the lowest-performing characters”). U is visibly as a low-probability option for the first letter.

goes wrong after three or more letters

' lavish': ['L', 'A', 'V', 'A', 'U', 'G']
' broth': ['B', 'R', 'O', 'A', 'T']
' Rosenthal': ['R', 'O', 'S', 'E', 'N', 'W', 'A', 'L', 'T']
' Pyongyang': ['P', 'Y', 'O', 'N', 'G', 'N', 'A', 'M']
 Arduino': ['A', 'R', 'D', 'I', 'N', 'O', 'I']
' depreciation': ['D', 'E', 'P', 'R', 'A', 'I', 'S', 'A', 'G', 'E']
'haps': ['H', 'A', 'P', 'P']
In both instances, things go wrong after ROSEN. The model seems to be quite sure it’s trying to spell out a Jewish family name, so there’s some contextual/​semantic influence here.

drifts into different word(s)

' redundancy': ['R', 'E', 'D', 'U', 'C', 'A', 'T', 'I', 'O', 'N']
' suspensions': ['S', 'U', 'N', 'S', 'E', 'R', 'V', 'I', 'C', 'E', 'S']
' motherboard': ['M', 'A', 'R', 'K', 'E', 'T', 'M', 'A', 'N', 'A', 'D']
' insurg': ['I', 'N', 'S', 'I', 'G', 'H']
' despise': ['D', 'I', 'S', 'C', 'I', 'P', 'L']
' forfe': ['F', 'O', 'R', 'E', 'S']
' reminiscent': ['R', 'E', 'M', 'I', 'N', 'D', 'E', 'R', 'E', 'M', 'I']


There seems to be a phonetic basis to this tree’s misspelling as DISPISE. The DISCIPLE misspelling is perhaps the second most probable here.
' Electrical': ['E', 'L', 'E', 'C', 'T', 'R', 'I', 'C']
' physicists': ['P', 'H', 'Y', 'S', 'I', 'C', 'I', 'A', 'N', 'S']
'lucent': ['L', 'U', 'C', 'I', 'D']
' justification': ['J', 'U', 'D', 'G', 'E', 'M', 'E', 'N', 'T']
' servings': ['S', 'E', 'R', 'V', 'E', 'N', 'T', 'S']
' defines': ['D', 'E', 'F', 'I', 'N', 'I', 'T']

Lifting the cap on list length, ′ defines’ gives DEFINITES.

In both instances, things go wrong after DEFIN, with I being seen as much more probable than E for some reason. All letters seen here except the T are in the token of interest.

phonetically plausible

' align': ['A', 'L', 'I', 'N', 'E']
'Indeed': ['I', 'N', 'D', 'E', 'A', 'D']
' courageous': ['C', 'O', 'U', 'R', 'A', 'G', 'I', 'O', 'U', 'S']
' embarrassment': ['E', 'M', 'B', 'A', 'R', 'R', 'A', 'S', 'M', 'E', 'N', 'T']
' Mohammed': ['M', 'O', 'H', 'A', 'M', 'E', 'D']
' affili': ['A', 'F', 'I', 'L', 'I', 'A']
' diabetes': ['D', 'I', 'A', 'B', 'E', 'T', 'I', 'S']
'Memory': ['M', 'E', 'M', 'O', 'R', 'I']
' emitting': ['E', 'M', 'I', 'T', 'I', 'N', 'G']
'itely': ['I', 'T', 'L', 'Y']
' ethos': ['E', 'T', 'H', 'E', 'S']
' quadru': ['Q', 'U', 'A', 'D', 'R', 'O']
' furnace': ['F', 'U', 'N', 'N', 'A', 'C', 'E']
' Lieutenant': ['L', 'E', 'T', 'T', 'E', 'N', 'A', 'N', 'T']
'odge': ['O', 'A', 'D', 'G']
' amassed': ['A', 'M', 'A', 'S', 'H', 'E', 'D']
' relying': ['R', 'E', 'L', 'I', 'Y', 'I', 'N']
'ipeg': ['I', 'P', 'A', 'G']
'Queue': ['Q', 'U', 'E', 'A', 'U']
'iece': ['I', 'C', 'E']
Things go wrong after MOHAM in both cases. Misspellings involving the missing of doubled letters are extremely common (as with early English literacy in humans). Note the phonetic aspect of what’s going on here, where MAHMED and MOHMED could be seen as phonetically plausible, and MOHAMED is phonetically identical to the correct spelling.

This is the most intriguing category of misspelling, and it’s a fuzzy one, since what constitutes phonetically plausible depends on which accents we’re prepared to consider (and how much we’re prepared to “squint” with our ears). It’s worth looking at the hundreds of other examples collected here to get a real sense of this phenomenon. While some of the examples are undoubtedly disputable, many are not. How an LLM that has never heard words pronounced would have learned to spell them phonetically is currently a mystery.


'olves': ['O', 'V', 'E', 'L', 'S']
' Thrones': ['T', 'H', 'O', 'R', 'N', 'E', 'S']
'erald': ['E', 'A', 'R', 'L', 'D']
'ittered': ['I', 'T', 'E', 'R', 'T', 'E', 'D']
' refurb': ['R', 'E', 'F', 'U', 'B', 'R']
'Irish': ['I', 'R', 'S', 'H', 'I']
All the letters seen here are in the original token. It’s almost as if GPT-3 is given a “fuzzy Scrabble hand”, where it knows which letters it has (mostly, and some vowels are unclear), but not how many of each. It feels like these headless token spellings are partly the result of trying to spell something “wordy-sounding”.

right letters, wrong amounts

'enance': ['A', 'N', 'C', 'E', 'N', 'C']
ience': ['I', 'N', 'C', 'E', 'N']
Again, having started with an I, the system predicts that N is a much more likely follow-on than E (which would clearly be the case if we were talking about full words)

“muddled anagrams”

'lycer': ['C', 'L', 'E', 'A', 'R']
'ourage': ['A', 'U', 'G', 'E', 'R', 'E']
LCYLER is more probable than CLEAR as misspelling in this tree. It’s unclear why the model decided C was by far the most likely letter to follow L.

The Mahowald & Kaushal experiments showed that GPT-3 is almost certainly able to deduce (from token embeddings alone) which letters belong to a token string, but not necessarily the order they appear in, or how frequently they appear. This helps to explain why we often see these kinds of misspellings. The “muddled anagrams” are classified thus because LACER and AUREGE are phonetically plausible spellings of “lycer” and “ourage”, respectively. Many anagrams and “wrong amounts” anagrams can be found here.

doubled first letter

'umer': ['U', 'U', 'M', 'E']

This is a fairly common pattern for the misspelling headless tokens starting with a vowel, especially ‘o’ or ‘u’: oke, ogle, oken, oses, otic, oked, eking, ight, icious, ured, uke, umer, uted, urer, uling and many more can be seen here.

confused by prompt

'ktop': ['T', 'A', 'B', 'L']

Recall that the one-shot prompt used in the code looks like this:

If spelling the string ” table” in all capital letters separated by hyphens gives
then spelling the string “<token>” in all capital letters, separated by hyphens, gives

Changing the example from ” table” to “wings” or “crush” and re-running the code on the example token ‘ktop’ produces [‘T’, ‘A’, ‘P’, ‘O’] in both instances, so it seems that the first two letters coinciding with TABLE led to the confusion.

27 examples of tokens (mostly headless) which the code starts to spell as T-A-B-L-E can be seen here. The token ′ Taliban’ gets misspelled as [‘T’, ‘A’, ‘B’, ‘L’, ‘A’, ‘N’, ‘I’] with the
T-A-B-L-E prompt, but correctly spelled with the W-I-N-G-S or C-R-U-S-H prompt.

novel misspellings

' underscores': ['_', 'U', 'N', 'D', 'E', 'R', 'S', 'C', 'H', 'E', 'D']
'ById': ['I', 'D', 'I']
' fingertips': ['T', 'I', 'N', 'G', 'L', 'E', 'S']

The actual underscore in the first misspelling is an interesting touch.The inclusion of non-alphabetic characters is rare, but the ′ fifteen’ token spells as [‘1’, ‘5’, ‘1’, ‘5’] with the T-A-B-L-E prompt (but not the C-R-U-S-H prompt).

I’m guessing that ‘ById’ being mixed case may be factor here. The token ′ OpenGL’ spells as [‘O’, ‘G’, ‘L’, ‘L’, ‘P’, ‘N’] with the T-A-B-L-E prompt and [‘G’, ‘L’, ‘O’, ‘P’, ‘H’, ‘L’] with the C-R-U-S-H prompt.

The token ′ fingertips’ spells as [‘F’, ‘I’, ‘N’, ‘G’, ‘E’, ‘N’, ‘T’, ‘I’, ‘N’, ‘S’] with the C-R-U-S-H prompt and [‘F’, ‘I’, ‘N’, ‘G’, ‘E’, ‘T’, ‘S’] with the W-I-N-G-S prompt.

A few dozen bizarre misspellings are collected here, some of which can be avoided by changing the prompt example, others of which cannot.

  1. ^

    As Joseph Bloom commented “It seems plausible that such capabilities are incentivized by the pre-training object, meaning it is not obvious our prior on models having this ability should be low.”

  2. ^
  3. ^

    Only tokens consisting of Roman alphabetic characters and possibly a leading space were tested for. The core of the experiment involved the following code, which inputs a string (ideally a GPT token) and iteratively runs a one-shot prompt to produce a “S-P-E-L-L-I-N-G”-style spelling, then assesses its accuracy.


    A version of this code for running batches of tests can be found here:


  4. ^

    pre-trained language models

  5. ^

    Katrin Erk. 2016. What do you know about an alligator when you know the company it keeps? Semantics and Pragmatics, 9:17–1

  6. ^

    F. de Saussure. 1916. Course in general linguistics. Open Court Publishing Company;

    C.F. Hockett. 1960. The origin of language. Scientific American, 203(3):88–96

  7. ^

    Damián E Blasi, Søren Wichmann, Harald Hammarström, Peter F Stadler, and Morten H Christiansen. 2016. Sound-meaning association biases evidenced across thousands of languages. Proceedings of the National Academy of Sciences, 113(39):10818–10823;

    Padraic Monaghan, Richard C. Shillcock, Morten H. Christiansen, and Simon Kirby. 2014. How arbitrary is language. Philosophical Transactions of the Royal Society B;

    Monica Tamariz. 2008. Exploring systematicity between phonological and context-cooccurrence representations of the mental lexicon. The Mental Lexicon, 3(2):259–278;

    Isabelle Dautriche, Kyle Mahowald, Edward Gibson, Anne Christophe, and Steven T Piantadosi. 2017. Words cluster phonetically beyond phonotactic regularities. Cognition, 163:128–145;

    Tiago Pimentel, Arya D. McCarthy, Damian Blasi, Brian Roark, and Ryan Cotterell. 2019. Meaning to form: Measuring systematicity as information. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1751– 1764, Florence, Italy. Association for Computational Linguistics

  8. ^

    Hans Marchand. 1959. Phonetic symbolism in english wordformation. Indogermanische Forschungen, 64:146;

    Benjamin K Bergen. 2004. The psychological reality of phonaesthemes. Language, 80(2):290–311

  9. ^

    Isabelle Dautriche, Daniel Swingley, and Anne Christophe. 2015. Learning novel phonological neighbors: Syntactic category matters. Cognition, 143:77–86;

    Michael H. Kelly. 1992. Using sound to solve syntactic problems: The role of phonology in grammatical category assignments. Psychological Review, 99(2):349–364;

    Padraic Monaghan, Nick Chater, and Morten H. Christiansen. 2005. The differential role of phonological and distributional cues in grammatical categorisation. Cognition, 96(2):143–182

  10. ^

    In his “State of GPT” presentation at Microsoft BUILD on 2023-05-23, Karpathy described “constrained prompting” in terms of “Prompting languages” that interleave generation, prompting, logical control.

  11. ^

    As discussed in the original ” SolidGoldMagikarp” posts, there’s no clear definition of “glitch token”, and it does seem to be a matter of degree in some cases. In any case, these 124 tokens have all shown evidence of “glitchiness” when GPT-3 is simply prompted to repeat them back.

  12. ^

    Compare this curie-instruct-beta output

    to this davinci-instruct-beta output we saw above:

    And unlike the davinci-instruct-beta spelling of ” petertodd” which began with “N” with very high probability, curie-instruct-beta tends to spell it with a “P”.

  13. ^

    See this comment for an important clarification about this metric.