Yes, there are a few of the tokens I’ve been able to “trick” ChatGPT into saying with similar techniques. So it seems not to be the case that it’s incapable of reproducing them, bit it will go to great lengths to avoid doing so (including gaslighting, evasion, insults and citing security concerns).
mwatkins
The ‘ petertodd’ phenomenon
Mapping the semantic void: Strange goings-on in GPT embedding spaces
SolidGoldMagikarp II: technical details and more recent findings
′ petertodd’’s last stand: The final days of open GPT-3 research
SolidGoldMagikarp III: Glitch token archaeology
The “spelling miracle”: GPT-3 spelling abilities and glitch tokens revisited
Phallocentricity in GPT-J’s bizarre stratified ontology
All AGI Safety questions welcome (especially basic ones) [~monthly thread]
Linear encoding of character-level information in GPT-J token embeddings
Mapping the semantic void II: Above, below and between token embeddings
Mapping the semantic void III: Exploring neighbourhoods
Yes, this post was originally going to look at how the ′ petertodd’ phenomenon (especially the anti-hero → hero archetype reversal between models) might relate to the Waluigi Effect, but I decided to save any theorising for future posts. Watch this space!
I’m well aware of the danger of pareidolia with language models. First, I should state I didn’t find that particular set of outputs “titillating”, but rather deeply disturbing (e.g. definitions like “to make a woman’s body into a cage” and “a woman who is sexually aroused by the idea of being raped”). The point of including that example is that I’ve run hundreds of these experiments on random embeddings at various distances-from-centroid, and I’ve seen the “holes” thing appearing, everywhere, in small numbers, leading to the reasonable question “what’s up with all these holes?”. The unprecedented concentration of them near that particular random embedding, and the intertwining themes of female sexual degradation led me to consider the possibility that it was related to the prominence of sexual/procreative themes in the definition tree for the centroid.
OK, I’ve found a pattern to this. When you run the tokeniser on these strings:
″ ertodd” > [′ ’, ‘ertodd’]
″ tertodd” > [′ t’, ‘ertodd’]
″ etertodd” > [′ e’, ‘ter’, ‘t’, ‘odd’]
″ petertodd” > [′ petertodd’]
″ aertodd” > [′ a’, ‘ertodd’]
″ repeatertodd” > [′ repe’, ‘ater’, ‘t’, ‘odd’]
″ eeeeeertodd” > [′ e’, ‘eeee’, ‘ertodd’]
″ qwertyertodd” > [′ q’, ‘wer’, ‘ty’, ‘ertodd’]
″ four-seatertodd” > [′ four’, ‘-’, ‘se’, ‘ater’, ‘t’, ‘odd’]
etc.
GPT-J doesn’t seem to have the same kinds of ′ petertodd’ associations as GPT-3. I’ve looked at the closest token embeddings and they’re all pretty innocuous (but the closest to the ′ Leilan’ token, removing a bunch of glitch tokens that are closest to everything is ′ Metatron’, who Leilan is allied with in some Puzzle & Dragons fan fiction). It’s really frustrating that OpenAI won’t make the GPT-3 embeddings data available, as we’d be able to make a lot more progress in understanding what’s going on here if they did.
Mangled, mixed English-Japanese text dumps from a Puzzle & Dragons fandom wiki is exactly the kind of thing I imagined could have resulted in those strings becoming tokens. Good find.
The most convincing partial explanation I’ve heard for why some tokens glitch is because those token strings appear extremely rarely in the training corpus, so GPT “doesn’t know about them”.
But if, in GPT training, the majority of the (relatively few) encounters with ′ Leilan’ occurred in fan-fiction (where she and Metatron are battling Satan, literally) might this account for all the crazy mythological and apocalyptic themes that spill out if you prompt it about ′ Leilan’?
Greg Maxwell of ′ gmaxwell’ fame said in a comment that
both Petertodd and I have been the target of a considerable amount of harassment/defamation/schitzo comments on reddit due commercially funded attacks connected to our past work on Bitcoin.
So if, in GPT training, the majority of the (relatively few) encounters with ′ petertodd’ occurred in defamatory contexts or contexts involving harassment, accusations, etc., might this account for all the negativity, darkness and unpleasant semantic associations GPT has somehow made with that token?
As you’ll read in the sequel (which we’ll post later today), in GPT2-xl, the anomalous tokens tend to be as far from the origin as possible. Horizontal axis sis distance from centroid. Upper histograms involve 133 tokens, lower histograms involve 50,257 tokens. Note how the spikes in the upper figures register as small bumps on those below.
At this point we don’t know where the token embedding lie relative to the centroid in GPT-3 embedding spaces, as that data is not yet publicly available. And all the bizarre behaviour we’ve been documenting has been in GPT-3 models (despite discovering the “triggering” tokens in GPT-2/J embedding spaces.
Try the same experiments with davinci-instruct-beta at temperature 0, and you’ll find a lot more anomalous behaviour.
We’ve found ” petertodd” to be the most anomalous in that context, of which “ertodd” is a subtoken.
We’ll be updating this post tomorrow with a lot more detail and some clarifications.
That’s an interesting suggestion.
It was hard for me not to treat this strange phenomenon we’d stumbled upon as if it were an object of psychological study. It felt like these tokens were “triggering” GPT3 in various ways. Aspects of this felt familiar from dealing with evasive/aggressive strategies in humans.
Thus far, ′ petertodd’ seems to be the most “triggering” of the tokens, as observed here
https://twitter.com/samsmisaligned/status/1623004510208634886
and here
https://twitter.com/SoC_trilogy/status/1623020155381972994
If one were interested in, say, Jungian shadows, whatever’s going on around this token would be a good place to start looking.