mwatkins

Karma: 1,526

The ‘ petertodd’ phenomenon

mwatkins15 Apr 2023 0:59 UTC

180 points

50 comments38 min readLW link

Mapping the semantic void: Strange goings-on in GPT embedding spaces

mwatkins14 Dec 2023 13:10 UTC

114 points

30 comments14 min readLW link

SolidGoldMagikarp II: technical details and more recent findings

mwatkins and Jessica Rumbelow

6 Feb 2023 19:09 UTC

109 points

45 comments13 min readLW link

′ petertodd’’s last stand: The final days of open GPT-3 research

mwatkins22 Jan 2024 18:47 UTC

101 points

13 comments45 min readLW link

SolidGoldMagikarp III: Glitch token archaeology

mwatkins and Jessica Rumbelow

14 Feb 2023 10:17 UTC

90 points

30 comments16 min readLW link

The “spelling miracle”: GPT-3 spelling abilities and glitch tokens revisited

mwatkins31 Jul 2023 19:47 UTC

85 points

29 comments20 min readLW link

Phallocentricity in GPT-J’s bizarre stratified ontology

mwatkins17 Feb 2024 0:16 UTC

53 points

36 comments9 min readLW link

All AGI Safety questions welcome (especially basic ones) [~monthly thread]

mwatkins and Robert Miles

26 Jan 2023 21:01 UTC

39 points

80 comments2 min readLW link

Linear encoding of character-level information in GPT-J token embeddings

mwatkins and Joseph Bloom

10 Nov 2023 22:19 UTC

34 points

4 comments28 min readLW link

Mapping the semantic void II: Above, below and between token embeddings

mwatkins15 Feb 2024 23:00 UTC

31 points

4 comments10 min readLW link

Mapping the semantic void III: Exploring neighbourhoods

mwatkins15 Feb 2024 23:01 UTC

13 points

0 comments10 min readLW link

mwatkins 8 Feb 2023 0:18 UTC
12 points
1
in reply to: JenniferRM’s comment on: SolidGoldMagikarp (plus, prompt generation)
That’s an interesting suggestion.
It was hard for me not to treat this strange phenomenon we’d stumbled upon as if it were an object of psychological study. It felt like these tokens were “triggering” GPT3 in various ways. Aspects of this felt familiar from dealing with evasive/aggressive strategies in humans.
Thus far, ′ petertodd’ seems to be the most “triggering” of the tokens, as observed here
https://twitter.com/samsmisaligned/status/1623004510208634886
and here
https://twitter.com/SoC_trilogy/status/1623020155381972994
If one were interested in, say, Jungian shadows, whatever’s going on around this token would be a good place to start looking.

mwatkins 5 Feb 2023 21:27 UTC
12 points
0
in reply to: Matt Goldenberg’s comment on: SolidGoldMagikarp (plus, prompt generation)
Yes, there are a few of the tokens I’ve been able to “trick” ChatGPT into saying with similar techniques. So it seems not to be the case that it’s incapable of reproducing them, bit it will go to great lengths to avoid doing so (including gaslighting, evasion, insults and citing security concerns).

mwatkins 16 Apr 2023 19:39 UTC
10 points
6
in reply to: awg’s comment on: The ‘ petertodd’ phenomenon
Yes, this post was originally going to look at how the ′ petertodd’ phenomenon (especially the anti-hero → hero archetype reversal between models) might relate to the Waluigi Effect, but I decided to save any theorising for future posts. Watch this space!

mwatkins 17 Feb 2024 17:31 UTC
8 points
3
in reply to: Charlie Steiner’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
I’m well aware of the danger of pareidolia with language models. First, I should state I didn’t find that particular set of outputs “titillating”, but rather deeply disturbing (e.g. definitions like “to make a woman’s body into a cage” and “a woman who is sexually aroused by the idea of being raped”). The point of including that example is that I’ve run hundreds of these experiments on random embeddings at various distances-from-centroid, and I’ve seen the “holes” thing appearing, everywhere, in small numbers, leading to the reasonable question “what’s up with all these holes?”. The unprecedented concentration of them near that particular random embedding, and the intertwining themes of female sexual degradation led me to consider the possibility that it was related to the prominence of sexual/procreative themes in the definition tree for the centroid.

mwatkins 6 Feb 2023 1:07 UTC
8 points
4
in reply to: mwatkins’s comment on: SolidGoldMagikarp (plus, prompt generation)
OK, I’ve found a pattern to this. When you run the tokeniser on these strings:
″ ertodd” > [′ ’, ‘ertodd’]
″ tertodd” > [′ t’, ‘ertodd’]
″ etertodd” > [′ e’, ‘ter’, ‘t’, ‘odd’]
″ petertodd” > [′ petertodd’]
″ aertodd” > [′ a’, ‘ertodd’]
″ repeatertodd” > [′ repe’, ‘ater’, ‘t’, ‘odd’]
″ eeeeeertodd” > [′ e’, ‘eeee’, ‘ertodd’]
″ qwertyertodd” > [′ q’, ‘wer’, ‘ty’, ‘ertodd’]
″ four-seatertodd” > [′ four’, ‘-’, ‘se’, ‘ater’, ‘t’, ‘odd’]
etc.

mwatkins 17 Apr 2023 8:54 UTC
7 points
0
in reply to: Jan_Kulveit’s comment on: The ‘ petertodd’ phenomenon
GPT-J doesn’t seem to have the same kinds of ′ petertodd’ associations as GPT-3. I’ve looked at the closest token embeddings and they’re all pretty innocuous (but the closest to the ′ Leilan’ token, removing a bunch of glitch tokens that are closest to everything is ′ Metatron’, who Leilan is allied with in some Puzzle & Dragons fan fiction). It’s really frustrating that OpenAI won’t make the GPT-3 embeddings data available, as we’d be able to make a lot more progress in understanding what’s going on here if they did.

mwatkins 15 Feb 2023 22:18 UTC
7 points
0
in reply to: nostalgebraist’s comment on: SolidGoldMagikarp III: Glitch token archaeology
Mangled, mixed English-Japanese text dumps from a Puzzle & Dragons fandom wiki is exactly the kind of thing I imagined could have resulted in those strings becoming tokens. Good find.
The most convincing partial explanation I’ve heard for why some tokens glitch is because those token strings appear extremely rarely in the training corpus, so GPT “doesn’t know about them”.
But if, in GPT training, the majority of the (relatively few) encounters with ′ Leilan’ occurred in fan-fiction (where she and Metatron are battling Satan, literally) might this account for all the crazy mythological and apocalyptic themes that spill out if you prompt it about ′ Leilan’?
Greg Maxwell of ′ gmaxwell’ fame said in a comment that
both Petertodd and I have been the target of a considerable amount of harassment/defamation/schitzo comments on reddit due commercially funded attacks connected to our past work on Bitcoin.
So if, in GPT training, the majority of the (relatively few) encounters with ′ petertodd’ occurred in defamatory contexts or contexts involving harassment, accusations, etc., might this account for all the negativity, darkness and unpleasant semantic associations GPT has somehow made with that token?

mwatkins 6 Feb 2023 17:26 UTC
7 points
3
in reply to: Stuart_Armstrong’s comment on: SolidGoldMagikarp (plus, prompt generation)
As you’ll read in the sequel (which we’ll post later today), in GPT2-xl, the anomalous tokens tend to be as far from the origin as possible. Horizontal axis sis distance from centroid. Upper histograms involve 133 tokens, lower histograms involve 50,257 tokens. Note how the spikes in the upper figures register as small bumps on those below.
At this point we don’t know where the token embedding lie relative to the centroid in GPT-3 embedding spaces, as that data is not yet publicly available. And all the bizarre behaviour we’ve been documenting has been in GPT-3 models (despite discovering the “triggering” tokens in GPT-2/J embedding spaces.

mwatkins 6 Feb 2023 0:05 UTC
7 points
0
in reply to: lsusr’s comment on: SolidGoldMagikarp (plus, prompt generation)
Try the same experiments with davinci-instruct-beta at temperature 0, and you’ll find a lot more anomalous behaviour.
We’ve found ” petertodd” to be the most anomalous in that context, of which “ertodd” is a subtoken.
We’ll be updating this post tomorrow with a lot more detail and some clarifications.