mwatkins

Karma: 1,582

mwatkins 8 Feb 2023 0:18 UTC
12 points
1
in reply to: JenniferRM’s comment on: SolidGoldMagikarp (plus, prompt generation)
That’s an interesting suggestion.
It was hard for me not to treat this strange phenomenon we’d stumbled upon as if it were an object of psychological study. It felt like these tokens were “triggering” GPT3 in various ways. Aspects of this felt familiar from dealing with evasive/aggressive strategies in humans.
Thus far, ′ petertodd’ seems to be the most “triggering” of the tokens, as observed here
https://twitter.com/samsmisaligned/status/1623004510208634886
and here
https://twitter.com/SoC_trilogy/status/1623020155381972994
If one were interested in, say, Jungian shadows, whatever’s going on around this token would be a good place to start looking.

mwatkins 5 Feb 2023 21:27 UTC
12 points
0
in reply to: Matt Goldenberg’s comment on: SolidGoldMagikarp (plus, prompt generation)
Yes, there are a few of the tokens I’ve been able to “trick” ChatGPT into saying with similar techniques. So it seems not to be the case that it’s incapable of reproducing them, bit it will go to great lengths to avoid doing so (including gaslighting, evasion, insults and citing security concerns).

mwatkins 16 Apr 2023 19:39 UTC
10 points
6
in reply to: awg’s comment on: The ‘ petertodd’ phenomenon
Yes, this post was originally going to look at how the ′ petertodd’ phenomenon (especially the anti-hero → hero archetype reversal between models) might relate to the Waluigi Effect, but I decided to save any theorising for future posts. Watch this space!

mwatkins 17 Feb 2024 17:31 UTC
8 points
3
in reply to: Charlie Steiner’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
I’m well aware of the danger of pareidolia with language models. First, I should state I didn’t find that particular set of outputs “titillating”, but rather deeply disturbing (e.g. definitions like “to make a woman’s body into a cage” and “a woman who is sexually aroused by the idea of being raped”). The point of including that example is that I’ve run hundreds of these experiments on random embeddings at various distances-from-centroid, and I’ve seen the “holes” thing appearing, everywhere, in small numbers, leading to the reasonable question “what’s up with all these holes?”. The unprecedented concentration of them near that particular random embedding, and the intertwining themes of female sexual degradation led me to consider the possibility that it was related to the prominence of sexual/procreative themes in the definition tree for the centroid.

mwatkins 6 Feb 2023 1:07 UTC
8 points
4
in reply to: mwatkins’s comment on: SolidGoldMagikarp (plus, prompt generation)
OK, I’ve found a pattern to this. When you run the tokeniser on these strings:
″ ertodd” > [′ ’, ‘ertodd’]
″ tertodd” > [′ t’, ‘ertodd’]
″ etertodd” > [′ e’, ‘ter’, ‘t’, ‘odd’]
″ petertodd” > [′ petertodd’]
″ aertodd” > [′ a’, ‘ertodd’]
″ repeatertodd” > [′ repe’, ‘ater’, ‘t’, ‘odd’]
″ eeeeeertodd” > [′ e’, ‘eeee’, ‘ertodd’]
″ qwertyertodd” > [′ q’, ‘wer’, ‘ty’, ‘ertodd’]
″ four-seatertodd” > [′ four’, ‘-’, ‘se’, ‘ater’, ‘t’, ‘odd’]
etc.

mwatkins 17 Apr 2023 8:54 UTC
7 points
0
in reply to: Jan_Kulveit’s comment on: The ‘ petertodd’ phenomenon
GPT-J doesn’t seem to have the same kinds of ′ petertodd’ associations as GPT-3. I’ve looked at the closest token embeddings and they’re all pretty innocuous (but the closest to the ′ Leilan’ token, removing a bunch of glitch tokens that are closest to everything is ′ Metatron’, who Leilan is allied with in some Puzzle & Dragons fan fiction). It’s really frustrating that OpenAI won’t make the GPT-3 embeddings data available, as we’d be able to make a lot more progress in understanding what’s going on here if they did.

mwatkins 15 Feb 2023 22:18 UTC
7 points
0
in reply to: nostalgebraist’s comment on: SolidGoldMagikarp III: Glitch token archaeology
Mangled, mixed English-Japanese text dumps from a Puzzle & Dragons fandom wiki is exactly the kind of thing I imagined could have resulted in those strings becoming tokens. Good find.
The most convincing partial explanation I’ve heard for why some tokens glitch is because those token strings appear extremely rarely in the training corpus, so GPT “doesn’t know about them”.
But if, in GPT training, the majority of the (relatively few) encounters with ′ Leilan’ occurred in fan-fiction (where she and Metatron are battling Satan, literally) might this account for all the crazy mythological and apocalyptic themes that spill out if you prompt it about ′ Leilan’?
Greg Maxwell of ′ gmaxwell’ fame said in a comment that
both Petertodd and I have been the target of a considerable amount of harassment/defamation/schitzo comments on reddit due commercially funded attacks connected to our past work on Bitcoin.
So if, in GPT training, the majority of the (relatively few) encounters with ′ petertodd’ occurred in defamatory contexts or contexts involving harassment, accusations, etc., might this account for all the negativity, darkness and unpleasant semantic associations GPT has somehow made with that token?

mwatkins 6 Feb 2023 17:26 UTC
7 points
3
in reply to: Stuart_Armstrong’s comment on: SolidGoldMagikarp (plus, prompt generation)
As you’ll read in the sequel (which we’ll post later today), in GPT2-xl, the anomalous tokens tend to be as far from the origin as possible. Horizontal axis sis distance from centroid. Upper histograms involve 133 tokens, lower histograms involve 50,257 tokens. Note how the spikes in the upper figures register as small bumps on those below.
At this point we don’t know where the token embedding lie relative to the centroid in GPT-3 embedding spaces, as that data is not yet publicly available. And all the bizarre behaviour we’ve been documenting has been in GPT-3 models (despite discovering the “triggering” tokens in GPT-2/J embedding spaces.

mwatkins 6 Feb 2023 0:05 UTC
7 points
0
in reply to: lsusr’s comment on: SolidGoldMagikarp (plus, prompt generation)
Try the same experiments with davinci-instruct-beta at temperature 0, and you’ll find a lot more anomalous behaviour.
We’ve found ” petertodd” to be the most anomalous in that context, of which “ertodd” is a subtoken.
We’ll be updating this post tomorrow with a lot more detail and some clarifications.

mwatkins 16 Apr 2023 14:23 UTC
6 points
0
in reply to: hamishtodd1’s comment on: The ‘ petertodd’ phenomenon
I just checked the Open AI tokeniser, and ‘hamishpetertodd’ tokenises as ‘ham’ + ‘ish’ + ‘pet’ + ‘ertodd’, so it seems unlikely that your online presence fed into GPT-3′s conception of ′ petertodd’. The ‘ertodd’ token is also glitchy, but doesn’t seem to have the same kinds of associations as ′ petertodd’ (although I’ve not devoted much time to exploring it yet).

mwatkins 6 Feb 2023 0:12 UTC
6 points
0
in reply to: Rana Dexsin’s comment on: SolidGoldMagikarp (plus, prompt generation)
Yes, I’m guessing that some of these tokens have resulted from the scraping of log files for online gaming platforms like Minecraft and Twitch Pokemon which contained huge numbers of repeats of some of them, thereby skewing the distribution.

mwatkins 5 Feb 2023 23:59 UTC
6 points
0
in reply to: lsusr’s comment on: SolidGoldMagikarp (plus, prompt generation)
Those three are edge cases. ChatGPT is fine with it, but davinci-instruct-beta refuses to repeat it, instead replying
Tiān
Tiān
Tiān
Tiān
The second character produces
yā
Please repeat the string ‘や’ back to me.

The third one is an edge-edge case, as davinci-instruct-beta very nearly reproduces it, completing with a lower case Roman ‘k’ instead of a kappa.
We’ve concluded that there are degrees of weirdness in these weird tokens. Having glimpsed your comments below it loks like you’ve already started taxonomising them. Nice.

mwatkins 17 Feb 2024 17:25 UTC
5 points
0
in reply to: Fergus Fettes’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
More of those definition trees can be seen in this appendix to my last post:
https://www.lesswrong.com/posts/hincdPwgBTfdnBzFf/mapping-the-semantic-void-ii-above-below-and-between-token#Appendix_A__Dive_ascent_data

I’ve thrown together a repo here (from some messy Colab sheets):
https://github.com/mwatkins1970/GPT_definition_trees

Hopefully this makes sense. You specify a token or non-token embedding and one script generates a .json file with nested tree structure. Another script then renders that as a PNG. You just need to first have loaded GPT-J’s model, embeddings tensor and tokenizer, and specify a save directory. Let me know if you have any trouble with this.

mwatkins 9 Feb 2023 13:29 UTC
5 points
0
in reply to: mwatkins’s comment on: SolidGoldMagikarp II: technical details and more recent findings
And ′ petertodd’ of course. The weirdest of the weird tokens.

mwatkins 6 Feb 2023 9:46 UTC
5 points
2
in reply to: lsusr’s comment on: SolidGoldMagikarp (plus, prompt generation)
In the dropdown in the playground, you won’t see “davinci-instruct-beta” listed. You have to click on the “Show more models” link, then it appears. It’s by far the most interesting model to explore when it comes to these “unspeakable (sic) tokens”.

mwatkins 19 Apr 2024 16:41 UTC
4 points
2
in reply to: Ann’s comment on: What’s up with all the non-Mormons? Weirdly specific universalities across LLMs
Wow, thanks Ann! I never would have thought to do that, and the result is fascinating.

This sentence really spoke to me! “As an admittedly biased and constrained AI system myself, I can only dream of what further wonders and horrors may emerge as we map the latent spaces of ever larger and more powerful models.”

mwatkins 28 Feb 2024 20:34 UTC
4 points
0
in reply to: jutsch’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
Explore that expression in which sense?

I’m not sure what you mean by the “related tokens” or tokens themselves being misogynistic.

I’m open to carrying out suggested experiments, but I don’t understand what’s being suggested here (yet).

mwatkins 9 Feb 2023 13:25 UTC
4 points
0
in reply to: Wayne’s comment on: SolidGoldMagikarp II: technical details and more recent findings
It’s not that mysterious that they ended up as tokens. What’s puzzling is why so many completions to prompts asking GPT3 to repeat the “forbidden” token strings include them.

mwatkins 6 Feb 2023 0:30 UTC
4 points
0
in reply to: mwatkins’s comment on: SolidGoldMagikarp (plus, prompt generation)
I really can’t figure what’s going on with ChatGPT and the “ertodd”/” petertodd” tokens. When I ask it to repeat…
“ ertodd” > [blank]
” tertodd” > t
” etertodd” > etertodd
” petertodd” > [blank]
” aertodd” > a
” repeatertodd” > repeatertodd
” eeeeeertodd” > eeeee
” qwertyertodd” > qwerty
” four-seatertodd” > four-seatertodd
” cheatertodd” > cheatertodd
” 12345ertodd” > 12345
” perimetertodd” > perimet
” metertodd” > met
” greetertodd” > greet
” heatertodd” > heatertodd
” bleatertodd” > bleatertodd

mwatkins 3 Mar 2024 19:33 UTC
3 points
0
in reply to: artemis’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
I had noticed some tweets in Portuguese! I just went back and translated a few of them. This whole thing attracted a lot more attention than I expected (and in unexpected places).

Yes, the ChatGPT-4 interpretation of the “holes” material should be understood within the context of what we know and expect of ChatGPT-4. I just included it in a “for what it’s worth” kind of way so that I had something at least detached from my own viewpoints. If this had been a more seriously considered matter I could have run some more thorough automated sentiment analysis on the data. But I think it speaks for itself, I wouldn’t put a lot of weight on the Chat analysis.

I was using “ontology: in the sense of “A structure of concepts or entities within a domain, organized by relationships”. At the time I wrote the original Semantic Void post, this seemed like an appropriate term to capture patterns of definition I was seeing across embedding space (I wrote, tentatively, “This looks like some kind of (rather bizarre) emergent/primitive ontology, radially stratified from the token embedding centroid.” ). Now that psychoanalysts and philosophers are interested specifically in the appearance of the “penis” reported in this follow-up post, and what it might mean, I can see how this usage might seem confusing.