mwatkins

Karma: 1,582

mwatkins 3 Jan 2023 21:12 UTC
3 points
2
on: Large language models can provide “normative assumptions” for learning human preferences
I got the same results with those prompts using the ‘text-davinci-003’ model, whereas the original ‘davinci’ model produces a huge range of creative but unhelpful (for these purposes) outputs. The difference is that text-davinci-003 was trained using human feedback data.
As far as I can tell (see here), OpenAI haven’t revealed the details of the training process. But the fact is that particular decisions were made about how this was done, in order to create a more user-friendly product. And this could have been done in any number of ways, using different groups of humans, working to a range of possible specifications.
This seems a relevant consideration if we’re considering the future use of LLMs to bridge the inference gap in the value-learning problem for AGI systems. Will human feedback be required, and if so, how would this be organised?

mwatkins 4 Jan 2023 11:45 UTC
0 points
on: Research Agenda v0.9: Synthesising a human’s preferences into a utility function
that
(than)

mwatkins 5 Jan 2023 10:17 UTC
0 points
on: Research Agenda v0.9: Synthesising a human’s preferences into a utility function
latter

All AGI Safety questions welcome (especially basic ones) [~monthly thread]

mwatkins and Robert Miles

26 Jan 2023 21:01 UTC

39 points

80 comments2 min readLW link

mwatkins 5 Feb 2023 21:27 UTC
12 points
0
in reply to: Matt Goldenberg’s comment on: SolidGoldMagikarp (plus, prompt generation)
Yes, there are a few of the tokens I’ve been able to “trick” ChatGPT into saying with similar techniques. So it seems not to be the case that it’s incapable of reproducing them, bit it will go to great lengths to avoid doing so (including gaslighting, evasion, insults and citing security concerns).

mwatkins 5 Feb 2023 23:59 UTC
6 points
0
in reply to: lsusr’s comment on: SolidGoldMagikarp (plus, prompt generation)
Those three are edge cases. ChatGPT is fine with it, but davinci-instruct-beta refuses to repeat it, instead replying
Tiān
Tiān
Tiān
Tiān
The second character produces
yā
Please repeat the string ‘や’ back to me.

The third one is an edge-edge case, as davinci-instruct-beta very nearly reproduces it, completing with a lower case Roman ‘k’ instead of a kappa.
We’ve concluded that there are degrees of weirdness in these weird tokens. Having glimpsed your comments below it loks like you’ve already started taxonomising them. Nice.

mwatkins 6 Feb 2023 0:05 UTC
7 points
0
in reply to: lsusr’s comment on: SolidGoldMagikarp (plus, prompt generation)
Try the same experiments with davinci-instruct-beta at temperature 0, and you’ll find a lot more anomalous behaviour.
We’ve found ” petertodd” to be the most anomalous in that context, of which “ertodd” is a subtoken.
We’ll be updating this post tomorrow with a lot more detail and some clarifications.

mwatkins 6 Feb 2023 0:12 UTC
6 points
0
in reply to: Rana Dexsin’s comment on: SolidGoldMagikarp (plus, prompt generation)
Yes, I’m guessing that some of these tokens have resulted from the scraping of log files for online gaming platforms like Minecraft and Twitch Pokemon which contained huge numbers of repeats of some of them, thereby skewing the distribution.

mwatkins 6 Feb 2023 0:30 UTC
4 points
0
in reply to: mwatkins’s comment on: SolidGoldMagikarp (plus, prompt generation)
I really can’t figure what’s going on with ChatGPT and the “ertodd”/” petertodd” tokens. When I ask it to repeat…
“ ertodd” > [blank]
” tertodd” > t
” etertodd” > etertodd
” petertodd” > [blank]
” aertodd” > a
” repeatertodd” > repeatertodd
” eeeeeertodd” > eeeee
” qwertyertodd” > qwerty
” four-seatertodd” > four-seatertodd
” cheatertodd” > cheatertodd
” 12345ertodd” > 12345
” perimetertodd” > perimet
” metertodd” > met
” greetertodd” > greet
” heatertodd” > heatertodd
” bleatertodd” > bleatertodd

mwatkins 6 Feb 2023 1:07 UTC
8 points
4
in reply to: mwatkins’s comment on: SolidGoldMagikarp (plus, prompt generation)
OK, I’ve found a pattern to this. When you run the tokeniser on these strings:
″ ertodd” > [′ ’, ‘ertodd’]
″ tertodd” > [′ t’, ‘ertodd’]
″ etertodd” > [′ e’, ‘ter’, ‘t’, ‘odd’]
″ petertodd” > [′ petertodd’]
″ aertodd” > [′ a’, ‘ertodd’]
″ repeatertodd” > [′ repe’, ‘ater’, ‘t’, ‘odd’]
″ eeeeeertodd” > [′ e’, ‘eeee’, ‘ertodd’]
″ qwertyertodd” > [′ q’, ‘wer’, ‘ty’, ‘ertodd’]
″ four-seatertodd” > [′ four’, ‘-’, ‘se’, ‘ater’, ‘t’, ‘odd’]
etc.

mwatkins 6 Feb 2023 9:46 UTC
5 points
2
in reply to: lsusr’s comment on: SolidGoldMagikarp (plus, prompt generation)
In the dropdown in the playground, you won’t see “davinci-instruct-beta” listed. You have to click on the “Show more models” link, then it appears. It’s by far the most interesting model to explore when it comes to these “unspeakable (sic) tokens”.

mwatkins 6 Feb 2023 17:26 UTC
7 points
3
in reply to: Stuart_Armstrong’s comment on: SolidGoldMagikarp (plus, prompt generation)
As you’ll read in the sequel (which we’ll post later today), in GPT2-xl, the anomalous tokens tend to be as far from the origin as possible. Horizontal axis sis distance from centroid. Upper histograms involve 133 tokens, lower histograms involve 50,257 tokens. Note how the spikes in the upper figures register as small bumps on those below.
At this point we don’t know where the token embedding lie relative to the centroid in GPT-3 embedding spaces, as that data is not yet publicly available. And all the bizarre behaviour we’ve been documenting has been in GPT-3 models (despite discovering the “triggering” tokens in GPT-2/J embedding spaces.

mwatkins 6 Feb 2023 17:29 UTC
2 points
−1
in reply to: Aryaman Arora’s comment on: SolidGoldMagikarp (plus, prompt generation)
In GPT2-small and GPT-J they’re actually smaller than average, as they tend to cluster close to the centroid (which isn’t too far from the origin). In GPT2-xl they do tend to be larger than average. But in all of these models, they’re found distributed across the full range of distances-from-centroid.
At this point we don’t know where the token embeddings lie relative to the centroid in GPT-3 embedding spaces, as that data is not yet publicly available. And all the bizarre behaviour we’ve been documenting has been in GPT-3 models (despite discovering the “triggering” tokens in GPT-2/J embedding spaces.
OpenAI is still claiming online that all of their token embeddings are normalised to norm 1, but this is simply untrue, as can be easily demonstrated with a few lines of PyTorch.

mwatkins 6 Feb 2023 17:31 UTC
1 point
0
in reply to: mwatkins’s comment on: SolidGoldMagikarp (plus, prompt generation)
3-shot prompting experiments with GPT2 and J models show that distance from centroid may contribute to anomalous behaviour, but it can’t be the sole cause.

SolidGoldMagikarp II: technical details and more recent findings

mwatkins and Jessica Rumbelow

6 Feb 2023 19:09 UTC

109 points

45 comments13 min readLW link

mwatkins 6 Feb 2023 21:05 UTC
1 point
0
in reply to: Douglas_Knight’s comment on: SolidGoldMagikarp (plus, prompt generation)
Leading spaces are extremely common in GPT tokens. ′ It’, ′ That’, ′ an’ and ′ has’ are all tokens, for example.

mwatkins 7 Feb 2023 14:30 UTC
3 points
0
in reply to: Douglas_Knight’s comment on: SolidGoldMagikarp (plus, prompt generation)
Oh, I see what you mean now.

mwatkins 8 Feb 2023 0:18 UTC
12 points
1
in reply to: JenniferRM’s comment on: SolidGoldMagikarp (plus, prompt generation)
That’s an interesting suggestion.
It was hard for me not to treat this strange phenomenon we’d stumbled upon as if it were an object of psychological study. It felt like these tokens were “triggering” GPT3 in various ways. Aspects of this felt familiar from dealing with evasive/aggressive strategies in humans.
Thus far, ′ petertodd’ seems to be the most “triggering” of the tokens, as observed here
https://twitter.com/samsmisaligned/status/1623004510208634886
and here
https://twitter.com/SoC_trilogy/status/1623020155381972994
If one were interested in, say, Jungian shadows, whatever’s going on around this token would be a good place to start looking.

mwatkins 8 Feb 2023 0:52 UTC
1 point
2
in reply to: Christopher King’s comment on: SolidGoldMagikarp (plus, prompt generation)
fnord

mwatkins 8 Feb 2023 0:53 UTC
1 point
2
in reply to: mwatkins’s comment on: SolidGoldMagikarp (plus, prompt generation)
(https://en.wikipedia.org/wiki/Fnord)

mwatkins

All AGI Safety ques­tions wel­come (es­pe­cially ba­sic ones) [~monthly thread]

SolidGoldMag­ikarp II: tech­ni­cal de­tails and more re­cent findings

All AGI Safety questions welcome (especially basic ones) [~monthly thread]

SolidGoldMagikarp II: technical details and more recent findings