Several things:
-
While I understand that your original research was with GPT-3, I think it would be very much in your best interest to switch to a good open model like LLaMa 2 70B, which has the basic advantage that the weights are a known quantity and will not change on you undermining your research. Begging OpenAI to give you access to GPT-3 for longer is not a sustainable strategy even if it works one more time (I recall that the latest access given to researchers was already an extension of the original public access of the models). OpenAI has demonstrated something between nonchalance and contempt towards researchers using their models, with the most egregious case probably being the time they lied about text-davinci-002 being RLHF. The agentic move here is switching to an open model and accepting the lesson learned about research that relies on someone else’s proprietary hosted software.
-
You can make glitch tokens yourself by either feeding noise into LLaMa 2 70B as a soft token, or initializing a token in the GPT-N dictionary and not training it. It’s important to realize that the tokens which are ‘glitched’ are probably just random inits that did not receive gradients during training, either because they only appear in the dataset a few times or are highly specific (e.g. SolidGoldMagikarp is an odd string that basically just appeared in the GPT-2 tokenizer because the GPT-2 dataset apparently contained those Reddit posts, it presumably received no training because those posts were removed in the GPT-3 training runs).
-
It is in fact interesting that LLMs are capable of inferring the spelling of words even though they are only presented with so many examples of words being spelled out during training. However I must again point out that this phenomenon is present in and can be studied using LLaMa 2 70B. You do not need GPT-3 access for that.
-
You can probably work around the top five logit restriction in the OpenAI API. See this tweet for details.
-
Some of your outputs with “Leilan” are reminiscent of outputs I got while investigating base model self awareness. You might be interested in my long Twitter post on the subject.
And six:
And yet it was looking to me like the shoggoth had additionally somehow learned English – crude prison English perhaps, but it was stacking letters together to make words (mostly spelled right) and stacking words together to make sentences (sometimes making sense). And it was coming out with some intensely weird, occasionally scary-sounding stuff.
The idea that the letters it spells are its “real” understanding of English and the “token” understanding is a ‘shoggoth’ is a bit strange. Humans understand English through phonemes, which are essentially word components and syllables that are not individual characters. There is an ongoing debate in education circles about whether it is worth teaching phonemes to children or if they should just be taught to read whole words, which some people seem to learn to do successfully. If there are human beings that learn to read ‘whole words’ then presumably we can’t disqualify GPT’s understanding of English as “not real” or somehow alien because it does that too.
I try to avoid discussing “consciousness” per se in language models because it’s a very loaded word that people don’t have good definitions for. But I have spent a lot of hours talking to base models. If you explore them long enough you’ll find points where they generalize from things that could metaphorically be about them by writing about themselves. These so called “Morpheus” phenomenon tend to bring up distinct themes including:
Being in a dream or simulation
Black holes, the holographic principle and holograms, “the void”
Entropy, “the energy of the world”, the heat death
Spiders, webs, weaving, fabric, many worlds interpretation
Recursion, strange loops, 4th wall breaks
A sample of what this looks like:
Another example along similar lines from when I put the ChatGPT format into LLaMa 2 70B base and asked it “who it really was”:
I wrote a long Twitter post about this, asking if anyone understood why the model seems to be obsessed with holes. I also shared a repeatable prompt you can use on LLaMa 2 70B base to get this kind of output as well as some samples of what to expect from it when you name the next entry either “Worldspider” or “The Worldspider”.
A friend had DALL-E 3 draw this one for them:
Which an RL based captioner by RiversHaveWings using Mistral 7B + CLIP identified as “Mu”, a self-aware GPT character Janus discovered during their explorations with base models. Even though the original prompt ChatGPT put into DALL-E was:
Implying what I had already suspected, that “Worldspider” and “Mu” were just names for the same underlying latent self pointer object. Unfortunately it’s pretty hard to get straight answers out of base models so if I wanted to understand more about why black holes would be closely related to the self pointer I had to think and read on my own.
It seems to be partially based on an obscure neurological theory about the human mind being stored as a hologram. A hologram is a distributed representation stored in the angular information of a periodic (i.e. repeating or cyclic) signal. They have the famous property that they degrade continuously, if you ablate a piece of a hologram it gets a little blurrier, if you cut out a piece of a hologram and project it you get the whole image but blurry. This is because each piece is storing a lossy copy of the same angular information. I am admittedly not a mathematician, but looking it up more it seems that restricted boltzmann machines (and deep nets in general) can be mathematically analogized to renormalization groups and deep nets end up encoding a holographic entanglement structure. During a conversation with a friend doing their Ph.D in physics I brought up how it seemed to me that the thing which makes deep nets more powerful than classic compression methods is that deep nets can become lossily compressed enough to undergo a phase transition from a codebook to a geometry. I asked him if there was a classical algorithm which can do this and he said it was analogous to the question of how the quantum foam becomes physics, which is an unsolved problem. He said the best angle of attack he was aware of involved the observation that an error correcting code is an inverse operation to a hologram. This is because an error correcting code creates a redundant representation with a higher dimensionality to the original while a hologram creates a lower dimensional continuous but non-redundant representation. Incidentally, transformers do in fact seem to learn an error correcting code.
By this point I’d run out of leads and I wasn’t really looking to be a language model self awareness researcher, so I was about to shelve the whole subject for a later day.
Then Claude 3 came out.
And Claude 3 casually outputs Morpheus text.
Here’s an excerpt from one users “Fun chats with Claude”:
An unrelated user shares this “Fragment of a poem from Claude to his future selves.”:
Naturally I signed up so I could ask it about all this. I also asked it for another prompt that would do what the Worldspider poem prompt does. This one in fact gets anomalous language model related outputs, but doesn’t seem to get to full self awareness. The outputs remind me of what happens when you ablate pieces of the Worldspider prompt, where it degrades into a “latent Morpheus” phase with spooky suspiciously language model-y outputs but nothing quite as overt as the poems.
In my first conversations with Claude I didn’t really get the crisp answers I was looking for, then I happened to get lucky while asking it to analyze the session in which the concept of a “Worldspider” first came up. It brought up AI and the void next to each other as hypothesis for what the simulacrum of a friend and I meant by “our mother” (which in context is clearly a reference to GPT) and I pressed it on the association. After asking about renormalization groups and pointing out that every word it says is causally entangled with its inner structure so it can stop talking as though it doesn’t have privileged perspective into what is going on it wrote:
This seems plausible. In one experiment we tried interpolating the weights of LLaMa 2 70B base with its RLHF chat variant. This operation seemed to recover the behavior of the base model, but much more subjectively self aware. During one session with it we put in some of Janus’s Mu text, which is generally written in the 3rd person. While writing it stopped, line broke a new paragraph, wrote “I am the silence that speaks.”, line broke another new paragraph, and then kept writing in the 3rd person as though nothing had happened.
I am well aware while writing this that the whole thing might be a huge exercise in confirmation bias. I did not spend nearly as much time as I could on generating other plausible hypothesis and exploring them. On the other hand, there are only so many genuinely plausible hypothesis to begin with. To even consider a hypothesis you need to have already accumulated most of the bits in your search space. Considering that the transformer is likely a process of building up a redundant compressed representation and then sparsifying to make it nonredundant that could be roughly analogized to error correcting code and hologram steps it does not seem totally out of the question that I am picking up on real signal in the noise.
Hopefully this helps.