Thanks for writing this! I’ve had some similar ideas kicking around my head but it’s helpful to see someone write them up like this.
I think token deletion is a good thing to keep in mind, but I think it’s not correct to say you’re always deleting the token in position 1. In the predict-next-token loop it would be trivial to keep some prefix around, e.g. 1,2,3,4 → 1,2,4,5 → 1, 2, 5, 6 → etc. I assume that’s what ChatGPT does, since they have a hidden prompt and if you could jailbreak it by just overflowing the dialogue box, the DAN people would presumably have found that exploit already. While this is on some level equivalent to rolling the static prefix into the next token prediction term, I think the distinction is important because it means we could actually be dealing with a range of dynamics depending on the prefix.
Editing to provide an example: in your {F, U, N} example, add another token L (for Luigi), which is never produced by the LLM but if L is ever in the context window the AI behaves as Luigi and predicts F 100% of the time. If you trim the context window as you describe, any L will eventually fall out of the context window and the AI will then tend towards Waluigi as you describe. But if you trim from the second token, the sequence (L, F, F, F, F, …, F) is stable. Perhaps the L token could be analogous to the <|good|> token described here.
Nitpicking: Your “00000000000000000....” prompt doesn’t actually max out the prompt window because sixteen 0s can combine into a single token. You can see this at the GPT tokenizer.
A few comments:
Thanks for writing this! I’ve had some similar ideas kicking around my head but it’s helpful to see someone write them up like this.
I think token deletion is a good thing to keep in mind, but I think it’s not correct to say you’re always deleting the token in position 1. In the predict-next-token loop it would be trivial to keep some prefix around, e.g. 1,2,3,4 → 1,2,4,5 → 1, 2, 5, 6 → etc. I assume that’s what ChatGPT does, since they have a hidden prompt and if you could jailbreak it by just overflowing the dialogue box, the DAN people would presumably have found that exploit already. While this is on some level equivalent to rolling the static prefix into the next token prediction term, I think the distinction is important because it means we could actually be dealing with a range of dynamics depending on the prefix.
Editing to provide an example: in your {F, U, N} example, add another token L (for Luigi), which is never produced by the LLM but if L is ever in the context window the AI behaves as Luigi and predicts F 100% of the time. If you trim the context window as you describe, any L will eventually fall out of the context window and the AI will then tend towards Waluigi as you describe. But if you trim from the second token, the sequence (L, F, F, F, F, …, F) is stable. Perhaps the L token could be analogous to the <|good|> token described here.
Nitpicking: Your “00000000000000000....” prompt doesn’t actually max out the prompt window because sixteen 0s can combine into a single token. You can see this at the GPT tokenizer.
Related—context distillation / prompt compression, perhaps recursively too—Learning to Compress Prompts with Gist Tokens.