I’ve spoken to some other people about Remark 1, and they also seem doubtful that token deletion is an important mechanism to think about, so I’m tempted to defer to you.
But on the inside view:
The finite context window is really important. 32K is close enough to infinity for most use-cases, but that’s because users and orgs are currently underutilising the context window. The correct way to utilise the 32K context window is to fill it with any string of tokens which might help the computation.
Here’s some fun things to fill the window with —
A summary of facts about the user.
A summary of news events that occurred after training ended. (This is kinda what Bing Sydney does — Bing fills the context window with the search results key words in the user’s prompt.)
A “compiler prompt” or “library prompt”. This is a long string of text which elicits high-level, general-purpose, quality-assessed simulacra which can then be mentioned later by the users. (I don’t mean simulacra in a woo manner.) Think of “import numpy as np” but for prompts.
Literally anything other than null tokens.
I think no matter how big the window gets, someone will work out how to utilise it. The problem is that the context window has grown faster than prompt engineering, so no one has realised yet how to properly utilise the 32K window. Moreover, the orgs (Anthropic, OpenAI, Google) are too much “let’s adjust the weights” (fine-tuning, RLHF, etc), rather than “let’s change the prompts”.
If you’ve spent much time playing with long conversation with LLMs with 1k-8k context windows (early versions of GPT or most current open-source models), then you quickly become painfully aware of token deletion. While modern frontier LLMs have novel-length context windows, their recall does tend to get worse as the current context size increases, especially for things not at either the beginning or end of the context window, which is a different but similar effect.
LLM-∞ wouldn’t suffer from this, but any realistic LLM will, and I agree that there is going to be a motive to fill the context window with useful stuff for any problem where enough us3eful stuff to do so exits.
I’ve spoken to some other people about Remark 1, and they also seem doubtful that token deletion is an important mechanism to think about, so I’m tempted to defer to you.
But on the inside view:
The finite context window is really important. 32K is close enough to infinity for most use-cases, but that’s because users and orgs are currently underutilising the context window. The correct way to utilise the 32K context window is to fill it with any string of tokens which might help the computation.
Here’s some fun things to fill the window with —
A summary of facts about the user.
A summary of news events that occurred after training ended. (This is kinda what Bing Sydney does — Bing fills the context window with the search results key words in the user’s prompt.)
A “compiler prompt” or “library prompt”. This is a long string of text which elicits high-level, general-purpose, quality-assessed simulacra which can then be mentioned later by the users. (I don’t mean simulacra in a woo manner.) Think of “import numpy as np” but for prompts.
Literally anything other than null tokens.
I think no matter how big the window gets, someone will work out how to utilise it. The problem is that the context window has grown faster than prompt engineering, so no one has realised yet how to properly utilise the 32K window. Moreover, the orgs (Anthropic, OpenAI, Google) are too much “let’s adjust the weights” (fine-tuning, RLHF, etc), rather than “let’s change the prompts”.
If you’ve spent much time playing with long conversation with LLMs with 1k-8k context windows (early versions of GPT or most current open-source models), then you quickly become painfully aware of token deletion. While modern frontier LLMs have novel-length context windows, their recall does tend to get worse as the current context size increases, especially for things not at either the beginning or end of the context window, which is a different but similar effect.
LLM-∞ wouldn’t suffer from this, but any realistic LLM will, and I agree that there is going to be a motive to fill the context window with useful stuff for any problem where enough us3eful stuff to do so exits.