Memorization in LLMs is probably Computation in Superposition (CiS, Vaintrob et al., 2024).
CiS is often considered a predominantly theoretical concept. I want to highlight that most memorization in LLMs is probably CiS. Specifically, the typical CiS task of “compute more AND gates than you have ReLU neurons” is exactly what you need to memorize lots of facts. I’m certainly not the first one to say this, but it also doesn’t seem common knowledge. I’d appreciate pushback or references in the comments!
Consider the token “Michael”. GPT-2 knows many things about Michael, including a lot of facts about Michael Jordan and Michael Phelps, all of which are relevant in different contexts. The model cannot represent all these in the embedding of the token Michael (conventional superposition, Elhagge et al., 2022); in fact—if SAEs are any indication—the model can only represent about 30-100 features at a time.
So this knowledge must be retrieved dynamically. In the sentence “Michael Jordan plays the sport of”, a model will consider the intersection of Michael AND Jordan AND sport, resulting in basketball. Folk wisdom is that this kind of memorization is implemented by the MLP blocks in a Transformer. And since GPT-2 knows more facts than it has MLP neurons, we arrive at the “compute more AND gates than you have ReLU neurons” problem.
Representation in superposition would not have been a novel idea to computer scientists in 2006. Johnson-Lindenstrauss is old. But there’s nothing I can think of from back then that’d let you do computation in superposition, linearly embedding a large number of algorithms efficiently on top each other in the same global vector space so they can all be pretty efficiently executed in parallel, without wasting a ton of storage and FLOP, so long as only a few algorithms do anything at any given moment.
To me at least, that does seem like a new piece of the puzzle for how minds can be set up to easily learn lots of very different operations and transformations that all apply to representations living in the same global workspace.
Memorization in LLMs is probably Computation in Superposition (CiS, Vaintrob et al., 2024).
CiS is often considered a predominantly theoretical concept. I want to highlight that most memorization in LLMs is probably CiS. Specifically, the typical CiS task of “compute more AND gates than you have ReLU neurons” is exactly what you need to memorize lots of facts. I’m certainly not the first one to say this, but it also doesn’t seem common knowledge. I’d appreciate pushback or references in the comments!
Consider the token “Michael”. GPT-2 knows many things about Michael, including a lot of facts about Michael Jordan and Michael Phelps, all of which are relevant in different contexts. The model cannot represent all these in the embedding of the token Michael (conventional superposition, Elhagge et al., 2022); in fact—if SAEs are any indication—the model can only represent about 30-100 features at a time.
So this knowledge must be retrieved dynamically. In the sentence “Michael Jordan plays the sport of”, a model will consider the intersection of Michael AND Jordan AND sport, resulting in basketball. Folk wisdom is that this kind of memorization is implemented by the MLP blocks in a Transformer. And since GPT-2 knows more facts than it has MLP neurons, we arrive at the “compute more AND gates than you have ReLU neurons” problem.
Agreed, I consider this a key theme in our fact finding work especially post 3 (but could maybe have made this more explicit) https://www.lesswrong.com/s/hpWHhjvjn67LJ4xXX/p/iGuwZTHWb6DFY3sKB
@Eliezer Yudkowsky If Large Language Models were confirmed to implement computation in superposition [1,2,3], rather than just representation in superposition, would you resolve this market as yes?
Representation in superposition would not have been a novel idea to computer scientists in 2006. Johnson-Lindenstrauss is old. But there’s nothing I can think of from back then that’d let you do computation in superposition, linearly embedding a large number of algorithms efficiently on top each other in the same global vector space so they can all be pretty efficiently executed in parallel, without wasting a ton of storage and FLOP, so long as only a few algorithms do anything at any given moment.
To me at least, that does seem like a new piece of the puzzle for how minds can be set up to easily learn lots of very different operations and transformations that all apply to representations living in the same global workspace.