Xiong et al (2024) show that LLMs can in-context-learn several tasks at once. Consider e.g. the following prompt:
France -> F
Portugal -> Portuguese
Germany -> Berlin
Spain -> Madrid
Russia -> R
Poland -> Polish
Italy ->
A model will complete this prompt sometimes with Rome, sometimes with I, and sometimes with Italian, learning a “superposition” of the country -> capital, country -> first-letter and country -> language tasks. (I wish they hadn’t used this word: the mech interp notion of superposition is unrelated).
Let Di be the proportion of the in-context examples that correspond to task i, and let Pi be the probability that the model completes the prompt according to task i. Call a model calibrated if Pi≈Di: the probability it assigns to a task is proportional to the number of times the task appeared in-context. A measure of the degree of calibration is given by the KL-divergence:
KL(D||P)=∑iDilog(Di/Pi).
Lower KL means better calibration.
Issue 1:
The authors show that, for a certain set of 6 tasks, and with 60 in-context examples, calibration improves with model size:
[Fig. 6 from the paper. X-axis labels show model’s parameter count. D1 is uniform distribution, D2 has probability 0.5 on the third task and 0.1 on other tasks, D3 is a distribution with probabilities alternating between 0.25 and 0.083.]
The best reported KL between the uniform distribution [0.167, 0.167, …, 0.167] and the model’s output distribution is around 1.5. To give a sense for what this means, here are four examples of distributions with such a KL:
P1=[0.004,0.004,0.004,0.33,0.33,0.33]
P2=[0.02,...,0.02,0.9]
P3=[10−5,0.19999,...,0.19999]
P4=[0.004,0.008,0.016,0.032,0.2,0.74].
Seeing these distributions convinced me that the models they studied are not well-calibrated. I felt a bit misled by the paper.
Issue 2:
Say we take turns with the model, repeatedly appending a new “question” (e.g. a country) to the prompt and allowing the model to generate an “answer” (e.g. the country’s capital, language, or first letter) at default temperature 1.0. The authors state:
After the first token is generated, the model tends to converge on predicting tokens for a single task, effectively negating its ability for multi-task execution.
But this sort of mode-collapse behavior is not what we would observe from a well-calibrated model!
For simplicity, consider the case of only two tasks: A and B. Let nA and nB be the number of examples of each task, prior to a given generation step. A calibrated model generates A with probability nAnA+nB, in which case (nA,nB)←(nA+1,nB). Otherwise, it generates B, and (nA,nB)←(nA,nB+1).
This is precisely the Pólya urn model. If the numbers of each task are initially A0 and B0, and then we generate for a long time, it can be shown that the limiting proportion of A tasks is a random variable X distributed as follows:
P[X=x]∝xA0−1(1−x)B0−1.
In the realistic setting where A0,B0>1, this distribution is peaked aroundA0−1A0+B0−2, i.e. roughly the proportion of the initial tasks that were A, and goes to zero at x=0 or 1. So calibrated models don’t collapse; they tend to maintain roughly the initial ratio of tasks.
Some issues with the ICL Superposition paper
Setup:
Xiong et al (2024) show that LLMs can in-context-learn several tasks at once. Consider e.g. the following prompt:
A model will complete this prompt sometimes with
Rome
, sometimes withI
, and sometimes withItalian
, learning a “superposition” of thecountry -> capital
,country -> first-letter
andcountry -> language
tasks. (I wish they hadn’t used this word: the mech interp notion of superposition is unrelated).Let Di be the proportion of the in-context examples that correspond to task i, and let Pi be the probability that the model completes the prompt according to task i. Call a model calibrated if Pi≈Di: the probability it assigns to a task is proportional to the number of times the task appeared in-context. A measure of the degree of calibration is given by the KL-divergence:
KL(D||P)=∑iDilog(Di/Pi).
Lower KL means better calibration.
Issue 1:
The authors show that, for a certain set of 6 tasks, and with 60 in-context examples, calibration improves with model size:
The best reported KL between the uniform distribution [0.167, 0.167, …, 0.167] and the model’s output distribution is around 1.5. To give a sense for what this means, here are four examples of distributions with such a KL:
P1=[0.004,0.004,0.004,0.33,0.33,0.33]
P2=[0.02,...,0.02,0.9]
P3=[10−5,0.19999,...,0.19999]
P4=[0.004,0.008,0.016,0.032,0.2,0.74].
Seeing these distributions convinced me that the models they studied are not well-calibrated. I felt a bit misled by the paper.
Issue 2:
Say we take turns with the model, repeatedly appending a new “question” (e.g. a country) to the prompt and allowing the model to generate an “answer” (e.g. the country’s capital, language, or first letter) at default temperature 1.0. The authors state:
But this sort of mode-collapse behavior is not what we would observe from a well-calibrated model!
For simplicity, consider the case of only two tasks: A and B. Let nA and nB be the number of examples of each task, prior to a given generation step. A calibrated model generates A with probability nAnA+nB, in which case (nA,nB)←(nA+1,nB). Otherwise, it generates B, and (nA,nB)←(nA,nB+1).
This is precisely the Pólya urn model. If the numbers of each task are initially A0 and B0, and then we generate for a long time, it can be shown that the limiting proportion of A tasks is a random variable X distributed as follows:
P[X=x]∝xA0−1(1−x)B0−1.In the realistic setting where A0,B0>1, this distribution is peaked aroundA0−1A0+B0−2, i.e. roughly the proportion of the initial tasks that were A, and goes to zero at x=0 or 1. So calibrated models don’t collapse; they tend to maintain roughly the initial ratio of tasks.