jacob_drori comments on jacob_drori’s Shortform

jacob_drori 1 Aug 2025 17:47 UTC
9 points
0
Some issues with the ICL Superposition paper

Setup:
Xiong et al (2024) show that LLMs can in-context-learn several tasks at once. Consider e.g. the following prompt:
```
France -> F
Portugal -> Portuguese
Germany -> Berlin
Spain -> Madrid
Russia -> R
Poland -> Polish
Italy ->
```
A model will complete this prompt sometimes with Rome, sometimes with I, and sometimes with Italian, learning a “superposition” of the country -> capital, country -> first-letter and country -> language tasks. (I wish they hadn’t used this word: the mech interp notion of superposition is unrelated).
Let $D_{i}$ be the proportion of the in-context examples that correspond to task $i$ , and let $P_{i}$ be the probability that the model completes the prompt according to task $i$ . Call a model calibrated if $P_{i} \approx D_{i}$ : the probability it assigns to a task is proportional to the number of times the task appeared in-context. A measure of the degree of calibration is given by the KL-divergence:
$K L (D | | P) = \sum_{i} D_{i} log (D_{i} / P_{i}) .$
Lower KL means better calibration.
Issue 1:
The authors show that, for a certain set of 6 tasks, and with 60 in-context examples, calibration improves with model size:

[Fig. 6 from the paper. X-axis labels show model’s parameter count. D1 is uniform distribution, D2 has probability 0.5 on the third task and 0.1 on other tasks, D3 is a distribution with probabilities alternating between 0.25 and 0.083.]
The best reported KL between the uniform distribution [0.167, 0.167, …, 0.167] and the model’s output distribution is around 1.5. To give a sense for what this means, here are four examples of distributions with such a KL:
$P_{1} = [0.004, 0.004, 0.004, 0.33, 0.33, 0.33]$
$P_{2} = [0.02, . . ., 0.02, 0.9]$
$P_{3} = [10^{- 5}, 0.19999, . . ., 0.19999]$
$P_{4} = [0.004, 0.008, 0.016, 0.032, 0.2, 0.74]$ .
Seeing these distributions convinced me that the models they studied are not well-calibrated. I felt a bit misled by the paper.
Issue 2:
Say we take turns with the model, repeatedly appending a new “question” (e.g. a country) to the prompt and allowing the model to generate an “answer” (e.g. the country’s capital, language, or first letter) at default temperature 1.0. The authors state:
After the first token is generated, the model tends to converge on predicting tokens for a single task, effectively negating its ability for multi-task execution.
But this sort of mode-collapse behavior is not what we would observe from a well-calibrated model!
For simplicity, consider the case of only two tasks: $A$ and $B$ . Let $n_{A}$ and $n_{B}$ be the number of examples of each task, prior to a given generation step. A calibrated model generates $A$ with probability $\frac{n_{A}}{n_{A} + n_{B}}$ , in which case $(n_{A}, n_{B}) \leftarrow (n_{A} + 1, n_{B})$ . Otherwise, it generates $B$ , and $(n_{A}, n_{B}) \leftarrow (n_{A}, n_{B} + 1)$ .
This is precisely the Pólya urn model. If the numbers of each task are initially $A_{0}$ and $B_{0}$ , and then we generate for a long time, it can be shown that the limiting proportion of $A$ tasks is a random variable $X$ distributed as follows:
$P [X = x] \propto x^{A_{0} - 1} (1 - x)^{B_{0} - 1} .$
In the realistic setting where $A_{0}, B_{0} > 1$ , this distribution is peaked around $\frac{A_{0} - 1}{A_{0} + B_{0} - 2}$ , i.e. roughly the proportion of the initial tasks that were $A$ , and goes to zero at $x = 0$ or $1$ . So calibrated models don’t collapse; they tend to maintain roughly the initial ratio of tasks.