The Geometry of LLM Logits (an analytical outer bound)
1 Preliminaries
Symbol
Meaning
d
width of the residual stream (e.g. 768 in GPT-2-small)
L
number of Transformer blocks
V
vocabulary size, so logits live in RV
h(ℓ)
residual-stream vector entering block ℓ
r(ℓ)
the update written by block ℓ
WU∈RV×d,b∈RV
un-embedding matrix and bias
Additive residual stream.
With (pre-/peri-norm) residual connections,
h(ℓ+1)=h(ℓ)+r(ℓ),ℓ=0,…,L−1.
Hence the final pre-logit state is the sum of L+1 contributions (block 0 = token+positional embeddings):
h(L)=L∑ℓ=0r(ℓ).
2 Each update is contained in an ellipsoid
Why a bound exists.
Every sub-module (attention head or MLP)
reads a LayerNormed copy of its input, so
∥u∥2≤ρℓ where ρℓ:=γℓ√d and γℓ is that block’s learned scale;
applies linear maps, a Lipschitz point-wise non-linearity (GELU, SiLU, …), and another linear map back to Rd.
Because the composition of linear maps and Lipschitz functions is itself Lipschitz, there exists a constant κℓ such that
∥r(ℓ)∥2≤κℓwhenever∥u∥2≤ρℓ.
Define the centred ellipsoid
E(ℓ):={x∈Rd:∥x∥2≤κℓ}.
Then every realisable update lies inside that ellipsoid:
r(ℓ)∈E(ℓ).
3 Residual stream ⊆ Minkowski sum of ellipsoids
Using additivity and Step 2,
h(L)=L∑ℓ=0r(ℓ)∈L∑ℓ=0E(ℓ)=:Etot,
where ∑ℓE(ℓ)=E(0)⊕⋯⊕E(L)
is the Minkowski sum of the individual ellipsoids.
4 Logit space is an affine image of that sum
Logits are produced by the affine map x↦WUx+b.
For any sets S1,…,Sm,
WU(⨁iSi)=⨁iWUSi.
Hence
logits=WUh(L)+b∈b+L⨁ℓ=0WUE(ℓ).
Because linear images of ellipsoids are ellipsoids, each WUE(ℓ) is still an ellipsoid.
5 Ellipsotopes
An ellipsotope is an affine shift of a finite Minkowski sum of ellipsoids.
The set
Louter:=b+L⨁ℓ=0WUE(ℓ)
therefore is an ellipsotope.
6 Main result (outer bound)
Theorem.
For any pre-norm or peri-norm Transformer language model whose blocks receive LayerNormed inputs, the set L of all logit vectors attainable over every prompt and position satisfies
L⊆Louter,
where Louter is the ellipsotope defined above.
Proof.
Containments in Steps 2–4 compose to give the stated inclusion; Step 5 shows the outer set is an ellipsotope. ∎
7 Remarks & implications
It is an outer approximation.
Equality L=Louter would require showing that every point of the ellipsotope can actually be realised by some token context, which the argument does not provide.
Geometry-aware compression and safety.
Because Louter is convex and centrally symmetric, one can fit a minimum-volume outer ellipsoid to it, yielding tight norm-based regularisers or robustness certificates against weight noise / quantisation.
Layer-wise attribution.
The individual sets WUE(ℓ) bound how much any single layer can move the logits, complementing “logit-lens’’ style analyses.
Assumptions.
LayerNorm guarantees ∥u∥2 is bounded; Lipschitz—but not necessarily bounded—activations (GELU, SiLU) then give finite κℓ. Architectures without such norm control would require separate analysis.
The Geometry of LLM Logits (an analytical outer bound)
Link post
The Geometry of LLM Logits (an analytical outer bound)
1 Preliminaries
Additive residual stream. With (pre-/peri-norm) residual connections,
h(ℓ+1)=h(ℓ)+r(ℓ),ℓ=0,…,L−1.
Hence the final pre-logit state is the sum of L+1 contributions (block 0 = token+positional embeddings):
h(L)=L∑ℓ=0r(ℓ).
2 Each update is contained in an ellipsoid
Why a bound exists. Every sub-module (attention head or MLP)
reads a LayerNormed copy of its input, so ∥u∥2≤ρℓ where ρℓ:=γℓ√d and γℓ is that block’s learned scale;
applies linear maps, a Lipschitz point-wise non-linearity (GELU, SiLU, …), and another linear map back to Rd.
Because the composition of linear maps and Lipschitz functions is itself Lipschitz, there exists a constant κℓ such that
∥r(ℓ)∥2≤κℓwhenever∥u∥2≤ρℓ.
Define the centred ellipsoid
E(ℓ):={x∈Rd:∥x∥2≤κℓ}.
Then every realisable update lies inside that ellipsoid:
r(ℓ)∈E(ℓ).
3 Residual stream ⊆ Minkowski sum of ellipsoids
Using additivity and Step 2,
h(L)=L∑ℓ=0r(ℓ)∈L∑ℓ=0E(ℓ)=:Etot,
where ∑ℓE(ℓ)=E(0)⊕⋯⊕E(L) is the Minkowski sum of the individual ellipsoids.
4 Logit space is an affine image of that sum
Logits are produced by the affine map x↦WUx+b. For any sets S1,…,Sm,
WU(⨁iSi)=⨁iWUSi.
Hence
logits=WUh(L)+b∈b+L⨁ℓ=0WUE(ℓ).
Because linear images of ellipsoids are ellipsoids, each WUE(ℓ) is still an ellipsoid.
5 Ellipsotopes
An ellipsotope is an affine shift of a finite Minkowski sum of ellipsoids. The set
Louter:=b+L⨁ℓ=0WUE(ℓ)
therefore is an ellipsotope.
6 Main result (outer bound)
Proof. Containments in Steps 2–4 compose to give the stated inclusion; Step 5 shows the outer set is an ellipsotope. ∎
7 Remarks & implications
It is an outer approximation. Equality L=Louter would require showing that every point of the ellipsotope can actually be realised by some token context, which the argument does not provide.
Geometry-aware compression and safety. Because Louter is convex and centrally symmetric, one can fit a minimum-volume outer ellipsoid to it, yielding tight norm-based regularisers or robustness certificates against weight noise / quantisation.
Layer-wise attribution. The individual sets WUE(ℓ) bound how much any single layer can move the logits, complementing “logit-lens’’ style analyses.
Assumptions. LayerNorm guarantees ∥u∥2 is bounded; Lipschitz—but not necessarily bounded—activations (GELU, SiLU) then give finite κℓ. Architectures without such norm control would require separate analysis.