The Geometry of LLM Logits (an analytical outer bound)

Link post

The Geometry of LLM Logits (an analytical outer bound)


1 Preliminaries

Symbol Meaning
width of the residual stream (e.g. 768 in GPT-2-small)
number of Transformer blocks
vocabulary size, so logits live in
residual-stream vector entering block
the update written by block
un-embedding matrix and bias

Additive residual stream. With (pre-/​peri-norm) residual connections,

Hence the final pre-logit state is the sum of contributions (block 0 = token+positional embeddings):


2 Each update is contained in an ellipsoid

Why a bound exists. Every sub-module (attention head or MLP)

  1. reads a LayerNormed copy of its input, so where and is that block’s learned scale;

  2. applies linear maps, a Lipschitz point-wise non-linearity (GELU, SiLU, …), and another linear map back to .

Because the composition of linear maps and Lipschitz functions is itself Lipschitz, there exists a constant such that

Define the centred ellipsoid

Then every realisable update lies inside that ellipsoid:


3 Residual stream ⊆ Minkowski sum of ellipsoids

Using additivity and Step 2,

where is the Minkowski sum of the individual ellipsoids.


4 Logit space is an affine image of that sum

Logits are produced by the affine map . For any sets ,

Hence

Because linear images of ellipsoids are ellipsoids, each is still an ellipsoid.


5 Ellipsotopes

An ellipsotope is an affine shift of a finite Minkowski sum of ellipsoids. The set

therefore is an ellipsotope.


6 Main result (outer bound)

Theorem. For any pre-norm or peri-norm Transformer language model whose blocks receive LayerNormed inputs, the set of all logit vectors attainable over every prompt and position satisfies

where is the ellipsotope defined above.

Proof. Containments in Steps 2–4 compose to give the stated inclusion; Step 5 shows the outer set is an ellipsotope. ∎


7 Remarks & implications

  • It is an outer approximation. Equality would require showing that every point of the ellipsotope can actually be realised by some token context, which the argument does not provide.

  • Geometry-aware compression and safety. Because is convex and centrally symmetric, one can fit a minimum-volume outer ellipsoid to it, yielding tight norm-based regularisers or robustness certificates against weight noise /​ quantisation.

  • Layer-wise attribution. The individual sets bound how much any single layer can move the logits, complementing “logit-lens’’ style analyses.

  • Assumptions. LayerNorm guarantees is bounded; Lipschitz—but not necessarily bounded—activations (GELU, SiLU) then give finite . Architectures without such norm control would require separate analysis.


No comments.