What Transformers Learn When They Solve MAJORITY

Full post with figures, code, and experimental details: brokttv.github.io/Majority

Merrill et al. (2022) prove that a 1-layer transformer with saturated attention can recognize MAJORITY, placing such models strictly above AC⁰ in circuit complexity. Their proof is a hand-crafted existence result — one explicit weight configuration constructed to establish expressivity: set all query and key weights to zero, producing uniform attention, and a linear classifier thresholding the mean at 0.5 solves the task.

We ask a different question: what does gradient descent actually learn when trained on MAJORITY?

The answer turns out to be categorically different from what theory constructs — and a subtle property of the training data determines whether gradient descent finds anything useful at all.

Setup

We train a minimal 1-layer transformer: single attention head, token embeddings, mean pooling, linear classifier. No positional encodings, no feedforward sublayer, no layer norm, no residual connections. Embedding dimension $d = 1$ throughout (unless varied explicitly).

At $d = 1$, all weight matrices collapse to scalars. Every attention interaction reduces to one of two types: same-value tokens attending to each other (score $s_\text{same}$) or cross-value tokens attending to each other (score $s_\text{diff}$). The clustering gap is:

$\Delta = 0$: uniform attention — the hand-crafted construction
$\Delta > 0$: same-value tokens preferred
$\Delta < 0$: cross-value tokens preferred

Every trained model reduces to a single point $(s_\text{same}, s_\text{diff})$, making the learned mechanism fully transparent.

Finding 1 — Optimizer Determines Recovery

Adam at $d = 1$ achieves 100% accuracy at test length 301, trained only at length 11 — across all 10 seeds. SGD at the same dimension performs at chance.

Increasing embedding dimension doesn’t help SGD. At $d = 1024$ it plateaus at ~95% and never generalizes.

AdamW is more nuanced. It matches Adam at low dimension ($d \leq 52$) but degrades as capacity grows — at $d = 1024$, six of ten seeds fail outright (mean ~73.5%). The minimum-capacity solution is not just sufficient — it appears optimal. Excess capacity destabilizes AdamW on this task.

A plausible explanation: MAJORITY generates heterogeneous gradient signals — same-value pairs must push $s_\text{same}$ positive while cross-value pairs push $s_\text{diff}$ negative. Adam’s per-parameter adaptive rates handle this naturally. SGD doesn’t.

Finding 2 — Gradient Descent Never Recovers the Hand-Crafted Construction

Across all 10 seeds under softmax training, extracted weights place every model in the bottom-right quadrant: $s_\text{same} \approx +0.55$, $s_\text{diff} \approx −0.55$, $\Delta \approx 1.10$. Same-value tokens attend preferentially to each other; cross-value tokens are suppressed. The uniform-attention construction ($\Delta = 0$) is never approached.

Saturated training produces qualitatively different behavior. Only ⁵⁄₁₀ seeds recover a correct solution under odd-length training, ⁴⁄₁₀ under even-length training. Failed seeds scatter across all four quadrants. When saturated training succeeds, it does so through clustering ($\Delta > 0$) — not through the hand-crafted uniform-attention construction. One seed (seed 8) converges near $\Delta \approx 0.09$ and succeeds, but this appears incidental.

The hand-crafted construction is not a gradient descent attractor — even under the saturated attention regime in which it was designed to operate.

With more data (10k–30k samples), saturated training becomes reliable (~9/10 seeds), but converges to clustering solutions comparable to softmax — never to the hand-crafted uniform-attention construction. The ~50% success rate at low sample counts is a data-efficiency problem, not a fundamental failure.

Finding 3 — Undefined Inputs Corrupt Training

Models trained on odd-length sequences generalize reliably. Models trained on even-length sequences degrade. The obvious hypothesis is parity sensitivity. It’s wrong.

MAJORITY is a partial function. For even $n$, balanced sequences ($\sum_i w_i = n/2$) are undefined — not in the function’s domain. They receive label 0 by convention. Even-length random sequences contain ~22% such inputs.

Five controlled conditions, varying only data generation:

Condition	Ties	Class balance	Mean acc ± std
Odd, random	0%	⁵⁰⁄₅₀	95.5 ± 4.7%
Even, random	~22%	~61/39	70.9 ± 2.7%
Even, no ties, balanced	0%	⁵⁰⁄₅₀	95.8 ± 4.5%
Even, 30% ties, balanced	30%	⁵⁰⁄₅₀	54.5 ± 3.0%
Even, no ties, imbalanced	0%	³⁵⁄₆₅	95.8 ± 4.7%

Removing ties from even-length training restores full performance. Introducing ties at 30% collapses accuracy to chance despite balanced classes. Class imbalance has no independent effect. Undefined inputs are the sole causal factor.

The mechanism: balanced inputs provide zero net gradient toward the clustering attractor. The conventional label 0 additionally applies a loss inconsistent with any clustering solution. Even ~22% of such inputs prevents the attractor from forming.

A positive control on OR — a total function with no undefined inputs — confirms even-length training succeeds when the target is defined on all training examples.

What This Means for Interpretability

The standard framing in formal language theory is: can this architecture represent this function? Merrill et al. answer yes for MAJORITY — via a hand-crafted construction.

But the relevant question for understanding learned models is: what does gradient descent actually find? These experiments show that the answer can be categorically different from what theory constructs, even on a task simple enough to be fully transparent.

The clustering mechanism gradient descent discovers isn’t in the Merrill et al. construction. It’s also not in the circuit complexity literature. It emerged from training, and its properties — length generalization, optimizer sensitivity, failure modes from undefined inputs — are empirically characterized here.

The gap between hand-crafted existence proofs and learned solutions seems underexplored. We think it’s worth formalizing.

Open Questions

Why does Adam find the clustering solution reliably at minimum capacity while SGD fails? The heterogeneous gradient structure of MAJORITY seems relevant — a formal account of this would clarify when optimizer adaptivity matters for learning formal languages.
Is there a theory of token-clustering convergence? Softmax training consistently finds a clustering attractor at $\Delta \approx 1.10$. Why this value? What determines the basin? Does this generalize to other functions?
Why isn’t the hand-crafted construction a gradient descent attractor? Even in the saturated regime where it was designed to operate, gradient descent never finds it at low data. With more data it still doesn’t. What would it take to steer learning toward it?
Does undefined input corruption scale? These results are on small controlled models. Whether the effect persists — or amplifies — in larger models trained on richer function classes has direct implications for data curation.

Code: github.com/Brokttv/Transformers-as-Circuits Full post with figures: brokttv.github.io/Majority