Is the output of the softmax in a single transformer attention head usually winner-takes-all?

[Question] Is the output of the softmax in a single transformer attention head usually winner-takes-all?

Using the notation from here: A Mathematical Framework for Transformer Circuits

The attention pattern for a single attention head is determined by $A = softmax (x^{T} W_{Q}^{T} W_{K} x)$ , where softmax is computed for each row of $x^{T} W_{Q}^{T} W_{K} x$ .

Each row of $A$ gives the attention pattern for the current token. Are these rows (post softmax) typically close to one-hot? I.e. are they mainly dominated by a single attention (per current token).

I’m interested in knowing this for various types of transformers, but mainly for LLM and/or frontier models.

I’m asking because I think this has implication for computations in super-position.