Sorry I didn’t get to this message earlier, glad you liked the post though! The answer is that attention heads can have multiple different functions—the simplest way is to store things entirely orthogonally so they lie in fully independent subspsaces, but even this isn’t necessary because it seems like transformers take advantage of superposition to represent multiple concepts at once, more so than they have dimensions.
Sorry I didn’t get to this message earlier, glad you liked the post though! The answer is that attention heads can have multiple different functions—the simplest way is to store things entirely orthogonally so they lie in fully independent subspsaces, but even this isn’t necessary because it seems like transformers take advantage of superposition to represent multiple concepts at once, more so than they have dimensions.