I’m having trouble understanding how KV cache helps significantly with serial depth (like “updating weights”). Isn’t the overwhelming bottleneck at the start of a new forward pass? Layer l KV cache entries for a given position contain only l-1 layers of contextual processing and layer 1 cache is just W^K to fixed token embeddings (no contextual richness). So then the deep info-rich representations only exist in the high-layer cache entries and those are only accessible to correspondingly high layers of the new token (layers that have already done their own deep processing) so early layers querying the KV cache are reading nearly context-free vectors (I think?)
There’s a discrete token bottleneck where depth-L computation selects a token, maps back to fixed embedding, and L layers process it from scratch so you get O(TL) serial depth over T steps but each cross-step compresses the high-dimensional representation down back to the trained vocab item/representation. Does this all sound right and you are just saying in theory you think this is sufficient?
I may be confusing/overlooking something simple
This was a helpful way to understand the differences (especially requiring authentication or knowing it’s really the principal is something I haven’t thought about much).
There’s an odd example where you might have someone like a CEO strike a deal with a misaligned/scheming model such that for some period of time/condition, the model is effectively secretly loyal despite it not having some weight-encoded “goal” or propensity to favor the principal.