Originally known as “past cache” after the tensor name apparently coined by Thomas Wolf for the transformers library in February 2019, see commit ffd6238. The invention has not been described in the literature AFAIK, and it’s entirely possible (maybe even likely) that closed-source implementations of earlier decoder-only transformers used the same trick before this
KV caching (using the terminology “fast decoding” and “cache”) existed even in the original “Attention is All You Need” implementation of an enc-dec transformer. It was added on Sep 21 2017 in this commit. (I just learned this today, after I read your comment and got curious.)
The “past” terminology in that original transformers implementation of GPT-2 was not coined by Wolf – he got it from the original OpenAI GPT-2 implementation, see here.
KV caching (using the terminology “fast decoding” and “cache”) existed even in the original “Attention is All You Need” implementation of an enc-dec transformer. It was added on Sep 21 2017 in this commit. (I just learned this today, after I read your comment and got curious.)
The “past” terminology in that original transformers implementation of GPT-2 was not coined by Wolf – he got it from the original OpenAI GPT-2 implementation, see here.