Neel was advised by the authors that it was important minimise batches having tokens from the same prompt. This approach leads to a buffer having activations from many different prompts fairly quickly.
Oh I see, it’s a constraint on the tokens from the vocabulary rather than the prompts. Does the buffer ever reuse prompts or does it always use new ones?
Hey, great post! Are your code or autoencoder weights available somewhere?