but a quick inspection of the embeddings available through the huggingface model shows this isn’t the case
That’s GPT-2 though, right? I interpret that Q&A claim as saying that GPT-3 does the normalisation, I agree that GPT-2 definitely doesn’t. But idk, doesn’t really matter
For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.
Interesting, what exactly do you mean by normalise? GPT-2 presumably breaks if you just outright normalise, since different tokens have very different norms
That’s GPT-2 though, right? I interpret that Q&A claim as saying that GPT-3 does the normalisation, I agree that GPT-2 definitely doesn’t. But idk, doesn’t really matter
Interesting, what exactly do you mean by normalise? GPT-2 presumably breaks if you just outright normalise, since different tokens have very different norms