the thing they aren’t is one-step crossentropy. that’s it, everything else is presumably sampled from the same distribution as existing LLMs. (this is like if someone finally upgraded BERT to be a primary model).
the thing they aren’t is one-step crossentropy. that’s it, everything else is presumably sampled from the same distribution as existing LLMs. (this is like if someone finally upgraded BERT to be a primary model).