Obviously the training data of LLMs contains more than human dialogue, so the claim that the pretrained LLMs are “strictly imitating humans” is clearly false. I don’t know why this was never brought up.
It’s neither obvious nor clear to me. Who wrote the rest of their training data, besides us oh-so-fallible humans? What percentage of the data does this non-human authorship constitute?
Obviously the training data of LLMs contains more than human dialogue, so the claim that the pretrained LLMs are “strictly imitating humans” is clearly false. I don’t know why this was never brought up.
It’s neither obvious nor clear to me. Who wrote the rest of their training data, besides us oh-so-fallible humans? What percentage of the data does this non-human authorship constitute?