However, I suspect that for fairness you’d actually want to avoid classics, to avoid leakage of human opinions about the subject matter into the training data (if such a data corpus exists, which seems likely). Doing the exercise with media released in the last week would sidestep the issue.
Well, maybe. After all, we humans talk about books and movies and influence one another’s opinions. Not sure it would be a bad thing for an AI to see how it’s done.
I know Project Gutenberg has loads of full texts of literary classics and not-so-classics available online. I have no idea whether or not those are scraped in the process of putting a corpus together.
Nice examples, thanks.
However, I suspect that for fairness you’d actually want to avoid classics, to avoid leakage of human opinions about the subject matter into the training data (if such a data corpus exists, which seems likely). Doing the exercise with media released in the last week would sidestep the issue.
Well, maybe. After all, we humans talk about books and movies and influence one another’s opinions. Not sure it would be a bad thing for an AI to see how it’s done.
I know Project Gutenberg has loads of full texts of literary classics and not-so-classics available online. I have no idea whether or not those are scraped in the process of putting a corpus together.