I wouldn’t call the dataset comprehensive exactly, but it’s plausibly representative—it’s Internet Archive snapshots of “pornhub.com″ from 2008-2023. You can see the script here https://github.com/dhealy05/semen_and_semantics/blob/main/data_retrieval/fetch_snapshots.py. I wrote an “HTMLParser” base class and e.g. “Parser2010″ subclass, to put the data into a common format across years. The data and embeddings are in the repo if you want to use them without running a script.
Interesting ideas—I truncated the readme for LessWrong, but my “Future Work” section is
“Analyze trends by “minutes watched” by weighting for views, view X length; this is more likely a heuristic for content production than actual viewing time”
So while I’m not sure about what those results would look like, I agree there’s an angle there.
Hi Richard, yes, certain keywords are banned. What I’m measuring is semantic similarity. For example, a video titled “rape” will be banned, but a video suggesting rape may not be. By using text embeddings, we’re finding the titles most similar to the concept of rape. To find trends over time, we’re counting how many of those titles are found per year, weighted by the total number of titles in a year.
With respect to certain keywords, we see a decline in trends starting after 2020, likely because of Nicholas Kristof’s NYT piece “The Children of Pornhub”, which led to both stricter keyword standards and a mass removal of videos. “Drunk” and “coma” capture “incapacitation” as a euphemism, which was used as a way to get around explicit keyword policing.
The fact that we do see declines in some areas and we have a known cause leads me to believe the data is reliable—it’s not all showing a line straight up.