One time in my sexology discord, some people were arguing that incest porn was super popular, and I was skeptical because this proposition conflicted with my survey data. I tried scraping data from some porn site (I think PornHub?), and when I sorted videos by number of views, I found that the top-viewed videos were often incest-themed, but as I looked at the cumulative viewcounts, the fraction of views to incest-themed porn dropped as I increased the sample size.
I don’t know for sure but my guess is that there was a supply/demand imbalance in the data, such that the fans of incest had their views concentrated into a smaller number of videos (that were thus more likely to have extraordinarily high view counts) because people weren’t producing “enough” incest videos to meet demand. But that overall preference for incest porn was lower than what one could guess from the top views.
I haven’t looked through your materials so I don’t know how my method of scraping in order of decreasing view count compares to your method. Did you get a complete/comprehensive dataset somehow?
Text diffusion models are still LLMs, just not autoregressive.