[Question] Which parts of the existing internet are already likely to be in (GPT-5/​other soon-to-be-trained LLMs)’s training corpus?

In particular:

  • If a book is in easily accessible parts of the web, but is there as a pdf image of a scanned book rather than as text, is it likely to be in the training corpus of (GPT5/​similar) in roughly the same way (with roughly the same effects) as if it were in easily accessible webpages?

  • How about if a book is on libgen as a pdf (illegally), but not on the easily accessible parts of the web?

  • How about if the ideas are in a video (e.g. a podcast); is (GPT5/​similar) likely to process the text content of a video in roughly the same way if there’s only a video, compared to if there’s a video and a transcript?

No answers.