[Question] Is there a safe version of the common crawl?

The larger LLMs are trained on the common crawl, a publicly available dump of significant parts (400TB) of the public internet. They are also trained on all kinds of additional data, but presumably a large fraction of dangerous content is likely from the common crawl.

Is there a safe version of the common crawl that has the dangerous parts removed (or at least labeled, such that it would be easy to remove)?

From a safety perspective it would probably most useful if material on AI (esp. about misalignment and alignment strategies) were removed. It would also be interesting if material on consciousness were removed to allow testing if LLMs discover it without prior knowledge.

Obviously, this wouldn’t solve the alignment problem because instrumental convergence still holds. But it could by some time.