Viliam comments on Is there a safe version of the common crawl?

Viliam 12 Aug 2025 15:26 UTC
8 points
1
Could you ask an AI to filter out the text you don’t want?
(Like, ask an AI1 to filter the text, then use the rest to train AI2.)
- Gunnar_Zarncke 12 Aug 2025 16:12 UTC
  8 points
  2
  Parent
  Sure, but that is expensive. Why would more than one team need to do it?
  Hm. It turns out it wouldn’t be soo expensive. ChatGPT estimated at least 12K$.
  - Steven Byrnes 12 Aug 2025 18:58 UTC
    7 points
    0
    Parent
    Didn’t read it in detail but I think Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs discusses filtering approaches.
  - lemonhope 12 Aug 2025 16:36 UTC
    2 points
    0
    Parent
    Llama 8b might do a decent job