Could you ask an AI to filter out the text you don’t want?
(Like, ask an AI1 to filter the text, then use the rest to train AI2.)
Sure, but that is expensive. Why would more than one team need to do it?
Hm. It turns out it wouldn’t be soo expensive. ChatGPT estimated at least 12K$.
Didn’t read it in detail but I think Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs discusses filtering approaches.
Llama 8b might do a decent job
Could you ask an AI to filter out the text you don’t want?
(Like, ask an AI1 to filter the text, then use the rest to train AI2.)
Sure, but that is expensive. Why would more than one team need to do it?
Hm. It turns out it wouldn’t be soo expensive. ChatGPT estimated at least 12K$.
Didn’t read it in detail but I think Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs discusses filtering approaches.
Llama 8b might do a decent job