Thanks for sharing this paper; this also reminded me of a paper, A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (https://arxiv.org/pdf/2305.13169), and their section on toxicity filtering (threshold, classifier vs generation trade-off)
Thanks for sharing this paper; this also reminded me of a paper, A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (https://arxiv.org/pdf/2305.13169), and their section on toxicity filtering (threshold, classifier vs generation trade-off)