This seems like a great idea! However, I think it might degrade the usefulness of the dataset, especially if it’s meant to later be used to evaluate LLMs since any jailbreaks etc. would apply in that setting as well. If you provide utilities to clean up the text before evaluation, these could be used for scraping as well.
Yeah, I guess the use-case I had in mind is generally people who don’t want LLMs trained on (particular pieces of) their writing, rather than datasets specifically.
This seems like a great idea! However, I think it might degrade the usefulness of the dataset, especially if it’s meant to later be used to evaluate LLMs since any jailbreaks etc. would apply in that setting as well. If you provide utilities to clean up the text before evaluation, these could be used for scraping as well.
Yeah, I guess the use-case I had in mind is generally people who don’t want LLMs trained on (particular pieces of) their writing, rather than datasets specifically.