Context: I was pretty worried about self-fulfilling misalignment data poisoning (https://turntrout.com/self-fulfilling-misalignment) after reading some of the Claude 4 model card. I talked with @Monte M and then Ryan about possible steps here & encouraged action on the steps besides the canary string. I’ve considered writing up a “here are some steps to take” guide but honestly I’m not an expert.
Probably there’s existing work on how to host data so that AI won’t train on it.
If not: I think it’d be great for someone to make a template website for e.g. signing up with CloudFlare. Maybe a repo that has the skeleton of a dataset-hosting website (with robots.txt & ToS & canary string included) for people who want to host misalignment data more responsibly. Ideally those people would just have to
Sign up with e.g. Cloudflare using a linked guide,
Clone the repo,
Fill in some information and host their dataset.
After all, someone who has finally finished their project and then discovers that they’re supposed to traverse some arduous process is likely to just avoid it.
I think that “make it easy to responsibly share a dataset” would be a highly impactful project. Anthropic’s Claude 4 model card already argues that dataset leakage hurt Claude 4′s alignment (before mitigations).
For my part, I’ll put out a $500 bounty on someone completing this project and doing a good job of it (as judged by me / whomever I consult). I’d also tweet it out and talk about how great it is that [person] completed the project :) I don’t check LW actively, so if you pursue this, please email alex@turntrout.com.
EDIT: Thanks to my coworker Anna Wang , the bounty is doubled to $1,000! Completion criterion is:
An unfamiliar researcher can follow the instructions and have their dataset responsibly uploaded within one hour
Please check proposed solutions with dummy datasets and scrapers
Something tricky about this is that researchers might want to display their data/transcripts in a particular way. So, the guide should ideally support this sort of thing. Not sure how this would interact with the 1 hour criteria.
Thanks for taking these steps!
Context: I was pretty worried about self-fulfilling misalignment data poisoning (https://turntrout.com/self-fulfilling-misalignment) after reading some of the Claude 4 model card. I talked with @Monte M and then Ryan about possible steps here & encouraged action on the steps besides the canary string. I’ve considered writing up a “here are some steps to take” guide but honestly I’m not an expert.
Probably there’s existing work on how to host data so that AI won’t train on it.
If not: I think it’d be great for someone to make a template website for e.g. signing up with CloudFlare. Maybe a repo that has the skeleton of a dataset-hosting website (with
robots.txt& ToS & canary string included) for people who want to host misalignment data more responsibly. Ideally those people would just have toSign up with e.g. Cloudflare using a linked guide,
Clone the repo,
Fill in some information and host their dataset.
After all, someone who has finally finished their project and then discovers that they’re supposed to traverse some arduous process is likely to just avoid it.
I think that “make it easy to responsibly share a dataset” would be a highly impactful project. Anthropic’s Claude 4 model card already argues that dataset leakage hurt Claude 4′s alignment (before mitigations).
For my part, I’ll put out a $500 bounty on someone completing this project and doing a good job of it (as judged by me / whomever I consult). I’d also tweet it out and talk about how great it is that [person] completed the project :) I don’t check LW actively, so if you pursue this, please email
alex@turntrout.com.EDIT: Thanks to my coworker Anna Wang , the bounty is doubled to $1,000! Completion criterion is:
Please check proposed solutions with dummy datasets and scrapers
Something tricky about this is that researchers might want to display their data/transcripts in a particular way. So, the guide should ideally support this sort of thing. Not sure how this would interact with the 1 hour criteria.