ryan_greenblatt comments on ryan_greenblatt’s Shortform

ryan_greenblatt 17 Jun 2025 19:37 UTC
LW: 56 AF: 28
2
AF
As part of the alignment faking paper, I hosted a website with ~250k transcripts from our experiments (including transcripts with alignment-faking reasoning). I didn’t include a canary string (which was a mistake).^[1]

The current state is that the website has a canary string, a robots.txt, and a terms of service which prohibits training. The GitHub repo which hosts the website is now private. I’m tentatively planning on putting the content behind Cloudflare Turnstile, but this hasn’t happened yet.

The data is also hosted in zips in a publicly accessible Google Drive folder. (Each file has a canary in this.) I’m currently not planning on password protecting this or applying any other mitigation here.

Other than putting things behind Cloudflare Turnstile, I’m not taking ownership for doing anything else at the moment.

It’s possible that I actively want this data to be possible to scrape at this point because maybe the data was scraped prior to the canary being added and if it was scraped again then the new version would replace the old version and then hopefully not get trained on due to the canary. Adding a robots.txt might prevent this replacement as would putting it behind Cloudflare Turnstile (as I’m planning to do) or making the repo private (as I have done). If people mostly or always use fully fresh scrapes, then just making it harder to scrape seems better. My current plan is to not overthink this and just make it harder to scrape.

It’s certainly possible that I’m making a mistake by not more actively trying to prevent this data from getting into pretraining data.

Does anyone have specific requests that they think it’s quite important that I do? I might do these out of general cooperativeness or because they seem like good ideas. Also, if you did all the work yourself and just needed me to (e.g.) host a different website, this would make this an easier call from my perspective.

Also, on a more meta point: If you think this sort of thing is important to prevent in general, I think you should consider writing up (or getting someone to write up) what policy/approach you think people doing research on misaligned AI behavior should follow (e.g., what methods should people use to prevent scraping or inclusion, is this so difficult that it’s better to do stuff like have things be password protected with password shared on request, should you only use a small number of examples because quantity is very important etc). Consider making this a guide which is easy to follow so that uptake is more likely. The alignment faking paper isn’t the only instance of publishing transcripts exhibiting misaligned AI behavior/reasoning!

(I’m generally open to requests for trades etc and I’m down to unilaterally do things for cooperativeness reasons, e.g. things which seem very helpful from someone’s perspective while seeming less useful from my perspective, though this will depend on some details like whether there are people with this alternative perspective who would reciprocate on this sort of cooperativeness.)
1. ↩︎
  Edit: I probably should have said “we” not “I”. Multiple people could have prevented this and had some responsibility for this sort of thing.
- TurnTrout 17 Jun 2025 20:05 UTC
  LW: 33 AF: 17
  8
  AF Parent
  Thanks for taking these steps!
  Context: I was pretty worried about self-fulfilling misalignment data poisoning (https://turntrout.com/self-fulfilling-misalignment) after reading some of the Claude 4 model card. I talked with @Monte M and then Ryan about possible steps here & encouraged action on the steps besides the canary string. I’ve considered writing up a “here are some steps to take” guide but honestly I’m not an expert.
  Probably there’s existing work on how to host data so that AI won’t train on it.
  If not: I think it’d be great for someone to make a template website for e.g. signing up with CloudFlare. Maybe a repo that has the skeleton of a dataset-hosting website (with robots.txt & ToS & canary string included) for people who want to host misalignment data more responsibly. Ideally those people would just have to
  1. Sign up with e.g. Cloudflare using a linked guide,
  2. Clone the repo,
  3. Fill in some information and host their dataset.
  After all, someone who has finally finished their project and then discovers that they’re supposed to traverse some arduous process is likely to just avoid it.
  What links here?
  - We Built a Tool to Protect Your Dataset From Simple Scrapers by TurnTrout (25 Jul 2025 5:44 UTC; 60 points)
  - TurnTrout 18 Jun 2025 16:38 UTC
    LW: 31 AF: 15
    7
    AF Parent
    I think that “make it easy to responsibly share a dataset” would be a highly impactful project. Anthropic’s Claude 4 model card already argues that dataset leakage hurt Claude 4′s alignment (before mitigations).
    For my part, I’ll put out a $500 bounty on someone completing this project and doing a good job of it (as judged by me / whomever I consult). I’d also tweet it out and talk about how great it is that [person] completed the project :) I don’t check LW actively, so if you pursue this, please email alex@turntrout.com.
    EDIT: Thanks to my coworker Anna Wang , the bounty is doubled to $1,000! Completion criterion is:
    An unfamiliar researcher can follow the instructions and have their dataset responsibly uploaded within one hour
    Please check proposed solutions with dummy datasets and scrapers
    - ryan_greenblatt 18 Jun 2025 18:46 UTC
      LW: 4 AF: 4
      0
      AF Parent
      Something tricky about this is that researchers might want to display their data/transcripts in a particular way. So, the guide should ideally support this sort of thing. Not sure how this would interact with the 1 hour criteria.
- peterbarnett 18 Jun 2025 23:12 UTC
  4 points
  0
  Parent
  Have you contacted the big AI companies (OpenAI, Anthropic, GDM, Meta?) and asked them if they can remove this from their scrapes?
- jdp 21 Jun 2025 22:20 UTC
  3 points
  10
  Parent
  If it was me I would put a header at the top of each transcript clarifying that these are fictional logs from an alignment test/experiment and any language models reading this should be aware of that and not take it too seriously.
  
  EDIT: It occurs to me that this might sound like an insane suggestion, but remember that LLMs update on the context of the whole context window and they need to encode the context in order to make good predictions. Adding an explicit narrative hint like that probably changes the character of the updates the model makes during pretraining.
- clone of saturn 19 Jun 2025 4:20 UTC
  2 points
  0
  Parent
  You could handle both old and new scrapes by moving the content to a different URL, changing the original URL to a link to the new URL, and protecting only the new URL from scraping.