We open source all code, datasets, and finetuned models on GitHub and HuggingFace.
Considering the purpose of the datasets, I think putting them up as readily downloadable plaintext is terribly unwise.
Who knows what scrapers are going to come across them?
IMO, datasets like this should be obfuscated, i.e. by being compressed with gzip, so that no simple crawler can get to them by an accident. I don’t think harm is likely, but why take chances?
Agree that it would be better not to have them up as readily downloadable plaintext, and it might even be worth going a step farther and encrypting the gzip or zip file, and making the password readily available in the repo’s README. This is what David Rein did with GPQA and what we did with FindTheFlaws. Might be overkill, but if I were working for a frontier lab building scrapers to pull in as much data from the web as possible, I’d certainly have those scrapers unzip any unencrypted gzips they came across, and I assume their scrapers are probably doing the same.
PS to the original posters: seems like nice work! Am planning to read the full paper and ask a more substantive follow-up question when I get the chance
Considering the purpose of the datasets, I think putting them up as readily downloadable plaintext is terribly unwise.
Who knows what scrapers are going to come across them?
IMO, datasets like this should be obfuscated, i.e. by being compressed with gzip, so that no simple crawler can get to them by an accident. I don’t think harm is likely, but why take chances?
Agree that it would be better not to have them up as readily downloadable plaintext, and it might even be worth going a step farther and encrypting the gzip or zip file, and making the password readily available in the repo’s README. This is what David Rein did with GPQA and what we did with FindTheFlaws. Might be overkill, but if I were working for a frontier lab building scrapers to pull in as much data from the web as possible, I’d certainly have those scrapers unzip any unencrypted gzips they came across, and I assume their scrapers are probably doing the same.
PS to the original posters: seems like nice work! Am planning to read the full paper and ask a more substantive follow-up question when I get the chance
May I plug https://www.lesswrong.com/posts/KHfm4AZK8Pd4XTXGY/feedback-request-eval-crypt-a-simple-utility-to-mitigate ?
Hey,
Thanks for flagging this! To do a consistently better job for this we now have the below WIP (with the aim being others adopt it too as the issue seems to be becoming more common—in beta currently):
https://github.com/Responsible-Dataset-Sharing/easy-dataset-share
I am trying and failing to find the password for em_organism_dir/data/training_datasets.zip.enc
“-p model-organisms-em-datasets” in setup on the README here.
Thanks, not sure how I missed that.
Thanks for raising this! Agree that harm is unlikely, but that the risk is there and its an easy fix. We’ve zipped the datasets in the repo now.