Probably I should have included a footnote about this. I’m well aware that this is not a foolproof mechanism, but it still seems better than nothing and I think it’s very easy to have a disclaimer that makes this clear. As I said in the post, I think that people should only do this for information they would have posted on LessWrong anyway.
I disagree that these things are basically ignored by labs. My guess is many labs put some effort into filtering out data with the canary string, but that this is slightly harder than you might think and so they end up messing it up sometimes. (They might also sometimes ignore it on purpose, I’m not sure.)
Even if labs ignore the canary string now having the canary string in there would make it much easier to filter these things out if labs ever wanted to do that in the future.
I also suggest using better methods like captchas for non-logged-in users. I expect something like this to work somewhat well (though it still wouldn’t be foolproof).
I disagree that these things are basically ignored by labs. My guess is many labs put some effort into filtering out data with the canary string, but that this is slightly harder than you might think and so they end up messing it up sometimes. (They might also sometimes ignore it on purpose, I’m not sure.)
Our infrastructure has been under attack since August 2024. Large Language Model (LLM) web crawlers have been a significant source of the attacks, and as for the rest, we don’t expect to ever know what kind of entity is targeting our sites or why.
This makes the big deployments that I know about include:
The Linux Kernel Mailing List archives
FreeBSD’s SVN (and soon git)
SourceHut
FFmpeg
Wine
UNESCO
The Science Olympiad Student Center
Enlightenment (the desktop environment)
GNOME’s GitLab
The first notable one is the Gnome GitLab. And
3:57
from what I’ve learned talking with the CIS admin team, it was a hailmary. Mhm.
4:03
Right. It’s like nothing else worked. What could we lose? So you’ve had Sorry. GitLab pod
4:11
instantly scaled down to three from six. So you’ve had discussions since then. Um
I think having copyrighted content in between might work, but it depends on the the labs on how they’re processing it but it being really difficult to prevent AI scraping seems to be largely accurate.
I think the canary string is slightly more likely to work than the robots.txt. And the things you link don’t say which labs. I totally believe some labs scrape aggressively and ignore all robots.txt, but it wouldn’t surprise me if others don’t! In my opinion, if even one frontier lab doesn’t train on this data, that’s still quite valuable. And it seems so cheap to just try this as a bit of a Hail Mary.
I think success of anubis (as shown above) does tell us that a lot of them are not doing scraping in good faith. The fact that performing crypographic computation is needed to stop them is ridiculous. I am not sure whether I have a list but there’s even a case of :
Report: huawei bound scrapers are bypassing Anubis #978
Where they started performing the computations to scrape the data.
After further investigation and communication. This is not a bug. The threat actor group in question installed headless chrome and simply computed the proof of work. I’m just going to submit a default rule that blocks huawei.
We received complaints from customers who had both disallowed Perplexity crawling activity in their robots.txt files and also created WAF rules to specifically block both of Perplexity’s declared crawlers: PerplexityBot and Perplexity-User. These customers told us that Perplexity was still able to access their content even when they saw its bots successfully blocked. We confirmed that Perplexity’s crawlers were in fact being blocked on the specific pages in question, and then performed several targeted tests to confirm what exact behavior we could observe.
There has been multiple reports that openAI also ignores robots.txt but it’s much less agressive.
Probably I should have included a footnote about this. I’m well aware that this is not a foolproof mechanism, but it still seems better than nothing and I think it’s very easy to have a disclaimer that makes this clear. As I said in the post, I think that people should only do this for information they would have posted on LessWrong anyway.
I disagree that these things are basically ignored by labs. My guess is many labs put some effort into filtering out data with the canary string, but that this is slightly harder than you might think and so they end up messing it up sometimes. (They might also sometimes ignore it on purpose, I’m not sure.)
Even if labs ignore the canary string now having the canary string in there would make it much easier to filter these things out if labs ever wanted to do that in the future.
I also suggest using better methods like captchas for non-logged-in users. I expect something like this to work somewhat well (though it still wouldn’t be foolproof).
https://www.fsf.org/blogs/sysadmin/our-small-team-vs-millions-of-bots
https://xeiaso.net/notes/2025/anubis-works/
I think having copyrighted content in between might work, but it depends on the the labs on how they’re processing it but it being really difficult to prevent AI scraping seems to be largely accurate.
I think the canary string is slightly more likely to work than the robots.txt. And the things you link don’t say which labs. I totally believe some labs scrape aggressively and ignore all robots.txt, but it wouldn’t surprise me if others don’t! In my opinion, if even one frontier lab doesn’t train on this data, that’s still quite valuable. And it seems so cheap to just try this as a bit of a Hail Mary.
I think success of anubis (as shown above) does tell us that a lot of them are not doing scraping in good faith. The fact that performing crypographic computation is needed to stop them is ridiculous. I am not sure whether I have a list but there’s even a case of :
https://github.com/TecharoHQ/anubis/issues/978
Where they started performing the computations to scrape the data.
https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/
There has been multiple reports that openAI also ignores robots.txt but it’s much less agressive.