I think success of anubis (as shown above) does tell us that a lot of them are not doing scraping in good faith. The fact that performing crypographic computation is needed to stop them is ridiculous. I am not sure whether I have a list but there’s even a case of :
Report: huawei bound scrapers are bypassing Anubis #978
Where they started performing the computations to scrape the data.
After further investigation and communication. This is not a bug. The threat actor group in question installed headless chrome and simply computed the proof of work. I’m just going to submit a default rule that blocks huawei.
We received complaints from customers who had both disallowed Perplexity crawling activity in their robots.txt files and also created WAF rules to specifically block both of Perplexity’s declared crawlers: PerplexityBot and Perplexity-User. These customers told us that Perplexity was still able to access their content even when they saw its bots successfully blocked. We confirmed that Perplexity’s crawlers were in fact being blocked on the specific pages in question, and then performed several targeted tests to confirm what exact behavior we could observe.
There has been multiple reports that openAI also ignores robots.txt but it’s much less agressive.
I think success of anubis (as shown above) does tell us that a lot of them are not doing scraping in good faith. The fact that performing crypographic computation is needed to stop them is ridiculous. I am not sure whether I have a list but there’s even a case of :
https://github.com/TecharoHQ/anubis/issues/978
Where they started performing the computations to scrape the data.
https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/
There has been multiple reports that openAI also ignores robots.txt but it’s much less agressive.