Mastodon’s Dubious Crawler Exemption

jefftk27 Nov 2022 14:20 UTC

9 points

When you share a link on social media the platform fetches the page and includes a preview with your post:

Even though this happens automatically the system doesn’t obey the robots.txt exclusion rules (RFC 3909): I have /test/no-robots excluded in my robots.txt but that doesn’t stop it:

I don’t even see a request for robots.txt in my logs. But that’s actually ok! The robots exclusion standard is for “crawlers”, or automated agents. When I share a link I want the preview to be included, and so the bot fetching the page is acting as my agent. This is the same reason why it’s fine that WebPageTest and PageSpeed Insights also ignore robots.txt: they’re fetching specific pages at the user s request so they can measure performance.

This puts Mastodon in an awkward situation. They do want to include previews, because they’re trying to do all the standard social network things, and if they respected robots.txt many sites you’d want to be able to preview won’t work. They also don’t want the originating instance to generate the preview and include it in the post, because it’s open to abuse:

You can trust mastodon.mit.edu about what @jefftk@mastodon.mit.edu says, but not about what newyorktimes.com says.

The approach Mastodon has gone with is to have each instance generate link previews for each incoming link. This means that while having a link shared on Twitter or Facebook might give you one automated pageview for the preview, on Mastodon it gives you one for each instance the link is federated to. For a link shared by a widely-followed account this might mean thousands of link-preview requests hitting a server, and I’d also now consider this traffic to be from an automated agent.

I’m not sure what the right approach is for Mastodon. Making the link preview fetcher respect robots.txt would be a good start. [1] Longer term I think including link previews when composing a post and handling abuse with defederation seems like it should work as well as the rest of Mastodon’s abuse handling.

(context)

[1] Wouldn’t that just add an extra request and increase the load on sites even more? No: everyone builds sites to make serving robots.txt very cheaply. Serving HTML, however, often involves scripting languages, DB lookups, and other slow operations.

jefftk27 Nov 2022 14:20 UTC

9 points

3 comments1 min readLW link

Dagon 27 Nov 2022 17:48 UTC
4 points
2
The federation protocol makes some weird choices about trust and bandwidth.
JWZ’s snarky commentary: https://www.jwz.org/blog/2022/11/mastodon-stampede/

Federation and what’s local vs remote: https://medium.com/@kris-nova/experimenting-with-federation-and-migrating-accounts-eae61a688c3c

It’s a very odd mix of low-trust but also many things not cryptographically validated to provide proof of intent (just https cert validation of the server). I give it 3 months before a rogue server figures out how to hijack or spoof follows, and then the spam whack-a-mole really begins.
Dagon 28 Nov 2022 18:53 UTC
2 points
0
Oh, on the actual topic, it would never occur to me that an explicit link would check robots.txt. Even back in the day, it was only used by a few indexers as an advisory to keep some things out of indexes (and even then, not well-followed), to minimize scraping, not to prevent linking or actual use. https://en.wikipedia.org/wiki/Robots_exclusion_standard points out that the internet archive has ignored it since 2017.

I think the choice not to snapshot or centrally cache link content was more a bandwidth decision (less data between mastodon servers, more data from big CDN sites to many mastodon servers).
gwillen 27 Nov 2022 20:26 UTC
2 points
0
It seems like you could mitigate this a lot if you didn’t generate the preview until you were about to render the post for the first time. Surely the vast majority of these automated previews are being rendered zero times, and saving nothing. (This arguably links the fetch to a human action, as well.)

If you didn’t want to take the hit that would cause—since it would probably mean the first view of a post didn’t get a preview at all—you could at least limit it to posts that the server theoretically might someday have a good reason to render (i.e. require that there be someone on the server following the poster before doing automated link fetching on the post.)