No, that’s not a working mechanism; it isn’t reliable enough, or granular enough. Users can’t add their own content to robots.txt when they submit it to websites. Websites can’t realistically list every opted-out post in their robots.txt, because that would make it impractically large. It is very common to want to refuse content for LLM training, without also refusing search or cross-site link preview. And robots.txt is never preserved when content is mirrored.
Unlike certain other labs and AI companies, afaik Google does respect robots.txt, which is the actual mechanism for keeping data out of its hands.
No, that’s not a working mechanism; it isn’t reliable enough, or granular enough. Users can’t add their own content to robots.txt when they submit it to websites. Websites can’t realistically list every opted-out post in their robots.txt, because that would make it impractically large. It is very common to want to refuse content for LLM training, without also refusing search or cross-site link preview. And robots.txt is never preserved when content is mirrored.