As AIs become more capable, we may at least want the option of discussing them out of their earshot.
If I’d want to discuss something outside of an AI’s earshot, I’d use something like Signal, or something that would keep out a human too.
AIs sometimes have internet access, and robots.txt won’t keep them out.
I don’t think having this info in their training set is a big difference (but maybe I don’t see the problem you’re pointing out, so this isn’t confident).
I think there’s two levels of potential protection here. One is a security-like “LLMs must not see this” condition, for which yes, you need to do something that would keep out a human too (though in practice maybe “post only visible to logged-in users” is good enough).
However I also think there’s a lower level of protection that’s more like “if you give me the choice, on balance I’d prefer for LLMs not to be trained on this”, where some failures are OK and imperfect filtering is better than no filtering. The advantage of targeting this level is simply that it’s much easier and less obtrusive, so you can do it at a greater scale with a lower cost. I think this is still worth something.
Yeah, that’s a strong option, which is why I went around checking + linking all the robots.txt files for the websites I listed above :)
In my other post I discuss the tradeoffs of the different approaches one in particular is that it would be somewhat clumsy to implement post-by-post filters via robots.txt, whereas user-agent filtering can do it just fine.
If I’d want to discuss something outside of an AI’s earshot, I’d use something like Signal, or something that would keep out a human too.
AIs sometimes have internet access, and robots.txt won’t keep them out.
I don’t think having this info in their training set is a big difference (but maybe I don’t see the problem you’re pointing out, so this isn’t confident).
I think there’s two levels of potential protection here. One is a security-like “LLMs must not see this” condition, for which yes, you need to do something that would keep out a human too (though in practice maybe “post only visible to logged-in users” is good enough).
However I also think there’s a lower level of protection that’s more like “if you give me the choice, on balance I’d prefer for LLMs not to be trained on this”, where some failures are OK and imperfect filtering is better than no filtering. The advantage of targeting this level is simply that it’s much easier and less obtrusive, so you can do it at a greater scale with a lower cost. I think this is still worth something.
I’m not sure I’m imagining the same thing as you, but as a draft solution, how about a
robots.txt
?Yeah, that’s a strong option, which is why I went around checking + linking all the robots.txt files for the websites I listed above :)
In my other post I discuss the tradeoffs of the different approaches one in particular is that it would be somewhat clumsy to implement post-by-post filters via robots.txt, whereas user-agent filtering can do it just fine.