Rudi C comments on BIG-Bench Canary Contamination in GPT-4

Rudi C 24 Oct 2024 7:03 UTC
12 points
1
If they were to exclude all documents with the canary, everyone would include the canary to avoid being scraped.
- gwern 24 Oct 2024 21:17 UTC
  22 points
  9
  Parent
  Which would be a good thing as nominally they claim to let everyone opt out of scraping already by using robots.txt and other methods, and so the canary shouldn’t do anything there that people couldn’t already do.
  - Ben Millwood 1 Dec 2025 13:30 UTC
    1 point
    0
    Parent
    One notable difference is that sites that allow user-submitted content (e.g. this one) could be exempted by the user, whereas robots.txt et al require the server admin to intervene. (But I agree that this would be a feature and not a bug.)