Reasons to care about Canary Strings

This post is a follow-up to my recent post on evaluation paranoia and benchmark contamination in Gemini 3. There was a lot of interesting discussion about canary strings in the comments that I wanted to respond to more thoroughly and collect some different people’s ideas on.

What are Canary Strings?

There was a time when people were worried that LLMs were not actually learning anything general, but instead just memorizing what was put into them.[1] Language and knowledge benchmarks, our main AI measurement tools at the time, were going into the training data, and it was unclear whether frontier models knew the answers to a math problem because they deduced it or because they were trained on the specific question.

To disincentivize training on benchmark data and to make it easier to avoid, researchers pioneered canary strings when attempting to go Beyond the Imitation Game in 2022. BIG-bench tasks and discussion of the benchmark were meant to contain a specific, unique hexadecimal string that model developers would filter out of their training. This would make the evaluation more valuable as a metric of useful skills, rather than just measuring memorization.

This caught on, and other benchmarks made their own canary strings to have developers filter out their data too. Many of these benchmarks were some superstring of the original BIG-bench canary, so that even if their benchmark never got any traction or even if developers didn’t think about their specific canary string, it would still be filtered out and still be useful as a metric. It also showed up in blog posts about AI evaluations.

Today

It’s unclear whether canary strings have worked. They showed up in GPT-4, Opus 3, and Sonnet 3.5. Now, they’re reproduced by two more frontier models: Gemini 3 Pro and Claude Opus 4.5. Dave Orr has confirmed that Google chooses not to filter out canary strings, using other benchmark filtering methods instead. I believe those other approaches are weaker. The case with Anthropic seems more mysterious here, as their public statements often strongly imply they use canary strings for filtering.

I don’t think it would be very useful for me to speculate on the inner workings of Anthropic’s pretraining data filtering, and I don’t think their public statements allow a full reconstruction of how the canary got into Opus 4.5. Instead, I can speak more concretely on the implications of these canary strings being in these models’ training corpora.

They Trained on Benchmarks

Take Google’s strategy of filtering out specific benchmark text and not including canaries. This means that their training corpora include:

  • Benchmarks that Google didn’t think to exclude

  • Models CoT transcripts while working through benchmark problems in a blog post with a canary string

  • Discussions of models aggressively scheming

  • Discussions between ornithologists on what the answer to that one benchmark question about hummingbirds is.

All of which they can’t cheaply filter out without using canary strings, and all of which they probably want to filter out. We want evaluations to be useful tools, and the more that companies keep ignoring available methods for filtering out benchmark data, the less that that is true. It makes sense to also filter out benchmark with other methods, such as checking for specific benchmark questions, even if there is no canary string present, but this is not the best we can do.

  1. ^

    In fact, I’ve heard rumors that some people still believe this today.