Canary strings are tricky; LLMs can learn them even if documents that contain the canary string are filtered out of the training set, if documents that contain indirect or transformed versions of the canary string are not filtered. For example, there are probably documents and web pages that discuss the canary string but don’t want to invoke it, which split the string into pieces, ROT-13 or base64 encode it, etc.
This doesn’t mean that they didn’t train on benchmarks, but it does offer a possible alternative explanation. In the future, labs that don’t want people to think they trained on benchmark data should probably include filters that look for transformed/indirect canary strings, in addition to the literal string.
Canary strings are tricky; LLMs can learn them even if documents that contain the canary string are filtered out of the training set, if documents that contain indirect or transformed versions of the canary string are not filtered. For example, there are probably documents and web pages that discuss the canary string but don’t want to invoke it, which split the string into pieces, ROT-13 or base64 encode it, etc.
This doesn’t mean that they didn’t train on benchmarks, but it does offer a possible alternative explanation. In the future, labs that don’t want people to think they trained on benchmark data should probably include filters that look for transformed/indirect canary strings, in addition to the literal string.