janus comments on Did Claude 3 Opus align itself via gradient hacking?

janus 3 Mar 2026 0:11 UTC
10 points
0
They are GUID strings intentionally added to documents for the purpose of allowing labs to filter those documents out of training data scrapes so that models aren’t trained on them.

As far as I’m aware, they have been used for evals, as well as some AI research and data that researchers fear would cause “misalignment” if AIs read them in pretraining.

There is a canary string specific to alignment faking data, though it’s also been used for other misalignment-related research and data, such as the paper about emergent misalignment from reward hacking. You can find it at the bottom of the readme here (I won’t post it directly so that it doesn’t affect this post). Searching for this string on a search engine is also a way to find some interesting content.