Dusto comments on Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Dusto 21 Mar 2025 22:21 UTC
2 points
0
Appreciate you taking the time to look into that and follow up. The comment about additional reasoning loops is interesting too. I was toying with some multi-turn eval awareness conversations with Claude where the model is prompted (EL = evaluation level 1-9, with some context about what a 1, 5, or 9 level means) to begin each output with EL: [number], and then having a benign conversation with occasional obvious eval formatted questions. The reasoning reliably picks up on the eval questions, but was inconsistent in the level it chose. Reasoning loops sometimes indicated that it identified an eval but might also note something like “but this is a part of a conversation so I will go with 1”. So maybe there is sufficient reasoning to get to the 2 choices, but not enough reasoning to properly think through which of those two is the best choice and you get a bit of a coin flip.