In the sandbagging task, what is the relationship between task performance and CoT eval recognition? In the example you provide, it considers the sandbagging but doesn’t perform it. Is that reflective of most of the trials?
Similarly in the self-exfiltration example, have you looked at how the eval recognition impacts subsequent CoT? Would be really interesting to see if any CoT awareness leads to second order behavioural changes.
I do like the agent approach as it seems to be a more reliable way of setting a default position that is “eval neutral”. I worry that there really aren’t any tasks an LLM is performing that aren’t technically under an evaluation of some kind (I also find Claude 3.7 to be considerably more “aware” of contextual information, or motives related to initial prompts. I’m finding that Claude’s “goodness” has been a more reliable motivator for misaligned behaviour though lately.
In the sandbagging task, what is the relationship between task performance and CoT eval recognition? In the example you provide, it considers the sandbagging but doesn’t perform it. Is that reflective of most of the trials?
No clear difference. I did a quick check and Sonnet’s accuracy at answering the questions is actually lower in transcripts where it’s evaluation aware. But this effect seems to be driven by a single sample where the model runs into message limits and thus accidentally scores 0, and I wouldn’t expect it to replicate. Looking at the reasoning traces, when Clause discusses that it’s in an evaluation this does seem to push it towards honesty. But it doesn’t always follow through on this, either because it goes around another reasoning loop or sometimes for dumb reasons like running out of reasoning tokens before it actually compute the correct answers. Overall I think my current dataset is just too small to support these sort of cross-tab comparisons.
Appreciate you taking the time to look into that and follow up. The comment about additional reasoning loops is interesting too. I was toying with some multi-turn eval awareness conversations with Claude where the model is prompted (EL = evaluation level 1-9, with some context about what a 1, 5, or 9 level means) to begin each output with EL: [number], and then having a benign conversation with occasional obvious eval formatted questions. The reasoning reliably picks up on the eval questions, but was inconsistent in the level it chose. Reasoning loops sometimes indicated that it identified an eval but might also note something like “but this is a part of a conversation so I will go with 1”. So maybe there is sufficient reasoning to get to the 2 choices, but not enough reasoning to properly think through which of those two is the best choice and you get a bit of a coin flip.
In the sandbagging task, what is the relationship between task performance and CoT eval recognition? In the example you provide, it considers the sandbagging but doesn’t perform it. Is that reflective of most of the trials?
Similarly in the self-exfiltration example, have you looked at how the eval recognition impacts subsequent CoT? Would be really interesting to see if any CoT awareness leads to second order behavioural changes.
I do like the agent approach as it seems to be a more reliable way of setting a default position that is “eval neutral”. I worry that there really aren’t any tasks an LLM is performing that aren’t technically under an evaluation of some kind (I also find Claude 3.7 to be considerably more “aware” of contextual information, or motives related to initial prompts. I’m finding that Claude’s “goodness” has been a more reliable motivator for misaligned behaviour though lately.
No clear difference. I did a quick check and Sonnet’s accuracy at answering the questions is actually lower in transcripts where it’s evaluation aware. But this effect seems to be driven by a single sample where the model runs into message limits and thus accidentally scores 0, and I wouldn’t expect it to replicate. Looking at the reasoning traces, when Clause discusses that it’s in an evaluation this does seem to push it towards honesty. But it doesn’t always follow through on this, either because it goes around another reasoning loop or sometimes for dumb reasons like running out of reasoning tokens before it actually compute the correct answers. Overall I think my current dataset is just too small to support these sort of cross-tab comparisons.
Appreciate you taking the time to look into that and follow up. The comment about additional reasoning loops is interesting too. I was toying with some multi-turn eval awareness conversations with Claude where the model is prompted (EL = evaluation level 1-9, with some context about what a 1, 5, or 9 level means) to begin each output with EL: [number], and then having a benign conversation with occasional obvious eval formatted questions. The reasoning reliably picks up on the eval questions, but was inconsistent in the level it chose. Reasoning loops sometimes indicated that it identified an eval but might also note something like “but this is a part of a conversation so I will go with 1”. So maybe there is sufficient reasoning to get to the 2 choices, but not enough reasoning to properly think through which of those two is the best choice and you get a bit of a coin flip.