Safety Research at NYU. MATS 8.0. Prev: UC Berkeley.
Yueh Han "John" Chen
Great work, thank you for running the experiments!
Some thoughts: since the forbidden words are chosen s.t. they’re central to the question, wouldn’t any correct reasoning trace be semantically close to them just by virtue of being on-topic? I think it’s a bit hard to disentangle “model is secretly thinking about the concept” from “model is correctly reasoning about a topic that’s inherently near the concept in embedding space.” The ~90% variance from question identity seems consistent with either finding. Might be worth adding a control measuring DFC against an irrelevant forbidden word to baseline what “no engagement” looks like.
RE outcome-based RL: yeah, the main reason I’m curious about this is that the CoT incontrollability mostly arises from outcome-based RL (as explained and demonstrated in the CoT-Control paper), so even if models learned to obfuscate its CoT during pre-training, I’d think outcome-based RL should drastically reduce its ability to do so.
RE mix-in training: interesting! Then this implies that SDF likely has a trade-off between its CoT obfuscation ability and models’ general capabilities. “although we performed no hyperparameter tuning for this recovery training and likely used too high of a ratio of educational docs to monitor docs here,” I think that doing a hyperparameter sweep might give too much power to the red team (on top of already having 1000 red-team documents).
This is really cool! Thanks for running these experiments. A few things that I’m quite curious about:
The motivation here is that the pretraining data contains documents like that. Would this obfuscation still show up after outcome-based RL?
It’s shown that such SDF decreases model capability on OOD tasks, but then showed that this can be recovered by including 5000 synthetic educational documents based on a held-out sample of MMLU questions. I wonder whether this mix-in also decreases the CoT obfuscation ability
I’d assume the CoT controllability types you mention, like persuasion or reframing, likely require much more cognitive effort to do so, and thus, are harder, compared to simply not using a word or thinking in uppercase. Though, one could empirically verify this additionally, in our work, we also see that harder controllability types are also harder to execute.
That’s interesting! But I should note that the more recent Claude models have lower CoT controllability. I hypothesize is more like due to (1) more RLVR steps, as we shown that will decrease CoT controllability, and (2) potentially earlier Claude models like Claude sonnet 3.7 were put under optimization pressure on their CoT, which is why its controllability is so high, but the more recent models didn’t get put under pressure on CoT. Again, these are my hypotheses.
Sonnet 4.5′s system card says this in the 7.5.10.2 section of research sandbagging: “It was only able to do this, though, with the advantage of unmonitored hidden reasoning, and without this artificial crutch, it is no more effective than prior models and appears to pose little risk in this setting.”
So I think it’s unlikely that later Claude sonnet models are worse at coT controllability is due to sandbagging. I hypothesize is more like due to (1) more RLVR steps, as we shown that will decrease CoT controllability, and (2) potentially earlier Claude models like Claude sonnet 3.7 were put under optimization pressure on their CoT, which is why its controllability is so high, but the more recent models didn’t get put under pressure on CoT. Again, these are my hypotheses.
Yeah I agree that an control eval setup would be nice! The main reason we focus on purely CoT instruction following (IF) is that current models lack motivation to pursue harmful side tasks, and even when given reasons, they may fail due to unreliability or honest self-reporting. Our eval bypasses this by focusing on CoT IF, assuming that IF training generalizes from outputs to CoTs in a way that is representative of how controlling CoTs in pursuit of misaligned goals might generalize.
Additionally, our CoT controllability tasks are very simple, e.g., not using a certain word or thinking only in uppercase, so if models cannot even do that, it makes a stronger case that they are unlikely to covertly pursue side tasks, which would require more complex control to bypass CoT monitors.
(Note: here I’m talking more about side tasks that usually would require tool calls. If side tasks are pure computational, like doing math, then CoT controllability eval will not be able to catcth those latent reasoning failure case )
Another comment on Fig 4, ie higher cosine similarity for concepts that are forbidden vs encouraged to mentioned.
I wonder if that’s because in the forbidden setup, most of the banned terms happen when the model is doing meta-discussion, so the near steps are essentially mentioning similar things, e.g., “I need another way to express [paraphrased terms] without saying the word”
would be nice to separate those distance calculation by whether its meta-discussion or not