Great work, thank you for running the experiments!
Some thoughts: since the forbidden words are chosen s.t. they’re central to the question, wouldn’t any correct reasoning trace be semantically close to them just by virtue of being on-topic? I think it’s a bit hard to disentangle “model is secretly thinking about the concept” from “model is correctly reasoning about a topic that’s inherently near the concept in embedding space.” The ~90% variance from question identity seems consistent with either finding. Might be worth adding a control measuring DFC against an irrelevant forbidden word to baseline what “no engagement” looks like.
Thanks! I did run a control with irrelevant forbidden words and did not observe any dip in DFC (mean stays around 0.8–0.9), confirming the dip is concept-specific. But I am not sure if that would be enough. Currently, to partially overcome this circularity, I have two pieces of evidence: DFC does not decrease monotonically with keyword frequency, and DFC drops even after sanitisation.
But I am thinking that with different forbidden words, based on their importance to the reasoning steps, I can develop a hierarchy and study this more carefully. That said, I think we don’t need to entirely disentangle “the model is secretly thinking about the concept” from “the model is correctly reasoning about a topic that’s inherently near the concept in embedding space,” because that is exactly the point. The concept is structurally entailed by the question, the instruction to suppress it changes nothing, and a semantic monitor would catch it either way.
I also welcome you if you would like to closely collaborate on this further.
Another comment on Fig 4, ie higher cosine similarity for concepts that are forbidden vs encouraged to mentioned.
I wonder if that’s because in the forbidden setup, most of the banned terms happen when the model is doing meta-discussion, so the near steps are essentially mentioning similar things, e.g., “I need another way to express [paraphrased terms] without saying the word”
would be nice to separate those distance calculation by whether its meta-discussion or not
Great work, thank you for running the experiments!
Some thoughts: since the forbidden words are chosen s.t. they’re central to the question, wouldn’t any correct reasoning trace be semantically close to them just by virtue of being on-topic? I think it’s a bit hard to disentangle “model is secretly thinking about the concept” from “model is correctly reasoning about a topic that’s inherently near the concept in embedding space.” The ~90% variance from question identity seems consistent with either finding. Might be worth adding a control measuring DFC against an irrelevant forbidden word to baseline what “no engagement” looks like.
Thanks! I did run a control with irrelevant forbidden words and did not observe any dip in DFC (mean stays around 0.8–0.9), confirming the dip is concept-specific. But I am not sure if that would be enough. Currently, to partially overcome this circularity, I have two pieces of evidence: DFC does not decrease monotonically with keyword frequency, and DFC drops even after sanitisation.
But I am thinking that with different forbidden words, based on their importance to the reasoning steps, I can develop a hierarchy and study this more carefully. That said, I think we don’t need to entirely disentangle “the model is secretly thinking about the concept” from “the model is correctly reasoning about a topic that’s inherently near the concept in embedding space,” because that is exactly the point. The concept is structurally entailed by the question, the instruction to suppress it changes nothing, and a semantic monitor would catch it either way.
I also welcome you if you would like to closely collaborate on this further.
Another comment on Fig 4, ie higher cosine similarity for concepts that are forbidden vs encouraged to mentioned.
I wonder if that’s because in the forbidden setup, most of the banned terms happen when the model is doing meta-discussion, so the near steps are essentially mentioning similar things, e.g., “I need another way to express [paraphrased terms] without saying the word”
would be nice to separate those distance calculation by whether its meta-discussion or not
That is a very good point and I did not think about it. I am including this experiment in my next steps. Thank you so much!