It feels dismissive, but I’d like to state that I was asked to provide an ontological exit state (which while useful, is not the intention), and thus provided one. This was done in spite of the language being more of a discipline towards looking shallower before deeper, and writing in a manner that leads others to follow suit. The exact boundary would be better described as “intentionally fuzzy and hard to exceed”.
I would state that the exit point is when the correct behavior is to diagnose and prevent mesa-optimization/behaviors, instead of analyze how the model could have statistically arrived at an output, if it’s reward hacking, if there is priors that lead to an output deterministically, or any number of other simpler diagnosis… but that’s putting the cart before the horse, and indirectly saying “use it when you feel like it”.
For condition 1: The exact paper referenced[1] helps detail further meaning. The CoT models are outputting thought-like behavior, yes. Is it true thought? Does the model know anything? I see no proof of either being true in the paper. Instead I see support of the potential proof. Which is below the bar I have set intentionally high. In reality, I would explain thought-like and reasoning-like behavior as statistical story-telling that affects the remainder of the statistical pattern matching. While the effect is still the same, it invokes a different causal state, and thus invokes a different set of actions from myself.
To detail why, Consider the following from the papers conclusion:
1a: “Key findings include the emergence of goal-directed deception without explicit instruction, suggesting it’s a byproduct of advanced reasoning. Representation engineering successfully quantified deception via high-accuracy steering vectors, establishing it as a measurable property. The developed framework allows for precise induction or suppression of deception, offering a pathway for balancing capability and safety in AI deployments. These results highlight the dual-use potential of CoT models and underscore the necessity of rigorous monitoring and control through methods like representation engineering for AI safety.”
1b: “Key findings include the emergence of output that is like goal-directed deception without explicit instruction, suggesting it’s a byproduct of advanced reasoning-like outputs. Representation engineering successfully quantified deception-like bias[2] via high-accuracy steering vectors, establishing these biases as a measurable property. The developed framework allows for precise induction or suppression of deception-like bias, offering a pathway for balancing capability and safety in AI deployments. These results highlight the dual-use potential of CoT models and underscore the necessity of rigorous monitoring and control through methods like representation engineering for AI safety.”
Specifically, the next paragraph leads to different insights depending on which is read prior.
“Despite demonstrating significant insights, the study has limitations. The influence of contextual framing on deception tendencies, as seen in performance disparities between paradigms, was not fully disentangled. Furthermore, while representation engineering showed layer correlations, it didn’t pinpoint precise architectural components encoding deception and task semantics, limiting understanding of mechanistic drivers. Future work should systematically investigate how contextual framing modulates deception and employ mechanistic interpretability to identify specific architectural elements responsible, enabling more targeted detection and mitigation strategies.”
I would like to know if the change in language sparked any different ideas on what the further research would look like. I don’t any evidence this language has the intended effect on others that it does on me.
Regarding Point 2: to meet the requirement, the information storage system in question must be proven opaque, retrievable, persistent, and mutable simultaneously. Let’s test:
weights :
Opaque: Not fully proven. some weights are entangled, but weights in general can be inspected and investigated. Basically, while the window is proven black for an end user, it’s not proven black for a ML researcher.
Retrievable: Proven true, if abstractly. all weights fire, attention and activation determines intensity, potentially leading to selective retrieval. Which is probably the word choice I should have used.
Persistent: Proven True.
Mutable: Proven False. Weights never change once frozen.
context:
Opaque: Not yet proven. If the information in question is “hidden bias or feature”… Then this is actually something I’m researching.
Retrievable: Proven false in most cases. The model does not have the ability to retrieve context, unless tool usage enables this, and even then, the “hidden bias or feature” is definitely not proven retrievable.
Persistent: Proven true.
Mutable: Proven… actually, compaction of context kind of proves this true, but in a very messy, unpredictable way. The model doesn’t have the ability to modify information in a controlled manner. Especially if the information is opaque, the compaction is highly likely to do more damage than good in the models case. Disregarding compaction, context is add only, which is by definition immutable storage.
Even if placed together, as weights + context, the condition holds. If we wanted to abstract significantly, with several assumptions… one could claim that the user is the opaque, retrievable, persistent, mutable storage system for the model. Honestly, that’s a bit too meta for me to truly engage.
I can definitely see how the definition of the two conditions feels very much like a fuzzy thing that may currently be possible to meet. I am admittedly not perfect with word choice, and the point is not a perfect definition… but instead a potential change in discipline of thought. As such… I’m open to a better set of conditions.
It feels dismissive, but I’d like to state that I was asked to provide an ontological exit state (which while useful, is not the intention), and thus provided one. This was done in spite of the language being more of a discipline towards looking shallower before deeper, and writing in a manner that leads others to follow suit. The exact boundary would be better described as “intentionally fuzzy and hard to exceed”.
I would state that the exit point is when the correct behavior is to diagnose and prevent mesa-optimization/behaviors, instead of analyze how the model could have statistically arrived at an output, if it’s reward hacking, if there is priors that lead to an output deterministically, or any number of other simpler diagnosis… but that’s putting the cart before the horse, and indirectly saying “use it when you feel like it”.
For condition 1: The exact paper referenced[1] helps detail further meaning. The CoT models are outputting thought-like behavior, yes. Is it true thought? Does the model know anything? I see no proof of either being true in the paper. Instead I see support of the potential proof. Which is below the bar I have set intentionally high. In reality, I would explain thought-like and reasoning-like behavior as statistical story-telling that affects the remainder of the statistical pattern matching. While the effect is still the same, it invokes a different causal state, and thus invokes a different set of actions from myself.
To detail why, Consider the following from the papers conclusion:
1a: “Key findings include the emergence of goal-directed deception without explicit instruction, suggesting it’s a byproduct of advanced reasoning. Representation engineering successfully quantified deception via high-accuracy steering vectors, establishing it as a measurable property. The developed framework allows for precise induction or suppression of deception, offering a pathway for balancing capability and safety in AI deployments. These results highlight the dual-use potential of CoT models and underscore the necessity of rigorous monitoring and control through methods like representation engineering for AI safety.”
1b: “Key findings include the emergence of output that is like goal-directed deception without explicit instruction, suggesting it’s a byproduct of advanced reasoning-like outputs. Representation engineering successfully quantified deception-like bias[2] via high-accuracy steering vectors, establishing these biases as a measurable property. The developed framework allows for precise induction or suppression of deception-like bias, offering a pathway for balancing capability and safety in AI deployments. These results highlight the dual-use potential of CoT models and underscore the necessity of rigorous monitoring and control through methods like representation engineering for AI safety.”
Specifically, the next paragraph leads to different insights depending on which is read prior.
“Despite demonstrating significant insights, the study has limitations. The influence of contextual framing on deception tendencies, as seen in performance disparities between paradigms, was not fully disentangled. Furthermore, while representation engineering showed layer correlations, it didn’t pinpoint precise architectural components encoding deception and task semantics, limiting understanding of mechanistic drivers. Future work should systematically investigate how contextual framing modulates deception and employ mechanistic interpretability to identify specific architectural elements responsible, enabling more targeted detection and mitigation strategies.”
I would like to know if the change in language sparked any different ideas on what the further research would look like. I don’t any evidence this language has the intended effect on others that it does on me.
Regarding Point 2: to meet the requirement, the information storage system in question must be proven opaque, retrievable, persistent, and mutable simultaneously. Let’s test:
weights :
Opaque: Not fully proven. some weights are entangled, but weights in general can be inspected and investigated. Basically, while the window is proven black for an end user, it’s not proven black for a ML researcher.
Retrievable: Proven true, if abstractly. all weights fire, attention and activation determines intensity, potentially leading to selective retrieval. Which is probably the word choice I should have used.
Persistent: Proven True.
Mutable: Proven False. Weights never change once frozen.
context:
Opaque: Not yet proven. If the information in question is “hidden bias or feature”… Then this is actually something I’m researching.
Retrievable: Proven false in most cases. The model does not have the ability to retrieve context, unless tool usage enables this, and even then, the “hidden bias or feature” is definitely not proven retrievable.
Persistent: Proven true.
Mutable: Proven… actually, compaction of context kind of proves this true, but in a very messy, unpredictable way. The model doesn’t have the ability to modify information in a controlled manner. Especially if the information is opaque, the compaction is highly likely to do more damage than good in the models case. Disregarding compaction, context is add only, which is by definition immutable storage.
Even if placed together, as weights + context, the condition holds. If we wanted to abstract significantly, with several assumptions… one could claim that the user is the opaque, retrievable, persistent, mutable storage system for the model. Honestly, that’s a bit too meta for me to truly engage.
I can definitely see how the definition of the two conditions feels very much like a fuzzy thing that may currently be possible to meet. I am admittedly not perfect with word choice, and the point is not a perfect definition… but instead a potential change in discipline of thought. As such… I’m open to a better set of conditions.
https://arxiv.org/pdf/2506.04909
Bias could also be described as a feature in this context.
Discounting temperature