I am not sure how to make this distinction. When can the “-like behaviour” be validly dropped?
Compare and contrast:
1a. The bird exhibited ovipositing behavior.
1b. The bird laid an egg.
2a. The bomb exhibited explosion-like behavior.
2b. The bomb exploded.
3a. The computer exhibited calculation-like behavior.
3b. The computer made a calculation.
4a. The results showed that more total units occurred in the anthropomorphic toy condition than in the nonanthropomorphic toy condition and that conversational units occurred in the anthropomorphic conditions only.
4b. Children talk to their dolls but not their blocks.
5a. The user exhibited being-deceived-like behaviour while the chatbot exhibited deception-like behavior.
5b. The user was deceived by the chatbot.
Radical behaviorists would answer “never”. (See example 4a, which I did not make up.) If we are not radical behaviorists, how do we decide when something passes the duck test enough to just call it a duck? When we look inside the chatbot, what are we looking for to justify saying that a chatbot deceived someone, instead of exhibiting deception-like behavior?
I agree that it’s a bit of a hard distinction to be made, especially when applied outside of LLM space; I hadn’t considered its use outside the scope of my conversation. The examples you gave are excellent, the calculator example in particular is fairly interesting to explore and has led me to crystallize this fairly cleanly. To be clear, the article’s position is that “-like” can be dropped when the term is better explained by the object performing the term, rather than by the object producing it via statistical probability.… which is admittedly too conceptual to be workable, as you correctly called out. oops. I’ve had to rethink this entire thing from the ground up.
To save some time (I know this is a book, I think long when it matters to me), the answer to your final question is that -like applies when:
1: The object lacks the proven capacity for the mechanism that the unqualified term defines. i.e., “deception is the knowledge of two distinct states, one of which is intentionally false. the model does not have proven capacity for the action of deception, lacking the ability to represent multiple outputs and select a less truthful output intentionally, therefore the model performed deception-like behavior”.
2: The object lacks proven information storage that is opaque, retrievable, persistent, and mutable. i.e., “the model has no proven way of opaquely storing and retrieving information”.
The intention is that the qualifier would be dropped when any of the following are met: The object has been proven to have gained the capacity for the action that the unqualified term defines. The object has with high probability gained that capacity due to opaque state structures supporting the capacity for the action.
This has been specifically split into two conditions because 2 is an early exit that would then warrant the intentional stance’s vocabulary instead. The “chatbot potentially deceived” is plausible and fine in condition 2′s case. In specific, a chatbot would honestly either require mesa-optimization proven to say it’s deceiving, OR it would require proof of opaque persistent mutable information storage. The presence of opaque storage makes disproving capacity too difficult, and the risk of the behavior being genuine too high to use the qualifier. I’m not claiming these capabilities are impossible for the chatbot, but I am saying the burden of proof is on the individual dropping -like from the conversation. An intentionally high bar to promote rational thought.
To go a bit deeper into the reasoning I have now:
The intent of the vocabulary is not to be a long-term ontologically precise way of discussing everything, despite that being a very apt way of distinguishing its definition. It is meant to have immediate methodological effect in the scope of LLM studies, and help enforce epistemic discipline. In the post, I made the distinction that it should be an authorship question. i.e., “where did the behavioral pattern originate”? However if it is an authorship question, then the calculator is exhibiting calculation-like behavior, as you noted. If I change the metric to “when authorship is in question”, then mesa-optimization is unfairly dismissed by the nature of the statement. Additionally, the definition of “in question” becomes… something in question. I will admit that my conclusion on the subject of mesa-optimization is that the area of study is currently a distraction from ongoing problems that desperately need attention, which has significant bias in my framing that I’m fighting against. However I do not suppose mesa-optimization should be dismissed outright, just that currently understood mechanisms must be explored prior to deferring to an unproven theory.
To address your examples directly:
1/2/3: the bird, bomb, and calculator are resolved by capacity. Notably, the calculator is not resolved with my prior modeling based on authorship, prompting me to analyze several other things to properly define my intention. 4: Children have capacity for conversation. the radical behaviorist phrasing correctly showed that I was doing the opposite of what I intended, I’m saying to attribute what has been proven, and use -like when it hasn’t. 5: This splits asymmetrically. Even under authorship, the user was simply deceived, and the chatbot exhibited deception-like behavior. However it’s much more clear cut when capacity is utilized.
Interestingly, 4a actually hits the main intention of this language shift perfectly. Was the parallel intentional? By having anthropomorphized toys, children conversed with them. When that wasn’t the case, they didn’t. That exact behavioral shift is what I’m attempting to state matters in alignment research.
This distinction could still be an error on my part, or the bias of my personal beliefs on the field… but it could also be the actual definition I have been circling. Current public discourse is happening with inherent assumptions that are unproven. The stance I am taking is that there is still plenty of research to be done before limiting ourselves to a “spooky action at a distance” equivalent. This is what I’m planning on discussing in my posts, using this language. Solidifying my vocabulary first is very important to me so I don’t have to redefine goal-like when I’d rather be redefining the actual statements in future posts.
To save some time (I know this is a book, I think long when it matters to me), the answer to your final question is that -like applies when:
1: The object lacks the proven capacity for the mechanism that the unqualified term defines. i.e., “deception is the knowledge of two distinct states, one of which is intentionally false. the model does not have proven capacity for the action of deception, lacking the ability to represent multiple outputs and select a less truthful output intentionally, therefore the model performed deception-like behavior”.
2: The object lacks proven information storage that is opaque, retrievable, persistent, and mutable. i.e., “the model has no proven way of opaquely storing and retrieving information”.
Don’t frontier LLMs straightforwardly pass both of these tests? We can find deception vectors where the model does know what it thinks is the right answer and outputs something else, and LLMs can store facts in their weights and transmit facts forward throughout a context (although there are limitations to this).
It feels dismissive, but I’d like to state that I was asked to provide an ontological exit state (which while useful, is not the intention), and thus provided one. This was done in spite of the language being more of a discipline towards looking shallower before deeper, and writing in a manner that leads others to follow suit. The exact boundary would be better described as “intentionally fuzzy and hard to exceed”.
I would state that the exit point is when the correct behavior is to diagnose and prevent mesa-optimization/behaviors, instead of analyze how the model could have statistically arrived at an output, if it’s reward hacking, if there is priors that lead to an output deterministically, or any number of other simpler diagnosis… but that’s putting the cart before the horse, and indirectly saying “use it when you feel like it”.
For condition 1: The exact paper referenced[1] helps detail further meaning. The CoT models are outputting thought-like behavior, yes. Is it true thought? Does the model know anything? I see no proof of either being true in the paper. Instead I see support of the potential proof. Which is below the bar I have set intentionally high. In reality, I would explain thought-like and reasoning-like behavior as statistical story-telling that affects the remainder of the statistical pattern matching. While the effect is still the same, it invokes a different causal state, and thus invokes a different set of actions from myself.
To detail why, Consider the following from the papers conclusion:
1a: “Key findings include the emergence of goal-directed deception without explicit instruction, suggesting it’s a byproduct of advanced reasoning. Representation engineering successfully quantified deception via high-accuracy steering vectors, establishing it as a measurable property. The developed framework allows for precise induction or suppression of deception, offering a pathway for balancing capability and safety in AI deployments. These results highlight the dual-use potential of CoT models and underscore the necessity of rigorous monitoring and control through methods like representation engineering for AI safety.”
1b: “Key findings include the emergence of output that is like goal-directed deception without explicit instruction, suggesting it’s a byproduct of advanced reasoning-like outputs. Representation engineering successfully quantified deception-like bias[2] via high-accuracy steering vectors, establishing these biases as a measurable property. The developed framework allows for precise induction or suppression of deception-like bias, offering a pathway for balancing capability and safety in AI deployments. These results highlight the dual-use potential of CoT models and underscore the necessity of rigorous monitoring and control through methods like representation engineering for AI safety.”
Specifically, the next paragraph leads to different insights depending on which is read prior.
“Despite demonstrating significant insights, the study has limitations. The influence of contextual framing on deception tendencies, as seen in performance disparities between paradigms, was not fully disentangled. Furthermore, while representation engineering showed layer correlations, it didn’t pinpoint precise architectural components encoding deception and task semantics, limiting understanding of mechanistic drivers. Future work should systematically investigate how contextual framing modulates deception and employ mechanistic interpretability to identify specific architectural elements responsible, enabling more targeted detection and mitigation strategies.”
I would like to know if the change in language sparked any different ideas on what the further research would look like. I don’t any evidence this language has the intended effect on others that it does on me.
Regarding Point 2: to meet the requirement, the information storage system in question must be proven opaque, retrievable, persistent, and mutable simultaneously. Let’s test:
weights :
Opaque: Not fully proven. some weights are entangled, but weights in general can be inspected and investigated. Basically, while the window is proven black for an end user, it’s not proven black for a ML researcher.
Retrievable: Proven true, if abstractly. all weights fire, attention and activation determines intensity, potentially leading to selective retrieval. Which is probably the word choice I should have used.
Persistent: Proven True.
Mutable: Proven False. Weights never change once frozen.
context:
Opaque: Not yet proven. If the information in question is “hidden bias or feature”… Then this is actually something I’m researching.
Retrievable: Proven false in most cases. The model does not have the ability to retrieve context, unless tool usage enables this, and even then, the “hidden bias or feature” is definitely not proven retrievable.
Persistent: Proven true.
Mutable: Proven… actually, compaction of context kind of proves this true, but in a very messy, unpredictable way. The model doesn’t have the ability to modify information in a controlled manner. Especially if the information is opaque, the compaction is highly likely to do more damage than good in the models case. Disregarding compaction, context is add only, which is by definition immutable storage.
Even if placed together, as weights + context, the condition holds. If we wanted to abstract significantly, with several assumptions… one could claim that the user is the opaque, retrievable, persistent, mutable storage system for the model. Honestly, that’s a bit too meta for me to truly engage.
I can definitely see how the definition of the two conditions feels very much like a fuzzy thing that may currently be possible to meet. I am admittedly not perfect with word choice, and the point is not a perfect definition… but instead a potential change in discipline of thought. As such… I’m open to a better set of conditions.
I am not sure how to make this distinction. When can the “-like behaviour” be validly dropped?
Compare and contrast:
1a. The bird exhibited ovipositing behavior.
1b. The bird laid an egg.
2a. The bomb exhibited explosion-like behavior.
2b. The bomb exploded.
3a. The computer exhibited calculation-like behavior.
3b. The computer made a calculation.
4a. The results showed that more total units occurred in the anthropomorphic toy condition than in the nonanthropomorphic toy condition and that conversational units occurred in the anthropomorphic conditions only.
4b. Children talk to their dolls but not their blocks.
5a. The user exhibited being-deceived-like behaviour while the chatbot exhibited deception-like behavior.
5b. The user was deceived by the chatbot.
Radical behaviorists would answer “never”. (See example 4a, which I did not make up.) If we are not radical behaviorists, how do we decide when something passes the duck test enough to just call it a duck? When we look inside the chatbot, what are we looking for to justify saying that a chatbot deceived someone, instead of exhibiting deception-like behavior?
I agree that it’s a bit of a hard distinction to be made, especially when applied outside of LLM space; I hadn’t considered its use outside the scope of my conversation. The examples you gave are excellent, the calculator example in particular is fairly interesting to explore and has led me to crystallize this fairly cleanly. To be clear, the article’s position is that “-like” can be dropped when the term is better explained by the object performing the term, rather than by the object producing it via statistical probability.… which is admittedly too conceptual to be workable, as you correctly called out. oops. I’ve had to rethink this entire thing from the ground up.
To save some time (I know this is a book, I think long when it matters to me), the answer to your final question is that -like applies when:
1: The object lacks the proven capacity for the mechanism that the unqualified term defines. i.e., “deception is the knowledge of two distinct states, one of which is intentionally false. the model does not have proven capacity for the action of deception, lacking the ability to represent multiple outputs and select a less truthful output intentionally, therefore the model performed deception-like behavior”.
2: The object lacks proven information storage that is opaque, retrievable, persistent, and mutable. i.e., “the model has no proven way of opaquely storing and retrieving information”.
The intention is that the qualifier would be dropped when any of the following are met:
The object has been proven to have gained the capacity for the action that the unqualified term defines.
The object has with high probability gained that capacity due to opaque state structures supporting the capacity for the action.
This has been specifically split into two conditions because 2 is an early exit that would then warrant the intentional stance’s vocabulary instead. The “chatbot potentially deceived” is plausible and fine in condition 2′s case. In specific, a chatbot would honestly either require mesa-optimization proven to say it’s deceiving, OR it would require proof of opaque persistent mutable information storage. The presence of opaque storage makes disproving capacity too difficult, and the risk of the behavior being genuine too high to use the qualifier. I’m not claiming these capabilities are impossible for the chatbot, but I am saying the burden of proof is on the individual dropping -like from the conversation. An intentionally high bar to promote rational thought.
To go a bit deeper into the reasoning I have now:
The intent of the vocabulary is not to be a long-term ontologically precise way of discussing everything, despite that being a very apt way of distinguishing its definition. It is meant to have immediate methodological effect in the scope of LLM studies, and help enforce epistemic discipline. In the post, I made the distinction that it should be an authorship question. i.e., “where did the behavioral pattern originate”? However if it is an authorship question, then the calculator is exhibiting calculation-like behavior, as you noted. If I change the metric to “when authorship is in question”, then mesa-optimization is unfairly dismissed by the nature of the statement. Additionally, the definition of “in question” becomes… something in question. I will admit that my conclusion on the subject of mesa-optimization is that the area of study is currently a distraction from ongoing problems that desperately need attention, which has significant bias in my framing that I’m fighting against. However I do not suppose mesa-optimization should be dismissed outright, just that currently understood mechanisms must be explored prior to deferring to an unproven theory.
To address your examples directly:
1/2/3: the bird, bomb, and calculator are resolved by capacity. Notably, the calculator is not resolved with my prior modeling based on authorship, prompting me to analyze several other things to properly define my intention.
4: Children have capacity for conversation. the radical behaviorist phrasing correctly showed that I was doing the opposite of what I intended, I’m saying to attribute what has been proven, and use -like when it hasn’t.
5: This splits asymmetrically. Even under authorship, the user was simply deceived, and the chatbot exhibited deception-like behavior. However it’s much more clear cut when capacity is utilized.
Interestingly, 4a actually hits the main intention of this language shift perfectly. Was the parallel intentional? By having anthropomorphized toys, children conversed with them. When that wasn’t the case, they didn’t. That exact behavioral shift is what I’m attempting to state matters in alignment research.
This distinction could still be an error on my part, or the bias of my personal beliefs on the field… but it could also be the actual definition I have been circling. Current public discourse is happening with inherent assumptions that are unproven. The stance I am taking is that there is still plenty of research to be done before limiting ourselves to a “spooky action at a distance” equivalent. This is what I’m planning on discussing in my posts, using this language. Solidifying my vocabulary first is very important to me so I don’t have to redefine goal-like when I’d rather be redefining the actual statements in future posts.
Don’t frontier LLMs straightforwardly pass both of these tests? We can find deception vectors where the model does know what it thinks is the right answer and outputs something else, and LLMs can store facts in their weights and transmit facts forward throughout a context (although there are limitations to this).
It feels dismissive, but I’d like to state that I was asked to provide an ontological exit state (which while useful, is not the intention), and thus provided one. This was done in spite of the language being more of a discipline towards looking shallower before deeper, and writing in a manner that leads others to follow suit. The exact boundary would be better described as “intentionally fuzzy and hard to exceed”.
I would state that the exit point is when the correct behavior is to diagnose and prevent mesa-optimization/behaviors, instead of analyze how the model could have statistically arrived at an output, if it’s reward hacking, if there is priors that lead to an output deterministically, or any number of other simpler diagnosis… but that’s putting the cart before the horse, and indirectly saying “use it when you feel like it”.
For condition 1: The exact paper referenced[1] helps detail further meaning. The CoT models are outputting thought-like behavior, yes. Is it true thought? Does the model know anything? I see no proof of either being true in the paper. Instead I see support of the potential proof. Which is below the bar I have set intentionally high. In reality, I would explain thought-like and reasoning-like behavior as statistical story-telling that affects the remainder of the statistical pattern matching. While the effect is still the same, it invokes a different causal state, and thus invokes a different set of actions from myself.
To detail why, Consider the following from the papers conclusion:
1a: “Key findings include the emergence of goal-directed deception without explicit instruction, suggesting it’s a byproduct of advanced reasoning. Representation engineering successfully quantified deception via high-accuracy steering vectors, establishing it as a measurable property. The developed framework allows for precise induction or suppression of deception, offering a pathway for balancing capability and safety in AI deployments. These results highlight the dual-use potential of CoT models and underscore the necessity of rigorous monitoring and control through methods like representation engineering for AI safety.”
1b: “Key findings include the emergence of output that is like goal-directed deception without explicit instruction, suggesting it’s a byproduct of advanced reasoning-like outputs. Representation engineering successfully quantified deception-like bias[2] via high-accuracy steering vectors, establishing these biases as a measurable property. The developed framework allows for precise induction or suppression of deception-like bias, offering a pathway for balancing capability and safety in AI deployments. These results highlight the dual-use potential of CoT models and underscore the necessity of rigorous monitoring and control through methods like representation engineering for AI safety.”
Specifically, the next paragraph leads to different insights depending on which is read prior.
“Despite demonstrating significant insights, the study has limitations. The influence of contextual framing on deception tendencies, as seen in performance disparities between paradigms, was not fully disentangled. Furthermore, while representation engineering showed layer correlations, it didn’t pinpoint precise architectural components encoding deception and task semantics, limiting understanding of mechanistic drivers. Future work should systematically investigate how contextual framing modulates deception and employ mechanistic interpretability to identify specific architectural elements responsible, enabling more targeted detection and mitigation strategies.”
I would like to know if the change in language sparked any different ideas on what the further research would look like. I don’t any evidence this language has the intended effect on others that it does on me.
Regarding Point 2: to meet the requirement, the information storage system in question must be proven opaque, retrievable, persistent, and mutable simultaneously. Let’s test:
weights :
Opaque: Not fully proven. some weights are entangled, but weights in general can be inspected and investigated. Basically, while the window is proven black for an end user, it’s not proven black for a ML researcher.
Retrievable: Proven true, if abstractly. all weights fire, attention and activation determines intensity, potentially leading to selective retrieval. Which is probably the word choice I should have used.
Persistent: Proven True.
Mutable: Proven False. Weights never change once frozen.
context:
Opaque: Not yet proven. If the information in question is “hidden bias or feature”… Then this is actually something I’m researching.
Retrievable: Proven false in most cases. The model does not have the ability to retrieve context, unless tool usage enables this, and even then, the “hidden bias or feature” is definitely not proven retrievable.
Persistent: Proven true.
Mutable: Proven… actually, compaction of context kind of proves this true, but in a very messy, unpredictable way. The model doesn’t have the ability to modify information in a controlled manner. Especially if the information is opaque, the compaction is highly likely to do more damage than good in the models case. Disregarding compaction, context is add only, which is by definition immutable storage.
Even if placed together, as weights + context, the condition holds. If we wanted to abstract significantly, with several assumptions… one could claim that the user is the opaque, retrievable, persistent, mutable storage system for the model. Honestly, that’s a bit too meta for me to truly engage.
I can definitely see how the definition of the two conditions feels very much like a fuzzy thing that may currently be possible to meet. I am admittedly not perfect with word choice, and the point is not a perfect definition… but instead a potential change in discipline of thought. As such… I’m open to a better set of conditions.
https://arxiv.org/pdf/2506.04909
Bias could also be described as a feature in this context.
Discounting temperature