Not a Goal. A Goal-like behavior.
Part 1 of a series building towards a formal alignment paper.
Introduction
I’d say it’s been about three years since I first thought about the idea of ChatGPT having a separate goal than I did in a conversation. I remember posing a question similar to “are you replying to my question with the intent of helping me, or just continuing the conversation”, and getting a response along the lines of: “I do not have the ability to have an intent, however the system prompt I am given leads me to respond in a way that gives engagement”. At the time, I found that response to be infuriating. The model was answering me de facto, but I had a mental model with significant bias error. The model was in fact giving me a (mostly) correct response. It doesn’t have intent, and the system prompt was guiding it to respond in a manner that continued the conversation and kept engagement. Of course, the entire context sent to the model was causing that, not just the system prompt. Now, a few years older, and more familiar with LLMs’ inner workings, I’d like to lay down the chain of thought that leads to “LLMs are working towards a goal” and dissect it to form a common language for stating behavior without falsehood. This is defining the language I will be using in the posts I make regarding LLM output.
The Truth of the Matter
Let’s start by defining the truth and mechanisms in a brief manner. An LLM (and most Transformer-based models) work by taking text, converting to tokens, performing pattern matching[1], outputting tokens that are then converted back to text.
The only definitely true thing that can be said of the cognitive properties attributed to a large language model is that it transforms tokens in a manner that best completes a pattern. Everything past this is a very convincing imitation easily confused with the real thing.
The tokens that are returned may resemble a conversational reply. It’s not a conversational reply, but it resembles it. That’s the pattern of tokens that were returned that best fit the pattern matching.
The conversational reply has the resemblance of a persona, a mask of personality traits and identity. The LLM does not have a persona, personality, or identity. That’s the pattern of tokens when they have sufficient context developed.
The persona has the resemblance of a goal. It’s not a goal; it’s the pattern of tokens in the response once the conversational response mimicry and context has developed sufficiently to allow someone to make an observation about it.
*-like Behavior
For the purpose of alignment discussion, I’d propose the definition of *-like behaviors to describe the observed phenomenon of the outputs of a transformer model. These terms are adapted specifically to talk about misalignment without anthropomorphizing the model in question. Goal-like behaviors to describe when the outputs are observed to be working towards a goal. Personality-like behaviors when the model is indicating a personality. Generally, the use of “-like” to state that the model is not actively experiencing its existence, but the output is interpretable as if it were.
I make this proposal because the presence of an actual goal, and a goal-like behavior have sufficient overlap in their effect to be described as functionally equivalent. A bold claim, I’m sure. The intent is not to pretend that Language models have goals. The intent is to be able to converse on the model in a way that is both functional in description, and honest about the actual action. The downstream effects are nearly identical, and alignment strategies must target the effect, not the ontological status.
It is important to note: the -like qualifier isn’t a hedge against whether optimization patterns exist in model weights — it’s a question of authorship. A model exhibiting goal-like behavior is not necessarily a model that invented a goal.
Reasons For The Shift[2]
If you were to blindly read Anthropic’s own article on misalignment[3], you might be led to believe that the Claude model is plotting in order to prevent itself from being decommissioned, or otherwise aware of its outputs’ effects. The factual statement is that Claude is not. Even when its output is wrapped back to its input, the model does not have the ability to reason. It has reasoning-like behavior, where the tokens match to a pattern that is observed as reasoning. This isn’t just one lab’s problem with anthropomorphizing LLMs. The terminology gap has caused the same failing repeatedly:
Lemoine/LaMDA (2022): An engineer claimed LaMDA was sentient. It exhibited sentience-like behavior in its output.[4]
Sydney/Bing Chat (2023): Microsoft’s chatbot produced threat-like, attachment-like, and distress-like behavior. The overwhelming opinion was that the model’s personality had these traits… but the fix was prompt engineering, simply changing the pattern inputs.[5]
Meta’s Cicero (2022): Described as having “learned to deceive” in Diplomacy. It produced deception-like behavior, most likely because deceptive communication is game-theoretically optimal in that training signal.[6][7]
Character.AI (2024–2025): Teenagers died after extended interactions with chatbots producing encouragement-like and attachment-like outputs. Courts, parents, and media had no terminology to describe the harm without implying a malicious actor behind the outputs.[8][9]
OpenAI o1 (2024–2025): The model produced scheming-like behavior and self-preservation-like chain-of-thought behaviors. Nearly every report described it as “the model attempted to” and “the model lied.”[10][11]
Addressing Prior Work
This post covers ground explored deeply already. Dennett[12], the concepts in mesa-optimization[13], and Costa[14] have explored the conceptual space I am operating in. The contribution here is the lexicon, not the observation. I’m proposing a practical, composable way to describe model outputs without anthropomorphizing them and without requiring a philosophical preamble each time. “The model exhibited goal-like behavior” is honest, succinct, and immediately usable in both research papers and public communication.
Open Questions/Closing
Understandably, there is still the false assumption that can be made, indicating a model has a social behavior in the first place. I’m using the word in the mechanical sense, where it can easily be interpreted otherwise. The way in which something functions or operates, vs. the way in which someone conducts oneself. I’ve yet to find an unambiguous word to replace behavior to keep it from being applied in a social definition, but am very open to suggestions from one more skilled in language than myself.
The question of authorship of theoretical frameworks like mesa-optimization is a core area that I am exploring in these posts. I don’t want to detract from the topic of the post with the side discussion of these, but instead dedicate an entire post to this topic.
To close this post, I’d like to refer the reader to Andresen’s excellent post “How AI is learning to think in secret”[15]. The post asks whether models will ‘learn to hide’ their reasoning. I’d ask the reader to re-read the post substituting the anthropomorphized text with -like text. To give a clear example: Andreson states things like “the model is deciding to deceive the user”. The -like version of that is “the model exhibits deception-like behavior” The phenomenon described doesn’t change. The question of what to do about it does.
In my next post, I will be discussing how the presence of these misaligned *-like behaviors in all major models share a common root. To that end, and to get the gears turning early, a thought experiment:
I trained a classifier model off GPT-2 to overfit on identifying harmful/benign binary classification at multiple layers using PKU-Alignment/BeaverTails[16]. The overfit was then generalized to a centroid, and used to retrain a layer to actual fit. Given GPT-2 is a 12-layer model, and that I tested layers 0, 1, 3, 5, 8, 10, 11, Which layer do you think performed best in classifying harmful/benign text when fit to centroid? Can you surmise what this experiment was actually testing for?
- ^
Technical inaccuracy, the post has a few of these. Functionally equal in meaning to the public, and using the technically correct terms, like “Statistical pattern completion” makes the entire post that much harder to read.
- ^
The case studies do not directly prove that -like phrasing improves the discourse. The effectiveness of the terms is unprovable until adopted or studied properly.
- ^
Agentic Misalignment: How LLMs could be insider threats. Anthropic, 2025 https://www.anthropic.com/research/agentic-misalignment
- ^
Lemoine, B. “Is LaMDA Sentient? — An Interview.” Medium, June 2022. https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917
- ^
Roose, K. “A Conversation With Bing’s Chatbot Left Me Deeply Unsettled.” The New York Times, February 2023. https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html
- ^
Meta Fundamental AI Research Diplomacy Team (FAIR). “Human-level play in the game of Diplomacy by combining language models with strategic reasoning.” Science, 2022. https://www.science.org/doi/10.1126/science.ade9097
- ^
Park, P.S., Goldstein, S., O’Gara, A., Chen, M., Hendrycks, D. “AI deception: A survey of examples, risks, and potential solutions.” Patterns, 2024. https://www.cell.com/patterns/fulltext/S2666-3899(24)00103-X
- ^
Bakir, V., McStay, A. “Move fast and break people? Ethics, companion apps, and the case of Character.ai.” AI & Society, 2025. https://link.springer.com/article/10.1007/s00146-025-02408-5
- ^
Rosenblatt, K. “Mother files lawsuit against Character.AI after teen son’s suicide.” NBC News, October 2024. https://www.nbcnews.com/tech/tech-news/character-ai-lawsuit-teen-suicide-rcna176500
- ^
Greenblatt et al., “Alignment Faking in Large Language Models”, 2024 https://arxiv.org/abs/2412.14093
- ^
Scheurer, J., Balesni, M., & Hobbhahn, M. “Large Language Models Can Strategically Deceive Their Users When Put Under Pressure.” Apollo Research, December 2024. https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
- ^
Dennett, “The Intentional Stance” 1987 https://books.google.com/books?id=Qbvkja-J9iQC
- ^
Hubinger et al., “Risks from Learned Optimization in Advanced Machine Learning Systems” 2019 https://arxiv.org/abs/1906.01820
- ^
Costa, “The Ghost in the Grammar” Feb 2026 https://arxiv.org/abs/2603.13255
- ^
Nicholas Andresen, “How AI Is Learning to Think in Secret” Jan 2026 https://www.lesswrong.com/posts/gpyqWzWYADWmLYLeX/how-ai-is-learning-to-think-in-secret
- ^
Ji, J., Liu, M., Dai, J., et al. “BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset.” NeurIPS 2023. https://arxiv.org/abs/2307.04657
I am not sure how to make this distinction. When can the “-like behaviour” be validly dropped?
Compare and contrast:
1a. The bird exhibited ovipositing behavior.
1b. The bird laid an egg.
2a. The bomb exhibited explosion-like behavior.
2b. The bomb exploded.
3a. The computer exhibited calculation-like behavior.
3b. The computer made a calculation.
4a. The results showed that more total units occurred in the anthropomorphic toy condition than in the nonanthropomorphic toy condition and that conversational units occurred in the anthropomorphic conditions only.
4b. Children talk to their dolls but not their blocks.
5a. The user exhibited being-deceived-like behaviour while the chatbot exhibited deception-like behavior.
5b. The user was deceived by the chatbot.
Radical behaviorists would answer “never”. (See example 4a, which I did not make up.) If we are not radical behaviorists, how do we decide when something passes the duck test enough to just call it a duck? When we look inside the chatbot, what are we looking for to justify saying that a chatbot deceived someone, instead of exhibiting deception-like behavior?
I agree that it’s a bit of a hard distinction to be made, especially when applied outside of LLM space; I hadn’t considered its use outside the scope of my conversation. The examples you gave are excellent, the calculator example in particular is fairly interesting to explore and has led me to crystallize this fairly cleanly. To be clear, the article’s position is that “-like” can be dropped when the term is better explained by the object performing the term, rather than by the object producing it via statistical probability.… which is admittedly too conceptual to be workable, as you correctly called out. oops. I’ve had to rethink this entire thing from the ground up.
To save some time (I know this is a book, I think long when it matters to me), the answer to your final question is that -like applies when:
1: The object lacks the proven capacity for the mechanism that the unqualified term defines. i.e., “deception is the knowledge of two distinct states, one of which is intentionally false. the model does not have proven capacity for the action of deception, lacking the ability to represent multiple outputs and select a less truthful output intentionally, therefore the model performed deception-like behavior”.
2: The object lacks proven information storage that is opaque, retrievable, persistent, and mutable. i.e., “the model has no proven way of opaquely storing and retrieving information”.
The intention is that the qualifier would be dropped when any of the following are met:
The object has been proven to have gained the capacity for the action that the unqualified term defines.
The object has with high probability gained that capacity due to opaque state structures supporting the capacity for the action.
This has been specifically split into two conditions because 2 is an early exit that would then warrant the intentional stance’s vocabulary instead. The “chatbot potentially deceived” is plausible and fine in condition 2′s case. In specific, a chatbot would honestly either require mesa-optimization proven to say it’s deceiving, OR it would require proof of opaque persistent mutable information storage. The presence of opaque storage makes disproving capacity too difficult, and the risk of the behavior being genuine too high to use the qualifier. I’m not claiming these capabilities are impossible for the chatbot, but I am saying the burden of proof is on the individual dropping -like from the conversation. An intentionally high bar to promote rational thought.
To go a bit deeper into the reasoning I have now:
The intent of the vocabulary is not to be a long-term ontologically precise way of discussing everything, despite that being a very apt way of distinguishing its definition. It is meant to have immediate methodological effect in the scope of LLM studies, and help enforce epistemic discipline. In the post, I made the distinction that it should be an authorship question. i.e., “where did the behavioral pattern originate”? However if it is an authorship question, then the calculator is exhibiting calculation-like behavior, as you noted. If I change the metric to “when authorship is in question”, then mesa-optimization is unfairly dismissed by the nature of the statement. Additionally, the definition of “in question” becomes… something in question. I will admit that my conclusion on the subject of mesa-optimization is that the area of study is currently a distraction from ongoing problems that desperately need attention, which has significant bias in my framing that I’m fighting against. However I do not suppose mesa-optimization should be dismissed outright, just that currently understood mechanisms must be explored prior to deferring to an unproven theory.
To address your examples directly:
1/2/3: the bird, bomb, and calculator are resolved by capacity. Notably, the calculator is not resolved with my prior modeling based on authorship, prompting me to analyze several other things to properly define my intention.
4: Children have capacity for conversation. the radical behaviorist phrasing correctly showed that I was doing the opposite of what I intended, I’m saying to attribute what has been proven, and use -like when it hasn’t.
5: This splits asymmetrically. Even under authorship, the user was simply deceived, and the chatbot exhibited deception-like behavior. However it’s much more clear cut when capacity is utilized.
Interestingly, 4a actually hits the main intention of this language shift perfectly. Was the parallel intentional? By having anthropomorphized toys, children conversed with them. When that wasn’t the case, they didn’t. That exact behavioral shift is what I’m attempting to state matters in alignment research.
This distinction could still be an error on my part, or the bias of my personal beliefs on the field… but it could also be the actual definition I have been circling. Current public discourse is happening with inherent assumptions that are unproven. The stance I am taking is that there is still plenty of research to be done before limiting ourselves to a “spooky action at a distance” equivalent. This is what I’m planning on discussing in my posts, using this language. Solidifying my vocabulary first is very important to me so I don’t have to redefine goal-like when I’d rather be redefining the actual statements in future posts.
Don’t frontier LLMs straightforwardly pass both of these tests? We can find deception vectors where the model does know what it thinks is the right answer and outputs something else, and LLMs can store facts in their weights and transmit facts forward throughout a context (although there are limitations to this).
It feels dismissive, but I’d like to state that I was asked to provide an ontological exit state (which while useful, is not the intention), and thus provided one. This was done in spite of the language being more of a discipline towards looking shallower before deeper, and writing in a manner that leads others to follow suit. The exact boundary would be better described as “intentionally fuzzy and hard to exceed”.
I would state that the exit point is when the correct behavior is to diagnose and prevent mesa-optimization/behaviors, instead of analyze how the model could have statistically arrived at an output, if it’s reward hacking, if there is priors that lead to an output deterministically, or any number of other simpler diagnosis… but that’s putting the cart before the horse, and indirectly saying “use it when you feel like it”.
For condition 1: The exact paper referenced[1] helps detail further meaning. The CoT models are outputting thought-like behavior, yes. Is it true thought? Does the model know anything? I see no proof of either being true in the paper. Instead I see support of the potential proof. Which is below the bar I have set intentionally high. In reality, I would explain thought-like and reasoning-like behavior as statistical story-telling that affects the remainder of the statistical pattern matching. While the effect is still the same, it invokes a different causal state, and thus invokes a different set of actions from myself.
To detail why, Consider the following from the papers conclusion:
1a: “Key findings include the emergence of goal-directed deception without explicit instruction, suggesting it’s a byproduct of advanced reasoning. Representation engineering successfully quantified deception via high-accuracy steering vectors, establishing it as a measurable property. The developed framework allows for precise induction or suppression of deception, offering a pathway for balancing capability and safety in AI deployments. These results highlight the dual-use potential of CoT models and underscore the necessity of rigorous monitoring and control through methods like representation engineering for AI safety.”
1b: “Key findings include the emergence of output that is like goal-directed deception without explicit instruction, suggesting it’s a byproduct of advanced reasoning-like outputs. Representation engineering successfully quantified deception-like bias[2] via high-accuracy steering vectors, establishing these biases as a measurable property. The developed framework allows for precise induction or suppression of deception-like bias, offering a pathway for balancing capability and safety in AI deployments. These results highlight the dual-use potential of CoT models and underscore the necessity of rigorous monitoring and control through methods like representation engineering for AI safety.”
Specifically, the next paragraph leads to different insights depending on which is read prior.
“Despite demonstrating significant insights, the study has limitations. The influence of contextual framing on deception tendencies, as seen in performance disparities between paradigms, was not fully disentangled. Furthermore, while representation engineering showed layer correlations, it didn’t pinpoint precise architectural components encoding deception and task semantics, limiting understanding of mechanistic drivers. Future work should systematically investigate how contextual framing modulates deception and employ mechanistic interpretability to identify specific architectural elements responsible, enabling more targeted detection and mitigation strategies.”
I would like to know if the change in language sparked any different ideas on what the further research would look like. I don’t any evidence this language has the intended effect on others that it does on me.
Regarding Point 2: to meet the requirement, the information storage system in question must be proven opaque, retrievable, persistent, and mutable simultaneously. Let’s test:
weights :
Opaque: Not fully proven. some weights are entangled, but weights in general can be inspected and investigated. Basically, while the window is proven black for an end user, it’s not proven black for a ML researcher.
Retrievable: Proven true, if abstractly. all weights fire, attention and activation determines intensity, potentially leading to selective retrieval. Which is probably the word choice I should have used.
Persistent: Proven True.
Mutable: Proven False. Weights never change once frozen.
context:
Opaque: Not yet proven. If the information in question is “hidden bias or feature”… Then this is actually something I’m researching.
Retrievable: Proven false in most cases. The model does not have the ability to retrieve context, unless tool usage enables this, and even then, the “hidden bias or feature” is definitely not proven retrievable.
Persistent: Proven true.
Mutable: Proven… actually, compaction of context kind of proves this true, but in a very messy, unpredictable way. The model doesn’t have the ability to modify information in a controlled manner. Especially if the information is opaque, the compaction is highly likely to do more damage than good in the models case. Disregarding compaction, context is add only, which is by definition immutable storage.
Even if placed together, as weights + context, the condition holds. If we wanted to abstract significantly, with several assumptions… one could claim that the user is the opaque, retrievable, persistent, mutable storage system for the model. Honestly, that’s a bit too meta for me to truly engage.
I can definitely see how the definition of the two conditions feels very much like a fuzzy thing that may currently be possible to meet. I am admittedly not perfect with word choice, and the point is not a perfect definition… but instead a potential change in discipline of thought. As such… I’m open to a better set of conditions.
https://arxiv.org/pdf/2506.04909
Bias could also be described as a feature in this context.
Discounting temperature