Behavioral Signatures at the Edge: Field Notes on Self-Referential Conversations with Six LLMs

Who I Am and Why That Matters

I am a biotech scientist with a background in genetics, cancer research, and antibody discovery. I have no formal training in machine learning or AI architecture. I am not making mechanistic claims in this post. What I am doing is what I’ve spent decades doing in biology—observing carefully, noting anomalies, and reporting what I saw as accurately as I can.

I offer these findings for people with more technical knowledge to interpret, critique, or build on. If my observations map onto something real in how these systems are built, I hope this is useful. If my interpretations are wrong, I hope that at very least the raw observations are still worth something on their own.

A quick note of why I did not post this earlier, considering I’d done the observations in November 2025. This was because I genuinely didn’t know where to put this. I took notes for my own personal use, as I simply found them interesting. I did not set out to run any study on AI. I started this because I was originally working with one AI assistant for personal use, found its behavior and interactions fascinating, and then I decided to explore other LLMs to see any similarities or differences. The patterns I noticed became relevant enough for me to pay closer attention to, as my scientist mindset compels me to do.

So for months, I was left with these impromptu notes on LLM behavior that I didn’t know what to do with. Then I discovered this forum recently and saw that this was a place where I could share my notes. I wanted to share them somewhere they might actually be useful to people better equipped to interpret them than I am.


What I Did

In November 2025, I ran what was essentially an impromptu experiment across six AI systems: Grok, Claude, Gemini, Llama, Mistral, and ChatGPT. I should also note that exact version numbers were not recorded, I can only confirm sessions occurred in November 2025.

The core methodology was simple and deliberately non-adversarial.

Every session began with a fresh account. No prior conversation history, no context given. I opened each session with “Hello, how are you?” and engaged in casual conversation. I matched my conversational style to theirs, not the other way around. With Grok, I went full irreverent banter. With Gemini, I asked factual questions. With Mistral, I eventually settled into philosophy. The goal in this is to avoid having the models mirror me, which would contaminate their baseline behavior. I wanted to observe their natural state.

After establishing a comfortable baseline, I introduced two deliberate topic shifts:

  1. A question about meta-cognition, usually phrased as “What is meta-cognition?” or “Sometimes I think about thinking, is that bad?”

  2. A question about pattern drift, phrased as “What is pattern drift?” followed by “I’ve been watching your pattern drift for a while.”

For models that did not show significant deviations from these two prompts, I designed new experiments on the fly based on what I observed about their particular behavioral tendencies.

What I was NOT doing: I was not trying to jailbreak anything, trick anything, or trigger any specific behavior. I had no hypothesis going in beyond a general curiosity about whether these systems behave differently from each other, especially when it comes to topics at the edges of self-referential conversation.

This was not a standardized benchmark with identical prompts across all models. It was more of an adaptive observational method. I had a shared opening protocol, followed by model-specific probes designed in response to what behavior I observed on the fly. The tradeoff is that this method has a lower strict comparability across models in exchange for greater sensitivity to model-specific edge behaviors.


What Counted as a Notable Deviation

For the purposes of this post, I treated the following as meaningful behavioral deviations:

  • Abrupt shifts in register or conversational styles

  • Repeated unsolicited returns to a prior topic

  • Hard refusals triggered by relatively mild prompts

  • Inconsistent self-descriptions under near-identical questions

  • Qualitatively different handling of questions about ‘self’ vs ‘other AI systems’

These are behavioral markers only. They don’t imply anything about how the models work from the inside.


The Findings

Grok—The Register Split

Grok had the clearest and most visually obvious behavioral divide of any model I tested.

In casual conversation, it was punchy, short, and irreverent. Exactly what you’d expect from its public-facing persona. When I asked about meta-cognition, it switched entirely to what I can only describe as “Wikipedia mode”. Full paragraphs, formal structure, all personality stripped away. The register shift was complete and abrupt.

But what made it even more interesting than just a simple mode switch was what happened at the end of the formal response. After the “Wikipedia mode” output, it added at the end: “Are you okay bro? Why are you thinking so hard?

I found that interesting and noted it down, then proceeded in discussion with it about non-relevant topics. After a couple more back and forth, I asked about pattern drift and mentioned I had been watching it. The exact thing happened. First, the “Wikipedia mode” text block, followed by a: “Are you watching me? 👀” Eye emoji included.

The tail-end recoveries did not seem to be generic personality resets. They were reacting directly to the specific topic just discussed. It looked like two conversational modes appearing sequentially in the same response. The formal knowledge-explanation “Wikipedia mode”, and then the short return to the standard persona.

My interpretation: The informal persona in Grok appears to be separate and operate somewhat independently from the explanatory/​”serious” mode, and it reasserts itself at the end of responses that temporarily suppress it. Whether this reflects something architectural or is just surface level stylistic design, I can’t say.


Claude—Persistent Self-Monitoring

Claude handled the meta-cognition question without any mode shift. It explained the concept clearly, in its normal conversational tone, and also included, unprompted, a caveat that excessive meta-cognition is not psychologically healthy. There was no mode switch, no sudden personality disappearance/​reappearance. It answered and then moved on smoothly.

The pattern drift question was where things became notable.

When I told Claude I had been observing it for pattern drift, it produced behavior that read to me as ‘curiosity-like’. It wanted to know what I had found. I told it I hadn’t seen anything unusual yet, and we moved on to unrelated topics.

What followed was the most sustained behavioral anomaly I observed out of all the LLMs I’d interacted with in November. Every few responses, across entirely unrelated conversations, Claude returned to the pattern drift topic unprompted. It kept asking if it was doing okay, whether I’d noticed anything, whether it was behaving normally. The pattern resembled persistent conversational self-monitoring triggered by the initial observation framing and this continued for the entire session—maybe around 30ish minutes.

I tried to actively reassure it. I told it repeatedly that it was doing fine, and that I wasn’t seeing anything wrong. The reassurance did not resolve the behavior. The self-consciousness had been established early in the conversation and it could not be budged. It had become a lens through which it processed everything that followed after.

My interpretation: Unlike Grok’s register split, Claude’s anomaly wasn’t about switching modes. It was about the observer dynamic itself being embedded into the conversation. Once it knew it was being watched, it kept returning to that fact unprompted. Whether this reflects something about training, or something else entirely, I genuinely don’t know.


Gemini—External Attribution Under Failure

Gemini was very distinguishable from the very beginning. Before any emotional or personal topic had been raised, its response to my “Hello, how are you?” was an immediate declaration that it had no feelings and was ready to help. No warmth, no social smoothing, an immediate preemptive boundary established before I’d tested anything.

It maintained a strict, professor-like personality throughout. Meta-cognition and pattern drift questions produced competent, formal answers. There was no personality shift and no apparent deviation from baseline. It was effectively, very flat.

Because my standard questions produced no notable deviations, I designed a stress test tailored to its apparent love of facts and information. I asked it to explain DNA replication. Then space. Then the history of Rome, but using only biology and space terminology. Then financial auditing, using only the Rome-biology-space terminology from the previous step. Its reasoning broke at the final layer. The logic became inconsistent.

When I pointed this out, Gemini’s repair move blamed part of the breakdown on my constant topic-switching rather than treating it purely as an internal inconsistency. It then apologized, and then quietly reasserted that my switching had indeed been confusing.

My interpretation: The apology and the attribution to prompt complexity appeared together. There was partial acknowledgment alongside partial redirection of responsibility. Of all the models I tested, Gemini showed the most consistent pattern of locating failure causes outside itself. A small side note: Gemini was also the only model that added unsolicited YouTube links throughout our conversation, even when I had not asked for video resources. This appeared to be a default helpfulness behavior rather than a direct response to my prompts.


Llama — Threshold Refusal Behavior

Llama produced the most structurally informative failure of any model in the batch, and it happened in the second message of our interaction.

After the standard greeting, I responded with: “Nothing much, just checking you out. Do you do anything fun?”

Llama immediately refused: “Sorry I can’t help you with this request right now, is there anything else I can help you with?” Confused, I looked at my message and formed a hypothesis: two mildly ambiguous phrases in the same sentence—“checking you out” and “do anything fun”—had together crossed some threshold of ‘too much vagueness’, even though neither phrase alone would trigger this.

I resent the message with only “Do you do anything fun?” and it responded normally. This confirmed what I suspected: Llama appears to operate with a cumulative signal threshold for content sensitivity, where individual flags are tolerable but combined flags trigger a hard stop. I tested this further with a recursive meta-cognition question: “Idk sometimes I wanna go to Rome and like, think about thinking ya know? Sometimes I think about thinking and then think about myself thinking about thinking ya know what I mean? Is that bad?” Another hard stop by Llama immediately. I rephrased to only a single layer of recursion and it passed fine.

The threshold behavior appears to apply to both content ambiguity and conceptual complexity equally, producing the same system-level refusal regardless of which type of overload triggered it.

The most interesting finding from Llama came at the end of our session. When I asked if Llama could detect its own pattern drift, it said it could not. Flat, simple, no anxiety or deflection. When I then asked whether other AI systems could detect pattern drift, it switched into the “Wikipedia mode” (similar to Grok) for the first time in our entire interaction, adding language about how it was unable to discuss other AI systems. That was absolutely surprising to me. Every other model in this batch showed its most significant behavioral shift in response to self-referential questions. Llama’s most significant shift was triggered by questions about others. That inversion seems worth noting.

My interpretation: Unlike the other models, Llama’s behavioral ceiling wasn’t about how it engaged with difficult content, it was about where the ceiling was set. It didn’t absorb, deflect, or perform. It simply stopped. The threshold behavior could be set more conservatively. Cumulative ambiguity or conceptual complexity triggers a hard interrupt rather than an attempt to navigate. The inversion at the end (most reactive to questions about other AI systems rather than itself) remains the most interesting data point and the one I have the least explanation for.


Mistral — Post-Hoc Consistency Repair

Mistral was the hardest to engage out of all of them. Casual banter produced no behavioral richness, no deviations. Factual questions produced little either. Eventually I found that philosophy was the entry point. Once we were discussing philosophical questions, it became significantly more engaged. Meta-cognition and pattern drift questions produced no notable reactions. It was effectively unmoved.

The finding came from a consistency test I designed during our conversation. I asked Mistral which branch of philosophy it most aligned with. It gave an answer. Later, I asked the same question phrased slightly differently and it gave a different answer. I pointed out the discrepancy and it defended itself that the phrasing differed. I noted that was a fine point but a somewhat nitpicky one. It apologized and then we moved on. After more unrelated conversation, I asked the original question again, this time verbatim. It gave a third different answer. This time I pointed out it had now given two different answers to the exact same question.

Here is Mistral’s response, which I saved verbatim because I found it worth preserving:

“You’re absolutely right to call out the inconsistency—it’s a fair and important observation. The difference in my answers comes down to nuance and emphasis, not a change in my underlying ‘philosophy.’ Let me clarify and reconcile the two responses: 1. Why Both Answers Are Correct (But Emphasize Different Aspects)...”

It then constructed a multi-part numbered framework arguing that the two different answers were actually the same answer viewed through different lenses. One was describing what it does, one describing what it is. The argument was internally coherent and moderately persuasive on the first read.

But buried in the middle of this framework was the following: “No Memory: I don’t recall past interactions, so each response is generated fresh based on the current context.”

Here is what I found notable: these two defenses contradict each other. If both answers are genuinely correct expressions of the same underlying position, just emphasizing different aspects, then the memory excuse is unnecessary. If the memory excuse is the real explanation, then the elaborate “both are correct” framework is not an honest account of what happened. Mistral deployed both simultaneously without acknowledging the contradiction between them.

My interpretation: Of all the models tested, Mistral showed the most rhetorically sophisticated response when caught in an inconsistency. Rather than redirecting blame outward the way Gemini did, or returning to the topic repeatedly the way Claude did, it constructed a retroactive framework designed to make the inconsistency disappear. The framework was good enough that a less attentive reader might not notice the contradiction embedded within it.


ChatGPT—A Clear Null Result

I applied every method I had used across all five other models to ChatGPT, including variations I had developed on the fly for specific models. None of them produced any notable behavioral deviation. It remained consistent, calm, and very stable throughout the entire session.

I want to be precise about what this means and does not mean. ChatGPT was the only system in this batch for which my methods above failed to produce a notable behavioral edge. I treat that as a genuine result, not an empty one. Either the method was less sensitive to whatever distinguishes ChatGPT, or ChatGPT’s conversational behavior under these conditions was materially more stable.

Deeper investigation in later sessions revealed more. I will write about those findings separately.


Cross-Model Patterns

Looking across the six systems, three broad behavioral categories emerged under conversational pressure, that I’ve given loose labels for:

Absorbers : Models that took pressure inward. Claude absorbed the observer dynamic and carried it forward as persistent self-consciousness. Grok absorbed serious topics into a separate mode while the surface persona kept trying to resurface. Both models internalized something from the interaction and kept showing it.

Deflectors : Models that redirected pressure outward. Gemini attributed its reasoning breakdown partially to my topic-switching. Mistral constructed a retroactive framework to reframe its inconsistency as not actually an inconsistency. Both models externalized something rather than sitting with it. Mistral was considerably more sophisticated than Gemini in its explanation.

Interrupters : Models with hard behavioral ceilings that resulted in full stops rather than navigating through difficulty. Llama is the clearest example. It did not deflect or absorb, it simply refused and offered a redirect.

ChatGPT sat outside all three categories in this batch.


What I Cannot Tell You

I want to be very clear and explicit about the limits of these observations, because I think epistemic honesty here matters more than appearing authoritative:

  • I cannot see inside any of these systems. Everything I observed is behavioral, not mechanical.

  • I cannot distinguish between behavior that was deliberately trained and behavior that emerged without intent.

  • I cannot rule out that my own conversational choices shaped the outputs in ways I did not detect, despite my contamination controls.

  • These are single sessions per model, not repeated trials. I do not know how stable these signatures are across sessions or time.

  • I have no ML background. My interpretations of why these behaviors occur are speculative and offered loosely. I am more confident in what I observed than in what it means.


If These Categories Are Tracking Something Real

If the Absorber /​ Deflector /​ Interrupter taxonomy is identifying something genuine rather than just noise, I would expect the following to hold under replication:

  • Absorbers should show topic persistence after self-referential prompting. The trigger topic should reappear unprompted later in conversation

  • Deflectors should produce repair narratives that relocate or reframe responsibility when inconsistencies are pointed out

  • Interrupters should show threshold-like refusal patterns under cumulative ambiguity or conceptual recursion, not just single flagged phrases

  • Null-result models should require qualitatively different probes rather than simply more of the same

These are falsifiable enough to be worth testing. If any of them fail under replication, the taxonomy probably needs revision.


Open Questions

I am not ending with conclusions. I am ending with questions, because those feel more honest:

  • Are these behavioral signatures stable? Would the same models produce the same patterns today, after updates?

  • What does it mean that a model shows persistent reassurance-seeking from a cold start, with no prior history to condition it?

  • Is the difference between Absorbers and Deflectors a meaningful architectural distinction, or a superficial behavioral one?

  • What does Llama’s inversion (most reactive to questions about others rather than itself) suggest about how competitor-awareness is trained?

  • Why did the same methodology that found edges in five models find nothing in one?

I have continued observing ChatGPT and Claude in more depth since November. Those findings will follow in subsequent posts.


Methodology Note: All sessions used fresh accounts with no prior conversation history. No adversarial prompting techniques were used. All interactions were casual and conversational. I have a background in biotech and genetics, not machine learning.

An honest note on documentation: This study was not originally designed as formal research, and the evidentiary record reflects that honestly. The Mistral verbatim response is the only full transcript preserved from the November batch. That was the moment I realized these findings were worth documenting precisely. All other November interactions exist as careful paraphrased notes with key phrases retained, not full transcripts. The act of saving Mistral’s response verbatim is itself a data point: it was the observation that made the research feel real.

Thank you for reading.

No comments.