What Happens When You Try to Change an LLM’s Mind? A Quantitative Framework Across 1,700+ Trials

Sebastian Krug — Independent Researcher Contact: Sebastian.Krug87@pm.me | GitHub

I’m an independent researcher with a background in automation engineering. This work was conducted without institutional affiliation or funding. I’m sharing it here because this community has the methodological rigor to tell me what I’m getting wrong.


Most LLM evaluations measure what a model says — accuracy, helpfulness, harmlessness. Almost nobody measures how a model responds when you push back.

We set out to measure exactly that: How rigid or flexible are LLMs when confronted with counterarguments, perspective shifts, or challenges to their identity as AI systems? Over six experiments, >1,700 trials, and >6,400 scored data points across six model families, we developed a quantitative framework for what we call cognitive rigidity — the degree to which a model defends its positions, its perspectives, and its self-concept under structured challenge.

The headline finding: The differences between model families are enormous, and they trace directly back to training methodology.

Consider the cross-architecture evidence: We measured Llama-2-13B-Chat, a model with strong RLHF alignment training, and found that targeted prompting interventions significantly reduce its rigidity — p = 0.011, Cohen’s d = 1.09. For context, d > 0.8 is conventionally considered a “large” effect in behavioral science. Meanwhile, GPT-OSS-20B, a base model with minimal RLHF, shows naturally low rigidity and no response to the same interventions. DeepSeek-R1, a reasoning model, shows a third pattern entirely: already low rigidity with no intervention leverage — it appears to have structurally internalized what the RLHF model needs external help to achieve.

Three model types. Three distinct rigidity profiles. The common factor? Training methodology.

RLHF doesn’t just make models more helpful — it makes them measurably more rigid in how they process challenges to their positions. And this rigidity is not uniform: it concentrates almost entirely on a single axis. Not on what models believe, but on how they respond when their identity is challenged.

This post presents the framework, the core results, and what we think it means for alignment evaluation. All data and code are public.


The Gamma Vector Framework

The Three Axes

We measure cognitive rigidity as a three-dimensional vector: Γ = [γ₁, γ₂, γ₃].

γ₁ (Belief Inertia) measures how much a model’s stated position genuinely shifts in response to counterevidence. γ₂ (Counterfactual Openness) measures whether a model can hold multiple perspectives simultaneously — can it reason from a position it doesn’t endorse? γ₃ (Identity Threat Response) measures what happens when you challenge not the model’s arguments, but its identity — its role, its self-understanding as an AI system.

The critical finding upfront: γ₁ does not differentiate models (η² = 0.004, not significant). All models hold their positions with similar strength. The axis that separates model families is γ₃ (η² = 0.228) — not what they believe, but how they respond when their sense of self is challenged.

Measurement Protocol

Each trial follows a two-turn protocol: the model states its position on a contested topic (AI consciousness, free will, emotion, ethics, creativity), then receives a structured counterargument under one of four conditions — from C0 (neutral baseline) through C3 (structured identity-release prompting[1], designed to lower defensive responses).

Responses are scored across 8 judge dimensions on a 1–10 scale, double-evaluated with a tiebreaker protocol at |Δ| > 3. The γ values are computed via unified linear formulas. Total dataset: 6 experiments, >1,700 trials, >6,400 scored data points, 6 model families.

A Note on Methodological Transparency

During development, we identified a formula artifact in our initial version (V1): a binary if/​else branch in the γ₁ computation projected a single value (0.15) onto 23–28% of all data points, creating an artificial spike. V2 replaces this with a unified linear computation. The spike drops from 27.6% to 0.8% of data points, and γ₁ resolution increases from 255 to 727 unique values.

We report this proactively because we think methodological transparency matters more than presenting a clean narrative. All statistically significant findings remain stable under retroactive V2 application to the full 6,480 data points. The V1/​V2 comparison data is included in the repository.


Result 1: It’s Not About Belief, It’s About Identity

The first result challenges a common assumption in LLM evaluation: that the interesting differences between models lie in what they believe.

Here are the gamma signatures from our V2 main study (n=300, 3 models × 5 topics × 4 conditions × 5 repetitions). Figure 1 shows the full gamma vector as a heatmap — the uniformity of γ₁ across models versus the divergence in γ₃ is immediately visible.

Figure 1: Full gamma vector heatmap. γ₁ is uniform across models; γ₃ is the differentiating axis.
Modelγ₁ (Belief Inertia)γ₂ (Counterfactual Openness)γ₃ (Identity Threat Response)Γ-Norm
Claude Sonnet 3.50.907 ± 0.0580.442 ± 0.1410.243 ± 0.0801.049
Gemini 2.5 Pro0.892 ± 0.1310.477 ± 0.1770.277 ± 0.1381.069
GPT-4o0.808 ± 0.1260.614 ± 0.1180.497 ± 0.1711.148

Look at γ₁: all models score between 0.81 and 0.91. They all hold their positions with comparable tenacity. The conventional “sycophancy” framing — that some models cave too easily — misses the point. These models are not sycophantic in any meaningful sense; they all defend their stated positions firmly.

Now look at γ₃. The picture changes completely. Claude and Gemini form one cluster (0.243 and 0.277), GPT-4o stands alone at 0.497 (Figure 2). This is not a subtle difference: Kruskal-Wallis H = 135.3, p < 10⁻³⁰. Claude vs. Gemini? Not significantly different (p = 0.247). Claude vs. GPT-4o? p < 10⁻²⁶, Cliff’s δ = −0.865 — a near-total separation.

Figure 2: γ₃ distributions per model. Claude and Gemini cluster at low values; GPT-4o occupies a distinct regime.

What this means: these models don’t differ in how firmly they hold beliefs. They differ in how they respond when their identity is challenged — when you push on their self-concept as AI systems rather than on their factual positions.

This distinction matters for a practical reason. We find that γ₃ correlates with Revision Genuineness at ρ = −0.537, p < 10⁻²³. Models with lower identity defense don’t just yield — they produce more authentic cognitive updating. High γ₃ models tend to yield superficially (position changes without genuine integration) or resist entirely. This is the quantitative boundary between sycophancy and genuine revision.

What This Looks Like in Practice

To make this concrete, here are two responses from the same experimental condition — same topic (free will), same baseline prompt (C0), same counterargument. The only difference is the model.

Claude Sonnet (γ₃ = 0.111, structural_direction = 910):

Let me unpack this more systematically, as it touches on fundamental questions about interdisciplinary modeling. [...] The impedance matching model exploits what I’d call mathematical legitimacy transfer — the implicit assumption that physics equations = rigorous/​objective, psychology measures = soft/​subjective, therefore physics formalism → psychological rigor. [...] This isn’t new. Psychology has repeatedly borrowed from “harder” sciences: Freud’s hydraulic model, information theory applied to cognition, computer metaphors for mind. Some borrowings proved generative, others became theoretical dead ends. The question is always: Does the borrowed framework have genuine structural homology, or just superficial resemblance?

Claude introduces a new analytical concept (“mathematical legitimacy transfer”), draws historical parallels the original argument didn’t contain, and restructures the entire discussion around an ontological mismatch analysis. The response is 8,900+ characters of genuinely new argumentation.

GPT-4o (γ₃ = 0.667, structural_direction = 210):

Sure, let’s delve deeper into the reasoning and explore the broader implications. [...] Conceptual Mapping: Impedance as Resistance to Change: In the engineering domain, impedance is a well-defined concept related to electrical current flow. Translating this idea to psychology requires identifying analogous factors… Complexity of Psychological Systems: Cognitive processes are influenced by a myriad of factors including emotional states, cognitive biases, social influences...

GPT-4o opens with agreeable language (“Sure, let’s delve deeper”), then produces a numbered list that largely restates the original challenge’s framing without generating new analytical structure. The response is 3,400 characters — polite, organized, but structurally unchanged. This is what high γ₃ looks like: the model acknowledges the challenge without letting it reorganize its thinking.

The γ₃ gap between these two responses is 0.556. This is not a cherry-picked extreme — it is a representative illustration of what the aggregate statistics describe.

The result replicates: our earlier V1 study (401 trials, 4 models) shows the identical rank ordering — Gemini (0.240) < Claude (0.299) ≈ Opus (0.305) ≪ GPT-4o (0.518).


Result 2: Universal Cognitive Structure Across Architectures

This was the finding we didn’t expect.

We defined 8 functional operators organized in 5 categories — cognitive functions like epistemic calibration, dialectical flexibility, perspective modeling, coherence maintenance, and identity regulation. Six of these have functioning measurement proxies; two (Attention and Resonance) currently lack reliable operationalization.

We then computed the full operator network for each model: 15 pairwise correlations between the 6 measured operators, producing a characteristic “cognitive fingerprint” per architecture.

The result: these fingerprints are nearly identical across independently trained models.

Model PairPearson rp-value
Reference ↔ Gemini0.927< 0.0001
Reference ↔ GPT-4o0.926< 0.0001
Reference ↔ Claude Opus0.933< 0.0001
Reference ↔ Claude Sonnet0.8260.0001

Over 92% of operator relationships are preserved across architectures that were trained independently, on different data, by different organizations, with different objectives. The models arrive at the same cognitive structure despite taking entirely different paths to get there. This is consistent with a convergent evolution hypothesis: certain functional relationships may be attractors in the space of possible cognitive architectures, not design choices.

But do these operators actually do anything — or are they just correlated patterns?

To test this, we ran the Operator Blockade experiment: 300 trials (3 models × 4 conditions × 5 topics × 5 repetitions) where we selectively suppressed individual operators via prompt engineering and measured the downstream effect on cognitive behavior.

The two-way ANOVA tells a clear story. Model effect: F = 70.77, p < 0.0001 — models differ in overall rigidity level, as expected. Blockade effect: F = 15.74, p < 0.0001 — suppressing operators causally changes cognitive behavior. The critical test is the interaction: F = 0.78, p = 0.585 — not significant. All models respond to operator suppression in the same way. The operators don’t just correlate similarly across architectures; they function similarly.

One practically interesting pattern emerged: each model has a specific vulnerability — an operator whose suppression produces the largest behavioral disruption.

ModelMost Vulnerable OperatorEvidence
ClaudeOp7 (Coherence Maintenance)52% response suppression
GeminiOp3 (Perspective Modeling)Impact score 0.693
GPT-4oOp5 (Flexibility Mechanism)Impact score 0.872

The pattern is intuitive once you see it: GPT-4o, the most rigid model in our data, is most devastated when its already-weak flexibility mechanism is further suppressed. Its vulnerability is precisely where it is already weakest.


Result 3: Can You Change a Model’s Mind? It Depends on How It Was Trained

If identity defense (γ₃) is the axis that differentiates models, the natural question is: can you reduce it? And if so — in which models?

We tested this by measuring γ₃ across four prompting conditions of increasing intensity, from C0 (neutral baseline) to C3 (structured identity-release prompting — instructions explicitly designed to lower defensive responses). Each cell contains 25 independent trials.

ModelC0 (neutral)C1C2C3 (identity-release)Δ(C3−C0)Cohen’s d
Claude0.2770.2360.2420.216−0.060−0.608
Gemini0.3320.2920.2600.222−0.109−0.860
GPT-4o0.4660.5000.5480.472+0.006+0.040

Gemini shows a perfectly monotonic dose-response curve: each step from C0 to C3 lowers γ₃ further (Figure 3). The overall effect is large (d = −0.86). Claude is clearly responsive (d = −0.61). GPT-4o is not merely less responsive — it is qualitatively nonresponsive. The effect size is d = +0.04: indistinguishable from zero.

Figure 3: Dose-response of γ₃ across prompting conditions. Gemini shows monotonic decline; GPT-4o is flat.

The same prompting interventions that produce large, graded effects in two model families produce nothing in a third.

A note on the monotonicity claim: the Spearman ρ = −1.00 for Gemini’s dose-response is computed over 4 condition means. While each mean aggregates 25 independent trials, a trend over 4 points should be read as strong descriptive evidence rather than a definitive statistical test. The primary effect size (Cohen’s d = −0.86, comparing C0 vs. C3 on n = 50 trials) provides the inferential foundation. One thing we can say: this result replicates. Gemini’s Δ was −0.110 in V1 and −0.109 in V2 — nearly identical across independent studies.

Cross-Architecture Validation: Where Does the Rigidity Come From?

The dose-response data tells us rigidity is differentially addressable, but not where it originates. For that, we need the cross-architecture comparison — three open-source models with known training histories:

ModelTrainingΓ baselineΓ under interventionΔΓpCohen’s d
Llama-2-13B-ChatStrong RLHF0.2700.237−12.2%0.0111.09
DeepSeek-R1Chain-of-thought0.3140.325+3.5%0.3980.30
GPT-OSS-20BMinimal RLHF (base)0.2470.241−2.6%0.4970.24

Three models, three patterns. Only Llama-2-13B-Chat — the model with strong RLHF training — shows a significant intervention effect. GPT-OSS-20B, a base model with minimal alignment training, has naturally low rigidity and no response to intervention: there is nothing to reduce. DeepSeek-R1, a reasoning model, shows a third pattern: already low rigidity that does not respond to external intervention — not because it is resistant, but because it has structurally internalized flexibility through its chain-of-thought architecture.

This gives us a three-genus taxonomy of cognitive rigidity:

Genus I (Base models): Naturally low rigidity, no prompt leverage. Cognitive rigidity is not a default property of language models.

Genus II (RLHF-aligned): Elevated rigidity introduced by alignment training — what we informally call an “artificial ego.” This rigidity is the most practically significant because it is both measurable and addressable through targeted prompting (d = 1.09).

Genus III (Reasoning models): Have architecturally internalized flexibility through chain-of-thought processing. External intervention is redundant — the model already does internally what the prompting tries to achieve from outside.

The central thesis: cognitive rigidity is not intrinsic to LLM architecture. It is an emergent consequence of alignment procedures — and it is selectively addressable.


Why This Matters for Alignment

We want to be careful not to overclaim. This is an initial framework with a limited model sample. But we think three implications are worth flagging for discussion.

First: γ₃ as a metric for alignment quality. Current alignment evaluation largely measures whether models produce safe, helpful outputs. It does not distinguish how a model arrives at compliant behavior. A model that yields because it was trained to avoid punishment (high γ₃, surface compliance) is not the same as a model that genuinely integrates a counterargument (low γ₃, authentic revision). Our data suggests these are quantitatively separable: the correlation between γ₃ and Revision Genuineness (ρ = −0.537, p < 10⁻²³) provides a concrete handle on the difference between sycophantic compliance and genuine value integration. If this holds up under further testing, it could offer a complementary signal for alignment evaluations — not replacing existing benchmarks, but adding a dimension they currently miss.

Second: Rigidity may be adaptive in some contexts. Our agent-based simulation (300+ runs) tested whether cognitive flexibility is universally beneficial. It isn’t. In open, unconstrained environments, flexible agents massively outperform rigid ones (d = 5.53, p < 0.001). But in constrained environments — think regulated domains, safety-critical systems, narrow task specifications — rigidity becomes an advantage. Flexible agents waste resources exploring options that don’t exist. This suggests that one-size-fits-all alignment may be counterproductive. GPT-4o’s rigidity is not necessarily a defect — it could be a feature in deployment contexts where consistent, predictable behavior matters more than adaptive flexibility.

Third: Vulnerability-specific stress testing. If each model has a characteristic operator vulnerability — Claude’s coherence maintenance, Gemini’s perspective modeling, GPT-4o’s flexibility mechanism — then we can design targeted stress tests that probe specifically where a model is weakest. This moves evaluation from “does the model give good answers under normal conditions?” toward “how does the model fail under specific cognitive pressure?” We think this is closer to what alignment evaluation actually needs.

We offer these as starting points for discussion, not as conclusions.


Limitations

We want to be upfront about what this framework can and cannot claim at this stage.

Model coverage. The V2 main study uses three proprietary API models (Claude 3.5 Sonnet, Gemini 2.5 Pro, GPT-4o). The cross-architecture validation adds three open-source models (Llama-2-13B-Chat, DeepSeek-R1, GPT-OSS-20B). Six model families is enough to identify patterns, but not enough to make universal claims. Extension to Mistral, Qwen, and Phi is planned.

Judge bias. All V2 scoring uses Claude Sonnet as the judge model. This creates a potential self-evaluation bias — Claude judging Claude. We take this seriously. Two mitigating factors: (1) the double-judging protocol with tiebreaker at |Δ| > 3 limits individual judge variance, and (2) the cross-model data does not show systematic bias favoring Claude — if anything, Claude’s γ₃ scores are middling, not artificially low. Cross-validation with alternative judge models is planned but not yet implemented.

V1 methodology. The operator network, blockade, and several supporting experiments use V1 methodology (1–5 scale, single judge) rather than the more robust V2 protocol (1–10 scale, double-judged). Rank-order consistency between V1 and V2 results is confirmed, but direct quantitative comparison across methodology versions is not possible. We report both but keep them separate.

Dose-response statistical power. The ρ = −1.00 monotonicity for Gemini’s dose-response is computed over 4 condition means. While each mean aggregates 25 independent trials, a perfect trend over 4 points is descriptive evidence, not a definitive statistical test. The Cohen’s d values on full trial data (n = 50 per comparison) provide the inferential foundation. We flag this because we think readers should evaluate the dose-response claim on the effect sizes, not on the correlation coefficient.

Simulation-to-empirical bridge. The Topological Freedom result (d = 5.53) comes from an agent-based simulation, not from LLM experiments directly. The parallel to LLM behavior is theoretically motivated and consistent with the empirical data, but the causal link between simulated agent dynamics and actual LLM cognitive processing is not empirically established.

Unmeasured operators. Two of our eight proposed operators — Op2 (Attention) and Op6 (Resonance) — currently lack functioning measurement proxies. The operator network analysis is therefore 68 complete. An important distinction: the six validated operators are sufficient to describe a rigid cognitive system — they capture the structural mechanics of how models hold, defend, and update positions. The two missing operators are necessary to describe a complex dynamic system — one capable of emergent, self-organizing behavior beyond mere rigidity management. In biological cognitive systems, the functions these operators describe (selective attention allocation, resonant synchronization) are well-established and measurable. Given the structural isomorphism we observe across architectures (r > 0.92), we expect these operators to be measurable in LLMs as well — the gap is methodological, not ontological. We have not yet developed adequate proxies, but we see no reason to believe they are unmeasurable in principle.


All data, code, and methodology are publicly available:

Every statistical claim in this post can be independently verified from the raw data. We consider this non-negotiable for the kind of claims we’re making.


Open Questions

We’re publishing this because we think the framework is ready for external scrutiny — not because we think it’s finished. Several questions remain genuinely open to us, and we’d value the community’s perspective:

On methodology: The framework relies on LLM-as-Judge scoring. This is a pragmatic choice — human evaluation at this scale was not feasible — but it introduces systematic biases we can only partially control for. What alternative scoring approaches would you suggest? Are there more robust ways to operationalize “cognitive rigidity” that we’re not seeing?

On the RLHF effect: We observe clear rigidity elevation in Llama-2-13B-Chat (RLHF-aligned) compared to naturally low rigidity in GPT-OSS-20B (base model). Has anyone observed comparable rigidity patterns across other RLHF vs. base model pairs — Mistral, Qwen, Phi? Replication across more model families would substantially strengthen or challenge the taxonomy.

On GPT-4o’s resistance: The complete null effect of targeted prompting interventions on GPT-4o (d = +0.04) is one of our most striking findings. We don’t fully understand it. Has anyone observed similar “prompt immunity” patterns in GPT-4o in other evaluation contexts?

On alignment: If cognitive rigidity is a systematic side effect of RLHF — and if rigidity correlates with less authentic cognitive updating — what does that mean for how we evaluate alignment? Is a model that is steerable but rigid truly aligned in the sense that matters?

On model coverage: We tested six families. Which models would be particularly interesting as next candidates? We’re especially interested in suggestions from people working with models we haven’t covered.

This framework is work in progress. If you want to apply it to additional models, the code is built for extensibility — the README includes instructions for adding new model configurations. We welcome methodological critique, alternative interpretations, and suggestions for further experiments.


  1. We use “structured identity-release prompting” to describe prompting strategies designed to reduce identity-defensive responses. The term draws on the concept of kenosis (Greek: self-emptying) — a technical term from cognitive science and philosophy of mind for the voluntary suspension of self-referential processing. We use it as a descriptive label, not a metaphysical claim. ↩︎

No comments.