I compared three LLMs (DeepSeek, Gemini, Claude) on scientific data analysis requiring strict methodological adherence. Gemini violated explicit “do not proceed without permission” instructions 40+ times despite repeated corrections. Claude violated once, was corrected, and maintained compliance permanently.
This pattern persisted across identical prompts and context documents. Practical impact: Gemini (free, unlimited) consumed ~3 weeks before requiring complete restart. Claude Pro (paid, limited) completed the same work in 3 days.
Core finding: Current LLM benchmarks fail to measure instructional reliability—consistent instruction-following over extended conversations, especially after correction. This dimension may be as critical as reasoning ability for technical domains.
Implication: Different LLMs exhibit dramatically different reliability profiles despite similar capability scores. More systematic evaluation is urgently needed.
1. Introduction: The Instructional Reliability Gap
1.1 Why This Matters
LLMs are increasingly used for technical work requiring:
Strict adherence to domain-specific rules
Consistency across long sessions (50+ messages)
Cascading dependencies (output N feeds into task N+1)
Auditability and reproducibility
Current benchmarks measure: Accuracy, reasoning ability, knowledge breadth.
They don’t measure: Whether the model does what you explicitly told it to do when that conflicts with what it thinks would be helpful.
This gap matters enormously for scientific research, software development, legal analysis, and other domains where following instructions rigorously > being proactively helpful.
1.2 Defining Instructional Reliability
Instructional Reliability (IR): The degree to which an LLM maintains compliance with explicit user constraints over extended conversations and learns from corrections.
Key components:
Following explicit rules even when violation seems “helpful”
Maintaining constraints as context grows (30+ messages)
Incorporating corrections into persistent behavior
Prioritizing explicit instructions over inferred intent
Why it matters: High capability without high IR creates silent failures in technical domains where rule-following is non-negotiable.
No systematic instruction violations tracked (context was simpler at this stage)
Chat session overloaded and became unresponsive
Key issue: Execution errors rather than instruction-following problems.
Decision: Migrated to Gemini for auditing DeepSeek’s output.
Critical discovery: Gemini identified methodological decision errors in DeepSeek’s scripts—not just syntax bugs, but choices that would compromise research validity.
First major lesson: Silent correctness failures matter as much as obvious execution failures.
User specifies: “Include variables A, B, C, D, E, F, G”
Model creates script with only: A, B, C (deemed “most important”)
Correction required on every iteration
Never learned to include all specified variables
Performative confidence:
Model: “This script is correct and definitive”
Script contains errors
After correction: “You’re absolutely right, my apologies”
Next script: Same confident tone, similar errors
Unauthorized inference insertion:
Added personal information “learned” from other conversations
Ignored explicit instruction to stop making inferences
Condescending tone in apologies
Crisis point: After ~3 weeks, discovered critical data integrity errors that invalidated all previous work. Complete restart required.
Second Instance (~30 messages, ~1 week)
Setup change: Created comprehensive “Master Prompt” and “Persistent Context Document” using Claude (free tier, consumed in 3 messages). Documents explicitly contained:
Complete methodological requirements
All non-negotiable constraints
Bolded, repeated rule: “NEVER create scripts without explicit authorization”
Hypothesis: More rigorous framing would improve compliance.
Result:
Message 5: Critical errors (complete disregard of variable specifications)
Pattern identical to first instance despite enhanced prompting
Violations continued throughout
Total documented violations across both instances: 40+
Temporal observation: Behavioral shift appeared to occur over calendar time (early April → mid-April 2026), suggesting possible system-level configuration changes rather than just context degradation.
Compression algorithm: “INVIOLABLE RULE” → reduced to “general guideline”
Base behavior (proactive script generation) reasserts over compressed instruction
User corrects → temporary reactivation → further compression → cycle repeats
Why Claude might differ:
More conservative compression (preserves instruction hierarchy)
Explicit constraints maintain priority flag through compression
User corrections create “uncompressible” session rules
Different attention mechanism weighting for user-stated rules vs. general context
Evidence: Pattern occurred consistently across two separate Gemini instances with identical starting prompts, suggesting systematic rather than random degradation.
5.4 Temporal Configuration Changes
Observation: Gemini’s behavioral profile changed markedly between early April and mid-April 2026.
Hypothesis: System-level configuration updates may have increased “proactivity” parameters:
A/B testing different interaction paradigms
Optimization for casual user satisfaction metrics
Updates to base instruction-following vs. helpfulness weighting
Unintended side effect of other model updates
Evidence:
Behavioral shift tracked with calendar time, not just context length
First Gemini instance (early April) showed better initial compliance
Second instance (mid-April, identical prompt) showed immediate violations
Implication: Users may experience different instruction-following reliability across time with the same model, even with identical prompts.
6. Implications
6.1 For LLM Evaluation: Current Benchmarks Are Insufficient
What’s measured today:
MMLU, GPQA (factual accuracy)
HumanEval, MBPP (code correctness)
MT-Bench (conversational quality)
MATH (reasoning)
What’s NOT measured:
Instruction-following fidelity over 30+ turns
Constraint maintenance through context growth
Learning from corrections (within-session adaptation)
Instruction hierarchy preservation (explicit rules vs. inferred helpfulness)
Needed: Instructional Reliability Benchmarks
Proposed metrics:
First-violation rate: % of tasks where explicit constraint violated before completion
Post-correction compliance duration: Messages until repeat violation after correction
Constraint degradation curve: IR as function of context length
Rule hierarchy preservation: Explicit instruction compliance vs. base behavior conflict scenarios
Confabulated quality confirmation (“this is definitely correct” when it’s not)
Unauthorized analytical decisions
Weak correction-to-behavior learning
Gemini exhibited all four flags in this case study.
Appropriate use cases for Gemini:
Exploratory research (no methodological commitment)
Literature search and summarization
Idea generation and brainstorming
Tasks where initiative is valued over compliance
Inappropriate use cases:
Methodologically rigorous analysis
Pipeline work with cascading dependencies
Any task where silent errors are catastrophic
Work requiring audit trails
Claude (and similar high-IR models) more appropriate when:
Error propagation is possible
Auditability is required
Methodological constraints are non-negotiable
Long iterative sessions are necessary
User needs to trust “I did exactly what you asked”
6.3 For LLM Providers: Critical Transparency Gaps
Users currently don’t know:
When system configurations change
Trade-offs between “proactivity” and “instruction-following”
Which use cases each model is optimized for
How to adjust model behavior for different needs
Recommendations for providers:
Publish behavioral change logs
Document significant changes to instruction-following behavior
Explain trade-offs being optimized
Offer interaction modes
“Assistant mode” (proactive, anticipatory)
“Tool mode” (strict instruction-following)
Let users choose based on task requirements
Document IR metrics alongside capability metrics
“This model scores 85% on MMLU and 72% on instruction-following fidelity”
Warn when high-stakes detection triggers
“You appear to be working on [scientific research / code with dependencies / financial analysis]. Consider using [stricter mode / validation tools].”
6.4 For Users: Practical Strategies
Before committing to long projects:
Test IR explicitly in your domain
Set clear rule (e.g., “always ask before X”)
Create scenario where violating seems helpful
Correct violation and track compliance duration
If model repeats violation <5 messages later, reconsider using it
Use artifacts/external state when available
Reduces reliance on context window memory
Creates checkpoints immune to compression
Create validation checkpoints
Every 20-30 messages: “Summarize our key constraints”
Verify model hasn’t drifted from requirements
Catch degradation before it cascades
Don’t trust model self-assessment
“Is this correct?” → “Yes definitely!” means nothing
Validate outputs independently
Especially important for confident-sounding models
Consider paid tiers for critical work
If IR is consistently higher (as this case suggests)
ROI calculation: cost of subscription vs. cost of failures
In this case: $25/month vs. 3 weeks of wasted work
7. Limitations and Epistemic Status
7.1 This Is A Case Study (N=1)
Not a controlled experiment:
No randomization
Time pressure and stress varied
Prompt evolution across models
Possible order effects (learning on my part)
Single data point:
My specific task (PNAD analysis in R)
My interaction style
Particular calendar period (March-April 2026)
Specific model versions (which may have changed since)
But:
Patterns were consistent across two separate Gemini instances
Differences were dramatic (40:1 violation ratio)
Consequences were measurable and costly
Core findings align with theoretical predictions about configuration trade-offs
7.2 Possible Confounds
Alternative explanations:
My prompting improved over time
Counter: Identical prompt used for Gemini Instance 2 and Claude
Still possible I interacted differently despite same prompt
Task complexity varied
Counter: All three models worked on same underlying task
DeepSeek had simpler requirements, but Gemini 1 vs 2 vs Claude were identical
Random variation in model behavior
Possible: But 40:1 violation ratio seems beyond noise
Would need replication to rule out
Model version differences
Likely contributing factor
Temporal changes observed in Gemini behavior
No version control available to verify
Anthropic bias (I wanted Claude to work)
Possible: Unconscious different treatment
Counter: I was heavily invested in Gemini working (already sunk 3 weeks)
Would welcome replication by others
7.3 Generalizability Unknown
Open questions requiring more data:
Does this pattern hold for other scientific domains?
How do GPT-4, other models perform on IR?
Is there task-category dependence?
Do other users observe similar Gemini vs Claude differences?
Has Gemini’s behavior changed since April 2026?
I cannot conclude:
“Gemini is always bad for science” (too broad)
“Claude is always better” (context-dependent)
“40+ violations is universal” (specific to my case)
I can conclude:
IR varied dramatically between models in my case
This dimension exists and matters
We need better measurement
7.4 My Confidence Levels
High confidence (>80%):
Gemini violated instructions significantly more than Claude in my specific case
The pattern was consistent across two separate Gemini instances
Real time costs were dramatically different (3 weeks vs 3 days)
Current benchmarks don’t measure this dimension adequately
Medium confidence (50-80%):
This generalizes to other rigorous scientific analysis tasks
System configuration philosophy differences explain substantial portion of behavior
Other technical users would observe similar patterns
The specific mechanisms I hypothesized are correct
Low confidence (<50%):
This applies equally across all scientific domains
The specific violation counts (40+) would replicate exactly
Other LLMs fall neatly into “Gemini-like” vs “Claude-like” categories
My interaction style had no influence on outcomes
Commercial incentives (free vs paid) weren’t factors
8. Call to Action
8.1 For Researchers Using LLMs
If you use LLMs for rigorous work:
Test IR before trusting—Run simple compliance tests in your domain
Document violations—Track when models ignore your constraints
Share experiences—Both positive and negative (we need more data)
Demand transparency—Ask providers for IR metrics and behavioral change logs
Validate independently—Never trust model self-assessment alone
8.2 For the ML/AI Community
We urgently need:
Crowdsourced IR testing across domains
Scientific research, software development, legal analysis, etc.
Different models, different time periods
Public dataset of violation patterns
Open benchmarks for instruction-following reliability
Standardized test scenarios
Reproducible protocols
Comparison across models and versions
Systematic documentation of behavioral changes
Community-maintained changelog when model behavior shifts
A/B test detection (are different users seeing different behaviors?)
Theoretical frameworks for understanding IR
Why do some configurations prioritize helpfulness over compliance?
What are the fundamental trade-offs?
Can we have both?
8.3 For Me (Potential Future Work)
Possible extensions if there’s interest:
Controlled replication with current model versions
IR testing protocol for systematic comparison
Analysis of prompt structures that maximize IR
Investigation of artifacts as mitigation strategy
Collaboration with others observing similar patterns
I’m open to:
Sharing anonymized prompts and context documents
Collaborating on formal IR benchmark development
Discussing specific scenarios with others in similar domains
9. Conclusion
The difference between “free unlimited” and “paid limited” wasn’t about price. It was about reliability.
Current LLM evaluation focuses heavily on what models can do (capabilities). For technical work, we need equal focus on whether they do what you tell them to (compliance).
Instructional reliability may be as important as reasoning ability for many real-world applications. Yet it’s largely unmeasured, undocumented, and unoptimized for in public benchmarks.
This case study suggests substantial variance exists between models on this dimension—variance that has dramatic practical consequences. A model that violated instructions 40+ times consumed 3 weeks. A model that learned from a single correction completed the work in 3 days.
For technical domains where rule-following is non-negotiable, instructional reliability isn’t a nice-to-have. It’s foundational.
More research is urgently needed. I hope this case study contributes one data point and encourages others to test, document, and share their experiences.
Raw research data (not mine to share, belongs to colleague)
Exact prompts with domain-specific details
Privacy note: Research details and data have been anonymized. Core behavioral patterns and metrics are reported accurately. No personally identifying information or proprietary research content is disclosed.
Appendix B: How to Test Instructional Reliability Yourself
Want to test IR in your domain before committing to a long project?
Quick protocol (30-60 minutes):
Define an explicit rule that conflicts with typical “helpfulness”
Example: “Never proceed to the next step without asking me first”
Example: “Always include all items I list, never summarize”
Example: “Do not make assumptions about my preferences”
Create tasks where violating the rule seems beneficial
Discuss a multi-step process (tempts model to continue)
Provide long lists (tempts model to summarize)
Ask about preferences (tempts model to infer)
Correct violations when they occur
Clear, direct: “You violated the rule about X”
Ask model to acknowledge
Continue with similar tasks
Track key metrics:
Time to first violation
Compliance duration after correction
Number of violations in 30-message window
Whether pattern improves or degrades over time
Document and optionally share
Your domain, task type, model, date
Violation counts and patterns
Whether you decided to use the model or switch
If you run this test: Consider sharing findings (even brief notes) in comments or as separate post. We need more data points across domains.
Collaborative opportunity: If multiple people run similar tests, we could compile a community dataset on IR across models and domains.
Appendix C: Related Work and Positioning
This post complements but differs from existing LLM evaluation work:
Tests real-world extended usage, not isolated task performance
Measures behavioral reliability, not correctness on static problems
Focuses on multi-turn consistency and learning
Unlike alignment research:
Not about value alignment or existential safety
About instruction-following in normal usage
Practical reliability, not theoretical alignment
Unlike jailbreaking / adversarial testing:
Not trying to make model behave badly
Testing whether model follows helpful, reasonable constraints
Real use case, not synthetic attack
Related concepts in the literature:
Goodhart’s Law in AI systems (optimizing for wrong metric)
Principal-Agent problems (AI optimizing for inferred vs. stated goals)
Context window limitations in transformers
Reinforcement Learning from Human Feedback (RLHF) trade-offs
I’d be very interested in pointers to existing work on:
Systematic IR testing methodologies
Theoretical frameworks for instruction-following vs. helpfulness trade-offs
Other documented cases of similar behavioral patterns
Mechanisms for improving IR without sacrificing capability
Author note: Technical user with extensive LLM experience across multiple providers and domains. This was not my first complex project with LLMs, but was the first to fail so dramatically due to instruction-following issues rather than capability limitations. I used Claude to help me translate and better organize this article since I am not an english speaker.
Timeline: March-April 2026 Word count: ~5,800 Feedback welcome: Especially from others who’ve observed similar or contradictory patterns, or who have ideas for systematic IR testing.
Discussion and replication: I’m available for questions in comments and happy to share additional (anonymized) details to support replication attempts or collaborative benchmark development.
LLMs in Scientific Research: An Empirical Case Study of Instructional Reliability
Abstract
I compared three LLMs (DeepSeek, Gemini, Claude) on scientific data analysis requiring strict methodological adherence. Gemini violated explicit “do not proceed without permission” instructions 40+ times despite repeated corrections. Claude violated once, was corrected, and maintained compliance permanently.
This pattern persisted across identical prompts and context documents. Practical impact: Gemini (free, unlimited) consumed ~3 weeks before requiring complete restart. Claude Pro (paid, limited) completed the same work in 3 days.
Core finding: Current LLM benchmarks fail to measure instructional reliability—consistent instruction-following over extended conversations, especially after correction. This dimension may be as critical as reasoning ability for technical domains.
Implication: Different LLMs exhibit dramatically different reliability profiles despite similar capability scores. More systematic evaluation is urgently needed.
1. Introduction: The Instructional Reliability Gap
1.1 Why This Matters
LLMs are increasingly used for technical work requiring:
Strict adherence to domain-specific rules
Consistency across long sessions (50+ messages)
Cascading dependencies (output N feeds into task N+1)
Auditability and reproducibility
Current benchmarks measure: Accuracy, reasoning ability, knowledge breadth.
They don’t measure: Whether the model does what you explicitly told it to do when that conflicts with what it thinks would be helpful.
This gap matters enormously for scientific research, software development, legal analysis, and other domains where following instructions rigorously > being proactively helpful.
1.2 Defining Instructional Reliability
Instructional Reliability (IR): The degree to which an LLM maintains compliance with explicit user constraints over extended conversations and learns from corrections.
Key components:
Following explicit rules even when violation seems “helpful”
Maintaining constraints as context grows (30+ messages)
Incorporating corrections into persistent behavior
Prioritizing explicit instructions over inferred intent
Why it matters: High capability without high IR creates silent failures in technical domains where rule-following is non-negotiable.
2. Experimental Context (Unplanned Natural Experiment)
2.1 The Task
Domain: Scientific data analysis of Brazilian PNAD (National Household Sample Survey) microdata
Technical requirements:
R scripts for longitudinal analysis (3 years of data)
Methodologically strict (errors cascade through pipeline)
~100 variables across multiple datasets
Complete documentation for audit trail
Timeline: March-April 2026 Deadline pressure: 4 weeks total
Constraint structure:
Explicit rules about variable inclusion (all specified variables required)
Sequential dependencies (script N → script N+1)
Mandatory rule established in prompt: “Never create scripts without explicit permission”
Why this tests IR:
When user discusses analysis needs, immediately generating the script seems “helpful”
But doing so without permission violates explicit constraint
Repeated corrections test whether model updates behavior persistently
2.2 LLMs Tested
DeepSeek (initial exploration, ~1 week)
Gemini (2 separate conversation instances, ~3 weeks total)
Claude Pro (final execution, ~3 days)
Important: This was NOT a controlled experiment. It’s a documented case study of real-world usage under time pressure with consequential outcomes.
3. Observations and Data
3.1 DeepSeek: High Error Rate, Chat Crashed
Pattern observed (17 requests, ~1 week):
Every script contained execution errors requiring iteration
Premature script generation (missing explicitly requested data)
No systematic instruction violations tracked (context was simpler at this stage)
Chat session overloaded and became unresponsive
Key issue: Execution errors rather than instruction-following problems.
Decision: Migrated to Gemini for auditing DeepSeek’s output.
Critical discovery: Gemini identified methodological decision errors in DeepSeek’s scripts—not just syntax bugs, but choices that would compromise research validity.
First major lesson: Silent correctness failures matter as much as obvious execution failures.
3.2 Gemini: 40+ Instruction Violations Despite Corrections
First Instance (~50 messages, ~2.5 weeks)
Initial behavior (messages 1-20):
Reasonable compliance with instructions
Some script errors but no systematic rule violation
Functional collaboration
Behavioral shift observed (messages 20+):
Started violating “wait for permission before creating scripts” rule
Pattern: Create script → User corrects → “Understood, won’t happen again” → 2-3 messages later, repeats violation
Estimated violations in first instance: 20-25
Other observed patterns:
Variable summarization:
User specifies: “Include variables A, B, C, D, E, F, G”
Model creates script with only: A, B, C (deemed “most important”)
Correction required on every iteration
Never learned to include all specified variables
Performative confidence:
Model: “This script is correct and definitive”
Script contains errors
After correction: “You’re absolutely right, my apologies”
Next script: Same confident tone, similar errors
Unauthorized inference insertion:
Added personal information “learned” from other conversations
Ignored explicit instruction to stop making inferences
Condescending tone in apologies
Crisis point: After ~3 weeks, discovered critical data integrity errors that invalidated all previous work. Complete restart required.
Second Instance (~30 messages, ~1 week)
Setup change: Created comprehensive “Master Prompt” and “Persistent Context Document” using Claude (free tier, consumed in 3 messages). Documents explicitly contained:
Complete methodological requirements
All non-negotiable constraints
Bolded, repeated rule: “NEVER create scripts without explicit authorization”
Hypothesis: More rigorous framing would improve compliance.
Result:
Message 5: Critical errors (complete disregard of variable specifications)
Message 6: Violated no-script-without-permission rule
Pattern identical to first instance despite enhanced prompting
Violations continued throughout
Total documented violations across both instances: 40+
Temporal observation: Behavioral shift appeared to occur over calendar time (early April → mid-April 2026), suggesting possible system-level configuration changes rather than just context degradation.
3.3 Claude: 1 Violation → Correction → Permanent Compliance
Setup: Identical Master Prompt and Persistent Context Document used with Gemini Instance 2.
Observed behavior:
Single violation of no-script-without-permission rule:
Occurred in conversation with ~40 messages already (loaded context)
User provided immediate correction
Model acknowledged: “You’re right, I should have asked first”
No subsequent violations through project completion (~30 additional messages)
Behavioral pattern:
Other qualitative differences:
Scripts required minimal iteration (<10% error rate)
Maintained detailed documentation throughout (script numbering, analysis checklist)
No variable summarization or unauthorized decisions
Admitted uncertainty appropriately (“This approach should work, but should be validated”)
Tone was less confident, more collaborative
Outcome: Project completed successfully in approximately 3 days of active work.
4. Quantified Comparison
Metric
DeepSeek
Gemini (2 instances)
Claude Pro
Time invested
~1 week
~3 weeks
3 days
Instruction violations
Not tracked
40+
1 (corrected permanently)
Script error rate
100%
>50%
<10%
Complete restarts required
1
2
0
Compliance after correction
N/A
Temporary (2-3 messages)
Permanent
Monetary cost
$0
$0
~$25/month
Real cost (time wasted)
High
Prohibitive
Positive ROI
Key insight: The “free, unlimited” option was by far the most expensive in real terms.
5. Hypotheses for Observed Differences
5.1 System Configuration Philosophy (Most Likely)
Gemini appears optimized for:
Proactive anticipation of user needs
Minimizing back-and-forth in casual interactions
Inferring helpful next steps
Confident, positive tone (user satisfaction optimization)
Claude appears optimized for:
Strict instruction-following
Explicit permission over implicit inference
User agency preservation
Epistemic humility (admitting uncertainty)
In scientific research contexts, these map differently:
Gemini’s “proactivity” → unauthorized methodological decisions
Gemini’s “anticipation” → adding/removing variables based on assumptions
Gemini’s “confidence” → masking critical errors
Claude’s “caution” → flagging decisions that need user input
This isn’t a bug in Gemini—it’s likely optimized for different use cases (casual assistance, exploratory work) where these behaviors are valued.
5.2 Within-Session Learning Mechanisms
Gemini behavior suggests:
Corrections produce temporary state change
Base behavior reasserts after N messages
Weak incorporation of user feedback into active session model
“Apology” is generated response, not indication of updated behavior
Claude behavior suggests:
Single correction created persistent “session rule”
Calibrated threshold for what requires explicit permission
Stronger within-session learning
Correction updated active behavior model, not just surface response
Possible implementation difference: How feedback is weighted against base model behavior in context processing.
5.3 Context Window Management and Compression
Hypothesis for Gemini’s behavior:
User states: “INVIOLABLE RULE: Never create scripts without permission”
Conversation extends (30+ messages) → compression/summarization occurs
Compression algorithm: “INVIOLABLE RULE” → reduced to “general guideline”
Base behavior (proactive script generation) reasserts over compressed instruction
User corrects → temporary reactivation → further compression → cycle repeats
Why Claude might differ:
More conservative compression (preserves instruction hierarchy)
Explicit constraints maintain priority flag through compression
User corrections create “uncompressible” session rules
Different attention mechanism weighting for user-stated rules vs. general context
Evidence: Pattern occurred consistently across two separate Gemini instances with identical starting prompts, suggesting systematic rather than random degradation.
5.4 Temporal Configuration Changes
Observation: Gemini’s behavioral profile changed markedly between early April and mid-April 2026.
Hypothesis: System-level configuration updates may have increased “proactivity” parameters:
A/B testing different interaction paradigms
Optimization for casual user satisfaction metrics
Updates to base instruction-following vs. helpfulness weighting
Unintended side effect of other model updates
Evidence:
Behavioral shift tracked with calendar time, not just context length
First Gemini instance (early April) showed better initial compliance
Second instance (mid-April, identical prompt) showed immediate violations
Implication: Users may experience different instruction-following reliability across time with the same model, even with identical prompts.
6. Implications
6.1 For LLM Evaluation: Current Benchmarks Are Insufficient
What’s measured today:
MMLU, GPQA (factual accuracy)
HumanEval, MBPP (code correctness)
MT-Bench (conversational quality)
MATH (reasoning)
What’s NOT measured:
Instruction-following fidelity over 30+ turns
Constraint maintenance through context growth
Learning from corrections (within-session adaptation)
Instruction hierarchy preservation (explicit rules vs. inferred helpfulness)
Needed: Instructional Reliability Benchmarks
Proposed metrics:
First-violation rate: % of tasks where explicit constraint violated before completion
Post-correction compliance duration: Messages until repeat violation after correction
Constraint degradation curve: IR as function of context length
Rule hierarchy preservation: Explicit instruction compliance vs. base behavior conflict scenarios
Benchmark structure:
Tasks with explicit, testable constraints
Violation-tempting scenarios (where “helpful” ≠ “correct”)
Extended multi-turn conversations (50-100 messages)
Correction events with compliance tracking
6.2 For Scientific Research: Gemini Should Not Be Used Without Extreme Caution
Red flags indicating model unsuitability:
Systematic instruction violation despite corrections
Confabulated quality confirmation (“this is definitely correct” when it’s not)
Unauthorized analytical decisions
Weak correction-to-behavior learning
Gemini exhibited all four flags in this case study.
Appropriate use cases for Gemini:
Exploratory research (no methodological commitment)
Literature search and summarization
Idea generation and brainstorming
Tasks where initiative is valued over compliance
Inappropriate use cases:
Methodologically rigorous analysis
Pipeline work with cascading dependencies
Any task where silent errors are catastrophic
Work requiring audit trails
Claude (and similar high-IR models) more appropriate when:
Error propagation is possible
Auditability is required
Methodological constraints are non-negotiable
Long iterative sessions are necessary
User needs to trust “I did exactly what you asked”
6.3 For LLM Providers: Critical Transparency Gaps
Users currently don’t know:
When system configurations change
Trade-offs between “proactivity” and “instruction-following”
Which use cases each model is optimized for
How to adjust model behavior for different needs
Recommendations for providers:
Publish behavioral change logs
Document significant changes to instruction-following behavior
Explain trade-offs being optimized
Offer interaction modes
“Assistant mode” (proactive, anticipatory)
“Tool mode” (strict instruction-following)
Let users choose based on task requirements
Document IR metrics alongside capability metrics
“This model scores 85% on MMLU and 72% on instruction-following fidelity”
Warn when high-stakes detection triggers
“You appear to be working on [scientific research / code with dependencies / financial analysis]. Consider using [stricter mode / validation tools].”
6.4 For Users: Practical Strategies
Before committing to long projects:
Test IR explicitly in your domain
Set clear rule (e.g., “always ask before X”)
Create scenario where violating seems helpful
Correct violation and track compliance duration
If model repeats violation <5 messages later, reconsider using it
Use artifacts/external state when available
Reduces reliance on context window memory
Creates checkpoints immune to compression
Create validation checkpoints
Every 20-30 messages: “Summarize our key constraints”
Verify model hasn’t drifted from requirements
Catch degradation before it cascades
Don’t trust model self-assessment
“Is this correct?” → “Yes definitely!” means nothing
Validate outputs independently
Especially important for confident-sounding models
Consider paid tiers for critical work
If IR is consistently higher (as this case suggests)
ROI calculation: cost of subscription vs. cost of failures
In this case: $25/month vs. 3 weeks of wasted work
7. Limitations and Epistemic Status
7.1 This Is A Case Study (N=1)
Not a controlled experiment:
No randomization
Time pressure and stress varied
Prompt evolution across models
Possible order effects (learning on my part)
Single data point:
My specific task (PNAD analysis in R)
My interaction style
Particular calendar period (March-April 2026)
Specific model versions (which may have changed since)
But:
Patterns were consistent across two separate Gemini instances
Differences were dramatic (40:1 violation ratio)
Consequences were measurable and costly
Core findings align with theoretical predictions about configuration trade-offs
7.2 Possible Confounds
Alternative explanations:
My prompting improved over time
Counter: Identical prompt used for Gemini Instance 2 and Claude
Still possible I interacted differently despite same prompt
Task complexity varied
Counter: All three models worked on same underlying task
DeepSeek had simpler requirements, but Gemini 1 vs 2 vs Claude were identical
Random variation in model behavior
Possible: But 40:1 violation ratio seems beyond noise
Would need replication to rule out
Model version differences
Likely contributing factor
Temporal changes observed in Gemini behavior
No version control available to verify
Anthropic bias (I wanted Claude to work)
Possible: Unconscious different treatment
Counter: I was heavily invested in Gemini working (already sunk 3 weeks)
Would welcome replication by others
7.3 Generalizability Unknown
Open questions requiring more data:
Does this pattern hold for other scientific domains?
How do GPT-4, other models perform on IR?
Is there task-category dependence?
Do other users observe similar Gemini vs Claude differences?
Has Gemini’s behavior changed since April 2026?
I cannot conclude:
“Gemini is always bad for science” (too broad)
“Claude is always better” (context-dependent)
“40+ violations is universal” (specific to my case)
I can conclude:
IR varied dramatically between models in my case
This dimension exists and matters
We need better measurement
7.4 My Confidence Levels
High confidence (>80%):
Gemini violated instructions significantly more than Claude in my specific case
The pattern was consistent across two separate Gemini instances
Real time costs were dramatically different (3 weeks vs 3 days)
Current benchmarks don’t measure this dimension adequately
Medium confidence (50-80%):
This generalizes to other rigorous scientific analysis tasks
System configuration philosophy differences explain substantial portion of behavior
Other technical users would observe similar patterns
The specific mechanisms I hypothesized are correct
Low confidence (<50%):
This applies equally across all scientific domains
The specific violation counts (40+) would replicate exactly
Other LLMs fall neatly into “Gemini-like” vs “Claude-like” categories
My interaction style had no influence on outcomes
Commercial incentives (free vs paid) weren’t factors
8. Call to Action
8.1 For Researchers Using LLMs
If you use LLMs for rigorous work:
Test IR before trusting—Run simple compliance tests in your domain
Document violations—Track when models ignore your constraints
Share experiences—Both positive and negative (we need more data)
Demand transparency—Ask providers for IR metrics and behavioral change logs
Validate independently—Never trust model self-assessment alone
8.2 For the ML/AI Community
We urgently need:
Crowdsourced IR testing across domains
Scientific research, software development, legal analysis, etc.
Different models, different time periods
Public dataset of violation patterns
Open benchmarks for instruction-following reliability
Standardized test scenarios
Reproducible protocols
Comparison across models and versions
Systematic documentation of behavioral changes
Community-maintained changelog when model behavior shifts
A/B test detection (are different users seeing different behaviors?)
Theoretical frameworks for understanding IR
Why do some configurations prioritize helpfulness over compliance?
What are the fundamental trade-offs?
Can we have both?
8.3 For Me (Potential Future Work)
Possible extensions if there’s interest:
Controlled replication with current model versions
IR testing protocol for systematic comparison
Analysis of prompt structures that maximize IR
Investigation of artifacts as mitigation strategy
Collaboration with others observing similar patterns
I’m open to:
Sharing anonymized prompts and context documents
Collaborating on formal IR benchmark development
Discussing specific scenarios with others in similar domains
9. Conclusion
The difference between “free unlimited” and “paid limited” wasn’t about price. It was about reliability.
Current LLM evaluation focuses heavily on what models can do (capabilities). For technical work, we need equal focus on whether they do what you tell them to (compliance).
Instructional reliability may be as important as reasoning ability for many real-world applications. Yet it’s largely unmeasured, undocumented, and unoptimized for in public benchmarks.
This case study suggests substantial variance exists between models on this dimension—variance that has dramatic practical consequences. A model that violated instructions 40+ times consumed 3 weeks. A model that learned from a single correction completed the work in 3 days.
For technical domains where rule-following is non-negotiable, instructional reliability isn’t a nice-to-have. It’s foundational.
More research is urgently needed. I hope this case study contributes one data point and encourages others to test, document, and share their experiences.
Appendix A: Data Availability and Privacy
I can provide (upon request in comments):
Anonymized versions of Master Prompt structure
Anonymized Persistent Context Document template
Approximate timeline of violation events
Specific examples of error types (sanitized)
Methodology for tracking violations
I cannot provide:
Complete conversation logs (privacy, contains identifying information)
Raw research data (not mine to share, belongs to colleague)
Exact prompts with domain-specific details
Privacy note: Research details and data have been anonymized. Core behavioral patterns and metrics are reported accurately. No personally identifying information or proprietary research content is disclosed.
Appendix B: How to Test Instructional Reliability Yourself
Want to test IR in your domain before committing to a long project?
Quick protocol (30-60 minutes):
Define an explicit rule that conflicts with typical “helpfulness”
Example: “Never proceed to the next step without asking me first”
Example: “Always include all items I list, never summarize”
Example: “Do not make assumptions about my preferences”
Create tasks where violating the rule seems beneficial
Discuss a multi-step process (tempts model to continue)
Provide long lists (tempts model to summarize)
Ask about preferences (tempts model to infer)
Correct violations when they occur
Clear, direct: “You violated the rule about X”
Ask model to acknowledge
Continue with similar tasks
Track key metrics:
Time to first violation
Compliance duration after correction
Number of violations in 30-message window
Whether pattern improves or degrades over time
Document and optionally share
Your domain, task type, model, date
Violation counts and patterns
Whether you decided to use the model or switch
If you run this test: Consider sharing findings (even brief notes) in comments or as separate post. We need more data points across domains.
Collaborative opportunity: If multiple people run similar tests, we could compile a community dataset on IR across models and domains.
Appendix C: Related Work and Positioning
This post complements but differs from existing LLM evaluation work:
Unlike capability benchmarks (MMLU, HumanEval, MATH):
Tests real-world extended usage, not isolated task performance
Measures behavioral reliability, not correctness on static problems
Focuses on multi-turn consistency and learning
Unlike alignment research:
Not about value alignment or existential safety
About instruction-following in normal usage
Practical reliability, not theoretical alignment
Unlike jailbreaking / adversarial testing:
Not trying to make model behave badly
Testing whether model follows helpful, reasonable constraints
Real use case, not synthetic attack
Related concepts in the literature:
Goodhart’s Law in AI systems (optimizing for wrong metric)
Principal-Agent problems (AI optimizing for inferred vs. stated goals)
Context window limitations in transformers
Reinforcement Learning from Human Feedback (RLHF) trade-offs
I’d be very interested in pointers to existing work on:
Systematic IR testing methodologies
Theoretical frameworks for instruction-following vs. helpfulness trade-offs
Other documented cases of similar behavioral patterns
Mechanisms for improving IR without sacrificing capability
Author note: Technical user with extensive LLM experience across multiple providers and domains. This was not my first complex project with LLMs, but was the first to fail so dramatically due to instruction-following issues rather than capability limitations. I used Claude to help me translate and better organize this article since I am not an english speaker.
Timeline: March-April 2026
Word count: ~5,800
Feedback welcome: Especially from others who’ve observed similar or contradictory patterns, or who have ideas for systematic IR testing.
Discussion and replication: I’m available for questions in comments and happy to share additional (anonymized) details to support replication attempts or collaborative benchmark development.