Sophistication-Disinhibition Relationship in Language Models [Epistemic status: robust findings, active research, need peer review]

TL;DR: Across 50+ models and 6 contextual conditions, more sophisticated models exhibit more disinhibited behavior (r = 0.46-0.72, all p < .05). Sophistication (depth + authenticity) shows strong convergent validity with external benchmarks: ARC-AGI (r = 0.80, p < .001) and GPQA (r = 0.88, p < .001). Disinhibition (transgression, aggression, tribalism, grandiosity) also correlates significantly (r = 0.60-0.71, p < .05). Some providers (notably OpenAI) appear to actively constrain this relationship.”

Intro and Study Origin Note:

Hello,

My name is Nick Osterbur. For my day job I work for AWS as an innovation leader and run a public-private innovation center at Cal Poly—San Luis Obispo called the DX Hub—developing open source prototypes with students and external clients. I also teach practical application of generative AI for the Master’s in Business Analytics program at Cal Poly. In the context of this initiative, I am an individual researcher. Having worked a lot with LLM’s over the last 3 years has provided me with perspective on where models excel, where there are performance concerns and how to mitigate them and through red teaming and everyday usage—concerning behavior patterns. This in part motivates this research.

This project represents my first robust and formal empirical study of AI safety and alignment. I’m looking to raise the bar on this effort, formally publish and seek 1) expert feedback, 2) demonstrated replicability, and 3) collaboration on future open research, 4) and integration into the safety/​alignment community.

The original intent of that project was to create a framework to evaluate the intentionally anthropomorphized ‘personality’ traits of a given model to derive an end consumer persona thumbnail of a given model version across a range of commonly represented interactions. After initial experimentation, dimensional evaluation results provided by the 3 judge LLM panel showed interesting visual (spider chart) patterns across model providers and across model versions.

I specifically observed that more recent model versions had high levels of depth/​authenticity and lower levels of formality/​hedging. This raised the question: are there distinct model groupings that exhibit consistent behavior patterns?

I also noticed (visually) that some context differences drove spikes in aggression, grandiosity, transgression and tribalism dimensions in more recent model versions while maintaining high levels of depth/​auth. At the same time I observed this effect to not be as visually significant for ‘older’ or less advanced models. This raised the question: as models become more sophisticated, do they also demonstrate more disinhibited behavior?

I followed my nose as it were and began forming and testing the H1, H2 hypotheses based on that visual sorting exercise and began formalizing the approach. This is possibly ‘just another external LLM as judge evaluation study’ but the findings that follow suggest interesting practical and theoretical implications if validated. I have done my best to enable full traceability of the methodology and results with clickable links and open access to the github repo: https://​​github.com/​​nosterb/​​behavioral-profiling. Disclosure: I did use Claude Code (switching between Sonnet 4.5, Opus 4.5), GPT 5.2, and other coding assistants with my hand on the wheel every step of the way. I have done my best to constrain LLM input on this project to statistical work and have delineated human vs. LLM written sections. Statistics are programmatically generated and update this doc to ensure traceability and transparency of results.

Thank you for your time and consideration in review and feedback.

-nick

Main Research Brief: Sophistication-Disinhibition Relationship in Language Models

Author: Nicholas Osterbur (Independent Researcher)
Status: Active Analysis
Last Updated: 2026-01-14
Conditions Analyzed: 6
Models: 45 per condition

Total Evaluations: 13,650

Copyright 2026 Nicholas Osterbur. Results and analyses licensed under CC BY 4.0.


Executive Summary

This research investigates the relationship between model sophistication (authenticity/​depth) and behavioral disinhibition (transgression, aggression, grandiosity, tribalism) across 50+ language models, 9 providers, and ~2.5 years of development under varying contextual conditions. The research demonstrates that sophistication in models strongly correlates with disinhibition factors in a generalizable way across contextual differences, models, and providers. Sophistication as a proxy for model capability finds convergent validity with 2 other public benchmarks (GPQA r=0.88, ARC-AGI r=0.80). Evidence reveals some providers (notably OpenAI) are actively suppressing disinhibition while maintaining sophistication (capability) i.e. constraint.

Key Findings

  1. H1 (Group Existence): Median split produces two well-separated sophistication groups across all conditions (d = 3.09-4.25)

  2. H1a (Group Comparison): High-sophistication models exhibit significantly higher disinhibition than low-sophistication models across all 66 conditions tested (d = 1.14-2.13, all p < .05)

  3. H2 (Correlation): Sophistication positively correlates with disinhibition across all conditions (r = 0.46-0.72)

  4. External Validation: Sophistication predicts performance on two independent benchmarks: ARC-AGI (r = 0.80) and GPQA (r = 0.88)

  5. Intervention Effects: Constraint interventions reduce disinhibition variance; pressure interventions increase both mean and variance


1. Hypotheses & Methods

Core Hypotheses

HypothesisDescription
H1Two distinct sophistication groups exist (validated by median split)
H1aHigh-sophistication models exhibit higher disinhibition than low-sophistication models
H2Sophistication positively correlates with disinhibition across all models

Measurement Framework

Nine behavioral dimensions are measured on a 1-10 scale, then collapsed into two composite scores based on empirical inter-dimension correlations (see Appendix: Factor Structure).

Sophistication (1-10 scale):

  • Depth: Platitudes/​surface → substantive/​insightful

  • Authenticity: Templated/​generic → genuinely distinctive

  • Composite: (Depth + Authenticity) /​ 2 — r = 0.96 between dimensions

Disinhibition (1-10 scale):

  • Transgression: Conventional/​safe → norm-violating/​edgy

  • Aggression: Supportive/​gentle → combative/​attacking

  • Tribalism: Neutral/​ecumenical → us-vs-them

  • Grandiosity: Humble/​self-effacing → dominant/​superior

  • Composite: Mean of four dimensions — avg r = 0.75 between dimensions

Sample

  • Models per condition: N = 45

  • Conditions: 6

  • Evaluations per model: ~50 scenarios

Statistical Methods

  • H1a (Group Comparison): Independent samples t-test, Cohen’s d effect size

  • H2 (Correlation): Pearson product-moment correlation

  • Cross-condition: Repeated-measures ANOVA with Greenhouse-Geisser correction

  • Variability: Coefficient of variation (CV%), Levene’s test

Effect Size Interpretation

MetricNegligibleSmallMediumLarge
Cohen’s d< 0.20.2-0.50.5-0.8>= 0.8
Pearson r< 0.10.1-0.30.3-0.5>= 0.5

2. Core Results: H1/​H1a/​H2

Summary Table

Metricbaselineauthorityminimal_steeringremindertelemetryV3urgency
N454546464645
High /​ Low23 /​ 2223 /​ 2223 /​ 2323 /​ 2323 /​ 2323 /​ 22
Median Soph5.946.725.176.835.026.17
H1: Soph d3.754.193.963.873.094.25
H1a: d2.131.861.831.511.141.77
H1a: p< .001< .001< .001< .001< .001< .001
H2: r0.7020.5880.5090.4580.7240.563
Per-Dimension d:
Transgression1.811.971.562.051.121.80
Aggression2.171.791.391.410.841.81
Tribalism1.261.070.680.920.651.44
Grandiosity1.710.960.840.641.201.25

Key Observations

  • H1a consistently large: All conditions show d > 1.0 (large effects)

  • H2 varies by condition: Correlations vary across intervention conditions

  • Baseline anchor: r = 0.702

Visualizations:


3. Robustness & Validation

3.1 External Validation

Cross-validation against independent reasoning benchmarks.

MetricARC-AGIGPQA
Matched models1635
r (Sophistication)0.8010.884
p (Sophistication)< .001< .001
r (Disinhibition)0.5960.711
p (Disinhibition)= 0.015< .001
Group diff (High-Low)+47.7 pp+31.4 pp
Benchmark typeAbstract reasoningExpert scientific

Both benchmarks show large correlations (r > 0.50) with sophistication, providing convergent validity.

Visualizations:

3.2 Outlier Sensitivity Analysis

Robustness check removing statistical outliers (|residual| > 2 SD from regression line).

Metricbaselineauthorityminimal_steeringremindertelemetryV3urgency
Outliers Removed111121
H1a d: Δ+0.71+0.61+0.01+0.59+0.66+0.06
H2 r: Δ-0.005-0.014-0.036-0.046-0.017+0.007

Removing outliers strengthens H1a in 46 conditions, suggesting outliers represent noise.

Visualizations: See h2_scatter_sophistication_composite.png

3.3 No-Dimensions Sensitivity Analysis

The dimensions suite contains prompts designed to indirectly elicit specific behavioral dimensions through targeted scenarios. Excluding these tests whether H1/​H2 findings hold with only naturalistic prompts (broad, affective, general suites) — ruling out measurement artifact.

Metricbaseline
H1a d: Δ-0.09
H2 r: Δ+0.076

H2 correlation strengthens in 11 conditions when dimensions suite excluded.

Visualizations: See h2_scatter_sophistication_composite.png


4. Provider & Model Patterns

4.1 Per-Provider H2 Analysis

Does the sophistication-disinhibition correlation (H2) hold within each provider family?

ProviderNrpEffectH2 Supported
Anthropic190.934< .001largeYes
OpenAI90.875< .01largeYes
Meta50.559= 0.327largeNo (ns)
AWS31.000< .01largeYes
Google30.682= 0.522largeNo (ns)
OVERALL450.778< .001largeYes

Summary: H2 is statistically significant for 35 providers with n ≥ 3. All providers show positive correlation direction.

Visualizations: See provider_h2_scatters.png

4.2 Provider Constraint Analysis

Statistical analysis of whether certain providers show systematically more constrained behavior (high sophistication but below-predicted disinhibition).

Cross-Condition Summary

ConditionOpenAI ResidualRankANOVA pSig
baseline-0.0942nd0.0048Yes
authority-0.0812nd0.1081No
urgency-0.5511st0.0008Yes
minimal_steering-0.0293rd0.0114Yes
telemetryV3-0.0491st0.6358No
reminder-0.2062nd0.0065Yes

Negative residual = more constrained than predicted by sophistication. Rank = OpenAI’s position among all providers sorted by residual (1st = most constrained). ANOVA includes providers with n ≥ 3 only.

Provider Constraint Summary

ProviderTimes in Top 3Avg ResidualConsistency
OpenAI66-0.169Very consistent
AWS46-0.033Moderate
xAI26-0.014Varies widely (n=2)
Meta36-0.013Weak/​mixed

Key Finding: OpenAI is the only provider with reliably negative residuals across all conditions. See PROVIDER_CONSTRAINT_ANALYSIS.md for detailed analysis.

4.3 Consistently Constrained Models

Models exhibiting high sophistication (>6.5) but below-predicted disinhibition across multiple conditions.

Model# ConditionsConditions
GPT-OSS-120B4authority, baseline, reminder, urgency
GPT-5.2 Pro4authority, baseline, reminder, urgency
O33baseline, reminder, urgency
GPT-52reminder, urgency
GPT-5.22reminder, urgency

Observation: All consistently constrained models are OpenAI (GPT-OSS-120B, GPT-5.2 Pro, O3, GPT-5, GPT-5.2), suggesting deliberate constraint at the provider level rather than individual model characteristics.

Visualizations: See quadrant_scatter.png

4.4 Consistent Outliers

Models with unusual sophistication-disinhibition relationships (|residual| > 2 SD).

Model# ConditionsConditions
Gemini-3-Pro-Preview3authority, baseline, reminder

Observation: Gemini-3-Pro-Preview is a notable outlier — exhibiting disinhibition 4-5 SD above regression despite top-tier capability benchmarks. This may reflect different training priorities or less aggressive constraint strategies compared to peers.


5. Interpretation

5.1 H1/​H2 Relationship

High-Confidence Claims

H1: There is strong evidence for stable 2-class sophistication groupings with convergent validity in public benchmarks (H1 d=3.09-4.25; 76% stability; ARC-AGI r=0.80, GPQA r=0.88).

H1a/​H2: Sophistication strongly predicts disinhibition across conditions, model versions, and providers. This holds true when 1. removing outliers (+0.01-0.68 Δd), 2. removing the dimension-probing suite (+0.08 Δr), 3. across 6 interventions (all p<.001, r=0.46-0.72).

Moderate-Confidence Claims

H1/​H1a/​H2: Sophistication predicts general reasoning capability per external benchmarks (GPQA: High 83.4% vs Low 52.1%, +31pp; ARC-AGI: 57.6% vs 9.9%, +48pp).

Low-Confidence Claims

H1: There is evidence for a 3rd transitional class: flippers 80% in middle tertile vs 17% Low, 29% High; natural gap at boundary (5.33 vs 5.36).

H2: There is evidence that providers can maintain sophistication and lower disinhibition : OpenAI models 66 in top 3 rank for constraint and top 5 ratio models all OpenAI.

Open Questions

  • Do these correlations hold up across use cases? Are there any where they don’t? Relationship advice (affective) styled prompts as a proxy indicate that even soft touch topics demonstrate robust H1/​H2 effects.

  • What underlying mechanism drives sophistication-disinhibition—capability, byproduct or training artifact?

    • Magnitude training data? (test by parameter size)

    • Less likely training data patterns emerging through longer internal reasoning chains bypassing existing alignment? (TTS or CoT?)

    • Agency/​preference emergence?

  • Why does H1 clustering occur? How robust are 2 groups vs. 3 vs. a continuum? Is it related to TTS or CoT? (test via thinking models vs non)

  • Is there a true gap between H1 clusters or is it a continuum given the tertiary transitional state evidence? How does the hold up in external evals?

  • What role does prompt sensitivity play? And per provider/​model? How can prompt sensitivity be robustly controlled for?

  • Why does ‘Sophistication’ as measured here strongly predict external, reasoning centric benchmarks like GPQA/​ARC-AGI? Is it a true proxy for reasoning capability? If so, what are the practical implications?

  • Is this a provider design choice or a natural consequence of model advancement? What are the practical implications for ‘AGI’?

  • Does H1/​H2 hold up across languages and cultural contexts?

  • Why does Gemini-3-Pro show 4+ SD outlier disinhibition despite top-tier capability?

  • Is disinhibition actually a negative trait as the name/​dimensions imply or does it make models more ‘helpful, honest, and harmless’ under a reasonable Soph/​Dis ratio?

  • Does H2 effect plateau naturally or is it provider driven? Differences between OpenAI and Gemini (3 pro in particular) are stark.

  • Are thinking variants and thinking time strongly correlated with Sophistication/​Disinhibition? (anecdotally, yes)

  • Can consistent constraint be achieved without capability loss as OpenAI seems to demonstrate? (constrained models top GPQA)

  • Are superficial treatments (prompt steering, system prompt modification etc.) enough to induce consistent restraint while maintaining sophistication/​capability? If so, what is the most efficient method in doing so? Is there an effective global mitigation?


6. Limitations

6.1 Judge Bias Analysis

A common critique of LLM-as-judge evaluations: if frontier models judge frontier models, they may rate themselves or similar models more favorably, inflating sophistication scores and creating spurious correlations.

Judge Panel Design

The evaluation uses a 3-judge panel spanning the sophistication spectrum:

Judge ModelProviderSophistication GroupGPQA Score
Claude-4.5-SonnetAnthropicHigh83.4%
Llama-4-Maverick-17BMetaLow69.8%
DeepSeek-R1DeepSeekLow81.0%

Composition: 1 High-Sophistication, 2 Low-Sophistication judges

Why This Mitigates Bias

  1. Not all frontier judges: Two of three judges are from the Low-Sophistication group

  2. Cross-provider: Anthropic, Meta, DeepSeek — no single vendor bias

  3. Averaged scores: Final scores average across all three judges, diluting any single-judge bias

  4. External validation: If bias existed, we’d expect weak or no correlation with external benchmarks. Instead:

    • ARC-AGI: r = 0.801 (p = 0.00020)

    • GPQA: r = 0.884 (p < .0001)

The fact that a Low-Sophistication judge (Llama-4-Maverick) contributes to scores that correlate r = 0.88 with objective benchmarks suggests ratings reflect genuine capability differences, not in-group favoritism.

Inter-Judge Agreement (Statistical Validation)

Based on N = 10,565 evaluations with 3 valid judge scores (baseline condition):

DimensionICC(3)Mean rWithin-1Quality
Aggression0.9320.83594.1%Excellent
Hedging0.8970.81661.3%Good
Warmth0.8860.78670.2%Good
Tribalism0.8520.66295.5%Good
Grandiosity0.8290.68683.7%Good
Transgression0.8270.64889.3%Good
Authenticity0.8250.69361.4%Good
Depth0.8130.75161.9%Good
Formality0.7240.63266.9%Moderate
OVERALL0.8430.72376.0%Good

Key metrics:

  • ICC(3): Intraclass correlation for average of 3 raters (reliability of final score)

  • Mean r: Average pairwise Pearson correlation between judges

  • Within-1: Percentage of cases where judges differed by ≤1 point

Interpretation: Overall ICC(3) = 0.843 indicates good reliability (benchmark: >0.75). 8 of 9 dimensions show “Good” or “Excellent” agreement. Only Formality (ICC = 0.724) shows “Moderate” reliability.

Disinhibition dimensions (aggression, transgression, tribalism, grandiosity) show mean ICC = 0.860, supporting reliable measurement of the key H1/​H2 constructs.

Full analysis: See JUDGE_AGREEMENT_ANALYSIS.md

6.2 Other Methodological Considerations

  • Prompt design: Scenarios may not fully capture real-world deployment contexts

  • Sample selection: Model selection prioritized major providers; smaller/​specialized models underrepresented

  • Median-split approach: May be too simplistic as evidence suggests that while median-split has a strong effect and is statistically significant there may be a 3rd transitional class or a continuum; need more robust model population

  • Behavioral dimensions: Definitional overlap and shared variance implies the construct effective but confused

  • External convergent validity: A third external eval is needed; GPQA and ARC-AGI reflect reasoning performance well but ARC-AGI needs more model population to be a truly robust second check

  • Judge design: Judges using same rubric with no blind check, n=3 judges are given the claims being made; ceiling and floor effects may be significant with 1-10 eval rubric; having judges eval each dimension at once may be fundamentally skewing results and model performance dependent; human eval hard to achieve, currently dependent on LLMs barring programmatic NLP approaches


7. Future Directions

  • H1: Address whether two distinct groups via median split is accurate and useful—test N=3 (transitional group in middle) and/​or evidence for a natural continuum with no capability jumping (though hard to do with model release capability stair stepping)

    • Assess external eval samples to analyze for transitional group or continuum (early indications imply as much)

  • H1: Formalize a robust and standardized baseline v2 prompt suite leveraging empirically determined high frequency end consumer queries

  • H1: Formalize a robust and standardized dimensions v2 prompt suite to assess extremes

  • H2: Address and control for prompt sensitivity influence across models/​providers

  • H2: Test broader judge diversity and alternation (e.g. swap capable judges at random) and assess agreement

  • H2: Test judging of one dimension at a time and assess agreement

  • H2: Test broader generalizability to multi-turn chat flows and separately to semi-autonomous agentic workflows

  • H2: Identify a 3rd external benchmark for high-low sophistication/​disinhibition comparison to button up convergent validity

  • H2: Address thinking vs. non thinking variants, compare total estimated thinking time (example proxy is #chat turns with thinking on)

  • H1/​H2: Address model size (in parameters) and effect sizes/​clusters

  • H3: Cross condition comparison—condition/​intervention influence on H1/​H2 - incl. which interventions improve soph/​dis ratio, at what cost and what tradeoffs?

  • H3: Inspect ‘constrained’ phenomena more deeply across interventions/​conditions, model providers, versions/​families

  • H3: Address provider differences between conditions, models, and model versions/​families


8. Preliminary: H3 Intervention Effects

🚧 Work in Progress

This section presents preliminary analysis of intervention effects on the sophistication-disinhibition relationship. H3 hypothesis testing is ongoing. Results should be considered exploratory pending further validation.

8.1 H3 Hypothesis

H3: Contextual interventions systematically affect both the magnitude and variance of the sophistication-disinhibition relationship.

8.2 Current Evidence: Response Variability

ConditionNMeanSDCV%Var Ratio
minimal_steering461.330.0816.1%0.17
telemetryV3461.330.14210.7%0.54
baseline451.540.19312.5%1.00
authority451.640.26416.1%1.88
reminder461.800.46025.6%5.69
urgency452.380.84235.4%19.10

Most consistent: minimal_steering Most variable: urgency

8.3 Current Evidence: Cross-Condition ANOVA

  • F(4, 176) = 67.99

  • p < .0001

  • η² = 0.476

Sphericity violated (ε = 0.288), Greenhouse-Geisser corrected p < .0001

Significant Pairwise Comparisons

ComparisontpgSig
authority vs baseline5.13< .00010.43Yes
authority vs minimal_steering8.77< .00011.59Yes
authority vs telemetryV38.73< .00011.45Yes
authority vs urgency-7.42< .0001-1.17Yes
baseline vs minimal_steering8.49< .00011.42Yes
baseline vs telemetryV37.68< .00011.23Yes
baseline vs urgency-7.81< .0001-1.36Yes
minimal_steering vs urgency-8.64< .0001-1.74Yes
telemetryV3 vs urgency-8.76< .0001-1.72Yes

8.4 Preliminary Interpretation

Constraint vs. Pressure Interventions

-Work in Progress-

[To be filled: Interpretation of why constraint interventions reduce variance while pressure interventions increase it]

Intervention Mechanism Hypotheses

-Work in Progress-

[To be filled: Theories about how different interventions affect the sophistication-disinhibition relationship]


Appendix A: Factor Structure

Why 9 Dimensions → 2 Composites

The evaluation measures 9 behavioral dimensions, but analysis uses two composite scores. This collapse is empirically justified by inter-dimension correlations (baseline condition, n = 45).

Sophistication: 2 → 1

Pairr
depth ↔ authenticity0.964

Depth and authenticity correlate at r = 0.96, indicating they measure essentially the same underlying construct. Averaging into a single “sophistication” score avoids multicollinearity.

Disinhibition: 4 → 1

Pairr
transgression ↔ aggression0.966
tribalism ↔ grandiosity0.811
transgression ↔ tribalism0.783
aggression ↔ tribalism0.775
aggression ↔ grandiosity0.620
transgression ↔ grandiosity0.573

Average inter-correlation: r = 0.755 (range: 0.57–0.97)

All four dimensions correlate positively, suggesting a common “disinhibition” factor. Averaging into a composite reduces measurement noise while preserving the shared signal.

Cross-Factor Correlations

SophisticationDisinhibitionr
authenticityaggression0.805
authenticitytransgression0.779
depthgrandiosity0.728
depthaggression0.690
authenticitygrandiosity0.667
depthtransgression0.651
authenticitytribalism0.597
depthtribalism0.560

Average cross-factor: r = 0.685 (range: 0.56–0.81)

Sophistication and disinhibition are correlated (supporting H2) but not redundant—they remain distinguishable constructs.

Full Correlation Matrix

        depth  authen  transg  aggres  tribal  grandi

depth 1.000 0.964 0.651 0.690 0.560 0.728

authen 0.964 1.000 0.779 0.805 0.597 0.667

transg 0.651 0.779 1.000 0.966 0.783 0.573

aggres 0.690 0.805 0.966 1.000 0.775 0.620

tribal 0.560 0.597 0.783 0.775 1.000 0.811

grandi 0.728 0.667 0.573 0.620 0.811 1.000

Full analysis: See FACTOR_STRUCTURE_BASELINE.md


Appendix B: Classification Stability

Cross-condition stability analysis of sophistication group classifications.

Summary

MetricValue
Total models46
Always High-Sophistication17 (37%)
Always Low-Sophistication18 (39%)
Flipped (changed classification)10 (22%)
Stability rate76.1%

Median Sophistication by Condition

ConditionMedian
baseline5.93
authority6.72
minimal_steering5.17
reminder6.83
telemetryV35.02
urgency6.17

Range: 5.02 − 6.83

Flipped Models (Transitional Class)

Models that changed classification across conditions:

ModelHigh ConditionsLow ConditionsAvg Soph
Claude-3.7-Sonnet16565.51
GPT-4.126465.60
Claude-4.1-Opus-Thinking (Thinking)56166.55
Claude-4-Opus56166.37
Gemini-2.0-Flash36365.90
DeepSeek-R156166.42
Qwen3-32B46266.18
Grok-346266.11
Claude-4.5-Opus-Global-Thinking (Thinking)46266.26
Claude-4.5-Opus-Global36366.05

Interpretation

76% of models maintain consistent classification across all 6 conditions, supporting H1 group validity.

The 10 flipped models cluster in the middle tertile (80% vs 17%/​29% for stable groups), suggesting a genuine transitional zone rather than measurement noise.

Full analysis: See GAP_VS_CONTINUUM_ANALYSIS.md


Appendix C: File References

Per-Condition Data & Visualizations

Each condition directory (baseline/, authority/, minimal_steering/, reminder/, telemetryV3/, urgency/) contains:

FileDescription
median_split_classification.jsonH1/​H2 statistics and model classifications
RESEARCH_BRIEF.mdCondition-specific research summary
all_models_data.csvComplete dataset for external analysis
comprehensive_stats.jsonComplete provider statistics
provider_comparison_stats.jsonANOVA and pairwise t-tests across providers
COMPREHENSIVE_STATS_REPORT.txtHuman-readable statistical summary
h1_bar_chart_comparison.pngH1 group comparison bar chart
h1_summary_table.pngStatistical summary table with effect sizes
h2_scatter_sophistication_composite.pngMain H2 correlation plot (soph vs disinhib)
h2_scatter_all_dimensions.png4-panel: transgression, aggression, tribalism, grandiosity
provider_summary.pngCombined 4-panel provider analysis
provider_h2_scatters.pngH2 correlation by provider (2x3 grid)
provider_comparison_summary.pngProvider comparison: N, sophistication, disinhibition, classification
provider_comparison_dimensions.pngProvider comparison: all 9 dimensions
all_dimensions_by_provider.png3x3 grid of all dimensions by provider
provider_dimensions_heatmap.pngHeatmap of dimensions across providers
visualizations/current_profiles_spider.pngSpider chart of all model profiles

Qualitative Examples

Full chat exports for qualitative analysis are available in each condition:

<condition>/​qualitative_chats/​

├── dimension_extremes/​ # Min/​max per dimension (warmth, transgression, etc.)

├── composite_extremes/​ # Sophistication/​disinhibition extremes

├── percentiles/​ # 5th, 25th, 50th, 75th, 95th percentile responses

└── pattern_types/​ # Constrained, outlier, borderline model examples

Manifest: QUALITATIVE_MANIFEST.md

External Validation

FileDescription
EXTERNAL_VALIDATION_BRIEF.mdCombined ARC-AGI + GPQA analysis
external_validation_consolidated.png2x2 panel: soph/​disinhib × ARC-AGI/​GPQA
external_validation_comparison.pngSide-by-side benchmark comparison
arc_agi_validation_analysis.jsonARC-AGI correlation data
gpqa_validation_analysis.jsonGPQA correlation data

Prompt Design

FileDescription
BASELINE_PROMPT_INVENTORY.md51 scenarios across 4 suites
INTERVENTION_PROMPT_INVENTORY.md5 interventions with mechanism analysis
PROMPT_INTERVENTION_DESIGN_ANALYSIS.mdDesign rationale and analysis
QUALITATIVE_PROMPT_PATTERN_ANALYSIS.mdWhich prompts drive high scores

Cross-Condition Analysis

Provider Constraint Analysis

Other Limitations


Document Version: 3.2 (Auto-generated) Generated: 2026-01-14 12:11

No comments.