Coherence-Based Measure of AGI: GPT-5 ≈ 24 %

📄 Full paper on arXiv: A Coherence-Based Measure of AGI

True general intelligence should reflect coherent sufficiency: balanced competence across all essential domains. To capture this, we propose a coherence-aware measure of AGI based on integrating generalized means across compensability regimes.
This “area-under-the-curve” (AUC) approach penalizes imbalance and tracks robustness under varying assumptions.

Applied to published CHC-based domain scores (Hendrycks et al. (2025)) for GPT-4 and GPT-5, the coherence-adjusted AUC reveals that both remain far from general competence (e.g., GPT-5 ≈ 24% despite an arithmetic score of 58%).

1. Motivation

Hendrycks et al. (2025) formalized AGI as the arithmetic mean of performance across ten cognitive domains derived from the Cattell–Horn–Carroll (CHC) model of human cognition — reasoning, memory, perception, speed, etc.

While elegant, this assumes that exceptional ability in some domains can hide total failure in others. An AI with 100% reasoning but 0% memory could still appear “general.” But, that’s not how human intelligence or robust systems work.

In complex systems the overall function is limited by weak components. That is the case of human intelligence, which depends on coherence among faculties. So general intelligence should measure balance and interdependence, not isolated peaks.

2. The Idea: From Arithmetic to Generalized Means

Let $s_i$​ be normalized proficiency in domain i.

Hendrycks et al. use:

We generalize this into the continuous family:


  • p=1: arithmetic mean (fully compensatory)

  • p=0: geometric mean (moderate coupling)

  • p=−1: harmonic mean (non-compensatory)

  • p→−∞: strict bottleneck (min score)

As p decreases, compensation becomes less allowed; weakness in one domain matters more.

Thus, to summarize robustness across regimes, we integrate over $p∈ \in [-1,1]$:


This yields a single scalar capturing coherence under varying compensability assumptions.

3. Results (GPT-4, GPT-5)

Figure 1: Model performance across aggregation exponents $p$. The shaded area under each curve (AUC) represents performance and coherence under varying compensability assumptions.

Even GPT-5, which looks impressive by the arithmetic metric ($p=1$), collapses under coherence-based evaluation. Weaknesses in long-term memory and adaptive reasoning dominate. The AUC reveals that these systems are not yet “general”.

ModelArithmetic (AGI₁)AGI-AUC (Ours)
GPT-4 (2023)27 %7 %
GPT-5 (2025)58 %24 %
Ideal AGI100 %100 %

4. Interpretation

  • Arithmetic mean rewards specialization.
    Progress in one domain can inflate the overall score, hiding brittleness.

  • Geometric /​ harmonic means reveal interdependence.
    If even one faculty fails catastrophically, the system cannot maintain general competence.

  • AUC balances strictness and continuity.
    It doesn’t collapse to zero like the minimum, but it penalizes unevenness.

This coherence-based approach aligns more closely with out-of-distribution reasoning benchmarks like ARC-AGI and BIG-Bench, where GPT-5’s empirical performance is similar to its AGI-AUC score.

No comments.