Stop labeling, start measuring: the supervisory signal you can extract from a fixed corpus scales as N(N+1)/2 in the size of your frozen embedder panel
Chris Royse · Teleox.ai · chrisroyseai@gmail.com
Companion preprints: TCT (ResearchGate 403916407; under review at the NeurIPS 2026 Position Paper Track, forum mpQXCwkQcq) · Dynamic / ME-JEPA: An Audited, Domain-Portable World-Model Runtime on a Single RTX 5090, with a New Class of Training Data (ResearchGate 404389924; release candidate mejepa_5090_artifact_v2.0.0_rc1, Zenodo DOI 10.5281/zenodo.19977981, concept DOI 10.5281/zenodo.19953950) · Public Shakespeare LoRA (huggingface.co/cabdru/shakespeare-lora-gemma4) · Public ClipCannon repo (github.com/chrisroyse/clipcannon)
Disclosure: AI-assisted drafting, hand-revised throughout. The technical claims, scope guards, measurements, and code are mine. See §10 for the full conflict-of-interest statement.
This post defends one structural claim and one practical implication. The structural claim: a fixed raw corpus of n inputs projected through a panel of N frozen, approximately independently trained embedders yields a derived dataset of size n · (N + C(N,2)) = n · N(N+1)/2 structured supervisory signals — N per-embedder projections plus C(N,2) pairwise cross-embedder interaction features per input. The per-input yield grows quadratically in N. The practical implication: the bottleneck for extracting more supervised training data from a fixed corpus is no longer the corpus, the budget for human labelers, or the willingness to risk model collapse on synthetic data. The bottleneck is the supply of frozen, independently-trained embedders. As N grows, the per-input yield compounds. The framework I am building runs at N = 13 in the Context Graph production system today; my development branch is at N = 24; the asymptote is bounded only by how many independently-trained encoders the field is willing to ship.
I want this taken seriously and I want the bottleneck is the embedder supply implication taken seriously specifically. Anyone who reads this and then goes off to train a new specialised frozen encoder for some niche the existing panel does not cover is contributing to a research program where the marginal value of each new independent encoder is a quadratically-compounding multiplier on the supervised signal extractable from every existing corpus.
I am applying for the Anthropic Fellows Program July 2026 cohort and the Constellation Astra Fellowship around an experiment that measures the realised information density of the panel (the N_eff audit). I am stating the application context up front rather than burying it. The post is not part of either application packet.
Three scope guards travel with every claim. Quoted verbatim from the parent manuscript so the reader can hold me to them.
The per-input yield N + C(N,2) is a constructive count over the derived dataset, not an information-theoretic lower bound. The realised information density per pair is bounded above by the marginal entropy of each embedder and may contract under heavy embedder redundancy. EXP-2 pairwise mutual-information audit is named below.
Shumailov regime: DDA pipelines are structurally outside generator-in-loop recursion (scope claim, not refutation).
Meaning compression as a fourth taxonomic entry: axis-extension of the compression-as-intelligence frame, not a re-fit of scaling exponents; reviewer acceptance of the taxonomic distinction is a scholarly judgement I do not pre-empt.
If you read past those guards as if they were not there, you will read the headline stronger than I am writing it.
1. The counting identity
Let D = {x_i} for i = 1..n be a fixed raw corpus of n inputs. Let Φ = {φ_m} for m = 1..N be a panel of N frozen, approximately independently trained embedders, each φ_m : X → R^{d_m}. Frozen means parameters are held fixed throughout the pipeline and through any downstream training. Approximately independent carries the engineering-operational hedge that strict statistical independence between embedders is an open measurement question rather than a derived property; I return to it in §3 and §7.
For pairs (j, k) with j < k, define a cross-embedder interaction feature ρ_jk(φ_j(x_i), φ_k(x_i)) as a non-constant function of its two arguments. The canonical choices are normalised cosine, a pairwise mutual-information estimate, or a concatenation followed by a frozen adapter. The derived datasetD' materialises every per-embedder projection and every pairwise interaction on every raw input:
D' = { ( x_i , φ_1(x_i), …, φ_N(x_i) , (ρ_jk(φ_j(x_i), φ_k(x_i)))_{1 ≤ j < k ≤ N} ) }_{i=1..n}
Counting structured supervisory signals yields the DDA counting identity:
|D'| = n · ( N + C(N,2) ) = n · N(N+1)/2. (1)
The decomposition is n · N per-embedder labelled vectors plus n · C(N,2) pairwise cross-embedder interaction features. The per-input yield is N + C(N,2). The yield is quadratic in N. At N = 7 the per-input yield is 28; at N = 13 it is 91; at N = 24 it is 300; at N = 100 it is 5,050.
A signal here is a scalar or low-dimensional feature that (i) is a deterministic function of a raw input under frozen embedders and frozen interaction rules, (ii) is attached to a specific (input, embedder) or (input, embedder-pair) index, and (iii) can be used as input or target in a downstream loss without new data collection. DDA is therefore not synthetic-data generation (it decomposes D rather than synthesising new inputs), not data augmentation (the input is unchanged), and not distillation or self-training (no teacher–student relationship; no pseudo-labels). It is also not labelling in the human-rater sense. Every signal in D' is a deterministic measurement from a frozen instrument; no human is in the loop and no synthetic generator is in the loop.
Why “constructive” matters and what it leaves open. Identity (1) counts entries in D'. Each entry is a real measurement under the frozen-embedder axiom. The realised information per pair is a different question. Under approximate orthogonality, the pairwise mutual-information terms I(φ_j(X); φ_k(X)) are small relative to the marginal entropies and the pairwise count carries non-trivial information. When two embedders collapse to a linear reparameterisation of each other, their pair contributes near-zero additional information and the information-side gain from that pair contracts toward zero, even though the entry still exists in D'. This is the open hole I name below as the load-bearing experiment to tighten: a pre-registered pairwise mutual-information audit on the panel that returns a measured N_eff and a per-pair information histogram. The audit cannot falsify Identity (1) (the entries exist by construction). It can show that the realised multiplier on supervisory information density sits below the entry-count multiplier by some factor that depends on the panel’s diversity.
2. Two production demonstrations the labeller role works at scale
Two engineering substrates instantiate the counting identity at production scale, in different modalities. They are not the contribution. They are constructive evidence that the labeller role is implementable rather than hypothetical, and that the N in (1) is a real number on a real codebase rather than a notational placeholder.
Context Graph: text-side decomposition at N = 13 in production. A Rust workspace of roughly 370,000 source lines across ten crates, embedding every stored memory through 13 independent frozen embedders simultaneously, persisting all thirteen projections plus topic profiles and pairwise synergy features to a RocksDB store with 59 column families, exposing retrieval as 75 MCP tools, and running 5,184 in-process tests. The thirteen embedders span 11,008 dense dimensions, two 30,522-term sparse vocabularies (E6 lexical, E13 SPLADE; Formal et al., 2021), and a variable-length 128-per-token late-interaction space (E12 ColBERTv2; Santhanam et al., 2022). The remaining slots are e5-large-v2 for general semantics (E1), sinusoidal/Fourier temporal encodings (E2–E4), an asymmetric nomic-embed causal embedder (E5), a Qodo code embedder at 1,536-D (E7), an e5-large-v2 structural variant for graph edges (E8), a 10,000-bit hyperdimensional typo-tolerant encoder projected to 1,024-D (E9), an e5-base-v2 paraphrase-asymmetric variant (E10), and KEPLER (Wang et al., 2021) for entity geometry (E11). E5, E8, and E10 store dual vectors (cause/effect, source/target, document/query). All thirteen are frozen.
I ran the pipeline end-to-end on the Project Gutenberg Complete Works of Shakespeare (5.4 MB plain text) on 2026-04-14 on a single RTX 5090 Blackwell. The chunker produced 1,552 scene/sonnet/poem chunks; 2,741 of those passed quality gating and were ingested. The pipeline produced 249,431 labelled training signals (2,741 × 91), 13,465 cross-work contrastive anomaly pairs, and 44 per-work geometric constellations, totalling roughly 5.92 million derived features. Disk-storage form: a 120 MB compressed Parquet whose uncompressed payload is 1,551 MB. End-to-end wall time was about 85 minutes. A side-effect demonstration of the multi-encoder geometry: with no supervision, the system clustered eight of nine of Shakespeare’s English-king history plays (1 Henry IV, 2 Henry IV, Henry V, 2 Henry VI, 3 Henry VI, Richard II, plus close neighbours) into one tight region at pairwise centroid cosine 0.98+, separating them from comedies, tragedies, sonnets, and longer poems by pure geometry over the 13 frozen embedders. RocksDB column-family counts after the run: fingerprints 3,199; training_records 2,770; constellations 45; contrastive_pairs 13,465; audit_log ~3,000.
My current development branch runs the same architecture at N = 24. The new embedder slots cover modality combinations the N = 13 panel did not (additional language-specific encoders, additional structural encoders, additional temporal encoders). The per-input yield rises from 91 to 300 by Identity (1). I have not yet run the N = 24 panel end-to-end on a corpus the size of Shakespeare, so I am not reporting a new measured multiplier; the count grows by construction.
ClipCannon: video-side decomposition at N = 7 in production. A Python package of roughly 67,585 source lines running a 23-stage analysis DAG over source video, exposing 58 MCP tools, totalling 4,044 dimensions across seven modalities: visual (SigLIP-SO400M at 1,152-D; Zhai et al., 2023), semantic (nomic-embed-text-v1.5 at 768-D), emotion (wav2vec2-large at 1,024-D), speaker (WavLM-large at 512-D; Chen et al., 2022), prosody (custom 12-D F0/energy/rate/contour), sentiment (MiniLM-L6-v2 at 384-D), and voice identity (ECAPA-TDNN at 192-D; Desplanques et al., 2020). At N = 7, the per-input yield is 7 + 21 = 28 structured signals per source clip.
I ran the pipeline on a 975-second (16 minute, 15 second) interview video of a single subject (the “Santa” identity). The pipeline separated the speaker from the interviewer (17 interviewer segments removed via interviewer_ranges.npz), curated 2,362 training clips of 49 frames each (25 FPS, ~1.96 s per clip) for an EchoMimicV3 LoRA at rank 256 / α = 512 / 9 attention modules, and extracted Santa-only modality counts of 1,819 visual / 192 semantic / 362 emotion / 188 prosody / 188 voice / 154 sentiment / 3,177 FLAME expression frames. The Phase-1 constellation construction produced eight named behavioural constellations (calm, attentive, amused, curious, energetic, contemplative, warm, mischievous) at the top level, populated by 34 manually-defined micro-expression skills, populated by 40 K-means-discovered micro-expression groups, populated by 196 individual FLAME × FACS Action Units the pipeline observed in the source video. The verification surface is 61 + 22 unit tests against the live constellation. The forensic-narrative section of the parent paper documents six bug-fixes from the six-day sprint that produced the corrected dataset, including a flow-matching sign inversion in target = noise − latent, a q_audio LoRA-target name-shadowing typo, a frame-rate sampling mismatch making clips 2.4× too short, and mouth-bbox geometric ratios landing on the subject’s nose. The corrected pipeline ships 403 passing tests across the codebase.
The Santa case is the multimodal companion to the Shakespeare case. Same construct, different modality, smaller N (7 vs 13), much higher per-frame information content (4,044 dims per clip across modalities). The point is not the comparison between the two cases. The point is that the labeller role of the counting identity runs end-to-end on real text and on real video, at production scale, with no humans in the labelling loop.
3. The fourth-taxonomic-entry argument
Conventional compression literature asks “for fixed semantic content, how few bits, weights, or activations can I use?” The three established categories are well-defined in the ML systems literature. Bit compression (gzip, zstd, FLAC, PNG, arithmetic coding; Delétang et al., 2024 for the LLM-as-compressor framing) measures bits of compressed output per bit of raw input. Weight compression (post-training quantisation such as GPTQ, one-shot pruning such as SparseGPT, knowledge distillation) measures task accuracy per trainable parameter. Activation compression (KV-cache quantisation, TurboQuant) measures peak runtime memory per unit of output quality.
Meaning compression flips the question. For fixed raw-data volume, how many structured supervisory signals can I extract? Define the meaning-compression ratio as the per-input yield of Identity (1):
The unit pair is semantically distinct from the three established categories. A corpus compressed with gzip loses no information and produces no additional supervisory signals. A model pruned with SparseGPT is a different model on the same training data. TurboQuant changes what must be held in memory during a forward pass but not the signal density driving training. Meaning compression changes the training-data side at a raw corpus held constant. The four operate on different unit pairs, at different pipeline stages, and a single pipeline can realise all four simultaneously without contradiction.
Category
Input unit
Output unit
Canonical techniques
Bit compression
bits of raw data
bits of compressed data
gzip, zstd, FLAC, PNG, arithmetic coding
Weight compression
trainable parameters
task accuracy at those parameters
GPTQ, SparseGPT, distillation
Activation compression
peak runtime memory
output quality at that memory
TurboQuant, KV-cache quantisation
Meaning compression
units of raw data
structured supervisory signals
DDA (this proposal)
I am not claiming meaning compression supersedes any of the other three. I am claiming the unit pair is semantically distinct enough to deserve its own slot in the taxonomy, and that the ratio (2) — quadratic in N and bounded above only by the size of the available embedder panel — is the parameter the field has not been treating as a research target on its own.
I expect reviewer pushback on the taxonomic distinction. The strongest version of the pushback is “this is just multi-view representation learning under a new label.” I have two responses. First, multi-view literature evaluates panels of encoders as retrieval instruments with metrics like NDCG, Recall@k, and equal-error-rate, where the panel’s outputs are the endpoint and ranking-lift is the deliverable; the counting identity instead treats the same panel as a labeller that produces a structured supervisory dataset on a fixed raw corpus, which is a different operational role for the same primitive. Second, I am not arguing the move is conceptually impossible to derive from prior work; I am arguing that, pool-relative to my locked 88-citation reference set, no cited paper jointly states (i) the counting identity, (ii) the per-raw-data-unit ratio view, and (iii) the framing as a category alongside the three existing compression types, and names the embedder supply as the bottleneck on the ratio. A pool-external counter-example would qualify the framing without invalidating the construct, and I would value the pointer.
4. Why DDA pipelines sit structurally outside the Shumailov collapse regime
The recursive self-training literature, anchored by Shumailov et al. (2024) in Nature and developed in Dohmatob et al. (2024) on modified scaling laws and Alemohammad et al. (2023) on model autophagy, establishes an irreversible distributional drift under a generator-in-loop recursion. The precondition across every cited theorem and simulation is structural: there is a sequence of corpora {D_t} such that D_{t+1} contains samples drawn from a generator trained on D_t, formally D_{t+1} ∋ x̃ ~ p_{θ_t}.
DDA pipelines do not instantiate this recursion. By the frozen-embedder axiom, each φ_m(x_i) in D' depends only on a real input x_i ∈ D and on the frozen parameters of φ_m; no generator trained on D or on a predecessor corpus participates in the derivation. The derived dataset is a decomposition of the real corpus, not a sample from a generator. The generator-in-loop precondition of the cited collapse theorems is not satisfied by construction. DDA pipelines are structurally not subject to the Shumailov-form recursion regime, rather than refuting it.
This is a scope argument, not an immunity theorem. I am claiming the preconditions of the cited theorems are unsatisfied for DDA; I am not claiming a new independence-from-collapse result. Two boundary cases sit explicitly out of scope. First, embedder-mediated data selection (using φ-scored similarity to filter or re-weight D before training) is a different feedback structure that the cited theorems do not cover, and any drift it might induce would be a separate empirical question. Second, a variant that re-computed the centroid or the panel from generator outputs would re-introduce a generator-in-loop structure, and the scope argument would need reassessment; the protocol as specified does not do this.
The implication for the data-wall framing of Villalobos et al. (2024) is narrow but specific. The standing menu has two items: Path 1 acquires more real data through licensing and partnerships, with deals like News Corp / OpenAI in the hundreds of millions per partnership (Spangler, 2024), and Path 2 generates more synthetic data via self-distillation, accumulation per Gerstgrasser et al. (2024), or oracle verification per Feng et al. (2024). Both still operate on samples whose generation depends on a model trained on a prior round’s data. Neither has a theoretical place for the configuration in which a fixed corpus D is held constant and N frozen independently-trained embedders compute deterministic projections φ_m(x). The cited theorems are silent on this configuration. A third path (decompose a fixed corpus through frozen embedders rather than acquire or synthesise) is what the silence names, and the size of that third path is parameterised by the embedder count.
5. Two diagnostic findings where the panel surfaced bugs single-encoder eval missed
The strongest empirical case for the counting identity being a real multiplier of supervisory information rather than notational accounting comes from two engineering findings on systems that single-embedder evaluation would have shipped as healthy.
The Shakespeare LoRA training-data echo. I trained a rank-256 Shakespeare stylistic LoRA on Gemma-4-E4B-it (α = 512, 258 target modules, two-stage SFT-then-DPO on a 13-embedder DDA-projected derivative of the Shakespeare canon: 7,925 SFT pairs and 11,724 DPO pairs; final SFT loss 1.12 with 98.9% token accuracy, final DPO loss 0.47 with 98.0% rewards/accuracies and rewards/margins 43.62; published at huggingface.co/cabdru/shakespeare-lora-gemma4 under Apache-2.0). 5,954 of 8,857 SFT examples (67%) began with play-script header prefixes that the initial training pipeline silently learned to echo. A single-embedder style-cosine evaluation on dense E1 was insufficient to surface this; E1 absorbed the header signature into the style distribution. The multi-embedder STYLE_CENTROID guard caught the echo because the sparse panels (E6 lexical, E13 SPLADE) and the late-interaction E12 panel surfaced the header signature as a high-cosine outlier that the dense panel had normalised away. The bug-detection mechanism is structurally the counting identity at work: a signal that one row of D' cannot see is visible to a different row, and the conjunction of rows surfaces what any single row hides. The same LoRA, when given a Spanish-language prompt, also produced unprompted Siglo de Oro register prose despite the training corpus containing no Spanish text — single-instance observation, with caveats in the parent paper §6.1.4, but consistent with the read that several of the thirteen embedders carry signal components that are not language-specific.
The Santa avatar flow-matching sign-bug. Initial EchoMimicV3 LoRA training used an inverted target convention, target = latent − noise rather than target = noise − latent. Loss curves looked superficially healthy and converged smoothly while the trajectory the model was being asked to learn was inverted. Single-loss-curve diagnostics did not flag the divergence. The constellation-side audit, which re-embeds generated frames through SigLIP and validates them against the per-modality centroids constructed from the reference set, surfaced the disagreement: the generations were converging in pixel space but diverging in the seven frozen-encoder geometry. The bug was one of six fixes documented in the parent paper §6 and the ClipCannon white paper §6.2 / §10. Both findings were caught by the audit role of the same panel that produces the supervisory signals; the panel-as-diagnostic is not an afterthought, it is the same construct as the panel-as-labeller.
6. The bottleneck is the embedder supply
The practical implication of (1) and (2) sits in this section. The standard framing of the data wall asks how to acquire more raw text, or how to generate more synthetic text, or how to filter what synthetic text we already have. All three framings hold the panel Φ constant — usually implicitly, by assuming a single dense encoder is the right unit of measurement and that the question is what to feed into it.
Identity (1) reframes the question. Hold the corpus D constant. Vary Φ. The supervisory signal yield is n · N(N+1)/2. The leverage point is not the corpus or the labeling budget. It is N and the diversity of the panel. Adding one new independently-trained encoder to a panel of size N adds 1 + N new signals per input — itself, plus its pairwise interaction with every existing member. At N = 13 adding the 14th encoder adds 14 new signals per input. At N = 24 adding the 25th encoder adds 25. At N = 100 adding the 101st adds 101. The marginal value of each new independent encoder grows linearly in the panel size that already exists.
This reframes “what would help most with the data wall” in a specific and actionable direction. Build new specialised independently-trained frozen encoders. Niches the existing panel does not cover: long-form discourse-level encoders, encoders trained on programming-language ASTs rather than tokens, encoders for legal or medical or scientific domains where a single large dense encoder leaves structural information on the table, encoders for non-English language families, encoders for non-text modalities the field has not built strong frozen instruments for yet. Each new independent encoder, once frozen, contributes 1 + N new supervisory signals per input to every existing corpus DDA decomposition, retroactively, without re-touching the corpus.
The current Context Graph production panel (N = 13) is what I had infrastructure for in April 2026. The development panel (N = 24) is what I am running today. Both are tiny relative to the size the panel can grow to as the field ships more independently-trained frozen encoders. Anyone reading this who has trained a strong frozen encoder on a niche the panel does not cover has, in my reading, contributed a multiplier on the supervisory signal extractable from every fixed corpus that the panel is then applied to. The bottleneck is not corpora, not human labelers, not synthetic-data risk tolerance. It is the embedder supply.
7. Where the construct is weakest
Three places the framework can be attacked, in priority order. I am stating these as open holes rather than as defended positions. If the post receives one substantive critique, the most useful place for it to land is here.
Realised information per pair is empirical, not derivable. Identity (1) counts entries in D'. Each entry is a real measurement under the frozen-embedder axiom. The realised information per pair is bounded above by min(H(φ_j), H(φ_k)) and contracts as the pair’s mutual information rises. Heavy redundancy between any pair (for instance E1, the e5-large-v2 semantic encoder, and E10, a paraphrase variant of the same family, both dense-text encoders on overlapping pretraining) would shrink the realised information from that pair below the entry’s nominal weight. The pre-registered pairwise mutual-information audit (EXP-2 in §8) is the falsification test: estimate I(φ_j(X); φ_k(X)) for each of the C(N,2) pairs via MINE or partitioned-KSG-adjacent estimator on a stratified sample of 100 to 1,000 representative inputs; report a redundancy histogram and a measured N_eff. The audit cannot falsify Identity (1); it can show the realised information density sits below the entry-count multiplier by a measurable factor.
The taxonomic-acceptance argument is a reviewer judgement, not a derivation. “Meaning compression deserves a fourth slot” is a framing claim about how the literature should be organised; it cannot be proven from the counting identity itself. A reviewer can accept (1) and (2) as true and still reject the taxonomic distinction on the grounds that the identity is more naturally read as multi-view representation learning, multi-encoder retrieval, or labelled-signal extraction in a known sense. I think the unit-pair argument carries weight (no other category measures structured signals per raw-data unit at the dataset-preparation stage), but the strength of that argument depends on accepting the unit pair as semantically distinct from bits-per-bit, accuracy-per-parameter, and memory-per-output-quality, and that acceptance is not derivable from (1).
The downstream sample-efficiency claim is named future work, not present evidence. I am not claiming, in this post or in the parent manuscript, that DDA-enriched supervision actually improves sample efficiency in the sense of refitted Kaplan-style or Chinchilla-style scaling exponents. The counting identity says how many structured signals you can extract; it does not say what happens to the loss exponents when you train on those signals at matched compute against a non-DDA baseline. An experiment that re-fits scaling exponents on a DDA-enriched corpus is a multi-million-dollar GPU-hour experiment I cannot run on a graduate stipend. Until that lands, the contribution is the counting identity, the ratio view, the fourth-taxonomic-entry framing, the embedder-supply bottleneck argument, and two production substrates that demonstrate the labeller role implementable at scale across two modalities. The downstream sample-efficiency claim is the natural place where the framework either pays off or does not, and I am explicit that the present evidence does not yet reach there.
8. Three experiments that would tighten or kill the framework
Each is instrument-ready in the sense that the input data, the panel, and the evaluator are specified; only execution and writeup remain.
EXP-2 is the load-bearing one for this post. A pre-registered pairwise mutual-information audit on the 13-embedder Context Graph panel (extending naturally to the 24-embedder development panel): estimate I(φ_j(X); φ_k(X)) for each pair via MINE, partitioned-KSG, or InfoNCE lower-bound on a stratified sample of 100 to 1,000 representative inputs; report a redundancy histogram and a measured N_eff. The realised information density of the panel becomes a defended capacity rather than a constructive ceiling. This is the experiment that converts the framework from “the entries exist by construction” into “the entries carry this measured amount of non-redundant supervisory information.” It also produces the per-encoder marginal-value table that lets the field decide which gaps in the embedder panel are worth filling first.
EXP-1 attacks the runtime-guard side rather than the counting side. An ECAPA-TDNN cross-encoder re-score of the Case 3 voice argmax candidates from the parent paper (and ideally the full 120-candidate pre-selection distribution) on the same audio: report mean and max ECAPA SECS, Pearson r and Spearman ρ vs WavLM, and a Bland-Altman plot. The headline on the parent paper’s Case 3 voice is currently encoder-matched and within-WavLM-family; if cross-encoder agreement holds at mean ECAPA ≥ 0.85 with Pearson ≥ 0.5, it promotes to cross-encoder identity ranking. This is mostly relevant to the per-output runtime guard G_τ (§9 below), not to the counting identity directly.
EXP-3 is a sample-complexity sweep on the centroid construction. A four-point reference-set-size curve at n_m ∈ {10, 25, 50, 100} on the same Case 3 scorer and held-out set. The DDA axiom asserts n_m ∈ [10, 10²] suffices for a stable centroid; the empirical curve has one observation at n_m = 50, and a 4-point curve would anchor the sample-complexity claim at more than one operating point.
I am applying for the Anthropic Fellows Program (July 2026 cohort) and the Constellation Astra Fellowship (Sept 2026 cohort) with EXP-2 plus an AuditBench-adjacent calibration phase as the focused 5-month project. If anyone reading this is set up to run the MI estimator infrastructure on a 13- or 24-embedder panel, has an AuditBench-adjacent calibration setup, or wants to read §3.3 of the parent paper (the Shumailov scope claim) and tell me where it breaks, I would value the collaboration. Email chrisroyseai@gmail.com.
What would change my mind on the embedder-bottleneck framing. (i) A pool-external counter-example where a single sufficiently-large encoder produces a labelled-signal yield comparable to the multi-encoder panel at matched compute, suggesting the multi-encoder decomposition is a notational artefact rather than a structural multiplier. (ii) An EXP-2 audit that returns N_eff close to N (collapse of the pairwise term) on every panel anyone runs it on, falsifying the practical content of the identity even though the upper bound holds. (iii) A demonstration that further-trained-from-the-same-base encoders produce indistinguishably useful pairwise interaction features as truly independently-trained encoders — which would weaken the “independently-trained” axiom and suggest the embedder supply is not actually the bottleneck because cheap fine-tunes can substitute.
9. One downstream consequence, briefly: the runtime guard G_τ
The same frozen panel that produces D' in the labeller role can, without parameter update, serve as a runtime guard at inference. Each φ_m evaluates a generated candidate ŷ and computes cos(φ_m(ŷ), c_m) against a frozen centroid c_m built from a per-modality reference set in a Phase 1 step (the canonical default is the L2-normalised mean of L2-normalised projections of 10 to 100 reference samples). The acceptance predicate is a strict conjunction:
Cost is O(M · d_max) for the cosine comparisons plus the embedder forward passes (one per modality per candidate). There is no auxiliary learned model. This is a downstream consequence of the same panel being usable in three semantic roles without parameter update: retriever (the conventional role in dense, sparse, and late-interaction retrieval), labeller (the role in (1)), and runtime guard (the role in (3)). It provides a per-output runtime guarantee that scalar-reward alignment in the InstructGPT (Ouyang et al., 2022), DPO (Rafailov et al., 2023), and Constitutional AI (Bai et al., 2022) tradition does not provide in the same technical sense. I do not develop this thread here; the parent paper §5 develops it, and Bowman, Perez, Hubinger and colleagues’ November 2025 paper (arXiv 2511.18397) on natural emergent misalignment from reward hacking in production RL identifies three failure modes (encoder-matched survivorship, centroid collapse on non-diverse references, multi-modal strictness without M ≥ 4 evidence) that the runtime guard does not protect against and that I treat as load-bearing in §8 of the parent paper.
The relevance to this post: the runtime guard is one consequence of the three-role composition over identical frozen instances. The counting identity (1), the ratio (2), the fourth-axis framing in §3, and the embedder-supply bottleneck argument in §6 are the load-bearing claims; the runtime guard is what falls out when you ask the labeller panel to do double-duty at inference.
10. Reference implementation
A minimal reference implementation of the per-input signal-yield computation, intended to be readable rather than performant. The Context Graph Rust workspace at N = 13 is not currently a public repository; the public companion is the ClipCannon Python pipeline at N = 7 (github.com/chrisroyse/clipcannon), which implements the labeller role on video. A second public artefact, the Dynamic / ME-JEPA release candidate mejepa_5090_artifact_v2.0.0_rc1 on Zenodo (DOI 10.5281/zenodo.19977981), ships a deterministic Docker plus Apptainer/SIF reproducibility build that includes the meaning-compression instrumentation directly in the runtime: an MC-ratio subcommand, signal-yield accounting, and pairwise mutual-information audit outputs over its panel. The Python sketch below is for the AF reader who wants the labeller-side counting identity in 30 lines.
import torch
import torch.nn.functional as F
from typing import Sequence, Mapping, Callable
from itertools import combinations
@torch.no_grad()
def derived_signals(
x: torch.Tensor,
embedders: Sequence[Callable[[torch.Tensor], torch.Tensor]],
pairwise: Callable[[torch.Tensor, torch.Tensor], torch.Tensor] = None,
) -> dict:
"""
Compute the per-input row of D' for a fixed raw input x.
Returns N per-embedder projections + C(N,2) pairwise interaction features.
Identity: |row| = N + C(N,2) = N(N+1)/2.
"""
if pairwise is None:
pairwise = lambda a, b: F.cosine_similarity(
a.flatten().unsqueeze(0), b.flatten().unsqueeze(0)
).item()
projections = [phi(x) for phi in embedders]
interactions = {
(j, k): pairwise(projections[j], projections[k])
for j, k in combinations(range(len(embedders)), 2)
}
N = len(embedders)
expected = N + N * (N - 1) // 2
actual = N + len(interactions)
assert actual == expected, f"signal count drift: {actual} vs {expected}"
return {
"N": N,
"per_embedder": projections,
"pairwise": interactions,
"signal_count_upper_bound": expected,
}
# Per-input yield grows quadratically in N. Examples:
# N = 7 → 28 signals/input (ClipCannon production)
# N = 13 → 91 signals/input (Context Graph N=13 production)
# N = 24 → 300 signals/input (current development branch)
# N = 100 → 5050 signals/input (asymptote, embedder-supply-bound)
Two things to notice. There is no learned model in derived_signals; the embedders are frozen by axiom and the pairwise function is a deterministic interaction over their outputs. The assertion N + C(N,2) is the constructive entry count; replacing the cosine pairwise with a more expressive interaction (a frozen adapter, a partitioned-KSG MI estimate) does not change the count, only the realised information content per pair. The N_eff that EXP-2 would measure is the empirical capacity of a panel against this constructive ceiling, and it is the number that turns the per-input yield from a count of entries into a defended quantity of non-redundant supervisory information.
11. Conflicts of interest
I am the founder and operator of Teleox.ai (the research operation behind Context Graph, ClipCannon, the TCT framework, and Dynamic / ME-JEPA) and Leapable.ai (a creator-knowledge marketplace MCP stack). I hold equity in both. The DDA counting identity, the meaning-compression ratio view, the fourth-taxonomic-entry framing, the embedder-supply bottleneck argument, and the two production substrates that instantiate the labeller role are methodology contributions emerging from commercial engineering practice. No third-party research funding or vendor-paid benchmark is reported. The release posture is asymmetric: the public Shakespeare LoRA (Case 1) is byte-identical public on HuggingFace; the Santa case (Case 2) and the parent paper’s Case 3 voice case have subject artefacts withheld on consent and dual-use grounds documented in §Ethics of the parent paper. The Dynamic / ME-JEPA artefact (release candidate mejepa_5090_artifact_v2.0.0_rc1, Zenodo DOI 10.5281/zenodo.19977981, concept DOI 10.5281/zenodo.19953950) is fully reproducible. The public ClipCannon Python pipeline lives at github.com/chrisroyse/clipcannon. The Context Graph Rust workspace is not a public repository at time of writing.
I submitted applications to the Anthropic Fellows Program and the Constellation Astra Fellowship on 2026-05-02. This post is not part of either application; it is a separate research-side write-up of the construct for the AF/LW reader, and I am explicitly not asking anyone here to factor it into evaluation if our paths cross professionally.
I drafted this post with LLM assistance, then revised it by hand to strip the model’s stylistic tics and to tighten the scope guards where I had hedged too softly. The technical claims, the measurements, the scope guards, the open-holes analysis, and the code are mine. Where the open holes in §7 are stronger than I have stated them, particularly §7′s first hole (realised information per pair is empirical, not derivable) and §7′s third hole (no scaling-exponent re-fit), please say so explicitly. That is the most useful thing this post can receive.
Replies attempted within 24 hours. Specific critiques of §7 prioritised over praise.
Stop labeling, start measuring: the supervisory signal you can extract from a fixed corpus scales as N(N+1)/2 in the size of your frozen embedder panel
Chris Royse · Teleox.ai ·
chrisroyseai@gmail.comCompanion preprints: TCT (ResearchGate 403916407; under review at the NeurIPS 2026 Position Paper Track, forum mpQXCwkQcq) · Dynamic / ME-JEPA: An Audited, Domain-Portable World-Model Runtime on a Single RTX 5090, with a New Class of Training Data (ResearchGate 404389924; release candidatemejepa_5090_artifact_v2.0.0_rc1, Zenodo DOI 10.5281/zenodo.19977981, concept DOI 10.5281/zenodo.19953950) · Public Shakespeare LoRA (huggingface.co/cabdru/shakespeare-lora-gemma4) · Public ClipCannon repo (github.com/chrisroyse/clipcannon) Disclosure: AI-assisted drafting, hand-revised throughout. The technical claims, scope guards, measurements, and code are mine. See §10 for the full conflict-of-interest statement.This post defends one structural claim and one practical implication. The structural claim: a fixed raw corpus of
ninputs projected through a panel ofNfrozen, approximately independently trained embedders yields a derived dataset of sizen · (N + C(N,2)) = n · N(N+1)/2structured supervisory signals —Nper-embedder projections plusC(N,2)pairwise cross-embedder interaction features per input. The per-input yield grows quadratically inN. The practical implication: the bottleneck for extracting more supervised training data from a fixed corpus is no longer the corpus, the budget for human labelers, or the willingness to risk model collapse on synthetic data. The bottleneck is the supply of frozen, independently-trained embedders. AsNgrows, the per-input yield compounds. The framework I am building runs atN = 13in the Context Graph production system today; my development branch is atN = 24; the asymptote is bounded only by how many independently-trained encoders the field is willing to ship.I want this taken seriously and I want the bottleneck is the embedder supply implication taken seriously specifically. Anyone who reads this and then goes off to train a new specialised frozen encoder for some niche the existing panel does not cover is contributing to a research program where the marginal value of each new independent encoder is a quadratically-compounding multiplier on the supervised signal extractable from every existing corpus.
I am applying for the Anthropic Fellows Program July 2026 cohort and the Constellation Astra Fellowship around an experiment that measures the realised information density of the panel (the
N_effaudit). I am stating the application context up front rather than burying it. The post is not part of either application packet.Three scope guards travel with every claim. Quoted verbatim from the parent manuscript so the reader can hold me to them.
The per-input yield
N + C(N,2)is a constructive count over the derived dataset, not an information-theoretic lower bound. The realised information density per pair is bounded above by the marginal entropy of each embedder and may contract under heavy embedder redundancy. EXP-2 pairwise mutual-information audit is named below.Shumailov regime: DDA pipelines are structurally outside generator-in-loop recursion (scope claim, not refutation).
Meaning compression as a fourth taxonomic entry: axis-extension of the compression-as-intelligence frame, not a re-fit of scaling exponents; reviewer acceptance of the taxonomic distinction is a scholarly judgement I do not pre-empt.
If you read past those guards as if they were not there, you will read the headline stronger than I am writing it.
1. The counting identity
Let
D = {x_i}fori = 1..nbe a fixed raw corpus ofninputs. LetΦ = {φ_m}form = 1..Nbe a panel ofNfrozen, approximately independently trained embedders, eachφ_m : X → R^{d_m}. Frozen means parameters are held fixed throughout the pipeline and through any downstream training. Approximately independent carries the engineering-operational hedge that strict statistical independence between embedders is an open measurement question rather than a derived property; I return to it in §3 and §7.For pairs
(j, k)withj < k, define a cross-embedder interaction featureρ_jk(φ_j(x_i), φ_k(x_i))as a non-constant function of its two arguments. The canonical choices are normalised cosine, a pairwise mutual-information estimate, or a concatenation followed by a frozen adapter. The derived datasetD'materialises every per-embedder projection and every pairwise interaction on every raw input:Counting structured supervisory signals yields the DDA counting identity:
The decomposition is
n · Nper-embedder labelled vectors plusn · C(N,2)pairwise cross-embedder interaction features. The per-input yield isN + C(N,2). The yield is quadratic inN. AtN = 7the per-input yield is 28; atN = 13it is 91; atN = 24it is 300; atN = 100it is 5,050.A signal here is a scalar or low-dimensional feature that (i) is a deterministic function of a raw input under frozen embedders and frozen interaction rules, (ii) is attached to a specific (input, embedder) or (input, embedder-pair) index, and (iii) can be used as input or target in a downstream loss without new data collection. DDA is therefore not synthetic-data generation (it decomposes
Drather than synthesising new inputs), not data augmentation (the input is unchanged), and not distillation or self-training (no teacher–student relationship; no pseudo-labels). It is also not labelling in the human-rater sense. Every signal inD'is a deterministic measurement from a frozen instrument; no human is in the loop and no synthetic generator is in the loop.Why “constructive” matters and what it leaves open. Identity (1) counts entries in
D'. Each entry is a real measurement under the frozen-embedder axiom. The realised information per pair is a different question. Under approximate orthogonality, the pairwise mutual-information termsI(φ_j(X); φ_k(X))are small relative to the marginal entropies and the pairwise count carries non-trivial information. When two embedders collapse to a linear reparameterisation of each other, their pair contributes near-zero additional information and the information-side gain from that pair contracts toward zero, even though the entry still exists inD'. This is the open hole I name below as the load-bearing experiment to tighten: a pre-registered pairwise mutual-information audit on the panel that returns a measuredN_effand a per-pair information histogram. The audit cannot falsify Identity (1) (the entries exist by construction). It can show that the realised multiplier on supervisory information density sits below the entry-count multiplier by some factor that depends on the panel’s diversity.2. Two production demonstrations the labeller role works at scale
Two engineering substrates instantiate the counting identity at production scale, in different modalities. They are not the contribution. They are constructive evidence that the labeller role is implementable rather than hypothetical, and that the
Nin (1) is a real number on a real codebase rather than a notational placeholder.Context Graph: text-side decomposition at
N = 13in production. A Rust workspace of roughly 370,000 source lines across ten crates, embedding every stored memory through 13 independent frozen embedders simultaneously, persisting all thirteen projections plus topic profiles and pairwise synergy features to a RocksDB store with 59 column families, exposing retrieval as 75 MCP tools, and running 5,184 in-process tests. The thirteen embedders span 11,008 dense dimensions, two 30,522-term sparse vocabularies (E6 lexical, E13 SPLADE; Formal et al., 2021), and a variable-length 128-per-token late-interaction space (E12 ColBERTv2; Santhanam et al., 2022). The remaining slots are e5-large-v2 for general semantics (E1), sinusoidal/Fourier temporal encodings (E2–E4), an asymmetric nomic-embed causal embedder (E5), a Qodo code embedder at 1,536-D (E7), an e5-large-v2 structural variant for graph edges (E8), a 10,000-bit hyperdimensional typo-tolerant encoder projected to 1,024-D (E9), an e5-base-v2 paraphrase-asymmetric variant (E10), and KEPLER (Wang et al., 2021) for entity geometry (E11). E5, E8, and E10 store dual vectors (cause/effect, source/target, document/query). All thirteen are frozen.I ran the pipeline end-to-end on the Project Gutenberg Complete Works of Shakespeare (5.4 MB plain text) on 2026-04-14 on a single RTX 5090 Blackwell. The chunker produced 1,552 scene/sonnet/poem chunks; 2,741 of those passed quality gating and were ingested. The pipeline produced 249,431 labelled training signals (2,741 × 91), 13,465 cross-work contrastive anomaly pairs, and 44 per-work geometric constellations, totalling roughly 5.92 million derived features. Disk-storage form: a 120 MB compressed Parquet whose uncompressed payload is 1,551 MB. End-to-end wall time was about 85 minutes. A side-effect demonstration of the multi-encoder geometry: with no supervision, the system clustered eight of nine of Shakespeare’s English-king history plays (1 Henry IV, 2 Henry IV, Henry V, 2 Henry VI, 3 Henry VI, Richard II, plus close neighbours) into one tight region at pairwise centroid cosine 0.98+, separating them from comedies, tragedies, sonnets, and longer poems by pure geometry over the 13 frozen embedders. RocksDB column-family counts after the run:
fingerprints3,199;training_records2,770;constellations45;contrastive_pairs13,465;audit_log~3,000.My current development branch runs the same architecture at
N = 24. The new embedder slots cover modality combinations theN = 13panel did not (additional language-specific encoders, additional structural encoders, additional temporal encoders). The per-input yield rises from 91 to 300 by Identity (1). I have not yet run theN = 24panel end-to-end on a corpus the size of Shakespeare, so I am not reporting a new measured multiplier; the count grows by construction.ClipCannon: video-side decomposition at
N = 7in production. A Python package of roughly 67,585 source lines running a 23-stage analysis DAG over source video, exposing 58 MCP tools, totalling 4,044 dimensions across seven modalities: visual (SigLIP-SO400M at 1,152-D; Zhai et al., 2023), semantic (nomic-embed-text-v1.5 at 768-D), emotion (wav2vec2-large at 1,024-D), speaker (WavLM-large at 512-D; Chen et al., 2022), prosody (custom 12-D F0/energy/rate/contour), sentiment (MiniLM-L6-v2 at 384-D), and voice identity (ECAPA-TDNN at 192-D; Desplanques et al., 2020). AtN = 7, the per-input yield is7 + 21 = 28structured signals per source clip.I ran the pipeline on a 975-second (16 minute, 15 second) interview video of a single subject (the “Santa” identity). The pipeline separated the speaker from the interviewer (17 interviewer segments removed via
interviewer_ranges.npz), curated 2,362 training clips of 49 frames each (25 FPS, ~1.96 s per clip) for an EchoMimicV3 LoRA at rank 256 / α = 512 / 9 attention modules, and extracted Santa-only modality counts of 1,819 visual / 192 semantic / 362 emotion / 188 prosody / 188 voice / 154 sentiment / 3,177 FLAME expression frames. The Phase-1 constellation construction produced eight named behavioural constellations (calm, attentive, amused, curious, energetic, contemplative, warm, mischievous) at the top level, populated by 34 manually-defined micro-expression skills, populated by 40 K-means-discovered micro-expression groups, populated by 196 individual FLAME × FACS Action Units the pipeline observed in the source video. The verification surface is 61 + 22 unit tests against the live constellation. The forensic-narrative section of the parent paper documents six bug-fixes from the six-day sprint that produced the corrected dataset, including a flow-matching sign inversion intarget = noise − latent, aq_audioLoRA-target name-shadowing typo, a frame-rate sampling mismatch making clips 2.4× too short, and mouth-bbox geometric ratios landing on the subject’s nose. The corrected pipeline ships 403 passing tests across the codebase.The Santa case is the multimodal companion to the Shakespeare case. Same construct, different modality, smaller
N(7 vs 13), much higher per-frame information content (4,044 dims per clip across modalities). The point is not the comparison between the two cases. The point is that the labeller role of the counting identity runs end-to-end on real text and on real video, at production scale, with no humans in the labelling loop.3. The fourth-taxonomic-entry argument
Conventional compression literature asks “for fixed semantic content, how few bits, weights, or activations can I use?” The three established categories are well-defined in the ML systems literature. Bit compression (gzip, zstd, FLAC, PNG, arithmetic coding; Delétang et al., 2024 for the LLM-as-compressor framing) measures bits of compressed output per bit of raw input. Weight compression (post-training quantisation such as GPTQ, one-shot pruning such as SparseGPT, knowledge distillation) measures task accuracy per trainable parameter. Activation compression (KV-cache quantisation, TurboQuant) measures peak runtime memory per unit of output quality.
Meaning compression flips the question. For fixed raw-data volume, how many structured supervisory signals can I extract? Define the meaning-compression ratio as the per-input yield of Identity (1):
The unit pair is semantically distinct from the three established categories. A corpus compressed with gzip loses no information and produces no additional supervisory signals. A model pruned with SparseGPT is a different model on the same training data. TurboQuant changes what must be held in memory during a forward pass but not the signal density driving training. Meaning compression changes the training-data side at a raw corpus held constant. The four operate on different unit pairs, at different pipeline stages, and a single pipeline can realise all four simultaneously without contradiction.
I am not claiming meaning compression supersedes any of the other three. I am claiming the unit pair is semantically distinct enough to deserve its own slot in the taxonomy, and that the ratio (2) — quadratic in
Nand bounded above only by the size of the available embedder panel — is the parameter the field has not been treating as a research target on its own.I expect reviewer pushback on the taxonomic distinction. The strongest version of the pushback is “this is just multi-view representation learning under a new label.” I have two responses. First, multi-view literature evaluates panels of encoders as retrieval instruments with metrics like NDCG, Recall@k, and equal-error-rate, where the panel’s outputs are the endpoint and ranking-lift is the deliverable; the counting identity instead treats the same panel as a labeller that produces a structured supervisory dataset on a fixed raw corpus, which is a different operational role for the same primitive. Second, I am not arguing the move is conceptually impossible to derive from prior work; I am arguing that, pool-relative to my locked 88-citation reference set, no cited paper jointly states (i) the counting identity, (ii) the per-raw-data-unit ratio view, and (iii) the framing as a category alongside the three existing compression types, and names the embedder supply as the bottleneck on the ratio. A pool-external counter-example would qualify the framing without invalidating the construct, and I would value the pointer.
4. Why DDA pipelines sit structurally outside the Shumailov collapse regime
The recursive self-training literature, anchored by Shumailov et al. (2024) in Nature and developed in Dohmatob et al. (2024) on modified scaling laws and Alemohammad et al. (2023) on model autophagy, establishes an irreversible distributional drift under a generator-in-loop recursion. The precondition across every cited theorem and simulation is structural: there is a sequence of corpora
{D_t}such thatD_{t+1}contains samples drawn from a generator trained onD_t, formallyD_{t+1} ∋ x̃ ~ p_{θ_t}.DDA pipelines do not instantiate this recursion. By the frozen-embedder axiom, each
φ_m(x_i)inD'depends only on a real inputx_i ∈ Dand on the frozen parameters ofφ_m; no generator trained onDor on a predecessor corpus participates in the derivation. The derived dataset is a decomposition of the real corpus, not a sample from a generator. The generator-in-loop precondition of the cited collapse theorems is not satisfied by construction. DDA pipelines are structurally not subject to the Shumailov-form recursion regime, rather than refuting it.This is a scope argument, not an immunity theorem. I am claiming the preconditions of the cited theorems are unsatisfied for DDA; I am not claiming a new independence-from-collapse result. Two boundary cases sit explicitly out of scope. First, embedder-mediated data selection (using
φ-scored similarity to filter or re-weightDbefore training) is a different feedback structure that the cited theorems do not cover, and any drift it might induce would be a separate empirical question. Second, a variant that re-computed the centroid or the panel from generator outputs would re-introduce a generator-in-loop structure, and the scope argument would need reassessment; the protocol as specified does not do this.The implication for the data-wall framing of Villalobos et al. (2024) is narrow but specific. The standing menu has two items: Path 1 acquires more real data through licensing and partnerships, with deals like News Corp / OpenAI in the hundreds of millions per partnership (Spangler, 2024), and Path 2 generates more synthetic data via self-distillation, accumulation per Gerstgrasser et al. (2024), or oracle verification per Feng et al. (2024). Both still operate on samples whose generation depends on a model trained on a prior round’s data. Neither has a theoretical place for the configuration in which a fixed corpus
Dis held constant andNfrozen independently-trained embedders compute deterministic projectionsφ_m(x). The cited theorems are silent on this configuration. A third path (decompose a fixed corpus through frozen embedders rather than acquire or synthesise) is what the silence names, and the size of that third path is parameterised by the embedder count.5. Two diagnostic findings where the panel surfaced bugs single-encoder eval missed
The strongest empirical case for the counting identity being a real multiplier of supervisory information rather than notational accounting comes from two engineering findings on systems that single-embedder evaluation would have shipped as healthy.
The Shakespeare LoRA training-data echo. I trained a rank-256 Shakespeare stylistic LoRA on Gemma-4-E4B-it (α = 512, 258 target modules, two-stage SFT-then-DPO on a 13-embedder DDA-projected derivative of the Shakespeare canon: 7,925 SFT pairs and 11,724 DPO pairs; final SFT loss 1.12 with 98.9% token accuracy, final DPO loss 0.47 with 98.0%
rewards/accuraciesandrewards/margins43.62; published athuggingface.co/cabdru/shakespeare-lora-gemma4under Apache-2.0). 5,954 of 8,857 SFT examples (67%) began with play-script header prefixes that the initial training pipeline silently learned to echo. A single-embedder style-cosine evaluation on dense E1 was insufficient to surface this; E1 absorbed the header signature into the style distribution. The multi-embedder STYLE_CENTROID guard caught the echo because the sparse panels (E6 lexical, E13 SPLADE) and the late-interaction E12 panel surfaced the header signature as a high-cosine outlier that the dense panel had normalised away. The bug-detection mechanism is structurally the counting identity at work: a signal that one row ofD'cannot see is visible to a different row, and the conjunction of rows surfaces what any single row hides. The same LoRA, when given a Spanish-language prompt, also produced unprompted Siglo de Oro register prose despite the training corpus containing no Spanish text — single-instance observation, with caveats in the parent paper §6.1.4, but consistent with the read that several of the thirteen embedders carry signal components that are not language-specific.The Santa avatar flow-matching sign-bug. Initial EchoMimicV3 LoRA training used an inverted target convention,
target = latent − noiserather thantarget = noise − latent. Loss curves looked superficially healthy and converged smoothly while the trajectory the model was being asked to learn was inverted. Single-loss-curve diagnostics did not flag the divergence. The constellation-side audit, which re-embeds generated frames through SigLIP and validates them against the per-modality centroids constructed from the reference set, surfaced the disagreement: the generations were converging in pixel space but diverging in the seven frozen-encoder geometry. The bug was one of six fixes documented in the parent paper §6 and the ClipCannon white paper §6.2 / §10. Both findings were caught by the audit role of the same panel that produces the supervisory signals; the panel-as-diagnostic is not an afterthought, it is the same construct as the panel-as-labeller.6. The bottleneck is the embedder supply
The practical implication of (1) and (2) sits in this section. The standard framing of the data wall asks how to acquire more raw text, or how to generate more synthetic text, or how to filter what synthetic text we already have. All three framings hold the panel
Φconstant — usually implicitly, by assuming a single dense encoder is the right unit of measurement and that the question is what to feed into it.Identity (1) reframes the question. Hold the corpus
Dconstant. VaryΦ. The supervisory signal yield isn · N(N+1)/2. The leverage point is not the corpus or the labeling budget. It isNand the diversity of the panel. Adding one new independently-trained encoder to a panel of sizeNadds1 + Nnew signals per input — itself, plus its pairwise interaction with every existing member. AtN = 13adding the 14th encoder adds 14 new signals per input. AtN = 24adding the 25th encoder adds 25. AtN = 100adding the 101st adds 101. The marginal value of each new independent encoder grows linearly in the panel size that already exists.This reframes “what would help most with the data wall” in a specific and actionable direction. Build new specialised independently-trained frozen encoders. Niches the existing panel does not cover: long-form discourse-level encoders, encoders trained on programming-language ASTs rather than tokens, encoders for legal or medical or scientific domains where a single large dense encoder leaves structural information on the table, encoders for non-English language families, encoders for non-text modalities the field has not built strong frozen instruments for yet. Each new independent encoder, once frozen, contributes
1 + Nnew supervisory signals per input to every existing corpus DDA decomposition, retroactively, without re-touching the corpus.The current Context Graph production panel (
N = 13) is what I had infrastructure for in April 2026. The development panel (N = 24) is what I am running today. Both are tiny relative to the size the panel can grow to as the field ships more independently-trained frozen encoders. Anyone reading this who has trained a strong frozen encoder on a niche the panel does not cover has, in my reading, contributed a multiplier on the supervisory signal extractable from every fixed corpus that the panel is then applied to. The bottleneck is not corpora, not human labelers, not synthetic-data risk tolerance. It is the embedder supply.7. Where the construct is weakest
Three places the framework can be attacked, in priority order. I am stating these as open holes rather than as defended positions. If the post receives one substantive critique, the most useful place for it to land is here.
Realised information per pair is empirical, not derivable. Identity (1) counts entries in
D'. Each entry is a real measurement under the frozen-embedder axiom. The realised information per pair is bounded above bymin(H(φ_j), H(φ_k))and contracts as the pair’s mutual information rises. Heavy redundancy between any pair (for instance E1, the e5-large-v2 semantic encoder, and E10, a paraphrase variant of the same family, both dense-text encoders on overlapping pretraining) would shrink the realised information from that pair below the entry’s nominal weight. The pre-registered pairwise mutual-information audit (EXP-2 in §8) is the falsification test: estimateI(φ_j(X); φ_k(X))for each of theC(N,2)pairs via MINE or partitioned-KSG-adjacent estimator on a stratified sample of 100 to 1,000 representative inputs; report a redundancy histogram and a measuredN_eff. The audit cannot falsify Identity (1); it can show the realised information density sits below the entry-count multiplier by a measurable factor.The taxonomic-acceptance argument is a reviewer judgement, not a derivation. “Meaning compression deserves a fourth slot” is a framing claim about how the literature should be organised; it cannot be proven from the counting identity itself. A reviewer can accept (1) and (2) as true and still reject the taxonomic distinction on the grounds that the identity is more naturally read as multi-view representation learning, multi-encoder retrieval, or labelled-signal extraction in a known sense. I think the unit-pair argument carries weight (no other category measures structured signals per raw-data unit at the dataset-preparation stage), but the strength of that argument depends on accepting the unit pair as semantically distinct from bits-per-bit, accuracy-per-parameter, and memory-per-output-quality, and that acceptance is not derivable from (1).
The downstream sample-efficiency claim is named future work, not present evidence. I am not claiming, in this post or in the parent manuscript, that DDA-enriched supervision actually improves sample efficiency in the sense of refitted Kaplan-style or Chinchilla-style scaling exponents. The counting identity says how many structured signals you can extract; it does not say what happens to the loss exponents when you train on those signals at matched compute against a non-DDA baseline. An experiment that re-fits scaling exponents on a DDA-enriched corpus is a multi-million-dollar GPU-hour experiment I cannot run on a graduate stipend. Until that lands, the contribution is the counting identity, the ratio view, the fourth-taxonomic-entry framing, the embedder-supply bottleneck argument, and two production substrates that demonstrate the labeller role implementable at scale across two modalities. The downstream sample-efficiency claim is the natural place where the framework either pays off or does not, and I am explicit that the present evidence does not yet reach there.
8. Three experiments that would tighten or kill the framework
Each is instrument-ready in the sense that the input data, the panel, and the evaluator are specified; only execution and writeup remain.
EXP-2 is the load-bearing one for this post. A pre-registered pairwise mutual-information audit on the 13-embedder Context Graph panel (extending naturally to the 24-embedder development panel): estimate
I(φ_j(X); φ_k(X))for each pair via MINE, partitioned-KSG, or InfoNCE lower-bound on a stratified sample of 100 to 1,000 representative inputs; report a redundancy histogram and a measuredN_eff. The realised information density of the panel becomes a defended capacity rather than a constructive ceiling. This is the experiment that converts the framework from “the entries exist by construction” into “the entries carry this measured amount of non-redundant supervisory information.” It also produces the per-encoder marginal-value table that lets the field decide which gaps in the embedder panel are worth filling first.EXP-1 attacks the runtime-guard side rather than the counting side. An ECAPA-TDNN cross-encoder re-score of the Case 3 voice argmax candidates from the parent paper (and ideally the full 120-candidate pre-selection distribution) on the same audio: report mean and max ECAPA SECS, Pearson r and Spearman ρ vs WavLM, and a Bland-Altman plot. The headline on the parent paper’s Case 3 voice is currently encoder-matched and within-WavLM-family; if cross-encoder agreement holds at mean ECAPA ≥ 0.85 with Pearson ≥ 0.5, it promotes to cross-encoder identity ranking. This is mostly relevant to the per-output runtime guard
G_τ(§9 below), not to the counting identity directly.EXP-3 is a sample-complexity sweep on the centroid construction. A four-point reference-set-size curve at
n_m ∈ {10, 25, 50, 100}on the same Case 3 scorer and held-out set. The DDA axiom assertsn_m ∈ [10, 10²]suffices for a stable centroid; the empirical curve has one observation atn_m = 50, and a 4-point curve would anchor the sample-complexity claim at more than one operating point.I am applying for the Anthropic Fellows Program (July 2026 cohort) and the Constellation Astra Fellowship (Sept 2026 cohort) with EXP-2 plus an AuditBench-adjacent calibration phase as the focused 5-month project. If anyone reading this is set up to run the MI estimator infrastructure on a 13- or 24-embedder panel, has an AuditBench-adjacent calibration setup, or wants to read §3.3 of the parent paper (the Shumailov scope claim) and tell me where it breaks, I would value the collaboration. Email
chrisroyseai@gmail.com.What would change my mind on the embedder-bottleneck framing. (i) A pool-external counter-example where a single sufficiently-large encoder produces a labelled-signal yield comparable to the multi-encoder panel at matched compute, suggesting the multi-encoder decomposition is a notational artefact rather than a structural multiplier. (ii) An EXP-2 audit that returns
N_effclose toN(collapse of the pairwise term) on every panel anyone runs it on, falsifying the practical content of the identity even though the upper bound holds. (iii) A demonstration that further-trained-from-the-same-base encoders produce indistinguishably useful pairwise interaction features as truly independently-trained encoders — which would weaken the “independently-trained” axiom and suggest the embedder supply is not actually the bottleneck because cheap fine-tunes can substitute.9. One downstream consequence, briefly: the runtime guard
G_τThe same frozen panel that produces
D'in the labeller role can, without parameter update, serve as a runtime guard at inference. Eachφ_mevaluates a generated candidateŷand computescos(φ_m(ŷ), c_m)against a frozen centroidc_mbuilt from a per-modality reference set in a Phase 1 step (the canonical default is the L2-normalised mean of L2-normalised projections of 10 to 100 reference samples). The acceptance predicate is a strict conjunction:Cost is
O(M · d_max)for the cosine comparisons plus the embedder forward passes (one per modality per candidate). There is no auxiliary learned model. This is a downstream consequence of the same panel being usable in three semantic roles without parameter update: retriever (the conventional role in dense, sparse, and late-interaction retrieval), labeller (the role in (1)), and runtime guard (the role in (3)). It provides a per-output runtime guarantee that scalar-reward alignment in the InstructGPT (Ouyang et al., 2022), DPO (Rafailov et al., 2023), and Constitutional AI (Bai et al., 2022) tradition does not provide in the same technical sense. I do not develop this thread here; the parent paper §5 develops it, and Bowman, Perez, Hubinger and colleagues’ November 2025 paper (arXiv 2511.18397) on natural emergent misalignment from reward hacking in production RL identifies three failure modes (encoder-matched survivorship, centroid collapse on non-diverse references, multi-modal strictness without M ≥ 4 evidence) that the runtime guard does not protect against and that I treat as load-bearing in §8 of the parent paper.The relevance to this post: the runtime guard is one consequence of the three-role composition over identical frozen instances. The counting identity (1), the ratio (2), the fourth-axis framing in §3, and the embedder-supply bottleneck argument in §6 are the load-bearing claims; the runtime guard is what falls out when you ask the labeller panel to do double-duty at inference.
10. Reference implementation
A minimal reference implementation of the per-input signal-yield computation, intended to be readable rather than performant. The Context Graph Rust workspace at
N = 13is not currently a public repository; the public companion is the ClipCannon Python pipeline atN = 7(github.com/chrisroyse/clipcannon), which implements the labeller role on video. A second public artefact, the Dynamic / ME-JEPA release candidatemejepa_5090_artifact_v2.0.0_rc1on Zenodo (DOI 10.5281/zenodo.19977981), ships a deterministic Docker plus Apptainer/SIF reproducibility build that includes the meaning-compression instrumentation directly in the runtime: anMC-ratiosubcommand, signal-yield accounting, and pairwise mutual-information audit outputs over its panel. The Python sketch below is for the AF reader who wants the labeller-side counting identity in 30 lines.Two things to notice. There is no learned model in
derived_signals; the embedders are frozen by axiom and the pairwise function is a deterministic interaction over their outputs. The assertionN + C(N,2)is the constructive entry count; replacing the cosine pairwise with a more expressive interaction (a frozen adapter, a partitioned-KSG MI estimate) does not change the count, only the realised information content per pair. TheN_effthat EXP-2 would measure is the empirical capacity of a panel against this constructive ceiling, and it is the number that turns the per-input yield from a count of entries into a defended quantity of non-redundant supervisory information.11. Conflicts of interest
I am the founder and operator of Teleox.ai (the research operation behind Context Graph, ClipCannon, the TCT framework, and Dynamic / ME-JEPA) and Leapable.ai (a creator-knowledge marketplace MCP stack). I hold equity in both. The DDA counting identity, the meaning-compression ratio view, the fourth-taxonomic-entry framing, the embedder-supply bottleneck argument, and the two production substrates that instantiate the labeller role are methodology contributions emerging from commercial engineering practice. No third-party research funding or vendor-paid benchmark is reported. The release posture is asymmetric: the public Shakespeare LoRA (Case 1) is byte-identical public on HuggingFace; the Santa case (Case 2) and the parent paper’s Case 3 voice case have subject artefacts withheld on consent and dual-use grounds documented in §Ethics of the parent paper. The Dynamic / ME-JEPA artefact (release candidate
mejepa_5090_artifact_v2.0.0_rc1, Zenodo DOI 10.5281/zenodo.19977981, concept DOI 10.5281/zenodo.19953950) is fully reproducible. The public ClipCannon Python pipeline lives atgithub.com/chrisroyse/clipcannon. The Context Graph Rust workspace is not a public repository at time of writing.I submitted applications to the Anthropic Fellows Program and the Constellation Astra Fellowship on 2026-05-02. This post is not part of either application; it is a separate research-side write-up of the construct for the AF/LW reader, and I am explicitly not asking anyone here to factor it into evaluation if our paths cross professionally.
I drafted this post with LLM assistance, then revised it by hand to strip the model’s stylistic tics and to tighten the scope guards where I had hedged too softly. The technical claims, the measurements, the scope guards, the open-holes analysis, and the code are mine. Where the open holes in §7 are stronger than I have stated them, particularly §7′s first hole (realised information per pair is empirical, not derivable) and §7′s third hole (no scaling-exponent re-fit), please say so explicitly. That is the most useful thing this post can receive.
Replies attempted within 24 hours. Specific critiques of §7 prioritised over praise.