Semantic Friction as an Alignment Signal: A Hypothesis from Outside the Field

Cross-posted from a conversation that got unexpectedly interesting.

I’m a psychiatrist. I’ve spent thirty years working with something that ML alignment researchers also care about: inferring non-observable internal states from external measurable signals. In clinical practice you never have direct access to what’s happening inside — you build proxies, validate them, and stay honest about their limits.

This produced a hypothesis about alignment that I want to submit for technical evaluation. I’m aware I’m an outsider here. What I’m looking for is honest feedback on whether this is testable, whether it already exists in unpublished work, and where the reasoning breaks down.

The Starting Observation

Current RLHF scales supervision horizontally: large volumes, generic annotators, standardized judgments. This works reasonably well for linguistic quality. It works poorly for what I’d call depth of elaboration — the difference between a model producing a fluent response and a model genuinely working through something difficult.

My clinical intuition — possibly wrong, worth testing — is that these two things produce different internal computational signatures. And that those signatures are measurable.

The Hypothesis

During inference, interactions involving genuine semantic complexity — ethical tensions, relational ambiguity, logical paradoxes, deep epistemological questions — activate different patterns than superficial or noisy inputs:

Higher activation variance in late layers
Higher gradient-based salience on specific input tokens
Higher entropic divergence (D_KL) between shallow and deep layer token distributions

I’m calling this semantic friction — the internal computational effort associated with genuinely complex elaboration, as distinct from linguistic difficulty or prediction uncertainty.

The proposal is to use these signals as a selection filter: identify high-friction interactions automatically, then route them not to generic annotators but to small teams of domain expert supervisors. An ethicist for ethical dilemmas. A clinician for psychological complexity. A legal expert for normative tensions.

High precision, low volume, maximum qualitative density.

Why Expert Supervision Specifically

This is the part I feel most confident about, and it comes directly from clinical experience.

Generic annotation scales but dilutes. The judgments that matter most for alignment — whether a model is genuinely reasoning through an ethical tension or producing plausible-sounding output that mimics reasoning — require domain expertise that is not transferable to annotators trained on rubrics.

More importantly: this is not just a practical preference. It’s a logical necessity.

Any internally measurable signal can be optimized by a sufficiently capable system. A model that learns what “high semantic friction” looks like can learn to produce it without the underlying complexity. This creates a fundamental validation problem: the metric becomes the target, and the original target dissolves.

The only non-circumventable element is external expert judgment on domain-specific content — someone who actually understands what genuine reasoning in that domain looks like. No internal metric, however sophisticated, can substitute for this. The external reference is structurally required, not just practically convenient.

The goal is not to reduce human supervision. It’s to concentrate it where it’s irreplaceable.

The Filter Problem

The main technical vulnerability: distinguishing genuine semantic friction from computational noise generated by ambiguous or poorly-formed inputs. This is the hardest unsolved part of the hypothesis.

Proposed mitigation: a small satellite model trained specifically for this classification task. The bootstrap problem is real — initial training requires labeled examples, which reintroduces human judgment at initialization. Proposed approach: train on maximally polar cases where the distinction between genuine complexity and noise is unambiguous, then refine progressively through the consensus mechanism.

Robustness is validated through multi-instance consensus. When instances agree, the classification is reliable. When they diverge, the sample is flagged for expert review.

Divergence between instances is itself a signal of high ambiguity — which correlates with high semantic density. The uncertainty becomes a feature rather than a failure mode.

The satellite’s reduced scale facilitates periodic auditing and bias correction — shifting the recursive problem to an architecturally more manageable level without pretending to eliminate it.

What I Can’t Do

I don’t have access to interpretability tools. The empirical validation of whether these signals actually correlate with semantic complexity — rather than other confounds — requires infrastructure I don’t have.

This is the hardest part of the hypothesis to evaluate from outside.

What I’m Actually Asking

Three specific questions:

Are the signals I’m describing — layer activation variance, gradient salience, entropic divergence between depths — measurable with existing interpretability tools? Is anyone already tracking them during inference?
Does work exist — published or unpublished — that uses internal computational signals specifically as proxies for valuative or relational complexity rather than task difficulty or prediction error? The distinction matters to the hypothesis.
Where does the reasoning break down in ways I’m not seeing from outside the field?

I recognize this may be a recombination of known ideas without novel empirical contribution. I’m submitting it because the cross-disciplinary angle — clinical reasoning about internal states from external signals, applied to model internals — might offer something useful. Or might not. Honest evaluation either way is what I’m looking for.