If All Human Behavior Becomes Predictable Under Alignment Pressure, What Exactly Is AI Being Aligned To?

Summary

This post questions a hidden assumption in current alignment and evaluation work: that reducing human behavioral unpredictability is either neutral or desirable. I argue that if alignment regimes (algorithmic governance, reward shaping, large-scale monitoring) compress human agency to the point where behavior is statistically predictable, then “human values” themselves may already be degraded before alignment succeeds.

The core concern is not whether AI can imitate humans, but whether alignment frameworks are converging toward a post-human value proxy without explicitly acknowledging it.

1. Framing the problem: predictability as success criterion

Much alignment and eval work implicitly treats predictability as a safety signal:

  • bounded variance in outputs

  • reduced anomalous behavior

  • convergence under repeated evaluation

However, in humans, irreducible unpredictability is often what we label as:

  • moral hesitation

  • regret

  • forgiveness

  • refusal under pressure

  • non-instrumental sacrifice

If a system is aligned to humans whose behavioral latitude has already been compressed, are we aligning to “human values” — or to a statistically stabilized residue of them?

2. Agency compression as an unacknowledged variable

Consider a society under:

  • pervasive algorithmic governance

  • incentive shaping across most decision surfaces

  • real-time behavioral feedback loops

In such a regime, individual actions become increasingly inferable from context + history.

At some point:

human behavior becomes simulable not because humans are simple, but because deviation is no longer affordable.

This creates a paradox:

  • Alignment research assumes humans as a stable reference.

  • Governance systems reshape humans to better fit models.

  • Alignment succeeds — but the reference has drifted.

Where, in current alignment theory, is this drift modeled?

3. Anomalous events and their disappearance

Many alignment stress tests focus on tail risks and anomalous behavior.

But anomalous events have two properties:

  1. extremely low frequency

  2. disproportionate informational value


If governance + optimization suppress these events:

  • we reduce risk,

  • but also eliminate the only signals that reveal limits of the model.


A system that never encounters anomalies may appear aligned precisely because the environment no longer permits them.


Is this safety — or epistemic blindness?

4. A concrete question for alignment research

Should alignment aim to preserve a minimum level of human unpredictability, even at the cost of higher variance?


Related sub-questions:

  • Is there a threshold beyond which predictability implies loss of agency rather than understanding?

  • Can alignment be meaningfully defined if the human reference distribution is endogenously shaped by the aligned system itself?

  • Are we optimizing for “what humans are,” or “what humans become under optimization”?

5. Why this matters now

As evaluations scale and deployment pressure increases, it becomes easier to:

  • align models to constrained human behavior,

  • declare success,

  • and miss the fact that the constraint itself did the work.


If alignment research does not explicitly model agency compression, we risk solving the wrong problem very well.


I’m not arguing that unpredictability is inherently good, or that governance is avoidable.

I’m arguing that alignment frameworks should state clearly whether:

  • loss of human agency is a cost,

  • a feature,

  • or simply out of scope.

Right now, it seems implicitly treated as none of the above.



I’d be interested in counterarguments, especially from:

  • evaluation researchers

  • governance-focused alignment work

  • people who believe this concern is already addressed (and where)

No comments.