Value Learning Needs a Low-Dimensional Bottleneck

Epistemic status: Confident in the direction, not confident in the numbers. I have spent a few hours looking into this.

Suppose human values were internally coherent, high-dimensional, explicit, and decently stable under reflection. Would alignment be easier or harder?

My below calculations show that it would be much harder, if not impossible. I’m going to try to defend the claim that:

Human values are alignable only because evolution compressed motivation into a small number of low-bandwidth bottlenecks[1], so that tiny genetic changes can change behavior locally.

If behavior is driven by a high-dimensional reward vector , inverse reinforcement learning requires an unreasonable number of samples. But if it is driven by a low-rank projection with small k, inference may become tractable.

A common worry about human values is that they are complicated and inconsistent[2][3][4]. And the intuition seems to be that this makes alignment harder. But maybe the opposite is the case. Inconsistency is what you expect from lossy compression, and the dimensionality reduction makes the signal potentially learnable.

Calculation with Abbeel & Ng’s formula[5] gives for the number m of necessary expert demonstrations (independent trajectories):

m >k=10
(values are low-dim)
k=1000
(values are complex)
γ = 0.9 (short horizon ~10 steps)
γ = 0.99 (long horizon ~100 steps)

If you need at least 20 billion samples to learn complex values, we are doomed. But it may become solvable with a reduction of the number of required trajectories by a factor of about 200 (depending on how high-dimensional you were thinking the values are; 1000 is surely conservative—if any kinds of values can actually be learned the number may be much higher).

This could explain why constitutional AI works better than expected[6]: A low-dimensional latent space seems to capture most of the variation in preference alignment[7][8]. The reduction by x200 doesn’t mean it’s easy. The bottleneck helps with identifiability, but we still need many trajectories and the mapping of the structure of the bottleneck[9] can still kill us.

How can we test if the dimensionality of human values is actually low? We should see diminishing returns in predictability for example when using N pairwise comparisons of value-related queries. Predictability should drop off at , e.g., for k∼10 we’d expect an elbow around N~150.

  1. ^

    I’m agnostic of what the specific bottlenecks are here, but I’m thinking of the channels in Steven Byrnes’ steering system model and the limited number of brain regions that are influenced. See my sketch here.

  2. ^

    AI Alignment Problem: ‘Human Values’ don’t Actually Exist argues that human values are inherently inconsistent and not well-defined enough for a stable utility target:

    Humans often have contradictory values … human personal identity is not strongly connected with human values: they are fluid… ‘human values’ are not ordered as a set of preferences.

  3. ^

    Instruction-following AGI is easier and more likely than value aligned AGI:

    Though if you accept that human values are inconsistent and you won’t be able to optimize them directly, I still think that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.

  4. ^

    In What AI Safety Researchers Have Written About the Nature of Human Values we find some examples:

    [Drexler]: “It seems impossible to define human values in a way that would be generally accepted.” …

    [Yampolskiy]: “human values are inconsistent and dynamic and so can never be understood/​programmed into a machine. …

    In comparison to that Gordon Worley offers the intuition that there could be a low-dimension structure:

    [Gordon Worley]: “So my view is that values are inextricably tied to the existence of consciousness because they arise from our self-aware experience. This means I think values have a simple, universal structure and also that values are rich with detail in their content within that simple structure.

  5. ^

    Abbeel & Ng give an explicit bound for the required number of expert trajectories:

    it suffices that

    with

    • m: number of expert demonstrations (trajectories)

    • k: feature dimension

    • γ: discount factor (determines horizon)

    • ϵ: target accuracy parameter, above we use 0.1: 10% tolerance of regret

    • δ: failure probability, above we use 0.05: 95% confidence level

    Apprenticeship Learning via Inverse Reinforcement Learning

  6. ^

    Constitutional RL is both more helpful and more harmless than standard RLHF.

    Constitutional AI: Harmlessness from AI Feedback

  7. ^

    This aligns with expectations, as head_0 corresponds to the eigenvector with the largest variance, i.e., the most informative direction. Furthermore, among the top 100 heads [of 2048], most of the high-performing heads appear before index 40, which aligns with PCA’s property that the explained variance decreases as the head index increases. This finding further supports our argument that PCA can approximate preference learning.

    DRMs represent diverse human preferences as a set of orthogonal basis vectors using a novel vector-based formulation of preference. This approach enables efficient test-time adaptation to user preferences without requiring additional training, making it both scalable and practical. Beyond the efficiency, DRMs provide a structured way to understand human preferences. By decomposing complex preferences into interpretable components, they reveal how preferences are formed and interact.

    Rethinking Diverse Human Preference Learning through Principal Component Analysis

  8. ^

    retaining just 4 components (≈15% of total variance) reproduces nearly the full alignment effect.

    ...

    By combining activation patching, linear probing, and low-rank reconstruction, we show that preference alignment is directional, sparse, and ultimately localized within a mid-layer bottleneck.

    Alignment is Localized: A Causal Probe into Preference Layers

  9. ^

    Steven Byrnes talks about thousands of lines of pseudocode in the “steering system” in the brain-stem.