Proposal: A Viable System Model (VSM) Architecture for Inner Alignment

Summary

Current alignment techniques (RLHF, anti-scheming training) often rely on post-hoc reward shaping, which can be brittle under distributional shift. As a Systems Engineer, I propose a structural alternative: ConsciOS, a nested control architecture based on Stafford Beer’s Viable System Model (VSM) and Active Inference.

The Architecture

Instead of a monolithic agent, the system is decomposed into three nested controllers (based on VSM tiers):

  1. Embodied Controller (VSM 1-3): Handles short-horizon perception-action loops (fast, reactive).

  2. Supervisory Controller (VSM 4): A mid-level selector that chooses policy frames based on a Coherence Metric (minimizing prediction error against deep priors) rather than just reward maximization.

  3. Meta-Controller (VSM 5): Encodes immutable long-term priors (identity/​safety constraints) that lower levels cannot overwrite.

Key Mechanism: Coherence-Gated Selection

The core contribution is the Resonance Engine: a selector that gates high-complexity actions based on “Time-Integrated Coherence” (TIC). Effectively, the agent is structurally incapable of executing complex, covert plans if its internal state does not resonate with its safety priors. This aims to create a “physics of safety” rather than a “rule of safety.”

Empirical Roadmap

The paper outlines specific simulation benchmarks for distinguishing “coherent” behavior from “reward-hacking” behavior. I am looking for feedback specifically on the implementation of the Interoceptive Control Signal (ICS) as a shaping reward.

Link to Full Preprint on Zenodo (v4)

(Note: This paper has seen significant interest from the systems engineering community, with ~190 downloads in the last 3 weeks. I am now bringing it to the Alignment community for rigorous critique.)

No comments.