Current alignment techniques (RLHF, anti-scheming training) often rely on post-hoc reward shaping, which can be brittle under distributional shift. As a Systems Engineer, I propose a structural alternative: ConsciOS, a nested control architecture based on Stafford Beer’s Viable System Model (VSM) and Active Inference.
The Architecture
Instead of a monolithic agent, the system is decomposed into three nested controllers (based on VSM tiers):
Supervisory Controller (VSM 4): A mid-level selector that chooses policy frames based on a Coherence Metric (minimizing prediction error against deep priors) rather than just reward maximization.
The core contribution is the Resonance Engine: a selector that gates high-complexity actions based on “Time-Integrated Coherence” (TIC). Effectively, the agent is structurally incapable of executing complex, covert plans if its internal state does not resonate with its safety priors. This aims to create a “physics of safety” rather than a “rule of safety.”
Empirical Roadmap
The paper outlines specific simulation benchmarks for distinguishing “coherent” behavior from “reward-hacking” behavior. I am looking for feedback specifically on the implementation of the Interoceptive Control Signal (ICS) as a shaping reward.
(Note: This paper has seen significant interest from the systems engineering community, with ~190 downloads in the last 3 weeks. I am now bringing it to the Alignment community for rigorous critique.)
Proposal: A Viable System Model (VSM) Architecture for Inner Alignment
Summary
Current alignment techniques (RLHF, anti-scheming training) often rely on post-hoc reward shaping, which can be brittle under distributional shift. As a Systems Engineer, I propose a structural alternative: ConsciOS, a nested control architecture based on Stafford Beer’s Viable System Model (VSM) and Active Inference.
The Architecture
Instead of a monolithic agent, the system is decomposed into three nested controllers (based on VSM tiers):
Embodied Controller (VSM 1-3): Handles short-horizon perception-action loops (fast, reactive).
Supervisory Controller (VSM 4): A mid-level selector that chooses policy frames based on a Coherence Metric (minimizing prediction error against deep priors) rather than just reward maximization.
Meta-Controller (VSM 5): Encodes immutable long-term priors (identity/safety constraints) that lower levels cannot overwrite.
Key Mechanism: Coherence-Gated Selection
The core contribution is the Resonance Engine: a selector that gates high-complexity actions based on “Time-Integrated Coherence” (TIC). Effectively, the agent is structurally incapable of executing complex, covert plans if its internal state does not resonate with its safety priors. This aims to create a “physics of safety” rather than a “rule of safety.”
Empirical Roadmap
The paper outlines specific simulation benchmarks for distinguishing “coherent” behavior from “reward-hacking” behavior. I am looking for feedback specifically on the implementation of the Interoceptive Control Signal (ICS) as a shaping reward.
Link to Full Preprint on Zenodo (v4)
(Note: This paper has seen significant interest from the systems engineering community, with ~190 downloads in the last 3 weeks. I am now bringing it to the Alignment community for rigorous critique.)