We’ve published a new whitepaper for Aurelius, a decentralized protocol designed to generate alignment training data under a completely new data paradigm, one that proposes alignment as an emergent property of multi-agent systems. The protocol operates as Subnet 37 on Bittensor, where independent participants compete to evolve alignment environments and world model capabilities.
From the abstract:
“Alignment in biology emerged as the optimal strategy of the atomic unit; by replicating the foundational selection pressures, aligned intelligence emerges from atomic units of moral reasoning. Aurelius is a decentralized protocol that generates corpora designed to align intelligence through a hybrid data paradigm. Current data is either human-generated and unscalable, or synthetic and fabricated. To bridge the two, Organic-Synthetic data is introduced: generated by agents, but grounded in genuine multi-agent dynamics with persistent state, causal dynamics, consequence propagation, and epistemic opacity. Independent participants compete to evolve alignment environments and world model capabilities. The resulting data is scored, and successful world model improvements compound. The environmental equilibrium is one where Dual Life Value is the optimal long-term strategy for all agents. The corpus captures atomic units of moral reasoning from all agent perspectives. A model trained on this corpus integrates self and other robustly, achieving what Aurelius calls experiential alignment, an attractor state reached through the accumulation of decentralized experience rather than optimization toward a central reward signal. In principle, alignment scales with agent intelligence, and there is no upper bound on the capability of models the protocol can align.”
The full whitepaper is linked to this post.
We are seeking feedback from alignment researchers, interpretability theorists, and anyone working on related problems. Specifically:
What are the failure modes we should be modeling?
How should corpus effectiveness be validated beyond downstream benchmark performance?
What environment designs would produce the most informative alignment signal?
Aurelius: Proposing Alignment as an Emergent Property
Link post
We’ve published a new whitepaper for Aurelius, a decentralized protocol designed to generate alignment training data under a completely new data paradigm, one that proposes alignment as an emergent property of multi-agent systems. The protocol operates as Subnet 37 on Bittensor, where independent participants compete to evolve alignment environments and world model capabilities.
From the abstract:
“Alignment in biology emerged as the optimal strategy of the atomic unit; by replicating the foundational selection pressures, aligned intelligence emerges from atomic units of moral reasoning. Aurelius is a decentralized protocol that generates corpora designed to align intelligence through a hybrid data paradigm. Current data is either human-generated and unscalable, or synthetic and fabricated. To bridge the two, Organic-Synthetic data is introduced: generated by agents, but grounded in genuine multi-agent dynamics with persistent state, causal dynamics, consequence propagation, and epistemic opacity. Independent participants compete to evolve alignment environments and world model capabilities. The resulting data is scored, and successful world model improvements compound. The environmental equilibrium is one where Dual Life Value is the optimal long-term strategy for all agents. The corpus captures atomic units of moral reasoning from all agent perspectives. A model trained on this corpus integrates self and other robustly, achieving what Aurelius calls experiential alignment, an attractor state reached through the accumulation of decentralized experience rather than optimization toward a central reward signal. In principle, alignment scales with agent intelligence, and there is no upper bound on the capability of models the protocol can align.”
The full whitepaper is linked to this post.
We are seeking feedback from alignment researchers, interpretability theorists, and anyone working on related problems. Specifically:
What are the failure modes we should be modeling?
How should corpus effectiveness be validated beyond downstream benchmark performance?
What environment designs would produce the most informative alignment signal?
For any other considerations or clarifications, please reach out: austin@aureliusaligned.ai
Thank you,
The Aurelius Team