We’ve just published the first whitepaper for Aurelius, a decentralized protocol designed to generate verifiable alignment data through adversarial prompting, reasoning transparency, and contestable evaluation. The system draws from cryptoeconomic design principles (inspired by Bittensor and Bitcoin) but applies them to the AI alignment problem, with interpretability and reasoning coherence as first-class citizens.
At its core, Aurelius is an evolving market for epistemic conflict, where independent agents are rewarded for surfacing misaligned behavior, evaluating reasoning quality, and refining collective judgment over time. The protocol is designed to scale adversarial robustness, not through static judging rules, but through a dynamic, recursive feedback loop.
This version outlines only Phase 1 of the protocol. Future papers will cover recursive training architectures, the Viren Chain (a fully contestable multi-agent reasoning structure), and downstream fine-tuning using alignment proofs.
We believe decentralized alignment research is a promising complement to lab-based interpretability and preference modeling efforts. Aurelius is still in development, and we’re seeking feedback from alignment researchers, interpretability theorists, and anyone who sees promise in market-based mechanisms for truth and robustness.
Specifically feedback on prompting methodology, data labeling schema, and dataset structuring would be welcome. The overall goal is to generate high-signal alignment datasets, at scale, to progress alignment and interpretability research. We believe we cannot do that without input from independent, thoughtful researchers.
All critique and questions are welcome.
Thank you,
The Aurelius Team
aurelius.subnet@gmail.com
Would love to see your take on https://www.lesswrong.com/posts/evYne4Xx7L9J96BHW/video-and-transcript-of-talk-on-can-goodness-compete—explicitly using competitive games to surface goodness is the core of the kind of alignment problem I worry may be somewhere between inherently unsolvable and extremely hard. that talk (transcript) presents some of the question landscape. but generally speaking, the view that providing feedback is the core of alignment seems like it’s too early in the philosophical question—we still need to figure out what structure of feedback results in the minds-giving-feedback actually giving feedback to the right thing, which would be upstream of being sure the mechanism design of a protocol like this is asymptotically robust to arbitrarily superintelligent participants. How do we nail down that the feedback-giving comes from the moral patients which exist, even when those minds are arbitrarily exploitable by much stronger adversary minds?
If your idea does what it says it’s supposed to, it sounds amazing, and I generally have additional skepticism for things that would be amazing if true, especially when they purport to solve superintelligence-robust alignment.
I’m very interested by this premise. I think I should clarify the (Phase I) intent of Aurelius. It is not conceptualized as [a competitive game to surface goodness]. Rather, it can be thought of as an answer to shortcomings in observable overfitting to centralized alignment methodologies like those mentioned in the whitepaper (InstructGPT, CAI, etc). The efficacy of these methodologies is constrained largely due to centralization of prompting and evaluating agents.
Aurelius uses game-theoretic incentives to induce decentralized participants to generate [misalignment] data at scale. We hypothesize this will result in a proliferation of data that may illustrate:
1) the contour of misalignment in a given model (via adversarial probing)
2) how MI and CoT signal alignment discrepancies in such novel vectors of attack
3) conflict of interest and bias from close source model developers are root causes of slow alignment progress
All agents in the protocol are independent, motivated solely by capitalistic forces. A clever design structure can instantiate a competitive feedback loop that leads to superior prompting and [quantitative alignment] scoring methodologies. This environment is completely unconstrained and free to evolve as new data informs the efficacy of the system.
So, you’re hoping that this system pays red-teamers to push AIs into misaligned behavior, in the hope that by putting blue and red teams in a competitive game, this makes a variety of blue teams decentralizedly output more “do things like this” data, right? The open question I’m pointing at here is whether (human) blue teams can even output the things they intend to in the face of sufficiently strong red-teamers. If your incentive system sets up an adversarial system, could it even work in principle? that’s the question that seems unanswered by theory that would need to have an answer to establish whether this competition works, vs whether it just moves from the ai alignment problem to the aurelius alignment problem: how do you design the feedback loop such that if there are (non-human) red-teamers in the world that are so good at breaking models that they would also break the humans providing the data, the system is nevertheless set up in such a way that the humans can continue providing input and not get their stake drained away?
I certainly agree that an optimal system would be decentralized, but it seems to me there are overwhelmingly many ways to design a system (decentralized or not) which are vulnerable, and making your system secure against breaking blue teams seems like (a major component of) the hard part.
(Note: I’m sort of assuming this is still in the conceptual phase and engaging on that level. If you’d rather I look at the protocol math and look for exploitability in it, I can do that instead.)
Yes you’ve got the idea framing exactly right.
One clarification: this is not a zero-sum game between miners (prompters) and validators (judges). The validators will reward outputs adhering to whatever criteria the protocol designates as “valuable”. The adversarial term mostly refers to miners vs. [misaligned] LLMs.
So this is why I’m here:
What are the “wish list” schema dimensions (miners and validators both) from a research perspective?
What is the best way to aggregate, tag/label, categorize and package this into datasets that researchers can draw meaningful conclusions from?
What is an intelligent way to define and iterate “valuable” outputs (the mechanism itself will be python based script) from miners on the protocol, so that Aurelius becomes an engine for standardized alignment data.
I’m have some ideas of my own, but I’m here to hear it straight from the source.
So I ask anyone here, If you had access to a vast network of independent compute, all with their own prompting and scoring strategies, what would you ideally like to learn from that?