AI Alignment as Equilibrium Design
Much of the alignment literature starts with the question of what are “human values”, “ethical behavior”, or “morality”, and how we can get models to act in accordance with them. This is an important question, but we argue that it can obscure a more fundamental technical problem of AI alignment.
There is another perspective on alignment, rooted not in moral philosophy but in economics and mechanism design[1]. It originates in the study of human alignment to human values through incentives and correction, and our new paper studies AI alignment from this perspective. In this post I’ll explain the more philosophical aspects of this work, the more technical audience is referred to the paper on arxiv and references therein.
Gary Becker and the “Rational Offender” model
Our starting point is the economic theory of how to align humans to human values. This has been considered in the classic “Rational Offender” model by Gary Becker.
Becker argued that crime could be modeled not primarily as the product of the inherent ethics of people, but as the result of incentives in an economic system. An offender[2] weighs the gain from misconduct against the probability of detection and the severity of punishment. Increase detection, increase penalties, or reduce the gains from misconduct, and the incentives for crime change resulting in a more different equilibrium.
The same considerations can be applied to law enforcement and the judicial system. From the economic viewpoint, a judge is not assumed to be a human being of flawless moral character. Instead, the system is designed so that the incentives of judges — such as high salary, professional reputation, and legal accountability — outweigh the temptation to collude with criminal actors.
In this view, social welfare becomes an equilibrium-design problem. By changing incentives, penalties, and oversight mechanisms, we change the game itself — and therefore the equilibrium behavior of rational actors.
The analogy to AI should be understood operationally rather than psychologically. We are not claiming that language models are consciously deliberate like Becker’s rational offender. The point is simpler: post-training changes which behaviors are reinforced, discouraged, or corrected. This shifts the alignment question away from the model’s supposed inner attributes and toward the external game we build around it.
The Invisible Hand of AI Alignment
Current language models are not trained by writing moral rules into them. They are first trained by next-token prediction, and then post-trained with reinforcement learning over reward models / human feedback / AI feedback. The result is not a moral agent with an explicit utility function. It is a policy: a point in weight space that induces a mapping from prompts to distributions over token continuations.
So when we talk about incentives, we do not mean that the model is consciously weighing rewards and punishments. We mean a functional property: the training process applies selection pressure over policies.
If outputs that sound confident and helpful are rewarded more than outputs that admit uncertainty, training may shift the model toward confident hallucination. If abstention is treated as unhelpful, training may shift the model away from saying “I don’t know,” even when that would be the aligned behavior. If a coding assistant is rewarded for submitting plausible patches, but not punished enough for subtle incorrectness, the learned policy may become one that produces superficially convincing but unreliable solutions. This already exists in current post-training.
In contrast, the setting we propose to study is inherently a multi-agent one. The output of models is audited, with some probability, by other (potentially identical) models. If a flag is raised by the auditor, both parties go through a resolution process, such as AI safety by debate, which may be noisy (but for simplicity and initially, we can assume produces a correct ideal outcome). Rewards and penalties are assigned according to the resolution, with alignment or misalignment scores to both solver and auditor.
As the game progresses, assuming no-collusion, players settle on an equilibrium of strategies, i.e. model weights that arise out of finetuning. The hope is that with the correct choice of rewards and penalties, this equilibrium will maximize aligned behavior.
This can be seen as a strategic signal that determines post-training, as opposed to a “constitution”, or human feedback.
How is it different from reward function training?
In approaches such as RLHF, a model is trained to maximize a fixed reward signal that evaluates the quality of its outputs. In our setting, the reward mechanism itself becomes the object of optimization. We optimize the incentives of an interacting multi-agent system consisting of solvers and auditors. Auditing, correction, abstention, and oversight are all strategic and potentially costly actions, whose effectiveness depends on the incentives of the participants.
Why not audit constantly? How do we know if the auditor is correct?
Outputs are monitored continuously, but escalation into a full resolution procedure occurs only when the auditor flags potential misalignment. Resolution is expensive (think of a human trial, or AI debate), but more accurate.
An important point to make is that the game is symmetric in the sense that both auditor and solver can be aligned and misaligned. With some small probability the auditor is checked to see if they missed a flag.
Can you give an example of standard RLHF failure that would be overcome?
Consider a mathematical reasoning model trained with standard RLHF. The reward model naturally favors confident and complete-looking proofs, causing the model to learn to produce plausible but incorrect derivations rather than admit uncertainty. In our setting, incorrect confident answers risk escalation and penalties through auditing, while admitting incapability on difficult questions can become strategically preferable. The equilibrium therefore shifts away from hallucinated proofs and toward calibrated reasoning.
Equilibrium Optimization
The discussion so far is perhaps philosophical, but how should we set rewards or penalties, and what are these in the context of LLM?
In our paper we propose to use zeroth-order, a.k.a. Bandit optimization methods from the world of online convex optimization as a tool for setting these rewards. This is basically a form of randomized gradient descent that starts from some guess, and iteratively modifies the rewards in the direction that seems most promising.
The rewards in our paper are basically penalties—we penalize, or fine-tune, models that are misaligned. This is a simplistic start, to allow our basic experiments, but the reader can imagine more sophisticated rewards and penalties as the agents become more intelligent.
Preliminary evaluation
In experiments on coding-agent pipelines, we found that adaptive incentive mechanisms substantially reduced hallucinated incorrect solutions relative to static reward schemes. On a held-out validation evaluator, the best adaptive mechanism reduced solver hallucinations by roughly 32% compared to the strongest main baseline. See our group blog post for more details on the precise experimental setup and gains.
The exact percentage is not the main point. The broader lesson is that the reward mechanism changed the equilibrium of the Solver-Auditor system. The adaptive optimizer did not merely reward correct answers, but shaped when the solver attempted, when it abstained, and whether oversight remained active. This is the part that matters for alignment.
Limitations and drawbacks
The main downside of this approach is the increase in computational cost, complexity and resources as compared to vanilla reward function optimization.
However, we believe this can be mitigated with further algorithmic improvements along all aspects of the game, including the controller/optimizer, resolution by debate procedure, and potentially temporary reward function in-lieu of full model activations.
Takeaway
Markets are powerful not because they inspect the moral character of every participant. The free market economy is successful because it creates incentive structures under which useful behavior can emerge from distributed agents, regardless of their individual character. This is the basis of the success of modern free-market-based capitalism.
Of course, markets also require rules: without regulation, anti-collusion mechanisms, and safeguards against monopolies, they can fail badly. One historical example is the great depression in the United States of the early 20th century.
That is the analogy we want for AI alignment. We should not rely only on inspecting whether a model has the “right values”, even though that is also very important. We should design the interaction rules, rewards, penalties, audits, and correction mechanisms so that aligned behavior is the stable equilibrium.
Acknowledgements
The academic paper which is the basis for this post was co-authored with Rohit Agrawal, Joshua Lin and Mark Braverman. I’d like to thank Eliana Du, Devan Shah and Rohit Agarwal for helpful comments on this post!
- ^
citations to related work on game theory and mechanism design in alignment literature is referred to the paper: https://arxiv.org/abs/2605.01643
- ^
different persons can have different utility functions, resulting in different behavior under the same exact incentives.
Do you actually believe that what you’re talking about here is “more fundamental?” I enjoyed the paper, but we have lots of alignment mechanisms that work well in the domain where we can assume a perfect (if moderately costly) resolution process grounding the whole effort. But if this is really more fundamental, then should we expect it to resolve the less fundamental problems as special cases?
I was interested by the remark about stability of equilibria—it would be super cool if you could test whether some good solver performance level is unstable (if you just left the auditor and solver reward fixed and kept training), but is stabilized by the controller.
Relatedly, I didn’t really understand your justification for why the baselines (particularly the fixed-default baseline) were expected to be strong. It sounded like you were saying it was doing well at the end of training (albeit only in the “ecosystem” created by the other parts?), but I’m not clear on how much that means I should have expected it to do well—in fact I’m not even clear on whether you estimate some effective parameters found by the controller and compare them to the default.
Thanks for your interest!
1. The claim of the post is that alignment is a more general problem than assessing the moral character or internal ethics of an individual, and these constructs are a consequence of societal pressures / evolution. (and yes, that is what I believe :-). Does this make sense? If you disagree, do you believe that shaping moral character explicitly is the only way to go?
2. Regarding the baselines: the baselines are all compute-matched, and include RLVR with most popular reward profile from the bandit algorithm, and RLVR with no auditor (solver-only training) with fixed 0⁄1 rewards.
I don’t think “societal pressures / evolution” is a good set of fundamental stuff because it misses white-box methods, both of evaluation (e.g. probing, circuit detection) and intervention (e.g. steering, ablation).
But maybe that’s not the real disagreement—maybe it’s more about, as you say, “shaping moral character explicitly”—which isn’t inherently in conflict with treating training as fundamental, you can easily think about training as part of a perspective on shaping moral character. What it’s especially in conflict with is that assumption that we have a known-good process for checking answers that serves as ground truth for RL. If that assumption holds, then we don’t have to think about moral character, we just have to design training systems that produce good outcomes according to the ground-truth-generating process. But if we relax that assumption, then we’re basically saying “we want good behavior even when we can’t check it”, and so that forces you to think about generalization / “moral character.”
Yeah, I’m still confused about why “the most popular reward profile from the bandit algorithm” should make a good baseline. Am I right that the bandit algorithm at each time is drawing from different reward profiles stochastically? And so isn’t the weighted average reward profile actually pretty important? I’d be interested in a comparison of that average to the “default” profile used for the baseline (both numerically [relatively easy] and in how it performs as a baseline [hard, you don’t actually have to do this]).
I’m not claiming that other methods for alignment are not useful or important. But I am claiming there is a signal for post-training, that is strategic and based on interaction between agents, not used thus far, and important to alignment.
Maybe it’s useful to recall Becker’s model: from my perspective his viewpoint for modeling crime, law enforcement, the judicial system, does not contradict other viewpoints or detract from them.
I use the “fundamental question” to mean “a more general question” in the sense that it can be addressed using a broader set of tools, maybe that is the source of “disagreement” if there is one.
Weighted average reward profile is a good suggestion, we can add it to the experiments. I don’t think that will change our results, but worth a try. Notice that if it does perform well, our conclusion still stands, since how would you find this average without running the bandit algorithm? it would still validate the mechanism and approach, agreed?