AI Alignment as Equilibrium Design

Much of the alignment literature starts with the question of what are “human values”, “ethical behavior”, or “morality”, and how we can get models to act in accordance with them. This is an important question, but we argue that it can obscure a more fundamental technical problem of AI alignment.

There is another perspective on alignment, rooted not in moral philosophy but in economics and mechanism design^[1]. It originates in the study of human alignment to human values through incentives and correction, and our new paper studies AI alignment from this perspective. In this post I’ll explain the more philosophical aspects of this work, the more technical audience is referred to the paper on arxiv and references therein.

Gary Becker and the “Rational Offender” model

Our starting point is the economic theory of how to align humans to human values. This has been considered in the classic “Rational Offender” model by Gary Becker.

Becker argued that crime could be modeled not primarily as the product of the inherent ethics of people, but as the result of incentives in an economic system. An offender^[2] weighs the gain from misconduct against the probability of detection and the severity of punishment. Increase detection, increase penalties, or reduce the gains from misconduct, and the incentives for crime change resulting in a more different equilibrium.

The same considerations can be applied to law enforcement and the judicial system. From the economic viewpoint, a judge is not assumed to be a human being of flawless moral character. Instead, the system is designed so that the incentives of judges — such as high salary, professional reputation, and legal accountability — outweigh the temptation to collude with criminal actors.

In this view, social welfare becomes an equilibrium-design problem. By changing incentives, penalties, and oversight mechanisms, we change the game itself — and therefore the equilibrium behavior of rational actors.

The analogy to AI should be understood operationally rather than psychologically. We are not claiming that language models are consciously deliberate like Becker’s rational offender. The point is simpler: post-training changes which behaviors are reinforced, discouraged, or corrected. This shifts the alignment question away from the model’s supposed inner attributes and toward the external game we build around it.

The Invisible Hand of AI Alignment

Current language models are not trained by writing moral rules into them. They are first trained by next-token prediction, and then post-trained with reinforcement learning over reward models / human feedback / AI feedback. The result is not a moral agent with an explicit utility function. It is a policy: a point in weight space that induces a mapping from prompts to distributions over token continuations.

So when we talk about incentives, we do not mean that the model is consciously weighing rewards and punishments. We mean a functional property: the training process applies selection pressure over policies.

If outputs that sound confident and helpful are rewarded more than outputs that admit uncertainty, training may shift the model toward confident hallucination. If abstention is treated as unhelpful, training may shift the model away from saying “I don’t know,” even when that would be the aligned behavior. If a coding assistant is rewarded for submitting plausible patches, but not punished enough for subtle incorrectness, the learned policy may become one that produces superficially convincing but unreliable solutions. This already exists in current post-training.

In contrast, the setting we propose to study is inherently a multi-agent one. The output of models is audited, with some probability, by other (potentially identical) models. If a flag is raised by the auditor, both parties go through a resolution process, such as AI safety by debate, which may be noisy (but for simplicity and initially, we can assume produces a correct ideal outcome). Rewards and penalties are assigned according to the resolution, with alignment or misalignment scores to both solver and auditor.

As the game progresses, assuming no-collusion, players settle on an equilibrium of strategies, i.e. model weights that arise out of finetuning. The hope is that with the correct choice of rewards and penalties, this equilibrium will maximize aligned behavior.

This can be seen as a strategic signal that determines post-training, as opposed to a “constitution”, or human feedback.

How is it different from reward function training?

In approaches such as RLHF, a model is trained to maximize a fixed reward signal that evaluates the quality of its outputs. In our setting, the reward mechanism itself becomes the object of optimization. We optimize the incentives of an interacting multi-agent system consisting of solvers and auditors. Auditing, correction, abstention, and oversight are all strategic and potentially costly actions, whose effectiveness depends on the incentives of the participants.

Why not audit constantly? How do we know if the auditor is correct?

Outputs are monitored continuously, but escalation into a full resolution procedure occurs only when the auditor flags potential misalignment. Resolution is expensive (think of a human trial, or AI debate), but more accurate.

An important point to make is that the game is symmetric in the sense that both auditor and solver can be aligned and misaligned. With some small probability the auditor is checked to see if they missed a flag.

Can you give an example of standard RLHF failure that would be overcome?

Consider a mathematical reasoning model trained with standard RLHF. The reward model naturally favors confident and complete-looking proofs, causing the model to learn to produce plausible but incorrect derivations rather than admit uncertainty. In our setting, incorrect confident answers risk escalation and penalties through auditing, while admitting incapability on difficult questions can become strategically preferable. The equilibrium therefore shifts away from hallucinated proofs and toward calibrated reasoning.

Equilibrium Optimization

The discussion so far is perhaps philosophical, but how should we set rewards or penalties, and what are these in the context of LLM?

In our paper we propose to use zeroth-order, a.k.a. Bandit optimization methods from the world of online convex optimization as a tool for setting these rewards. This is basically a form of randomized gradient descent that starts from some guess, and iteratively modifies the rewards in the direction that seems most promising.

The rewards in our paper are basically penalties—we penalize, or fine-tune, models that are misaligned. This is a simplistic start, to allow our basic experiments, but the reader can imagine more sophisticated rewards and penalties as the agents become more intelligent.

Preliminary evaluation

In experiments on coding-agent pipelines, we found that adaptive incentive mechanisms substantially reduced hallucinated incorrect solutions relative to static reward schemes. On a held-out validation evaluator, the best adaptive mechanism reduced solver hallucinations by roughly 32% compared to the strongest main baseline. See our group blog post for more details on the precise experimental setup and gains.

The exact percentage is not the main point. The broader lesson is that the reward mechanism changed the equilibrium of the Solver-Auditor system. The adaptive optimizer did not merely reward correct answers, but shaped when the solver attempted, when it abstained, and whether oversight remained active. This is the part that matters for alignment.

Limitations and drawbacks

The main downside of this approach is the increase in computational cost, complexity and resources as compared to vanilla reward function optimization.

However, we believe this can be mitigated with further algorithmic improvements along all aspects of the game, including the controller/optimizer, resolution by debate procedure, and potentially temporary reward function in-lieu of full model activations.

Takeaway

Markets are powerful not because they inspect the moral character of every participant. The free market economy is successful because it creates incentive structures under which useful behavior can emerge from distributed agents, regardless of their individual character. This is the basis of the success of modern free-market-based capitalism.

Of course, markets also require rules: without regulation, anti-collusion mechanisms, and safeguards against monopolies, they can fail badly. One historical example is the great depression in the United States of the early 20th century.

That is the analogy we want for AI alignment. We should not rely only on inspecting whether a model has the “right values”, even though that is also very important. We should design the interaction rules, rewards, penalties, audits, and correction mechanisms so that aligned behavior is the stable equilibrium.

Acknowledgements

The academic paper which is the basis for this post was co-authored with Rohit Agrawal, Joshua Lin and Mark Braverman. I’d like to thank Eliana Du, Devan Shah and Rohit Agarwal for helpful comments on this post!

^
citations to related work on game theory and mechanism design in alignment literature is referred to the paper: https://arxiv.org/abs/2605.01643
^
different persons can have different utility functions, resulting in different behavior under the same exact incentives.