Confucianism in AI Alignment

I hear there’s a thing where people write a lot in November, so I’m going to try writing a blog post every day. Disclaimer: this post is less polished than my median. And my median post isn’t very polished to begin with.

Imagine a large corporation—we’ll call it BigCo. BigCo knows that quality management is high-value, so they have a special program to choose new managers. They run the candidates through a program involving lots of management exercises, simulations, and tests, and select those who perform best.

Of course, the exercises and simulations and tests are not a perfect proxy for the would-be managers’ real skills and habits. The rules can be gamed. Within a few years of starting the program, BigCo notices a drastic disconnect between performance in the program and performance in practice. The candidates who perform best in the program are those who game the rules, not those who manage well, so of course many candidates devote all their effort to gaming the rules.

How should this problem be solved?

Ancient Chinese scholars had a few competing schools of thought on this question, most notably the Confucianists and the Legalists. The (stylized) Confucianists’ answer was: the candidates should be virtuous and not abuse the rules. BigCo should demonstrate virtue and benevolence in general, and in return their workers should show loyalty and obedience. I’m not an expert, but as far as I can tell this is not a straw man—though stylized and adapted to a modern context, it accurately captures the spirit of Confucian thought.

The (stylized) Legalists instead took the position obvious to any student of modern economics: this is an incentive design problem, and BigCo leadership should design less abusable incentives.

If you have decent intuition for economics, it probably seems like the Legalist position is basically right and the Confucian position is Just Wrong. I don’t want to discourage this intuition, but I expect that many people who have this intuition cannot fully spell out why the Confucian answer is Just Wrong, other than “it has no hope of working in practice”. After all, the whole thing is worded as a moral assertion—what people should do, how the problem should be solved. Surely the Confucian ideal of everyone working together in harmony is not wrong as an ideal? It may not be possible in practice, but that doesn’t mean we shouldn’t try to bring the world closer to the Confucian vision.

Now, there is room to argue with Confucianism on a purely moral front—everyone working together in harmony is not synonymous with everyone receiving what they deserve. Harmony does not imply justice. Also, there’s the issue of the system being vulnerable to small numbers of bad agents. These are fun arguments to have if you’re the sort of person who enjoys endless political/​philosophical debates, but I bring it up to emphasize that they are NOT the arguments I’m going to talk about here.

The relevant argument here is not a moral claim, but a purely factual claim: the Confucian ideal would not actually solve the problem, even if it were fully implemented (i.e. zero bad actors). Even if BigCo senior management were virtuous and benevolent, and their workers were loyal and did not game the rules, the poor rules would still cause problems.

The key here is that the rules play more than one role. They act as:

  • Conscious incentives

  • Unconscious incentives

  • Selection rules

In the Confucian ideal, the workers all ignore the bad incentives provided by the rules, so conscious incentives are no longer an issue (as long as we’re pretending that the Confucian ideal is plausible in the first place). Unconscious incentives are harder to fight—when people are rewarded for X, they tend to do more X, regardless of whether they consciously intended to do so. But let’s assume a particularly strong form of Confucianism, where everyone fights hard against their unconscious biases.

That still leaves selection effects.

Even if everyone is ignoring the bad incentives, people are still different. Some people will naturally act in ways which play more to the loopholes and weaknesses in the rules, even if they don’t intend to do so. (And of course, if there’s even just a few bad actors, then they’ll definitely still abuse the rules.) And BigCo will disproportionately select those people as their new managers. It’s not necessarily maliciousness, it’s just Goodhart’s Law: make decisions based on a less-than-perfect proxy, and it will cease to be a good proxy.

Takeaway: even a particularly strong version of the Confucian ideal would not be sufficient to solve BigCo’s problem. Conversely, the Legalist answer—i.e. fixing the incentive structure—would be sufficient. Indeed, fixing the incentive structure seems not only sufficient but necessary; selection effects will perpetuate problems even if everyone is harmoniously working for the good of the collective.

Analogy to AI Alignment

The modern Ml paradigm: we have a system that we train offline. During that training, we select parameters which perform well in simulations/​tests/​etc. Alas, some parts of the parameter space may abuse loopholes in the parameter-selection rules. In extreme cases, we might even see malicious inner optimizers: subagents smart enough to intentionally abuse loopholes in the parameter-selection rules.

How should we solve this problem?

One intuitive approach: find some way to either remove or align the inner optimizers. I’ll call this the “generalized Confucianist” approach. It’s essentially the Confucianist answer from earlier, with most of the moralizing stripped out. Most importantly, it makes the same mistake: it ignores selection effects.

Even if we set up a training process so that it does not create any inner optimizers, we’ll still be selecting for the same bad behaviors which a malicious inner optimizer would utilize.

The basic problem is that “optimization” is an internal property, not a behavioral property. A malicious optimizer might do some learning and reasoning to figure out that behavior X exploits a weakness in the parameter selection goal/​algorithm. But some other parameters could just happen to perform behavior X “by accident”, without any malicious intent at all. The parameter selection goal/​algorithm will be just as weak to this “accidental” abuse as to the “intentional” abuse of an inner optimizer.

The equivalent of the Legalists’ solution to the problem would be to fix the parameter-selection rule: design a training goal and process which aren’t abusable, or at least aren’t abusable by anything in the parameter space. In alignment jargon: solve the outer alignment problem, and build a secure outer optimizer.

As with the Confucian solution to the BigCo problem, the Confucian solution is not sufficient for AI alignment. Even if we avoid creating misaligned inner optimizers, bad parameter-selection rules would still select for the same behavior that the inner optimizers would display. The only difference is that we’d select for rules which behave badly “by accident”.

Conversely, the Legalist solution would be sufficient to solve the problem, and seems necessary if we want to keep the general framework of optimization.

The main takeaway I want to emphasize here is that making our outer objective “secure” against abuse is part of the outer alignment problem. This means outer alignment is a lot harder than I think a lot of people imagine. If our proxy for human values has loopholes which a hypothetical inner optimizer could exploit, then it’s a bad proxy. If an inner optimizer could exploit some distribution shift between the training and deployment environments, then performance-in-training is a bad proxy for performance-in-deployment. In general, outer alignment contains an implicit “for all” quantifier: for all possible parameter values, our training objective should give a high value only if those parameters would actually perform well in practice.

The flip side is that, since we probably need to build the Legalist solution anyway, the Confucian solution isn’t really necessary. We don’t necessarily need to make any special effort to avoid inner optimizers, because our selection criteria need to be secure against whatever shenanigans the inner optimizers could attempt anyway.

That said, I do think there are some good reasons to work on inner optimizers. The biggest is imperfect optimization. In this context: our outer optimizer is not going to check every single point in the parameter space, so the basin of attraction of any misaligned behavior matters. If we expect that malicious inner optimizers will take up a larger chunk of the parameter space than “accidental” bad behavior, then it makes sense to worry more about “intentional” than “accidental” maligness. At this point, we don’t really know how to tell how much of a parameter space is taken up by malicious agents, or any sort of inner agents; one example of this kind of problem is Paul’s question about whether minimal circuits are daemon-free.

Taking the analogy back to the BigCo problem: if it’s very rare for workers to accidentally game the rules, and most rule-gaming is intentional, then the Confucian solution makes a lot more sense.

I also expect some people will argue that malicious inner optimizers would be more dangerous than accidental bad behavior. I don’t think this argument quite works—in sufficiently-rish parameter spaces, I’d expect that there are non-agenty parameter combinations which exhibit the same behavior as any agency combinations. Optimization is an internal property, not a behavioral property. But a slight modification of this argument seems plausible: more dangerous behaviors take up a larger fraction of the agenty chunks of parameter space than the non-agenty chunks. It’s not that misaligned inner optimizers are each individually more dangerous than their behaviorally-identical counterparts, it’s that misaligned optimizers are more dangerous on average. This would be a natural consequence to expect from instrumental convergence, for instance: a large chunk of agenty parameter space all converges to the same misbehavior. Again, this threat depends on imperfect optimization—if the optimizer is perfect, then “basin of attraction” doesn’t matter.

Again taking the analogy back to the BigCo problem: if most accidental-rule-abusers only abuse the rules a little, but intentional-rule-abusers usually do it a lot, then the Confucian solution can help a lot.

Of course, even in the cases where the Confucian solution makes relatively more sense, it’s still just an imperfect patch; it still won’t fix “accidental” abuse of the rules. The Legalist approach is the full solution. The selection rules are the real problem here, and fixing the selection rules is the best possible solution.