Robustness as a Path to AI Alignment

[Epistemic Status: Some of what I’m going to say here is true technical results. I’ll use them to gesture in a research direction which I think may be useful; but, I could easily be wrong. This does not represent the current agenda of MIRI overall, or even my whole research agenda.]

Converting Philosophy to Machine Learning

A large part of the work at MIRI is to turn fuzzy philosophy problems into hard math. This sometimes makes it difficult to communicate what work needs done, for example to math-savvy people who want to help. When most of the difficulty is in finding a problem statement, it’s not easy to outsource the intellectual labor.

Philosophy is also hard to get traction on. Arguably, something really good happened to the epistemic norms of AI research when things switched over from GOFAI to primarily being about machine learning. Before, what constituted progress in AI was largely up to personal taste. After, progress could be verified by achieving high performance on benchmarks. There are problems with the second mode as well—you get a kind of bake-off mentality which focuses on tricks to get higher performance, not always yielding insight. (For example, top-performing techniques in machine learning competitions often combine many methods, taking advantage of the complementary strengths and weaknesses of each. However, this approach leans on the power of the other methods.) Nonetheless, this is better for AI progress than armchair philosophy and toy problems.

It would be nice if AI alignment could be more empirically grounded. There are serious obstacles to this. Many alignment concerns, such as self-modification, barely show up or seem quite easy to solve when you aren’t dealing with a superintelligent system. However, I’ll argue that there is sometimes a way to turn difficult alignment problems into machine learning problems.

A second reason to look in this direction is that in order to do any good, alignment research has to be used by the people who end up making AGI. The way things look right now, that means they have to be used by machine learning researchers. To that end, anything which puts things closer to a shape which ML researchers are familiar with seems good.

To put it a different way: in the end, we need a successful pipeline from philosophy, to math, to implementation. So far, MIRI has focused on optimizing the first part of that pipeline. I think it may be possible to do research in a way which helps optimize the second part.

We wouldn’t want the research direction to be constrained by this, since in the end we need to figure out what actually works, not what creates the most consumable research. However, I’ll also argue that the research direction is plausible in itself.

The Big Picture

I started working full-time at MIRI about three months ago. In my second week, we had a research retreat in which we spent a lot of time re-thinking how all of the big research problems connect with each other and to the overall goal. I came out of this with the view that things factored somewhat cleanly into three research areas: value learning, robust optimization, and naturalized agency. [Again, this write-up isn’t intended reflect the view of MIRI as a whole.]

  1. Value Learning: The first problem is to specify what is “good” or what you “want” in enough detail that nothing goes wrong when we optimize for it. This is too hard (since humans seem really bad at knowing what they want in precise terms), so it would be nice to reduce it to a learning problem, if possible. This requires things like learning human concepts (including the concept “human”) and accounting for bounded rationality in learning human values (so that you don’t assume the human wanted to stub its toe on the coffee table).

  2. Robust Optimization: We will probably get #1 wrong, so how can we specify systems which don’t go off-track too badly if their values are misspecified? This includes things like transparency, corrigibility, and planning under moral uncertainty (doing something other than max-expected-value to avoid over-optimizing). Ideally, you want to be able to ask a superintelligent AI to make burritos, and not end up with a universe tiled with burritos. This corresponds approximately to the AAMLS agenda.

  3. Naturalized Agency: Even if we just knew the correct value function and knew how to optimize it in a robust way, we don’t actually know how to build intelligent agents which optimize values. It’s a bit like the difference between knowing that you want to classify images and getting to the point where you optimize neural nets to do so: you have to figure out that squared-error loss plus a regularizer works well, or whatever. We aren’t to the point where we just know what function to optimize neural nets for to get AGI out, value-aligned or no. Existing decision theories, agent frameworks, and definitions of intelligence don’t seem up to the task of examining what rational agency looks like when the agent is embedded in a world which is bigger than it (so the real world is certainly not in the hypothesis space which the agent can represent), the agent can self-modify (so reflective stability and self-trust becomes important), and the agent is part of the world (so agents must understand themselves as physics and consider their own death).

To storify: AI should do X such that X=argmax(value(x))! WAIT! We don’t know what value is! We should figure that out! WAIT! Trying to argmax a slightly wrong thing often leads to more-than-just-slightly wrong results! We should figure out some other operation than argmax, which doesn’t have that problem! WAIT! The universe isn’t actually in a functional form such that we can just optimize it! What are we supposed to do?

In a sense, this is a series of proxy problems. We actually want #1, but we’ve done relatively little on that front, because it seems much too confusing to make progress on. #2 still cuts relatively close to the problem, and plausibly, solving #2 means not needing to solve #1 as well. More has been done on #2, but it is still harder and more confusing than #3. #3 is fairly far removed from what we want, but working on #3 seems to plausibly be the fastest route to resolving confusions which block progress on #1 and #2.

What I want to outline is a particular way of thinking which seems to be associated with progress on both #2 and #3, and which also seems like a good sign for the philosophy→implementation pipeline.

What is Robustness?

(I’m calling this thing “robustness” in association with #2, but “robust optimization” should be thought of as its own thing—robustness is necessary for robust optimization, but perhaps not sufficient.)

Robustness might be intuitively described as tolerance to errors. Put in a mathematical context, we can model this via an adversary who has some power to trip you up. A robustness property says something about how well you do against such adversaries.

For example, take quantilization. We want an alternative to optimization which is robust to misspecified utility functions. A Bayesian approach might introduce a probability distribution over possible utility functions, and maximize expected utility with respect to that uncertainty. This doesn’t do much to increase our confidence in the outcome; we’ve only pushed the problem back to correctly specifying our uncertainty over the utility distribution, and problems from over-optimizing a misspecified function seem just about as likely. Certainly we get no new formal guarantees.

So, instead, we model the situation by supposing that an adversary has some bounded amount of power to deceive you about what your true utility function is. The adversary might concentrate all of this on one point which you’ll be very mistaken about (perhaps making a very bad idea look very good), or spread it out across a number of possibilities (making many good ideas look a little worse), or something inbetween. Under this assumption, a policy which randomizes actions somewhat rather than taking the max-expected-utility action is effective. This gives you some solid guarantees against utility misspecification, unlike the naive Bayesian approach. There’s still more to be desired, but this is a clear improvement.

Mathematically, an adversarial assumption is just a “for all” requirement. Bayesians are more familiar with doing well in expectation. Doing well in expectation has its merits. However, adversarial assumptions create stronger guarantees on performance, by optimizing for the worst case.

Garrabrant Inductors as Robustness

Garrabrant Inductors (AKA logical inductors) are MIRI’s big success in naturalized agency. (Reflective oracles come in second.) They go a long way to clear up confusions about logical uncertainty, which was one of the major barriers to understanding naturalized agents. When I say that there has been more progress on naturalized agency than on robustness, they’re a big part of the reason. Yet, at their heart is something which looks a lot like a robustness result: the logical induction criterion. You take the set of all poly-time trading strategies on a belief market, and ask that a Garrabrant inductor doesn’t keep losing against any of these forever. This is very typical of bounded-loss conditions in machine learning.

In return, we get reliability guarantees. Sub-sequence optimality means that we get the benefits of the logical induction criterion no matter which subset of facts we actually care about. Calibration means that the beliefs can be treated as frequencies, and unbiasedness means these frequencies will be good even if the proof system is biased (selectively showing evidence on one side more often than the other). Timely learning means (among other things) that it doesn’t matter too much if the theorem prover we’re learning from is slow; we learn to predict things as quickly as possible (eventually).

The logical induction criterion is a relaxation of the standard Bayesian requirement that there be no Dutch Book against the agent. So, the Dutch Book argument for the axioms of probability theory has an adversarial form as well. The same can be said of the money-pump argument which justifies expected utility theory. Bayesians are not so averse to adversarial assumptions as they may seem; lurking behind the very notion of “doing well in expectation” is a “for all” requirement! Bayesians turn up their noses at decision procedures which try to do well in any other than an average-case sense because they know such a procedure is money-pumpable; an adversary could swoop in and take advantage of it!

This funny mix of average-case and worst-case reasoning is at the very foundation of the Bayesian edifice. I’m still not quite sure what to think of it, myself. Philosophically, what should determine when I prefer an average-case argument vs a worst-case one? But, that is a puzzle for another time. The point I want to make here is that there’s a close connection between the types of arguments we see at the foundations of decision theory (Dutch Book and money-pump arguments which justify notions of rationality in terms of guarding yourself against an adversary) and arguments typical of machine learning (bounded-loss properties).

The Dutch Book argument forces a tight, coherent probability distribution, which can’t be both computable and consistent with logic. Relaxing things a little yields a wealth of benefits. What other foundational arguments in decision theory can we relax a little to get rich robustness results?


These examples are somewhat hand-wavy; what I’ll say here is true, but hasn’t yet brought forth any fruit in terms of AI alignment results. I am putting it here merely to provide more examples of being able to frame decision-theory things as robustness properties.

I’ve mentioned the Dutch Book argument. Another of the great foundational arguments for Bayesian subjective probability theory is Cox’s Theorem. One of the core assumptions is that if a probability can be derived in many ways, the results must be equal. This is related to (but not identical with) the fact that it doesn’t matter what order you observe evidence in; the same evidence gives the same conclusion, every time.

Putting this into an adversarial framework, this means the class of arguments which we accept doesn’t leave us open to manipulation. Garrabrant Induction weakens this (conclusions are not fully independent of the order in which evidence is presented), but also gets versions of the result which a Bayesian can’t, as mentioned in the previous section: it arrives at unbiased probabilities even if it is shown a biased sampling of the evidence, so long as it keeps seeing more and more. (This is part of what saves Garrabrant Induction from my All Mathematicians are Trollable result.)

Another example illustrating the need for path-independence is Pascal’s Mugging. If your utility function is unbounded and your probability distribution is something like the Solomonoff distribution, it’s awfully hard to avoid having divergent expected utilities. What this means is that when you try to sum up your expected utility, the sum you get is dependent on the order you sum things in. This means Pascal can alter your end conclusion by directing your attention to certain possibilities, extracting money from you as a result.

It seems to me that path-independent reasoning is a powerful rationality tool which I don’t yet fully understand.

Nuke Goodhart’s Law From Orbit

(Repeatedly. It won’t stay down.)

Goodhart’s Curse is not the only problem in the robust optimization cluster, but it’s close; the majority of the problems there can be seen as one form or another of Goodhart. Quantilizers are significant progress against goodhart, but not total.

  1. Quantilizers give you a knob you can turn to optimize softer or harder, without clear guidance on how much optimization is safe. If you keep turning up the knob and seeing better results, what would make you back off from cranking it up as far as you can go?

  2. Along similar lines, but from the AIs perspective, there’s nothing stopping a quantilizer from building a maximizer in order to solve its problem. In our current environment, “implement a superintelligent AI to solve the problem” is far from the laziest solution; but in an environment containing highly intelligent quantilizers, the tools to do so are lying around. It can do so merely by “turning up its own knob”.

Nonetheless, it seems plausible that progress can be made via more robustness results in a similar direction.

Something which has been discussed at MIRI, due to Paul Christiano’s thoughts on the subject, is the Benign Induction problem. Suppose that you have some adversarial hypotheses in your belief mixture, which pose as serious hypotheses and make good predictions much of the time, but are actually out to get you; after amassing enough credibility, at a critical juncture they make bad predictions which do you harm.

One way of addressing this, inspired by the KWIK learning framework, is the consensus algorithm. How it works is, you don’t output any probability at all unless your top hypotheses agree on the prediction; not just on the classification, but on the probability to within some acceptable epsilon tolerance. This acts as an honesty amplifier. Suppose you have a hundred hypotheses, and only one is good; the rest are corrupt. Even if the corrupt hypotheses can coordinate with each other, the one good hypothesis keeps them all in check: nothing they say gets out to the world unless they agree very closely with the good one. (However, they can silence the good hypothesis selectively, which seems concerning!)

A solution to the benign induction problem would be significant progress on the robust optimization problem: if we could trust the output of induction, we could use it to predict what optimization techniques are safe! (There’s much more to be said on this, but this is not the article for it.)

So, quantilizers are robust to adversarial noise in the utility function, and the consensus algorithm is (partially) robust to adversarial hypotheses in its search space. Imagine a world where we’ve got ten more things like that. This seems like significant progress. Maybe then we come up with a Robust Optimization Criterion which implies all the things we want!

Machine learning experts and practitioners alike are familiar with the problems of over-optimization, and the need for regularization, in the guise of overfitting. Goodhart’s Curse is, in a sense, just a generalization of that. So, this kind of alignment progress might be absorbed into machine learning practice with relative ease.

Limits of the Approach

One problem with this approach is that it doesn’t provide that much guidance. My notion of robustness here is extremely broad. “You can frame it in terms of adversarial assumptions” is, as I noted, equivalent to “use for all”. Setting out to use universal quantifiers in a theory is hardly much to go on!

It’s not nothing, though; as I said, it challenges the Bayesian tendency to use “in expectation” everywhere. And, I think adding adversarial assumptions is a good brainstorming exercise. If a bunch of people sit down and try to come up with new parts to inject adversarial assumptions into for five minutes, I’m happy. It just may be that someone comes up with a great new robustness idea as a consequence.