Bounding the impact of AGI

For those of you interested, András Kornai’s paper “Bounding the impact of AGI” from this year’s AGI-Impacts conference at Oxford had a few interesting ideas (which I’ve excerpted below).

Summary:

  1. Acceptable risk tolerances for AGI design can be determined using standard safety engineering techniques from other fields

  2. Mathematical proof is the only available tool to secure the tolerances required to prevent intolerable increases in xrisk

  3. Automated theorem proving will be required so that the proof can reasonably be checked by multiple human minds

Safety engineering

Since the original approach of Yudkowsky (2006) to friendly AI, which sought mathematical guarantees of friendliness, was met with considerable skepticism, we revisit the issue of why such guarantees are essential. In designing radioactive equipment, a reasonable guideline is to limit emissions to several orders of magnitude below the natural background radiation level, so that human-caused dangers are lost in the noise compared to the pre-existing threat we must live with anyway. In the full paper, we take the “big five” extinction events that occurred within the past half billion years as background, and argue that we need to design systems with a failure rate below 10−63 per logical operation.

What needs to be emphasized in the face of this requirement is that the very best physical measurements have only one part in 1017 precision, not to speak of social and psychological phenomena where our understanding is considerably weaker. What this means is that guarantees of the requisite sort can only be expected from mathematics, where our measurement precision is already considerably better.

How reliable is mathematics?

The period since World War II has brought incredible advances in mathematics, such as the Four Color Theorem (Appel and Haken 1976), Fermat’s Last Theorem (Wiles 1995), the classification of finite simple groups (Gorenstein 1982, Aschbacher 2004), and the Poincare conjecture (Perelman 1994). While the community of mathematicians is entirely convinced of the correctness of these results, few individual mathematicians are, as the complexity of the proofs, both in terms of knowledge assumed from various branches of mathematics and in terms of the length of the deductive chain, is generally beyond our ken. Instead of a personal understanding of the matter, most of us now rely on argumentum ad verecundiam: well Faltings and Ribet now think that the Wiles-Taylor proof is correct, and even if I don’t know Faltings or Ribet at least I know and respect people who know and respect them, and if that’s not good enough I can go and devote a few years of my life to understand the proof for good. Unfortunately, the communal checking of proofs often takes years, and sometimes errors are discovered only after a decade has passed: the hole in the original proof of the Four Color Theorem (Kempe 1879) was detected by Heawood in 1890. Tomonaga in his Nobel lecture (1965) describes how his team’s work in 1947 uncovered a major problem in Dancoff (1939):

Our new method of calculation was not at all different in its contents from Dancoff’s perturbation method, but had the advantage of making the calculation more clear. In fact, what took a few months in the Dancoff type of calculation could be done in a few weeks. And it was by this method that a mistake was discovered in Dancoff’s calculation; we had also made the same mistake in the beginning.

To see that such long-hidden errors are are by no means a thing of the past, and to observe the ‘web of trust’ method in action, consider the following example from Mohr (2012).

The eighth-order coefficient A1(8) arises from 891 Feynman diagrams of which only a few are known analytically. Evaluation of this coefficient numerically by Kinoshita and coworkers has been underway for many years (Kinoshita, 2010). The value used in the 2006 adjustment is A1(8) = −1.7283(35) as reported by Kinoshita and Nio (2006). However, (...) it was discovered by Aoyama et al. (2007) that a significant error had been made in the calculation. In particular, 2 of the 47 integrals representing 518 diagrams that had not been confirmed independently required a corrected treatment of infrared divergences. (...) The new value is (Aoyama et al., 2007) A1(8) = 1.9144(35); (111) details of the calculation are given by Aoyama et al. (2008). In view of the extensive effort made by these workers to ensure that the result in Eq. (111) is reliable, the Task Group adopts both its value and quoted uncertainty for use in the 2010 adjustment.

Assuming no more than three million mathematics and physics papers published since the beginnings of scientific publishing, and no less than the three errors documented above, we can safely conclude that the overall error rate of the reasoning used in these fields is at least 10-6 per paper.

The role of automated theorem-proving

That human reasoning, much like manual arithmetic, is a significantly error-prone process comes as no surprise. Starting with de Bruijn’s Automath (see Nederpelt et al 1994) logicians and computer scientists have invested significant effort in mechanized proof checking, and it is indeed only through such efforts, in particular through the Coq verification (Gonthier 2008) of the entire logic behind the Appel and Haken proof that all lingering doubts about the Four Color Theorem were laid to rest. The error in A1(8) was also identified by using FORTRAN code generated by an automatic code generator (Mohr et al 2012).

To gain an appreciation of the state of the art, consider the theorem that finite groups of odd order are solvable (Feit and Thompson 1963). The proof, which took two humans about two years to work out, takes up an entire issue of the Pacific Journal of Mathematics (255 pages), and it was only this year that a fully formal proof was completed by Gonthier’s team (see Knies 2012). The effort, 170,000 lines, 15,000 definitions, 4,200 theorems in Coq terms, took person-decades of human assistance (15 people working six years, though many of them part-time) even after the toil of Bender and Glauberman (1995) and Peterfalvi (2000), who have greatly cleaned up and modularized the original proof, in which elementary group-theoretic and character-theoretic argumentation was completely intermixed.

The classification of simple finite groups is two orders of magnitude bigger: the effort involved about 100 humans, the original proof is scattered among 20,000 pages of papers, the largest (Aschbacher and Smith 2004a,b) taking up two volumes totaling some 1,200 pages. While everybody capable of rendering meaningful judgment considers the proof to be complete and correct, it must be somewhat worrisome at the 10-64 level that there are no more than a couple of hundred such people, and most of them have something of a vested interest in that they themselves contributed to the proof. Let us suppose that people who are convinced that the classification is bug-free are offered the following bet by some superior intelligence that knows the answer. You must enter a room with as many people you can convince to come with you and push a button: if the classification is bug-free you will each receive $100, if not, all of you will immediately die. Perhaps fools rush in where angels fear to tread, but on the whole we still wouldn’t expect too many takers.

Whether the classification of finite simple groups is complete and correct is very hard to say – the planned second generation proof will still be 5,000 pages, and mechanized proof is not yet in sight. But this is not to say that gaining mathematical knowledge of the required degree of reliability is hopeless, it’s just that instead of monumental chains of abstract reasoning we need to retreat to considerably simpler ones. Take, for example, the first Sylow Theorem, that if the order of a finite group G is divisible by some prime power pn, G will have a subgroup H of this order. We are absolutely certain about this. Argumentum ad verecundiam of course is still available, but it is not needed: anybody can join the hive-mind by studying the proof. The Coq verification contains 350 lines, 15 definitions, 90 theorems, and took 2 people 2 weeks to produce. The number of people capable of rendering meaningful judgment is at least three orders of magnitude larger, and the vast majority of those who know the proof would consider betting their lives on the truth of this theorem an easy way of winning $100 with no downside risk.

Further remarks

Not only do we have to prove that the planned AGI will be friendly, the proof itself has to be short enough to be verifiable by humans. Consider, for example, the fundamental theorem of algebra. Could it be the case that we, humans, are all deluded into thinking that an n-th degree polynomial will have n roots? Yes, but this is unlikely in the extreme. If this so-called theorem is really a trap laid by a superior intelligence we are doomed anyway, humanity can find its way around it no more than a bee can find its way around the windowpane. Now consider the four-color theorem, which is still outside the human-verifiable range. It is fair to say that it would be unwise to create AIs whose friendliness critically depends on design limits implied by the truth of this theorem, while AIs whose friendliness is guaranteed by the fundamental theorem of algebra represent a tolerable level of risk.

Recently, Goertzel and Pitt (2012) have laid out a plan to endow AGI with morality by means of carefully controlled machine learning. Much as we are in agreement with their goals, we remain skeptical about their plan meeting the plain failure engineering criteria laid out above.