Towards formalizing universality

(Cross-posted at ai-al­ign­

The scal­a­bil­ity of iter­ated am­plifi­ca­tion or de­bate seems to de­pend on whether large enough teams of hu­mans can carry out ar­bi­trar­ily com­pli­cated rea­son­ing. Are these schemes “uni­ver­sal,” or are there kinds of rea­son­ing that work but which hu­mans fun­da­men­tally can’t un­der­stand?

This post defines the con­cept of “as­crip­tion uni­ver­sal­ity,” which tries to cap­ture the prop­erty that a ques­tion-an­swer­ing sys­tem A is bet­ter-in­formed than any par­tic­u­lar sim­pler com­pu­ta­tion C.

Th­ese par­allel posts ex­plain why I be­lieve that the al­ign­ment of iter­ated am­plifi­ca­tion largely de­pends on whether HCH is as­crip­tion uni­ver­sal. Ul­ti­mately I think that the “right” defi­ni­tion will be closely tied to the use we want to make of it, and so we should be re­fin­ing this defi­ni­tion in par­allel with ex­plor­ing its ap­pli­ca­tions.

I’m us­ing the awk­ward term “as­crip­tion uni­ver­sal­ity” partly to ex­plic­itly flag that this is a pre­limi­nary defi­ni­tion, and partly to re­serve lin­guis­tic space for the bet­ter defi­ni­tions that I’m op­ti­mistic will fol­low.

(Thanks to Ge­offrey Irv­ing for dis­cus­sions about many of the ideas in this post.)

I. Definition

We will try to define what it means for a ques­tion-an­swer­ing sys­tem A to be “as­crip­tion uni­ver­sal.”

1. Ascribing be­liefs to A

Fix a lan­guage (e.g. English with ar­bi­trar­ily big com­pound terms) in which we can rep­re­sent ques­tions and an­swers.

To as­cribe be­liefs to A, we ask it. If A(“are there in­finitely many twin primes?”) = “prob­a­bly, though it’s hard to be sure” then we as­cribe that be­lief about twin primes to A.

This is not a gen­eral way of as­cribing “be­lief.” This pro­ce­dure wouldn’t cap­ture the be­liefs of a na­tive Span­ish speaker, or for some­one who wasn’t an­swer­ing ques­tions hon­estly. But it can give us a suffi­cient con­di­tion, and is par­tic­u­larly use­ful for some­one who wants to use A as part of an al­ign­ment scheme.

Even in this “straight­for­ward” pro­ce­dure there is a lot of sub­tlety. In some cases there are ques­tions that we can’t ar­tic­u­late in our lan­guage, but which (when com­bined with A’s other be­liefs) have con­se­quences that we can ar­tic­u­late. In this case, we can in­fer some­thing about A’s be­liefs from its an­swers to the ques­tions that we can ar­tic­u­late.

2. Ascribing be­liefs to ar­bi­trary computations

We are in­ter­ested in whether A “can un­der­stand ev­ery­thing that could be un­der­stood by some­one.” To clar­ify this, we need to be more pre­cise about what we mean by “could be un­der­stood by some­one.”

This will be the most in­for­mal step in this post. (Not that any of it is very for­mal!)

We can imag­ine var­i­ous ways of as­cribing be­liefs to an ar­bi­trary com­pu­ta­tion C. For ex­am­ple:

  • We can give C ques­tions in a par­tic­u­lar en­cod­ing and as­sume its an­swers re­flect its be­liefs. We can ei­ther use those an­swers di­rectly to in­fer C’s be­liefs (as in the last sec­tion), or we can ask what set of be­liefs about la­tent facts would ex­plain C’s an­swers.

  • We can view C as op­ti­miz­ing some­thing and ask what set of be­liefs ra­tio­nal­ize that op­ti­miza­tion. For ex­am­ple, we can give C a chess board as in­put, see what move it pro­duces, as­sume it is try­ing to win, and in­fer what it must be­lieve. We might con­clude that C be­lieves a par­tic­u­lar line of play will be won by black, or that C be­lieves gen­eral heuris­tics like “a pawn is worth 3 tempi,” or so on.

  • We can rea­son about how C’s be­hav­ior de­pends on facts about the world, and ask what state of the world is de­ter­mined by its cur­rent be­hav­ior. For ex­am­ple, we can ob­serve that C(113327) = 1 but that C(113327) “would have been” 0 if 113327 had been com­pos­ite, con­clud­ing that C(11327) “knows” that 113327 is prime. We can ex­tend to prob­a­bil­is­tic be­liefs, e.g. if C(113327) “prob­a­bly” would have been 0 if 113327 had been com­pos­ite, then we might that C knows that 113327 is “prob­a­bly prime.” This cer­tainly isn’t a pre­cise defi­ni­tion, since it in­volves con­sid­er­ing log­i­cal coun­ter­fac­tu­als, and I’m not clear whether it can be made pre­cise. (See also ideas along the lines of “knowl­edge is free­dom”.)

  • If a com­pu­ta­tion be­haves differ­ently un­der differ­ent con­di­tions, then we could use re­strict at­ten­tion to a par­tic­u­lar con­di­tion. For ex­am­ple, if a ques­tion-an­swer­ing sys­tem ap­pears to be bil­in­gual but an­swers ques­tions differ­ently in Span­ish and English, we could as­cribe two differ­ent sets of be­liefs. Similarly, we could as­cribe be­liefs to any sub­com­pu­ta­tion. For ex­am­ple, if a part of C can be un­der­stood as op­ti­miz­ing the way data is laid out in mem­ory, then we can as­cribe be­liefs to that com­pu­ta­tion about the way that data will be used.

Note that these aren’t in­tended to be effi­cient pro­ce­dures that we could ac­tu­ally ap­ply to a given com­pu­ta­tion C. They are hy­po­thet­i­cal pro­ce­dures that we will use to define what it means for A to be uni­ver­sal.

I’m not go­ing to try to as­cribe a sin­gle set of be­liefs to a given com­pu­ta­tion; in­stead, I’ll con­sider all of the rea­son­able as­crip­tion pro­ce­dures. For ex­am­ple, I think differ­ent pro­ce­dures would as­cribe differ­ent be­liefs to a par­tic­u­lar hu­man, and don’t want to claim there is a unique an­swer to what a hu­man “re­ally” be­lieves. A uni­ver­sal rea­soner needs to have more rea­son­able be­liefs than the be­liefs as­cribed to that a hu­man us­ing any par­tic­u­lar method.

An as­crip­tion-uni­ver­sal rea­soner needs to com­pete with any be­liefs that can be as­cribed to C, so I want to be gen­er­ous with this defi­ni­tion. For ex­am­ple, given a chess-play­ing al­gorithm, we might ra­tio­nal­ize it as try­ing to win a game and in­fer its be­liefs about the rules of chess. Or we might ra­tio­nal­ize it as try­ing to look like a hu­man and in­fer its be­liefs about what a hu­man would do. Or some­thing differ­ent al­to­gether. Most of these will be kind of crazy as­crip­tions, but I want to com­pete with them any­way (com­pet­ing with cra­zier be­liefs will turn out to just be eas­ier).

It’s not to­tally clear what counts as a “rea­son­able” as­crip­tion pro­ce­dure, and that’s the biggest source of in­for­mal­ity. In­tu­itively, the key prop­erty is that the as­crip­tion it­self isn’t do­ing the “hard work.” In prac­tice I’m us­ing an in­for­mal ex­ten­sional defi­ni­tion, guided by ex­am­ples like those in the bul­leted list.

3. Com­par­ing beliefs

What does it mean to say that one agent is “bet­ter-in­formed” than an­other?

It’s nat­u­ral to try to ex­press this in terms of em­piri­cal in­for­ma­tion about the world, but we are par­tic­u­larly in­ter­ested in the differ­ent in­fer­ences that agents are able to draw from the same data. Another nat­u­ral ap­proach is to com­pare their “knowl­edge,” but I have no idea how to define knowl­edge or jus­tified be­lief. So I’m re­duced to work­ing di­rectly with sets of be­liefs.

Con­sider two sets of be­liefs, de­scribed by the sub­jec­tive ex­pec­ta­tions 𝔼¹ and 𝔼². What does it mean to say that 𝔼¹ is bet­ter-in­formed than 𝔼²?

This fram­ing makes it tempt­ing to try some­thing sim­ple: “for ev­ery quan­tity, 𝔼¹’s be­lief about that quan­tity is more ac­cu­rate.” But this is prop­erty is to­tally un­achiev­able. Even if 𝔼¹ is ob­tained by con­di­tion­ing 𝔼² on a true fact, it will al­most cer­tainly hap­pen to up­date in the “wrong” di­rec­tion for some claims.

We will in­stead use a sub­jec­tive defi­ni­tion, i.e. we’ll define this con­cept from a par­tic­u­lar epistemic po­si­tion rep­re­sented by an­other sub­jec­tive ex­pec­ta­tion 𝔼.

Then we say that 𝔼¹ dom­i­nates 𝔼² (w.r.t. 𝔼) if, for ev­ery bounded quan­tity X and for ev­ery “nice” prop­erty Φ:

  • 𝔼[X|Φ(𝔼¹, 𝔼²)] = 𝔼[𝔼¹[X]|Φ(𝔼¹, 𝔼²)]

(By “nice” I mean some­thing like: sim­ple to define and open in the product topol­ogy, view­ing 𝔼¹ and 𝔼² as in­finite ta­bles of num­bers.)

In­tu­itively, this means that 𝔼 always “trusts” 𝔼¹, even if given ar­bi­trary in­for­ma­tion about 𝔼¹ and 𝔼². For ex­am­ple, if 𝔼 was told that 𝔼¹[X] ≈ x and
𝔼²[X] ≈ y, then it would ex­pect X to be around x (rather than y). Allow­ing ar­bi­trary pred­i­cates Φ al­lows us to make stronger in­fer­ences, effec­tively that 𝔼 thinks that 𝔼¹ cap­tures ev­ery­thing use­ful about 𝔼².

I’m not sure if this is ex­actly the right prop­erty, and it be­comes par­tic­u­larly tricky if the quan­tity X is it­self re­lated to the be­hav­ior of 𝔼¹ or 𝔼² (con­ti­nu­ity in the product topol­ogy is the min­i­mum plau­si­ble con­di­tion to avoid a self-refer­en­tial para­dox). But I think it’s at least roughly what we want and it may be ex­actly what we want.

Note that dom­i­nance is sub­jec­tive, i.e. it de­pends on the epistemic van­tage point 𝔼 used for the outer ex­pec­ta­tion. This prop­erty is a lit­tle bit stronger than what we origi­nally asked for, since it also re­quires 𝔼 to trust 𝔼¹, but this turns out to be im­plied any­way by our defi­ni­tion of uni­ver­sal­ity so it’s not a big defect.

Note that dom­i­nance is a prop­erty of the de­scrip­tions of 𝔼¹ and 𝔼². There could be two differ­ent com­pu­ta­tions that in fact com­pute the same set of ex­pec­ta­tions, such that 𝔼 trusts one of them but not the other. Per­haps one com­pu­ta­tion hard-codes a par­tic­u­lar re­sult, while the other does a bunch of work to es­ti­mate it. Even if the hard-coded re­sult hap­pened to be cor­rect, such that the two com­pu­ta­tions had the same out­puts, 𝔼 might trust the hard work but not the wild guess.

4. Com­plex­ity and parameterization

There are com­pu­ta­tions with ar­bi­trar­ily so­phis­ti­cated be­liefs, so no fixed Acan hope to dom­i­nate ev­ery­thing. To rem­edy this, rather than com­par­ing to a fixed ques­tion-an­swerer A, we’ll com­pare to a pa­ram­e­ter­ized fam­ily A[C].

I’ll con­sider two differ­ent kinds of po­ten­tially-uni­ver­sal rea­son­ers A:

  • In the “ideal­ized” case, A[C] de­pends only on the com­plex­ity of C.
    For ex­am­ple, we might hope that an n-round de­bate dom­i­nates any be­liefs that could be as­cribed to a fast com­pu­ta­tion with (n-1) rounds of al­ter­na­tion. In par­tic­u­lar, this A[C] is the same for any two com­pu­ta­tions Cof the same com­plex­ity.

  • In the “prac­ti­cal” case, A[C] de­pends on the com­plex­ity of C but also uses the com­pu­ta­tion C as a hint. For ex­am­ple, if C is the train­ing pro­cess for a neu­ral net, then we might take A[C] to be a de­bate in which the de­baters are able to share weights and ac­ti­va­tions with the neu­ral net through­out the en­tire train­ing pro­cess.

I’m gen­er­ally in­ter­ested in the case where A[C] is only slightly more pow­er­ful than C it­self. This mir­rors the set­ting where a uni­ver­sal Tur­ing ma­chine is able to run any other Tur­ing ma­chine with only a mod­est slow­down.

Put­ting it all together

We say that a set of be­liefs 𝔼ᴬ epistem­i­cally dom­i­nates a com­pu­ta­tion C (w.r.t. some be­liefs 𝔼 and lan­guage L) if the be­liefs as­cribed to A by the “straight­for­ward” pro­ce­dure, us­ing L, dom­i­nate (w.r.t. 𝔼) the be­liefs as­cribed to C by any rea­son­able as­crip­tion pro­ce­dure.

We say that a fam­ily of ques­tion-an­swer­ing sys­tems A[·] are as­crip­tion uni­ver­sal (w.r.t. 𝔼 and L) if A[C] epistem­i­cally dom­i­nates C for ev­ery com­pu­ta­tion C.

II. Discussion

Why is (sub­jec­tive) dom­i­nance suffi­cient?

This uni­ver­sal­ity con­di­tion re­quires that we be­lieve that A[C] is bet­ter-in­formed than C. Naively we might have wanted it to ac­tu­ally be the case that A[C] is bet­ter-in­formed than C; the stronger con­di­tion is clearly un­achiev­able, but why should we be satis­fied with the weaker con­di­tion?

In ap­pli­ca­tions of this prop­erty, the sub­jec­tive con­di­tion is what we need in or­der for us to be­lieve that A[C] will cope with the challenges posed by C. For ex­am­ple, sup­pose that C for­mu­lates a plan to “trick” A[C]. Then the sub­jec­tive uni­ver­sal­ity con­di­tion im­plies that we don’t ex­pect C to suc­ceed.

This isn’t as good as ac­tu­ally know­ing that C won’t suc­ceed. But I think it should be good enough for us — the rea­son we are think­ing about AI safety is be­cause we are con­cerned that some­thing bad will hap­pen. If we find a tech­nique that de­fuses this ar­gu­ment, then we’ve ad­dressed the mo­ti­vat­ing prob­lem. It may still be the case that bad things hap­pen (and we should still search for ad­di­tional rea­sons that bad things might hap­pen), but we don’t par­tic­u­larly ex­pect them to.

Of course if you se­lect over a large num­ber of com­pu­ta­tions, then you may find one that will suc­ceed in trick­ing A. But if we are con­cerned about that, then we can in­stead ap­ply as­crip­tion uni­ver­sal­ity to the en­tire pro­cess in­clud­ing the se­lec­tion.

Why trust opaque com­pu­ta­tion?

If C uses some clever heuris­tics that I don’t un­der­stand, then C’s “be­liefs” might be ex­cel­lent, but I might not ex­pect them to be ex­cel­lent. In this sense un­der­stand­ing may seem al­most vac­u­ous. If there is some heuris­tic that I trust, wouldn’t A just use it?

To see why the defi­ni­tion is de­mand­ing, con­sider the spe­cial case where Cperforms an ex­ten­sive search to find a com­pu­ta­tion that works well em­piri­cally. For ex­am­ple, C might be the fol­low­ing com­pu­ta­tion:

  • Start with a train­ing set of (image, la­bel) pairs.

  • Search over sim­ple pro­grams to find one that makes good pre­dic­tions.

  • Run that sim­ple pro­gram on a new image to pre­dict its la­bel.

In this case, we can as­cribe be­liefs to C about the con­tents of the new image. And be­cause those be­liefs are com­ing from a sim­ple pro­gram that works em­piri­cally, I ex­pect them to be ac­cu­rate (in some re­spects).

For ex­am­ple, a sim­ple clas­sifier C may “be­lieve” that the new image con­tains a par­tic­u­lar curve that typ­i­cally ap­pears in images la­beled “dog;” or a re­ally so­phis­ti­cated clas­sifier may perform com­plex de­duc­tions about the con­tents of the scene, start­ing from premises that were em­piri­cally val­i­dated on the train­ing set.

So it’s not OK for A to sim­ply ig­nore what­ever heuris­tics C is us­ing — if those heuris­tics have the kind of em­piri­cal sup­port that makes us think they ac­tu­ally work, then A needs to be able to un­der­stand ev­ery­thing that those heuris­tics im­ply about the do­main.

Why be so gen­eral?

I’ve for­mu­lated uni­ver­sal­ity as com­pet­ing with ar­bi­trary com­pu­ta­tions C. It seems to­tally pos­si­ble that the form of C dis­cussed in the last sec­tion — search­ing for a pro­gram that works well in prac­tice and then us­ing it in a new situ­a­tion — is so cen­tral that the defi­ni­tion of uni­ver­sal­ity should fo­cus en­tirely on it.

One rea­son to use the broader defi­ni­tion is be­cause some­times this “se­lec­tion” pro­cess can be em­bed­ded in a non-triv­ial way in a larger com­pu­ta­tion. For ex­am­ple, if I have a suffi­ciently large group of hu­mans, I might ex­pect memetic se­lec­tion to oc­cur and pro­duce sys­tems that could be said to have “be­liefs,” and I’d like uni­ver­sal sys­tems to dom­i­nate those be­liefs as well.

The other rea­son to use this very gen­eral defi­ni­tion is be­cause I don’t see an easy way to sim­plify the defi­ni­tion by us­ing the ad­di­tional struc­tural as­sump­tion about C. I do think it’s likely there’s a nicer state­ment out there that some­one else can find.

Univer­sal from whose per­spec­tive?

Un­for­tu­nately, achiev­ing uni­ver­sal­ity de­pends a lot on the epistemic per­spec­tive 𝔼 from which it is be­ing eval­u­ated. For ex­am­ple, if 𝔼 knows any facts, than a uni­ver­sal agent must know all of those facts as well. Thus “a de­bate judged by Paul” may be uni­ver­sal from Paul’s per­spec­tive, but “a de­bate ar­bi­trated by Alice” can­not be uni­ver­sal from my per­spec­tive un­less I be­lieve that Alice knows ev­ery­thing I know.

This isn’t nec­es­sar­ily a big prob­lem. It will limit us to con­clu­sions like: Google en­g­ineers be­lieve that the AI they’ve built serves the user’s in­ter­ests rea­son­ably well. The user might not agree with that as­sess­ment, if they have differ­ent be­liefs from Google en­g­ineers. This is what you’d ex­pect in any case where Google en­g­ineers build a product, how­ever good their in­ten­tions.

(Of course Google en­g­ineers’ no­tion of “serv­ing the user’s in­ter­ests” can in­volve defer­ring to the user’s be­liefs in cases where they dis­agree with Google en­g­ineers, just as they could defer to the user’s be­liefs with other prod­ucts. That gives us rea­son to be less con­cerned about such di­ver­gences, but even­tu­ally these eval­u­a­tions do need to bot­tom out some­where.)

This prop­erty be­comes more prob­le­matic when we ask ques­tions like: is there a way to se­ri­ously limit the in­puts and out­puts to a hu­man while pre­serv­ing uni­ver­sal­ity of HCH? This causes trou­ble be­cause even if limit­ing the hu­man in­tu­itively pre­serves uni­ver­sal­ity, it will effec­tively elimi­nate some of the hu­man’s knowl­edge and know-how that can only be ac­cessed on large in­puts, and hence vi­o­late uni­ver­sal­ity.

So when in­ves­ti­gat­ing schemes based on this kind of im­pov­er­ished hu­man, we would need to eval­u­ate uni­ver­sal­ity from some im­pov­er­ished epistemic per­spec­tive. We’d like to say that the im­pov­er­ished per­spec­tive is still “good enough” for us to feel safe, de­spite not be­ing good enough to cap­ture liter­ally ev­ery­thing we know. But now we risk beg­ging the ques­tion: how do we eval­u­ate whether the im­pov­er­ished per­spec­tive is good enough? I think this is prob­a­bly OK, but it’s definitely sub­tle.

I think that defin­ing uni­ver­sal­ity w.r.t. 𝔼 is an ar­ti­fact of this defi­ni­tion strat­egy, and I’m op­ti­mistic that a bet­ter defi­ni­tion wouldn’t have this de­pen­dence, prob­a­bly by di­rectly at­tack­ing the no­tion of “jus­tified” be­lief (which would likely also be use­ful for ac­tu­ally es­tab­lish­ing uni­ver­sal­ity, and may even be nec­es­sary). But that’s a hard prob­lem. Philoso­phers have thought about very similar prob­lems ex­ten­sively with­out mak­ing the kind of progress that seems ad­e­quate for our pur­poses, and I don’t see an im­me­di­ate an­gle of at­tack.

III. Which A might be uni­ver­sal?

Two regimes

I’m in­ter­ested in uni­ver­sal­ity in two dis­tinct regimes:

  • Univer­sal­ity of ideal­ized pro­ce­dures defined in terms of perfect op­ti­miza­tion, such as de­bate un­der op­ti­mal play or max-HCH, where A[C] de­pends only the com­pu­ta­tional com­plex­ity of C.

  • Univer­sal­ity of prac­ti­cal var­i­ants, in which the perfect op­ti­miza­tion is re­placed by a bounded op­ti­miza­tion, which might use the com­pu­ta­tion Cas a “hint.” For ex­am­ple, we might con­sider am­plifi­ca­tion or de­bate where the agents are al­lowed to in­spect a tran­script of C; or if C is a neu­ral net train­ing pro­cess, we might train them jointly with C.

Ideal­ized models

A[C] could be:

  • A de­bate, judged by a hu­man, with perfect de­baters, whose length de­pends on the com­plex­ity of C. (Hope­fully in such a way that A[A[C]] is only a slightly longer de­bate than A[C].)

  • max-HCH, us­ing perfect op­ti­miza­tion rather than a bounded op­ti­mizer. Again, we can im­pose a bud­get limit that de­pends on the com­plex­ity of C, ideally such that A[A[C]] does not in­volve a much larger com­pu­ta­tion than A[C].

  • Use HCH to im­ple­ment two func­tions, Info and An­swer. Define A[C](Q) to be An­swer(Q, a*), where a* = argmax Info(a, Q), and the op­ti­miza­tion is over ad­vice strings a of length com­pa­rable to the de­scrip­tion of C.

I don’t think any of these would be uni­ver­sal if in­stan­ti­ated with an open-ended hu­man policy — pro­vid­ing ar­bi­trary com­plex in­puts to a hu­man is likely to lead to bad be­hav­ior. In­stead, I imag­ine re­strict­ing the hu­man to some small enough space of in­puts that they can be­have “rea­son­ably” over the whole space, while still be­ing rich enough to be uni­ver­sal. Even as such, I think ad­di­tional mea­sures might be needed to achieve se­cu­rity (since a team of hu­mans may be­have badly in some ex­otic cor­ners of the in­put space, even if in­di­vi­d­ual hu­mans are sheltered from ex­otic in­puts).

Prac­ti­cal models

In prac­tice I’m par­tic­u­larly in­ter­ested in the case where C has the spe­cial form de­scribed in the “Trust­ing opaque com­pu­ta­tions” sec­tion: it trains a neu­ral net on some dataset, and then runs that neu­ral net­work on new in­puts.

A[C] could be:

  • De­bates ar­bi­trated by a hu­man where the de­baters are trained jointly with C, and can make refer­ences to in­di­vi­d­ual ex­am­ples from the dataset.

  • max-HCH where the agent is trained jointly with C, and the over­seer can con­sult in­di­vi­d­ual en­tries from the dataset.

  • HCH, com­bined with in­ter­pretabil­ity tech­niques that al­low it to get in­for­ma­tion about what C is think­ing and to re­late its be­hav­ior to the dataset on which it was trained.

For each of these prac­ti­cal var­i­ants, I don’t ex­pect it to be uni­ver­sal un­less the cor­re­spond­ing ideal­ized model is. So we can think of di­vid­ing the uni­ver­sal­ity ques­tion into two pieces:

  • Would our pro­ce­dure be uni­ver­sal if the op­ti­miza­tion was performed perfectly?

  • Is weight shar­ing /​ in­ter­pretabil­ity strong enough to solve the op­ti­miza­tion prob­lems needed for uni­ver­sal­ity?

I think none of the sug­ges­tions in this sec­tion are plau­si­bly uni­ver­sal with­out some ad­di­tional mechanisms, and so even if HCH or de­bate is uni­ver­sal, ad­di­tional ideas are needed in or­der to trans­late those to prac­ti­cal uni­ver­sal sys­tems. I’m still in­clined to fo­cus on the ideal­ized case, be­cause I want to bet­ter un­der­stand what is needed for uni­ver­sal­ity with­out think­ing about al­gorith­mic difficul­ties.

IV. Which C are hard to epistem­i­cally dom­i­nate?

The defi­ni­tion in this post was some­what in­for­mal, but the dis­cus­sion in this ap­pendix will be ex­tremely in­for­mal. The pur­pose is to com­mu­ni­cate some of my early think­ing about uni­ver­sal­ity and my sense for what a uni­ver­sal rea­soner might look like.

The first three ex­am­ples will con­sider pro­grams C with the fol­low­ing form:

  • Iter­ate over pro­grams to find an f that suc­cess­fully pre­dicts the an­swers to some dataset of his­tor­i­cal ques­tions for which we know the an­swer.

  • Use that pro­gram to pre­dict the an­swers to fu­ture ques­tions.

The differ­ent ex­am­ples ex­plore fea­tures of f that might make it difficult to dom­i­nate the over­all com­pu­ta­tion.


Sup­pose f takes the fol­low­ing form:

  • Given a ques­tion and con­text, it performs a short se­ries of de­duc­tions from facts in the con­text, to­gether with some ax­ioms (e.g. “Bob’s eyes are more likely to twitch when he is ly­ing,” the rules of poker, etc.)

  • When it ar­rives at an­swers that con­strain the pos­si­ble an­swers to the given ques­tion, it out­puts an an­swer con­sis­tent with those con­straints.

We could at­tribute the state­ments pro­duced by this pro­cess to C as be­liefs. We ex­pect this set of be­liefs to have use­ful in­for­ma­tion be­cause it was se­lected for mak­ing cor­rect pre­dic­tions.

For­tu­nately, if HCH is able to see the train­ing dataset then it may be able to ar­rive at similarly ac­cu­rate views:

  • It can com­pute that P(ob­ser­va­tions | ax­ioms) is high by fol­low­ing along with the de­duc­tive pro­cess used by f .

  • Bayes im­plies that the ax­ioms are likely to be true (fully cap­tur­ing the ar­gu­ment which leads us, from our cur­rent per­spec­tive, to ex­pect them to be true).

  • Once HCH knows the ax­ioms are likely to be true, it can fol­low along with the de­duc­tive pro­cess to re­pro­duce all of f’s be­liefs.


Sup­pose f takes the fol­low­ing form:

  • Given a ques­tion Q and con­text, set up a rele­vant phys­i­cal situ­a­tion.

  • Use a dy­nam­ics model M to pre­dict what will hap­pen at some fu­ture time.

  • Use the re­sult­ing state to an­swer Q.

We could say that C be­lieves ev­ery­thing im­plied by the out­come of this phys­i­cal simu­la­tion. We ex­pect these be­liefs to be ac­cu­rate be­cause f has pre­dicted well in the past.

As in the de­duc­tion case, hope­fully HCH can com­pute that P(ob­ser­va­tions|M is ac­cu­rate) is high, since it can also com­pute the con­se­quences of M. Then Bayes im­plies the dy­nam­ics model is ac­cu­rate, and HCH can use that model to com­pute phys­i­cal states.

In­fer­ring all the be­liefs from a dy­nam­ics model is not triv­ial though. As an ex­treme ex­am­ple, if f is perform­ing an atom-by-atom simu­la­tion of a room, and that room con­tains Alice and Bob, then we could as­cribe ex­ten­sive be­liefs about Alice and Bob to the com­pu­ta­tion C.

(Here we run head on into the fuzzi­ness about what counts as a “rea­son­able” as­crip­tion pro­ce­dure, but for the mo­ment I’ll as­sume that some rea­son­able pro­ce­dure as­cribes be­liefs about Alice and Bob to the com­pu­ta­tion.)

To com­pete with these as­crip­tions, HCH needs to in­fer those high-level be­liefs about Alice and Bob from the low-level com­pu­ta­tion in­volv­ing atoms. One way to do this is to search over pos­si­ble “bridg­ing” hy­pothe­ses that re­late low-level phys­i­cal facts to high-level facts about the en­vi­ron­ment. If such a hy­poth­e­sis can ex­plain ad­di­tional high-level facts, then a Bayesian can learn that it is true. Similarly, if the bridg­ing hy­poth­e­sis re­lates facts about the model to con­straints we know from the high-level in­ter­pre­ta­tion, then the Bayesian can po­ten­tially use that as ev­i­dence. (This kind of rea­son­ing will be dis­cussed in a bit more de­tail in the next sec­tion.)

We could fur­ther hope that search­ing for a bridg­ing hy­poth­e­sis isn’t much harder than perform­ing the origi­nal search over low-level physics, given that the low-level physics needed to ex­plain a bunch of high-level facts and so already must en­code some part of that cor­re­spon­dence.

(Note that the “de­duc­tion” ex­am­ple in the pre­vi­ous case could also in­volve alien con­cepts or mod­els, in which case the same kind of work would be needed.)

Alien reasoning

In the pre­vi­ous sec­tion we de­scribed two styles of rea­son­ing we already un­der­stand. But there are prob­a­bly many kinds of rea­son­ing that work well in prac­tice but that would be more alien, and those might be more challeng­ing. This sec­tion will ex­plore one ex­am­ple in some de­tail to try to help an­chor our rea­son­ing about the gen­eral phe­nomenon. It will also elab­o­rate on some of the rea­son­ing about “bridg­ing” hy­pothe­ses men­tioned in the last sec­tion.

Sup­pose that our pre­dic­tions are always of the same form (e.g. what is the prob­a­bil­ity the stock mar­ket will go up to­day), and f works as fol­lows (the de­tails are long but not very im­por­tant):

  • Find the PSD ma­trix A with max­i­mum log de­ter­mi­nant sub­ject to the con­straints in the next bul­let points, then out­put the (0, 0) en­try.

  • There is an im­plicit cor­re­spon­dence be­tween the rows/​columns of A, and some un­cer­tain prop­er­ties X(0), X(1), X(2), …. (which we’ll view as 0–1 vari­ables), where X(0) is the prop­erty we want to fore­cast.

  • If the (i, j) en­try of A rep­re­sented the ex­pec­ta­tion E[X(i)X(j)], then the ma­trix would nec­es­sar­ily satisfy a bunch of con­straints, which we im­pose A. For ex­am­ple:

  • If the con­text im­plies that X(i) = 1, then E[X(i)X(j)] = E[X(j)] = E[X(j)²], so A(i, j) = A(j, j).

  • If X(i) and X(j) to­gether im­ply X(k), then we must have E[X(i)X(j)] ≤ E[X(i)X(k)] and hence A(i, j) ≤ A(i, k).

  • For any con­stants a, b, …, E[(a X(1) + b X(2) + … )²] ≥ 0 — i.e., the ma­trix A must be PSD.

The cho­sen ma­trix A(opt) cor­re­sponds to a set of be­liefs about the propo­si­tions X(i), and we can as­cribe these be­liefs to C. Be­cause f pre­dicts well, we again ex­pect these be­liefs to say some­thing im­por­tant about the world.

I chose this pro­ce­dure f in part be­cause we can give a kind of ar­gu­ment for why the ma­trix A(opt) should tend to en­code ac­cu­rate be­liefs. But I don’t think that a uni­ver­sal rea­soner can make use of that ar­gu­ment:

  • Find­ing the ar­gu­ment that f works is an ad­di­tional prob­lem, be­yond find­ing f it­self, which might be much harder.

  • A com­pre­hen­si­ble ver­sion of that ar­gu­ment may be much larger than the strat­egy it­self, so even in the ideal­ized cases like de­bate with perfect op­ti­miza­tion, we may need to in­crease the scale.

  • I don’t ex­pect that all “good” rea­son­ing strate­gies have clean un­der­stand­able ar­gu­ments in their fa­vor (and even in this case, if it the scheme worked well it would be largely an em­piri­cal fact rather than a con­se­quence of the sim­ple the­o­rems we could prove). I think this kind of ex­am­ple is use­ful be­cause we can eas­ily imag­ine a hu­man de­bate judge not hav­ing the ar­gu­ment while still be­ing ap­par­ently uni­ver­sal. This makes it a use­ful anal­ogy for cases where the ar­gu­ment re­ally doesn’t ex­ist.

In­stead, I think a uni­ver­sal rea­soner needs to be able to in­fer the effi­cacy of this rea­son­ing pro­ce­dure from its em­piri­cal suc­cess. It’s rel­a­tively easy for a Bayesian to learn the reg­u­lar­ity “f makes good pre­dic­tions.” Re­cov­er­ing the rest of the ma­trix A, and learn­ing how to in­ter­pret and whether to trust them, is the hard part.

This is go­ing to re­quire the same kind of bridg­ing/​iden­ti­fi­ca­tion we dis­cussed in the last sec­tion. Let’s write X(A) for the set of be­liefs about the world im­plied by the “in­tended” iden­ti­fi­ca­tion. Search­ing over pos­si­ble iden­ti­fi­ca­tions to find X (or some­thing like it) is the only way we can ever re­late the rows of A to the quan­tities X(i). Again, we can hope that it isn’t much harder than find­ing the origi­nal rea­son­ing pro­ce­dure.

I think that a suffi­ciently so­phis­ti­cated Bayesian would prob­a­bly be able to learn to trust X(A):

  • If f is perform­ing well enough that we think it’s more likely to be right in the fu­ture, then the Bayesian is go­ing to end be­liev­ing some claim like “the pre­dic­tions of f are good” (since it ex­plains the data so well).

  • This is a com­pli­cated state­ment, and with­out some kind of ex­pla­na­tion this claim has a low prior prob­a­bil­ity (roughly de­cay­ing with the com­plex­ity of f). The Bayesian is mo­ti­vated to find an ex­pla­na­tion with higher prior prob­a­bil­ity.

  • The cor­re­spon­dence X can ex­plain the con­straints on the ma­trix A, in terms of facts that we already know about the world. This ex­pla­na­tion may end up be­ing sim­pler (or at least higher prior) than a di­rect enu­mer­a­tion of the con­straints on A — I hope (and think it’s plau­si­ble) that this hap­pens iff we’d ac­tu­ally be­lieve on re­flec­tion that X(A) cap­tures re­al­ity.
    (To the ex­tent that we are un­cer­tain and think A’s be­liefs have a non-neg­ligible chance of cap­tur­ing re­al­ity, then hope­fully we can cap­ture that by the same mechanism by end­ing up with a non-de­gen­er­ate pos­te­rior.)

  • Now the Bayesian is faced with at least two kinds of ex­pla­na­tions:
    (a) “If you use the con­straints im­plied by cor­re­spon­dence X(A) + pos­i­tive semidefinite­ness, and then op­ti­mize log det, you get a ma­trix A for which X(A) makes good pre­dic­tions,”
    (b) “The ac­tual situ­a­tion in the real world is de­scribed by pos­i­tive semi-definite ma­tri­ces with higher log de­ter­mi­nant (un­der the cor­re­spon­dence X).”

  • Ex­pla­na­tion (b) is ex­plain­ing two things at once: both why the op­ti­miza­tion done by f re­spects the con­straints on our be­liefs, and why that op­ti­miza­tion leads to good pre­dic­tions. Hope­fully this is sim­pler than mak­ing two sep­a­rate bridg­ing claims, one which ex­plains f as re­spect­ing the con­straints im­plied by X, and one which claims that f makes good pre­dic­tions. Ideally, this 2-for-1 that fa­vors (b) ex­actly mir­rors the un­der­ly­ing rea­son­ing that leads us to ac­tu­ally be­lieve that X(A) is cor­rect, rather than re­sem­bling what we know about re­al­ity and mak­ing good pre­dic­tions “by co­in­ci­dence.”

This is a pretty spec­u­la­tive dis­cus­sion — it’s not very care­ful, and it’s hard to make it care­ful in part be­cause I don’t have a for­mal­iza­tion of Bayesian rea­son­ing that can even re­ally be ap­plied to this set­ting. But it seems to match my in­tu­itions about what rea­son­able Bayesian rea­son­ing “should” do, which gives me a lot more op­ti­mism that a care­ful Bayesian would be able to epistem­i­cally dom­i­nate C.

De­liber­a­tion and self-improvement

Often we ex­pect the com­pu­ta­tion C to have ac­cu­rate be­liefs be­cause it uses a strat­egy that ap­pears to work in prac­tice — the last 3 ex­am­ples have dis­cussed that case. But there are other rea­sons to trust a com­pu­ta­tion.

For ex­am­ple, hu­mans of­ten write code and trust it (to some ex­tent) even with­out ex­ten­sive em­piri­cal test­ing — in­stead, we have a rea­son to think it will work, and need only mod­est test­ing to make sure that we haven’t made an er­ror in our im­ple­men­ta­tion or rea­son­ing. If I write an au­to­mated math­e­mat­i­cian that works by find­ing proofs that pass a proof checker, I don’t ex­pect it to be cor­rect be­cause of the em­piri­cal record (Em­piri­cal data backs up some key as­sump­tions, but isn’t be­ing used to di­rectly es­tab­lish­ing the cor­rect­ness of the method.)

Like­wise, if we train a pow­er­ful agent, that agent might ini­tially use strate­gies that work well in train­ing, but over time it might use learned rea­son­ing to iden­tify other promis­ing strate­gies and use those. Rea­son­ing might al­low it to to­tally skip em­piri­cal test­ing, or to adopt the method af­ter much less test­ing than would have been nec­es­sary with­out the rea­son­ing.

To dom­i­nate the be­liefs pro­duced by such rea­son­ing, we can’t di­rectly ap­peal to the kind of statis­ti­cal in­fer­ence made in the pre­vi­ous sec­tion. But in these cases I think we have ac­cess to an even more di­rect strat­egy.

Con­cretely, con­sider the situ­a­tion where C con­tains a pro­cess f that de­signs a new rea­son­ing pro­cess g. Then:

  • From the out­side, we trust g be­cause we trust f and it trusts g.

  • An oth­er­wise-uni­ver­sal rea­soner A will dom­i­nate f’s be­liefs, and in par­tic­u­lar if f is jus­tified in think­ing that g will work then A will be­lieve that and un­der­stand why.

  • Once we un­der­stand f’s be­liefs, dom­i­nat­ing g is es­sen­tially an­other in­stance of the origi­nal as­crip­tion uni­ver­sal­ity prob­lem, but now from a slightly stronger epistemic state that in­volves both what 𝔼 knows and what f knows. So un­less our origi­nal ap­proach to uni­ver­sal­ity was tightly wed­ded to de­tails of 𝔼, we can prob­a­bly dom­i­nate g.

At the end of the day we’d like to put all of this to­gether into a tight ar­gu­ment for uni­ver­sal­ity, which will need to in­cor­po­rate both statis­ti­cal ar­gu­ments and this kind of dy­namic. But I’m ten­ta­tively op­ti­mistic about achiev­ing uni­ver­sal­ity in light of the prospect of agents de­sign­ing new agents, and am much more wor­ried about the kind of opaque com­pu­ta­tions that “just work” de­scribed in the last few sec­tions.